## Import UDF: Simple Parser with Troubleshooting and Debugging

<HTML>
    <br>
In the previous example, <a href="./1%20-%20Import%20UDF%20Simple%20Parser.ipynb" target="_self">Import UDF Simple Parser</a>, you generated random stock data and imported it to Xcalar using a simple Parser UDF. In this Jupyter Notebook you will introduce errors into the same data to demonstrate techniques to log and debug these errors in order to troubleshoot when they occur during run-time (note that the Xcalar Jupyter template described in
<a href="#test_and_upload">Testing & Uploading UDF to Workbook</a> section provides a convenient tooling to debug your UDFs as you design them). 

The following code will randomly generate non-numeric ask and bid values, which we will handle by import UDF later in the Jupyter Notebook. Highlight the cell and click <b>Run</b>.
</HTML>

In [34]:
import random
from datetime import datetime, timedelta

def genRandomMarketData(fileDesc, dayCount):
    headers ="Security,date,Bid,Ask,Bid size,Ask size,Last Sale,Last size,Volume"
    #fake stock names
    securities = ["GGX" ,"ABC" , "ZZM" , "XEW" , "FFG" ,"UYT","RTF"]
    file.write(headers + "\n")
    for rec in range (1, dayCount +1):
        date = datetime.today() - timedelta(days=rec)
        for security in securities:
            badDataProb = random.randint(1,50)
            # bids with ask = 0 will be later tracked in Map UDF example
            # although it is a bad data, we want to analyze these records and their source latter
            ask = random.randint(-1,200)         
            bid = ask + random.uniform(1,3)
            bidSize = random.randint(400,1500)
            askSize = random.randint(400,1500)
            lastSale = bid + random.uniform(-1,+1)
            lastSize = random.randint(400,1500)
            volume = random.randint(5000,15000)
            if badDataProb < 20:  # generate bad data once in 50 records
                ask = "abc" 
                bid= "doRayMe" 
            record =  security + "," + date.strftime('%m.%d.%Y') + "," + str(ask) + "," + str(bid) + "," + str(bidSize)
            record = record +  "," + str(askSize) + "," + str(lastSale) + "," + str(lastSize) + "," + str(volume)  
            file.write(record + "\n")

# write the file
file_path = "/tmp/stocks_bad.csv"
with open(file_path, "w") as file: 
    genRandomMarketData(file,10000)

# test the file was created
import os
print("Created {} file: {} bytes".format(file_path, os.stat(file_path).st_size))

Created /tmp/stocks_bad.csv file: 4934574 bytes



## Catching Invalid Data

The code above creates some non-numeric values randomly that could break data processing if not properly handled. In this example, we will show how you can write your Import UDF to catch these data violations and logs them to analyze later.

Let's modify the Import UDF parser created in the previous Jupyter Notebook to log erroneous records that contain strings where floats are expected. The code within the function explicitly converts columns values to float, and that would break on non-numeric values if not handled. An invalid record would throw an exception and none of the source records will be loaded. 

The code below will catch these (and any other) exceptions and continue, inserting erroneous records into the Xcalar table as clearly marked errors. 


In [17]:
# Following function yields the bad records

def parse_stocks_file(inFile, inStream):
    import codecs 
    firstRow = True
    Utf8Reader = codecs.getreader("utf-8") #  Xcalar opens and streams files in binary mode,
    utf8Stream = Utf8Reader(inStream)      #  We need a codec to convert it to UTF-8
    for line in utf8Stream:                                  
        fields = line.split(",")                           # split comma separated fields
        if firstRow:                                       # skip first row (headers)
            headers = fields                              
            firstRow = False
            continue
        record = {}                                                       
        try:
            record["security"] = fields[0]                                      # Asked Price 
            record["Date"] = fields[1]                                        # Date
            record["Ask"] = float(fields[3])                                  # Asked Price
            record["Bid"] = float(fields[2])                                  # Bid Price
            record["Avg"] = (float(fields[2]) + float(fields[3])) / 2.0       # New Field , (Ask + Bid) / 2.0
            record["Sale"] = float(fields[6])                                 # Handshake price
            record["Volume"] = float(fields [8])                              # Number of stocks
            record["Total"] = float(fields[8]) * float (fields[6])            # New Field , Total Price 
            yield (record)
        except Exception as e:
            yield {"Bad Record":line, "Exception": str(e)}


### Testing your parser function

One way to test your parser is by using the code block below that iterates over several records as they are inserted into the target Xcalar table. 

Note that a beter way to test your parser is provided by the Xalar Template discussed in the <a name="test_and_upload">Testing & Uploading UDF to Workbook</a> section later in this tutorial.

In [19]:
import itertools

with open("/tmp/stocks_bad.csv","rb") as file: 
    data = parse_stocks_file("inFile", file)
    for record in itertools.islice(data, 10):
        print(record)

{'Bad Record': 'GGX,05.23.2018,abc,doRayMe,883,734,143.06817083846994,1084,11266\n', 'Exception': "could not convert string to float: 'doRayMe'"}
{'Bad Record': 'ABC,05.23.2018,abc,doRayMe,1030,787,14.206858342432081,1350,14262\n', 'Exception': "could not convert string to float: 'doRayMe'"}
{'Bad Record': 'ZZM,05.23.2018,abc,doRayMe,1364,1212,72.97317426906935,1400,7619\n', 'Exception': "could not convert string to float: 'doRayMe'"}
{'security': 'XEW', 'Date': '05.23.2018', 'Ask': 163.87247521450925, 'Bid': 162.0, 'Avg': 162.9362376072546, 'Sale': 163.49517836154794, 'Volume': 8983.0, 'Total': 1468677.187221785}
{'security': 'FFG', 'Date': '05.23.2018', 'Ask': 163.82133369399244, 'Bid': 161.0, 'Avg': 162.4106668469962, 'Sale': 164.70378387106237, 'Volume': 10743.0, 'Total': 1769412.750126823}
{'Bad Record': 'UYT,05.23.2018,abc,doRayMe,1162,425,102.32482347654988,1308,13168\n', 'Exception': "could not convert string to float: 'doRayMe'"}
{'security': 'RTF', 'Date': '05.23.2018', 'Ask'

### Logging Errors

Additionally, you may choose to log errors using Xcalar Python logging mechanism. 

<b>Note</b>: All outputs from this function are redirected to a Xcalar log file called xpu.out for later analysis. You can analyze this log after the import operation in order to troubleshoot potential problems.

In [2]:
# Following function logs and yields the bad records

def parse_and_log(inFile, inStream):
    import logging                                  # importing logging module 
    import datetime                                 # to get current date-time for our error logs
    import codecs 
    logging.basicConfig(level=logging.INFO)         # setting logging 
    firstRow = True
    Utf8Reader = codecs.getreader("utf-8")          #  Xcalar opens and streams files in binary mode,
    utf8Stream = Utf8Reader(inStream)               #  We need a codec to convert it to UTF-8
    for line in utf8Stream:                                  
        fields = line.split(",")                    # split comma separated fields
        if firstRow:                                # skip first row (headers)
            headers = fields                              
            firstRow = False
            continue
        try:
            record = {}                                                       # record dictionary
            record["security"] = fields[0]                                      # Asked Price 
            record["Date"] = fields[1]                                        # Date
            record["Ask"] = float(fields[3])                                  # Asked Price
            record["Bid"] = float(fields[2])                                  # Bid Price
            record["Avg"] = (float(fields[2]) + float(fields[3])) / 2.0       # New Field , (Ask + Bid) / 2.0
            record["Sale"] = float(fields[6])                                 # Handshake price
            record["Volume"] = float(fields [8])                              # Number of stocks
            record["Total"] = float(fields[8]) * float (fields[6])            # New Field , Total Price 
            yield (record);
        except Exception as e:
            #add error logging. These logs will appear in file xpu.out
            log_dict = {}
            log_dict ["Time Stamp"] = datetime.datetime.utcnow().strftime("%I:%M%p on %B %d, %Y")
            log_dict ["Source UDF"] = "parse_stocks_file"
            log_dict ["Description"] = "Parser Error " + str(e)
            log_dict ["File"] = inFile
            logging.error(log_dict)
            yield {"Bad Record":line, "Exception": str(e)}


<HTML>
<br>
<div style="background-color : blue; color : white
    width: 284px;
    padding: 20px 20px 20px 100px;
    border: 1px solid #BFBFBF;
    background-color: white;box-shadow: 0px 0px 0px 0px #aaaaaa; position: relative;"><font style="font-size:20px">
Debugging Best Practices</font>
    <br><b>Note</b>: In some cases you may prefer to filter out erroneous records during import, but the best practice is to bring every record to Xcalar and tag data quality issues and other errors so that they can be filtered and collected into an erroneous rows table for further analysis.
    <img src="xi-questionmark_yellow.png" 
         style="position: absolute;top: 5px;left: 30px;width:40px ;height:40px" />
</div>
</HTML>

## Adding this UDF to Xcalar
<HTML>
    <br>
Adding a UDF to Xcalar has already been covered in <a href="./1%20-%20Import%20UDF%20Simple%20Parser.ipynb" target="_self">Import UDF Simple Parser</a>. As a reminder, the steps you need to take are:
    <br>
</HTML>

1.   Click the <b>CODE SNIPPETS</b> dropdown menu in the top right corner of Jupyter.
2.   Select <b>Connect to Xcalar workbook</b>.
3.   Run the code cell containing the generated code to connect Jupyter to you current workbook.

In [3]:
# Xcalar Notebook Connector
# 
# Connects this Jupyter Notebook to the Xcalar Workbook <TutorialNotebooks5-14-18>
#
# To use any data from your Xcalar Workbook, run this snippet before other 
# Xcalar Snippets in your workbook. 
# 
# A best practice is not to edit this cell.
#
# If you wish to use this Jupyter Notebook with a different Xcalar Workbook 
# delete this cell and click CODE SNIPPETS --> Connect to Xcalar Workbook.

%matplotlib inline

# Importing third-party modules to faciliate data work. 
import pandas as pd
import matplotlib.pyplot as plt

# Importing Xcalar packages and modules. 
# For more information, search and post questions on discourse.xcalar.com
from xcalar.compute.api.XcalarApi import XcalarApi
from xcalar.compute.api.Session import Session
from xcalar.compute.api.WorkItem import WorkItem
from xcalar.compute.api.ResultSet import ResultSet

# Create a XcalarApi object
xcalarApi = XcalarApi()
# Connect to current workbook that you are in
workbook = Session(xcalarApi, "xdpadmin", "xdpadmin", 4399150, True, "TutorialNotebooks-HelloUDF-Full")
xcalarApi.setSession(workbook)

### Using UDF Tempate

4.   Again, click the <b>CODE SNIPPETS</b> dropdown.
5.   This time select <b>Create Import UDF</b>.

A form will pop up with the following fields:

*   <b>Data Target</b>: Select 'Default Share Root'
*   <b>Data source Path</b>: Type the path to the file you created earlier '/tmp/stocks_bad.csv'.
*   <b>Module Name</b>: Choose a name for your UDF group, e.g. 'log_stocks'.
*   <b>Function Name</b>: Enter the name of your parser function, 'parse_and_log'.

Once you enter these settings click the <b>CONFIRM</b> button.

### Testing & Uploading UDF to Workbook
<a name="test_and_upload"></a>

You will see a new code cell added to your Jupyter Notebook. Below you will find these steps already implemented. The code below does the following:
- Declares UDF function (parse_and_log).
- Provides a function to upload UDF to Xcalar (uploadUDF).
- Test the UDF once it has been uploaded (testImportUDF) 

Locate the function 'parse_and_log' and replace it with our version of that function. Highlight the cell and click Run. 

You will see sample lines generated in a table under the code cell. Note that this is an alternative way to test your UDF. It is prefered because it tests the actual uploaded UDF function. Also, it is generated for you automatically when adding the code snippet.


In [4]:
# Xcalar Import UDF Template
#
# This is a function definition for a Python UDF to import external data source
# file <Default Shared Root:/tmp/bad_stocks.csv>
#
# Module name: <log_stocks>
# Function name: <parse_and_log>
#
# REQUIREMENTS: Import UDF functions take two arguments...
#   fullPath: The file path to the data source file being imported.
#   inStream: A binary stream of the data source file.
#
#   Your Import UDF function must be a generator, a Python function which
#   processes and returns a stream of data.
#
# To create an import UDF, modify the function definition immediately below this
# comment, as necessary.
#
# To test your UDF, run this cell. (Hit <control> + <enter>.)
#
# To apply it to your dataset, click the "Apply UDF on Dataset Panel" button.
#
#
# NOTE: Use discipline before replacing this module. Consider whether the import of older
# data source files using this UDF will be affected by this change. If so, versioning this
# module may be appropriate.
#
# Best practice is to name helper functions by starting with __. Such
# functions will be considered private functions and will not be directly
# invokable from Xcalar tools.

# Function definition for your Import UDF.
def parse_and_log(inFile, inStream):
    import logging                                  # importing logging module 
    import datetime                                 # to get current date-time for our error logs
    import codecs 
    logging.basicConfig(level=logging.INFO)         # setting logging 
    firstRow = True
    Utf8Reader = codecs.getreader("utf-8")          #  Xcalar opens and streams files in binary mode,
    utf8Stream = Utf8Reader(inStream)               #  We need a codec to convert it to UTF-8
    for line in utf8Stream:                                  
        fields = line.split(",")                    # split comma separated fields
        if firstRow:                                # skip first row (headers)
            headers = fields                              
            firstRow = False
            continue
        try:
            record = {}                                                       # record dictionary
            record["security"] = fields[0]                                      # Asked Price 
            record["Date"] = fields[1]                                        # Date
            record["Ask"] = float(fields[3])                                  # Asked Price
            record["Bid"] = float(fields[2])                                  # Bid Price
            record["Avg"] = (float(fields[2]) + float(fields[3])) / 2.0       # New Field , (Ask + Bid) / 2.0
            record["Sale"] = float(fields[6])                                 # Handshake price
            record["Volume"] = float(fields [8])                              # Number of stocks
            record["Total"] = float(fields[8]) * float (fields[6])            # New Field , Total Price 
            yield record;
        except Exception as e:
            #add error logging. These logs will appear in file xpu.out
            log_dict = {}
            log_dict ["Time Stamp"] = datetime.datetime.utcnow().strftime("%I:%M%p on %B %d, %Y")
            log_dict ["Source UDF"] = "parse_stocks_file"
            log_dict ["Description"] = "Parser Error " + str(e)
            log_dict ["File"] = inFile
            logging.error(log_dict)
            yield {"Bad Record":line, "Exception": str(e)}

### WARNING DO NOT EDIT CODE BELOW THIS LINE ###
from xcalar.compute.api.Dataset import *
from xcalar.compute.coretypes.DataFormatEnums.ttypes import DfFormatTypeT
from xcalar.compute.api.Udf import Udf
from xcalar.compute.coretypes.LibApisCommon.ttypes import XcalarApiException
import random

def uploadUDF():
    import inspect
    sourceCode = "".join(inspect.getsourcelines(parse_and_log)[0])
    try:
        Udf(xcalarApi).add("log_stocks", sourceCode)
    except XcalarApiException as e:
        if e.status == StatusT.StatusUdfModuleAlreadyExists:
            Udf(xcalarApi).update("log_stocks", sourceCode)

def testImportUDF():
    from IPython.core.display import display, HTML
    userName = "nogievetsky@xcalar.com"
    tempDatasetName = userName + "." + str(random.randint(10000,99999)) + "jupyterDS" + str(random.randint(10000,99999))
    dataset = UdfDataset(xcalarApi,
        "Default Shared Root",
        "/tmp/stocks_bad.csv",
        tempDatasetName,
        "log_stocks:parse_and_log")

    dataset.load()

    resultSet = ResultSet(xcalarApi, datasetName=dataset.name, maxRecords=100)

    NUMROWS = 100
    rowN = 0
    numCols = 0
    headers = []
    data = []
    for row in resultSet:
        if rowN >= NUMROWS:
            break
        newRow = [""] * numCols
        for key in row:
            idx = headers.index(key) if key in headers else -1
            if idx > -1:
                newRow[idx] = row[key]
            else:
                numCols += 1
                newRow.append(row[key])
                headers.append(key)
        data.append(newRow)
        rowN += 1
    data = [row + [""] * (numCols - len(row)) for row in data]

    print("The following should look like a proper table with headings.")
    display(HTML(
            '<table><tr><th>{}</th></tr><tr>{}</tr></table>'.format(
            '</th><th>'.join(headers),
            '</tr><tr>'.join('<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in data)
            )))

    dataset.delete()
    print("End of UDF")

# Test import UDF on file
uploadUDF()
testImportUDF()

The following should look like a proper table with headings.


security,Date,Ask,Bid,Avg,Sale,Volume,Total,Bad Record,Exception
GGX,05.20.2018,69.61643997435232,67.0,68.30821998717616,69.20759126547209,11242.0,778031.7410064372,,
,,,,,,,,"ABC,05.20.2018,abc,doRayMe,544,1091,2.443585987169952,877,13719",could not convert string to float: 'doRayMe'
ZZM,05.20.2018,7.411274958262378,6.0,6.705637479131189,6.873990656526375,10807.0,74287.21702508054,,
XEW,05.20.2018,202.54357063085752,200.0,201.2717853154288,203.02589577089168,5246.0,1065073.8492140975,,
FFG,05.20.2018,199.9157254985934,198.0,198.95786274929668,200.34635720766207,14878.0,2980753.102535596,,
,,,,,,,,"UYT,05.20.2018,abc,doRayMe,546,1251,92.74649743358962,954,7916",could not convert string to float: 'doRayMe'
RTF,05.20.2018,79.60955409712923,78.0,78.80477704856462,78.63145269041719,8507.0,668917.768037379,,
GGX,05.19.2018,119.80744868058504,118.0,118.90372434029253,119.68703128300128,7769.0,929848.5460376368,,
ABC,05.19.2018,134.8556076209867,132.0,133.42780381049334,134.0314289887087,7722.0,1034990.6946508088,,
ZZM,05.19.2018,43.12895337852567,42.0,42.56447668926283,43.11669527286721,6487.0,279698.0022350896,,


End of UDF


## Creating a Table from your Data
<HTML>
    <br>
We almost done. Now, as covered in <a href="./1%20-%20Import%20UDF%20Simple%20Parser.ipynb" target="_self">Import UDF Simple Parser</a>, you can use your UDF to create a Xcalar table. Please do this as this table will be used in a later Map UDF tutorial.
    <br>
</HTML>

1. Click the Datasets icon in the XD menu.
2. In the <b>Import Data Source</b> form select 'Default Shared Root' for <b>Data Target</b> and '/tmp/stocks_bad.csv' for <b>Data Source Path</b>.
3. Click <b>NEXT</b>.
4. In the next page, change the <b>Format</b> to 'Custom Format', this will change the other fields allowing you to select a UDF.
5. Select the Module you created, and 'parse_stocks_file_log' in <b>Function</b>.
6. Click <b>CREATE DATASET</b>.
7. The next page will show a preview of your table, slect all columns and click <b>CREATE TABLE</b>.

<img src="importStockBad.png" style="width: 800px; border: 1px solid #CCC;"/>

## Checking the log file

Once we run the import operation and created the table, some of the records where filtered out and recorded in the Xcalar UDF log file <i>xpu.out</i>. The code below will print the last 10 errors you sent to the log. Note that in one of the later tutorials on log analytics we will explore how we can get more insights from this log.

In [15]:
import os

#FINDS THE xpu.out LOG FILE AND READS DATA
def getConfigDict():
    from xcalar.compute.api.Env import XcalarConfigPath
    localExportDir = XcalarConfigPath
    cfgData = None
    with open(XcalarConfigPath, 'r') as f:
        cfgData = f.read()
    configdict = {}
    for line in cfgData.splitlines():
        if "=" in line:
            name, var = line.partition("=")[::2]
            configdict[name] = var.strip()
    return configdict
config = getConfigDict()
jnpath = os.getcwd()
fallbackpath = os.path.join(jnpath.split("opt/xcalar/jupyterNotebooks")[0],"log/xcalar")
logpath = config.get('Constants.XcalarLogCompletePath', fallbackpath)
with open(os.path.join(logpath,'xpu.out'),"r") as file:
    content = file.readlines()
content[-10:]

['ERROR:root:{\'Time Stamp\': \'12:37PM on May 22, 2018\', \'Source UDF\': \'parse_stocks_file\', \'Description\': "Parser Error could not convert string to float: \'doRayMe\'", \'File\': \'/tmp/stocks_bad.csv\'}\n',
 'ERROR:root:{\'Time Stamp\': \'12:37PM on May 22, 2018\', \'Source UDF\': \'parse_stocks_file\', \'Description\': "Parser Error could not convert string to float: \'doRayMe\'", \'File\': \'/tmp/stocks_bad.csv\'}\n',
 'ERROR:root:{\'Time Stamp\': \'12:37PM on May 22, 2018\', \'Source UDF\': \'parse_stocks_file\', \'Description\': "Parser Error could not convert string to float: \'doRayMe\'", \'File\': \'/tmp/stocks_bad.csv\'}\n',
 'ERROR:root:{\'Time Stamp\': \'12:37PM on May 22, 2018\', \'Source UDF\': \'parse_stocks_file\', \'Description\': "Parser Error could not convert string to float: \'doRayMe\'", \'File\': \'/tmp/stocks_bad.csv\'}\n',
 'ERROR:root:{\'Time Stamp\': \'12:37PM on May 22, 2018\', \'Source UDF\': \'parse_stocks_file\', \'Description\': "Parser Error cou

<html>
 Next: <a href="./3%20-%20Import%20UDF%20-%20Simple%20Connector.ipynb" target="_self">3. Import UDF: Simple Connector.ipynb</a><br>
 Back to <a href="./0%20-%20Introduction.ipynb" target="_self">Introduction</a><br>
</html>