## Import UDF: Simple Parser 

<HTML>
<br>
As you learned in the <a href="./0%20-%20Introduction.ipynb" target="_self">Introduction</a>, UDFs are used to import data, transform it and export into downstream systems. In this Jupyter Notebook you will learn how to create a simple Import UDF Parser.
    
<br>
<br>
<div style="background-color : blue; color : white
    width: 284px;
    padding: 20px 20px 20px 100px;
    border: 1px solid #BFBFBF;
    background-color: white;box-shadow: 0px 0px 0px 0px #aaaaaa; position: relative;"><font style="font-size:20px">
What is a Parser?</font>
    <br>Parsers, are a subset of Import UDFs that read data from a file, optionally amend the data and insert it into a Xcalar table. For more on the Import UDF and other Parsers see [User-defined function (UDF) examples
](https://www.xcalar.com/documentation/help/XD/1.3.1/Content/D_Reference/H1_UdfExamples.htm?Highlight=import%20udf). It may also be useful to watch an introductory video tutorial [How-to Series (UDFs) - How to write import UDFs](https://www.youtube.com/watch?v=RFc_Ks8MgvU) on the Xcalar Youtube channel. 
    <img src="xi-questionmark_yellow.png" 
         style="position: absolute;top: 20px;left: 30px;width:40px ;height:40px" />
</div>
</HTML>


<HTML>
<div style="background-color : blue; color : white
    width: 284px;
    padding: 20px 20px 20px 100px;
    border: 1px solid #BFBFBF;
    background-color: white;box-shadow: 0px 0px 0px #aaaaaa;"><font style="font-size:20px">
    Authoring UDFs</font>
    <br>There are two ways of creating a UDF. First, using Xcalar's native UDF panel and second, using Jupyter Notebook. This tutorial only covers authoring UDFs with Jupyter. To learn more about the native Xcalar Design UDF panel refer to 
[Creating a UDF in Xcalar Design](https://www.xcalar.com/documentation/help/XD/1.3.1/Content/C_AdvancedTasks/C_UDFTasks.htm).
    <img src="xi-unlock icon_blue.png" 
         style="position: absolute;top: 26px;left: 125px;width:35px ;height:45.94px" />
</div>

</HTML>



### Creating Sample Data
We will write a Parser UDF that imports stock data from a local file, calculates the total sale price and insert it to a Xcalar table.

First, you need to create a source file with random data. The code block below creates a file in the temporary directory `/tmp/` and appends 10000 security records in it. Highlight the cell and click <b>Run</b>.

In [8]:
import random
from datetime import datetime, timedelta

def genRandomMarketData(fileDesc, dayCount):
    #column names
    headers ="security,date,Bid,Ask,Bid size,Ask size,Last Sale,Last size,Volume"
    #stock names
    securities = ["GGX" ,"ABC" , "ZZM" , "XEW" , "FFG" ,"UYT","RTF"]

    file.write(headers + "\n")
    for rec in range (1, dayCount +1):
        date = datetime.today() - timedelta(days=rec)
        for security in securities:
            # bids with ask<=0 will be later tracked in Map UDF example
            # although it is a bad data, we want to analyze these records and their source latter
            ask = random.uniform(-1,200)
            bid = ask + random.uniform(1,3)
            bidSize = random.randint(400,1500)
            askSize = random.randint(400,1500)
            lastSale = ask + random.uniform(-3,+3)
            lastSize = random.randint(400,1500)
            volume = random.randint(5000,15000)
            record = security + "," + date.strftime('%m.%d.%Y') + "," + str(ask) + "," + str(bid) + "," + str(bidSize)
            record = record + "," + str(askSize) + "," + str(lastSale) + "," + str(lastSize) + "," + str(volume)  
            file.write(record + "\n")

# write the file
file_path = "/tmp/stocks.csv"
with open(file_path, "w") as file: 
    genRandomMarketData(file,10000)

# test the file was created
import os
print("Created {} file: {} bytes".format(file_path, os.stat(file_path).st_size))

Created /tmp/stocks.csv file: 6246506 bytes


### Authoring a UDF

Now we are ready to write a Parser UDF that will process the sample data we just generated. 

Import UDFs take two arguments: the full path of the source file, and the binary file stream of the source file. Import UDFs leverage the Python generator function to insert data into the target tables. 

In the code below we construct the ‘record’ dictionary variable that maps column names to column values. This dictionary variable represents the record that will be inserted into the Xcalar table. Note that in addition to reading data from the source file, the code also constructs "Total Sales" column that was not present in the source. The code iterates over each row performing operations on some of the fields. Each record is passed to the Python Generator to be inserted into the table via the yield command.

Keeping this in mind, let's implement our import function.

In [2]:
def parse_stocks_file(inFile, inStream):
    firstRow = True
    import codecs 
    Utf8Reader = codecs.getreader("utf-8") #  Xcalar opens and streams files in binary mode,
    utf8Stream = Utf8Reader(inStream)      #  We need a codec to convert it to UTF-8
    for line in utf8Stream:                                  
        fields = line[:-1].split(",")      # split comma separated fields
        if firstRow:                       # skip first row (headers)
            headers = fields                              
            firstRow = False
            continue
        record = {}                        # record dictionary
        for i,field in enumerate(fields):
            record[headers[i]] = field
        record["Total Sale"] = (float(record["Last Sale"]) * float(record["Volume"])) 
        yield (record)                     # returns a single row to Xcalar to be inserted


### Testing your Parser Function

One way to test your parser is using the snippet below that iterates over several records as they would be inserted into the target Xcalar table. 

Note that a better way to test your parser is provided by the Xcalar Template that we will discuss in the <a name="test_and_upload">Testing & Uploading UDF to Workbook</a> section later in this tutorial.

In [11]:
import itertools

with open("/tmp/stocks.csv","rb") as file: 
    data = parse_stocks_file("inFile", file)
    # print top three records
    for record in itertools.islice(data, 3):
        print(record)

{'security': 'GGX', 'date': '06.04.2018', 'Bid': '97.01519269012385', 'Ask': '98.58351312990226', 'Bid size': '1000', 'Ask size': '1191', 'Last Sale': '94.98348332118351', 'Last size': '703', 'Volume': '13756', 'Total Sale': 1306592.7965662004}
{'security': 'ABC', 'date': '06.04.2018', 'Bid': '159.16283761369965', 'Ask': '160.17275961434677', 'Bid size': '1289', 'Ask size': '666', 'Last Sale': '157.7267128144177', 'Last size': '938', 'Volume': '5071', 'Total Sale': 799832.1606819122}
{'security': 'ZZM', 'date': '06.04.2018', 'Bid': '178.55466743419694', 'Ask': '180.7752692892218', 'Bid size': '1207', 'Ask size': '624', 'Last Sale': '177.98573961358767', 'Last size': '1249', 'Volume': '8443', 'Total Sale': 1502733.5995575206}


<h3>Connecting Jupyter Notebook to Xcalar</h3>

Before utilizing this UDF, you need to establish a connection to the Xcalar session. 

1.   Click the <b>CODE SNIPPETS</b> dropdown menu in the top right corner of Jupyter.
2.   Select <b>Connect to Xcalar workbook</b>.
3.   Run the code cell containing the generated code to connect Jupyter to you current workbook.

Please note that when you are creating a new Jupyter Notebook, the "Xcalar Notebook Connector" cell will be created automatically.

Please highlight the following cell and <b>Run</b>.

In [4]:
# Xcalar Notebook Connector
# 
# Connects this Jupyter Notebook to the Xcalar Workbook <wb-1>
#
# To use any data from your Xcalar Workbook, run this snippet before other 
# Xcalar Snippets in your workbook. 
# 
# A best practice is not to edit this cell.
#
# If you wish to use this Jupyter Notebook with a different Xcalar Workbook 
# delete this cell and click CODE SNIPPETS --> Connect to Xcalar Workbook.

%matplotlib inline

# Importing third-party modules to facilitate data work. 
import pandas as pd
import matplotlib.pyplot as plt

# Importing Xcalar packages and modules. 
# For more information, search and post questions on discourse.xcalar.com
from xcalar.compute.api.XcalarApi import XcalarApi
from xcalar.compute.api.Session import Session
from xcalar.compute.api.WorkItem import WorkItem
from xcalar.compute.api.ResultSet import ResultSet

# Create a XcalarApi object
xcalarApi = XcalarApi()
# Connect to current workbook that you are in
workbook = Session(xcalarApi, "xdpadmin", "xdpadmin", 4399150, True, "TutorialNotebooks-HelloUDF-Full")
xcalarApi.setSession(workbook)

### Using UDF Tempate to Create Import UDF


1.   Again, click the <b>CODE SNIPPETS</b> dropdown.
2.   This time select <b>Create Import UDF</b>.

A form will pop up with the following fields:

*   <b>Data Target</b>: Select 'Default Share Root'
*   <b>Data source Path</b>: Type the path to the file you created earlier '/tmp/stocks.csv'.
*   <b>Module Name</b>: Choose a name for your UDF group, e.g. 'my_stocks'.
*   <b>Function Name</b>: Enter the name of your parser function, 'parse_stocks_file'.

Once you enter these settings click the <b>CONFIRM</b> button.


   <HTML> <img src="parser_udf_steps.png" style="width:440px" /></HTML>

<a name="test_and_upload"></a>
### Testing & Uploading UDF to your Workbook

You should see a new code cell added to your Jupyter Notebook. In this new cell locate the function 'parse_stocks_file' and replace it with our version of that function. 

Now run the code and see sample lines generated in a table under the code cell. Below you will find the above steps already implemented. The code below does the following:
- Declares UDF function (parse_stocks_file).
- Provides a function to upload UDF to Xcalar (uploadUDF).
- Test the UDF once it has been uploaded (testImportUDF) - note that this is an alternative way to test your UDF. It is better because it tests actually uploaded UDF function, also it is generated for you automatically by the Template.


In [5]:
# Xcalar Import UDF Template
#
# This is a function definition for a Python UDF to import external data source
# file <Default Shared Root:/tmp/stocks.csv>
#
# Module name: <my_stocks>
# Function name: <parse_stocks_file>
#
# REQUIREMENTS: Import UDF functions take two arguments...
#   fullPath: The file path to the data source file being imported.
#   inStream: A binary stream of the data source file.
#
#   Your Import UDF function must be a generator, a Python function which
#   processes and returns a stream of data.
#
# To create an import UDF, modify the function definition immediately below this
# comment, as necessary.
#
# To test your UDF, run this cell. (Hit <control> + <enter>.)
#
# To apply it to your dataset, click the "Apply UDF on Dataset Panel" button.
#

# NOTE: Use discipline before replacing this module. Consider whether the import of older
# data source files using this UDF will be affected by this change. If so, versioning this
# module may be appropriate.
#
# Best practice is to name helper functions by starting with __. Such
# functions will be considered private functions and will not be directly
# invokable from Xcalar tools.

# Function definition for your Import UDF.
def parse_stocks_file(inFile, inStream):
    firstRow = True
    import codecs 
    Utf8Reader = codecs.getreader("utf-8") #  Xcalar opens and streams files in binary mode,
    utf8Stream = Utf8Reader(inStream)     #  We need a codec to convert it to UTF-8
    for line in utf8Stream:                                  
        fields = line[:-1].split(",")                           # split comma separated fields
        if firstRow:                                       # skip first row (headers)
            headers = fields                              
            firstRow = False
            continue
        record = {}                                                        # record dictionary
        for i,field in enumerate(fields):
            record[headers[i]] = field
        record["Total Sale"] = (float(record["Last Sale"]) * float(record["Volume"])) 
        yield (record)        # returns a single row to Xcalar to be inserted

### WARNING DO NOT EDIT CODE BELOW THIS LINE ###
from xcalar.compute.api.Dataset import *
from xcalar.compute.coretypes.DataFormatEnums.ttypes import DfFormatTypeT
from xcalar.compute.api.Udf import Udf
from xcalar.compute.coretypes.LibApisCommon.ttypes import XcalarApiException
import random

def uploadUDF():
    import inspect
    sourceCode = "".join(inspect.getsourcelines(parse_stocks_file)[0])
    try:
        Udf(xcalarApi).add("my_stocks", sourceCode)
    except XcalarApiException as e:
        if e.status == StatusT.StatusUdfModuleAlreadyExists:
            Udf(xcalarApi).update("my_stocks", sourceCode)

def testImportUDF():
    from IPython.core.display import display, HTML
    userName = "temp"
    tempDatasetName = userName + "." + str(random.randint(10000,99999)) + "jupyterDS" + str(random.randint(10000,99999))
    dataset = UdfDataset(xcalarApi,
        "Default Shared Root",
        "/tmp/stocks.csv",
        tempDatasetName,
        "my_stocks:parse_stocks_file")

    dataset.load()

    resultSet = ResultSet(xcalarApi, datasetName=dataset.name, maxRecords=100)

    NUMROWS = 100
    rowN = 0
    numCols = 0
    headers = []
    data = []
    for row in resultSet:
        if rowN >= NUMROWS:
            break
        newRow = [""] * numCols
        for key in row:
            idx = headers.index(key) if key in headers else -1
            if idx > -1:
                newRow[idx] = row[key]
            else:
                numCols += 1
                newRow.append(row[key])
                headers.append(key)
        data.append(newRow)
        rowN += 1
    data = [row + [""] * (numCols - len(row)) for row in data]

    print("The following should look like a proper table with headings.")
    display(HTML(
            '<table><tr><th>{}</th></tr><tr>{}</tr></table>'.format(
            '</th><th>'.join(headers),
            '</tr><tr>'.join('<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in data)
            )))

    dataset.delete()
    print("End of UDF")

# Test import UDF on file
uploadUDF()
testImportUDF()

The following should look like a proper table with headings.


security,date,Bid,Ask,Bid size,Ask size,Last Sale,Last size,Volume,Total Sale
GGX,05.24.2018,103.31345139991043,105.6778701874851,1357,1410,103.33112454568308,1490,14428,1490861.4649451154
ABC,05.24.2018,169.90860774729515,172.4875960759087,1079,475,171.98397799793744,425,12364,2126409.9039664986
ZZM,05.24.2018,118.19393752442696,120.51581817275006,564,1357,120.470786645656,1425,12308,1482754.442034734
XEW,05.24.2018,25.056889047063542,26.34779477289431,1446,617,27.275088644956277,1304,5056,137902.8481888989
FFG,05.24.2018,48.67056445690629,49.73149689545219,584,829,46.568009326936135,764,8967,417575.3396346363
UYT,05.24.2018,94.6684222982964,96.29528501314402,1093,1033,92.22536902189536,435,5383,496449.16144486266
RTF,05.24.2018,-0.4175023514063348,2.001814138336352,702,1157,1.350552830023628,1431,5781,7807.5459103665935
GGX,05.23.2018,158.1672249053461,159.2968332531478,985,634,161.13852477336044,1464,7503,1209022.3513745237
ABC,05.23.2018,22.50383286927212,24.279247873536505,431,820,21.3744937488841,1093,14813,316620.37590222014
ZZM,05.23.2018,75.25134452433052,77.96135545881455,1111,1002,76.40839421814982,1464,9676,739327.6224548176


End of UDF


### Applying your UDF to Create a Table

Now that your UDF is set up you can use it to create a table in Xcalar Design. Note that we will use this table in a later tutorials on Map UDF. You must first select the stocks.csv file you created as a datasource.

<HTML>
<br>
<div style="background-color : blue; color : white
    width: 284px;
    padding: 20px 20px 20px 100px;
    border: 1px solid #BFBFBF;
    background-color: white;box-shadow: 0px 0px 0px 0px #aaaaaa; position: relative;"><font style="font-size:20px">
Selecting a datasource</font>
    <br>For more information on importing a datasource see [Selecting a data source](https://www.xcalar.com/documentation/help/XD/1.3.1/Content/A_GettingStarted/D_DetailedStepsForPointing.htm).
    <img src="xi-questionmark_yellow.png" 
         style="position: absolute;top: 5px;left: 30px;width:40px ;height:40px" />
</div>
</HTML>


1. Click the Datasets icon in the XD menu.
2. In the <b>Import Data Source</b> form select 'Default Shared Root' for <b>Data Target</b> and '/tmp/stocks.csv' for <b>Data Source Path</b>.
3. Click <b>NEXT</b>.

<img src="importDS1.png" style="width: 800px; border: 1px solid #CCC;"/>
<HTML>
<div style="background-color : blue; color : white
    width: 284px;
    padding: 20px 20px 20px 100px;
    border: 1px solid #BFBFBF;
    background-color: white;box-shadow: 0px 0px 0px 0px #aaaaaa; position: relative;"><font style="font-size:20px">
Importing a datasource</font>
    <br>For more information on importing a datasource to import see [Importing a data source](https://www.xcalar.com/documentation/help/XD/1.3.1/Content/A_GettingStarted/E_SpecifyHowToImport.htm).
    <img src="xi-questionmark_yellow.png" 
         style="position: absolute;top: 5px;left: 30px;width:40px ;height:40px" />
</div>
</HTML>
4. In the next page, change the <b>Format</b> to 'Custom Format', this will change the other fields allowing you to select a UDF.
5. Select 'my_stocks' in the <b>Module</b> dropdown, and 'parse_stocks_file' in <b>Function</b>.
6. Click <b>CREATE DATASET</b>.
7. The next page will show a preview of your table, slect all columns and click <b>CREATE TABLE</b>.
<img src="importDS2.png" style="width: 600px; border: 1px solid #CCC;"/>

Note that the parser we created in this example is rather simple. Usually you would build parsers to connect to semi-structured or unstructured files or files with non-standard format, compression or encoding. 

<html>
 Next: <a href="./2%20-%20Import%20UDF%20Parser%20Debugging%20and%20Troubleshooting.ipynb" target="_self">2. Import UDF: Simple Parser with Troubleshooting and Debugging</a><br>
 Back to <a href="./0%20-%20Introduction.ipynb" target="_self">Introduction</a><br>
</html>