# Data Ingestion

The First Step of the Data Science Process (Excluding Business Understanding) is the Data Ingestion. To do Data Science, we need data and it is important to be able to ingest different types of formats. Vertica allows the ingestion of many data files thanks to different built-in parsers.

Most of the time we know in advance the file types and we need to write the entire SQL query to ingest it. However, sometimes we don't know the columns names and types in advance. To solve this problem, Vertica allows the users to create Flex Tables. They are efficient ways to ingest any data file without knowing in advance its columns types or even its structure.

Vertica ML Python is using Flex Tables to allow the auto-ingestion of JSON and CSV files. For the other files types, it is advise to use direct SQL queries to ingest them. Becareful when using the following functions as the data types detected may not be optimal and it is always preferable to write SQL queries using optimized types and segmentations. 

It is important to remember that Vertica ML Python is using Vertica SQL in back-end so by optimizing table structure you are increasing Vertica ML Python performance.

# Ingesting CSV

CSV is the favourite data scientists format. It has an internal structure which makes it easy to ingest. To ingest a CSV file, we will use the 'read_csv' function.

In [13]:
from vertica_ml_python import read_csv
help(read_csv)

Help on function read_csv in module vertica_ml_python.utilities:

read_csv(path:str, cursor=None, schema:str='public', table_name:str='', sep:str=',', header:bool=True, header_names:list=[], na_rep:str='', quotechar:str='"', escape:str='\\', genSQL:bool=False, parse_n_lines:int=-1, insert:bool=False)
    ---------------------------------------------------------------------------
    Ingests a CSV file using flex tables.
    
    Parameters
    ----------
    path: str
            Absolute path where the CSV file is located.
    cursor: DBcursor, optional
            Vertica DB cursor.
    schema: str, optional
            Schema where the CSV file will be ingested.
    table_name: str, optional
            Final relation name.
    sep: str, optional
            Column separator.
    header: bool, optional
            If set to False, the parameter 'header_names' will be used to name the 
            different columns.
    header_names: list, optional
            List of the columns nam

You can easily ingest a CSV file by entering the correct parameters.

In [16]:
read_csv("titanic.csv",
         schema = "public",
         table_name = "titanic",
         sep = ",")

The table "public"."titanic" has been successfully created.


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,survived,boat,ticket,embarked,home.dest,sibsp,fare,sex,body,pclass,age,name,cabin,parch
0.0,1,2,24160,S,"St Louis, MO",0,211.3375,female,,1,29.0,"Allen, Miss. Elisabeth Walton",B5,0
1.0,1,11,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,male,,1,0.92,"Allison, Master. Hudson Trevor",C22 C26,2
2.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,female,,1,2.0,"Allison, Miss. Helen Loraine",C22 C26,2
3.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,male,135,1,30.0,"Allison, Mr. Hudson Joshua Creighton",C22 C26,2
4.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,female,,1,25.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",C22 C26,2
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1234, Number of columns: 14

If no schema is indicated as parameter, the 'public' schema will be used. 
If 'table_name' is not defined, the name of the final relation will correspond to the name of the CSV file.

It is also possible to not ingest the file and only to generate the SQL query which can be used to create the final relation.

In [19]:
read_csv("titanic.csv",
         schema = "public",
         table_name = "titanic",
         sep = ",",
         genSQL = True)

CREATE TABLE "public"."titanic"("pclass" Integer, "survived" Integer, "name" Varchar(164), "sex" Varchar(20), "age" Numeric(6,3), "sibsp" Integer, "parch" Integer, "ticket" Varchar(36), "fare" Numeric(10,5), "cabin" Varchar(30), "embarked" Varchar(20), "boat" Varchar(100), "body" Integer, "home.dest" Varchar(100));
COPY "public"."titanic"("pclass", "survived", "name", "sex", "age", "sibsp", "parch", "ticket", "fare", "cabin", "embarked", "boat", "body", "home.dest") FROM {} DELIMITER ',' NULL '' ENCLOSED BY '"' ESCAPE AS '\' SKIP 1;


You can also use the parameter 'insert' to insert new data in the existing relation.

In [22]:
read_csv("titanic.csv",
         schema = "public",
         table_name = "titanic",
         sep = ",",
         insert = True)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,survived,boat,ticket,embarked,home.dest,sibsp,fare,sex,body,pclass,age,name,cabin,parch
0.0,1,2,24160,S,"St Louis, MO",0,211.3375,female,,1,29.0,"Allen, Miss. Elisabeth Walton",B5,0
1.0,1,11,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,male,,1,0.92,"Allison, Master. Hudson Trevor",C22 C26,2
2.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,female,,1,2.0,"Allison, Miss. Helen Loraine",C22 C26,2
3.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,male,135,1,30.0,"Allison, Mr. Hudson Joshua Creighton",C22 C26,2
4.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,female,,1,25.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",C22 C26,2
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 2468, Number of columns: 14

# Ingesting JSON

JSON is also a popular format and you can ingest JSON files using the 'read_json' function.

In [23]:
from vertica_ml_python import read_csv
help(read_json)

Help on function read_json in module vertica_ml_python.utilities:

read_json(path:str, cursor=None, schema:str='public', table_name:str='', usecols:list=[], new_name:dict={}, insert:bool=False)
    ---------------------------------------------------------------------------
    Ingests a JSON file using flex tables.
    
    Parameters
    ----------
    path: str
            Absolute path where the JSON file is located.
    cursor: DBcursor, optional
            Vertica DB cursor.
    schema: str, optional
            Schema where the JSON file will be ingested.
    table_name: str, optional
            Final relation name.
    usecols: list, optional
            List of the JSON parameters to ingest. The other ones will be ignored. If
            empty all the JSON parameters will be ingested.
    new_name: dict, optional
            Dictionary of the new columns name. If the JSON file is nested, it is advised
            to change the final names as special characters will be include

This function will work the same way as 'read_csv' but it has less parameter due to the standardization of the JSON format. 

In [25]:
read_json("titanic.json",
          schema = "public",
          table_name = "titanic")

The table "public"."titanic" has been successfully created.


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,fields.parch,fields.name,fields.survived,fields.embarked,fields.sibsp,record_timestamp,fields.passengerid,fields.pclass,fields.fare,fields.sex,fields.ticket,recordid,fields.age,fields.cabin,datasetid
0.0,0,"Collander, Mr. Erik Gustaf",False,S,0,2016-09-20 15:34:51.313,343,2,13.0,male,248740,835634b93c8f759537a89daa01c3c3658e934617,28.0,,titanic-passengers
1.0,0,"Moen, Mr. Sigurd Hansen",False,S,0,2016-09-20 15:34:51.313,76,3,7.65,male,348123,97941a419e5cf6a4bb65147a7a21d7025c8a6e1b,25.0,F G73,titanic-passengers
2.0,0,"Jensen, Mr. Hans Peder",False,S,0,2016-09-20 15:34:51.313,641,3,7.8542,male,350050,b762da1fa9f7f7765bc14006d9f5b8fc1d2d5177,20.0,,titanic-passengers
3.0,4,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",False,S,0,2016-09-20 15:34:51.313,568,3,21.075,female,349909,dc455b086d203605705820911c0aaa98467bcd41,29.0,,titanic-passengers
4.0,0,"Davidson, Mr. Thornton",False,S,1,2016-09-20 15:34:51.313,672,1,52.0,male,F.C. 12750,5aa00b39a93376656528f1c7d929a297e31e1a20,31.0,B71,titanic-passengers
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 891, Number of columns: 15

Our 'JSON' file was nested which leaded to the creation of columns names having a dot ('.') separator. You can use the parameters 'usecols' and 'new_name' to only select the needed columns and rename them before the ingestion.

In [31]:
read_json("titanic.json",
          schema = "public",
          table_name = "titanic",
          usecols = ["fields.survived",
                     "fields.pclass",
                     "fields.fare"],
          new_name = {"fields.survived": "survived",
                      "fields.pclass": "pclass",
                      "fields.fare": "fare"})

The table "public"."titanic" has been successfully created.


0,1,2,3
,fare,pclass,survived
0.0,13.0,2,False
1.0,7.65,3,False
2.0,7.8542,3,False
3.0,21.075,3,False
4.0,52.0,1,False
,...,...,...


<object>  Name: titanic, Number of rows: 891, Number of columns: 3

You are now ready to understand the Vertica ML Python Data Exploration functionalities.