# Features Engineering

In the entire Data Preparation cycle, we need to think about the most suitable features we can use to solve the Business Problem. Many techniques are possible. We will see the most popular ones.

# Customized Features Engineering

To build a customized feature, you can use the 'eval' method of the vDataFrame. Let's look at an example using the well-known Titanic dataset.

In [8]:
from vertica_ml_python import *
vdf = vDataFrame("titanic")
print(vdf)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,fare,sex,body,pclass,age,name,cabin,parch,survived,boat,ticket,embarked,home.dest,sibsp
0.0,151.55,female,,1,2.0,"Allison, Miss. Helen Loraine",C22 C26,2,0,,113781,S,"Montreal, PQ / Chesterville, ON",1
1.0,151.55,male,135,1,30.0,"Allison, Mr. Hudson Joshua Creighton",C22 C26,2,0,,113781,S,"Montreal, PQ / Chesterville, ON",1
2.0,151.55,female,,1,25.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",C22 C26,2,0,,113781,S,"Montreal, PQ / Chesterville, ON",1
3.0,0.0,male,,1,39.0,"Andrews, Mr. Thomas Jr",A36,0,0,,112050,S,"Belfast, NI",0
4.0,49.5042,male,22,1,71.0,"Artagaveytia, Mr. Ramon",,0,0,,PC 17609,C,"Montevideo, Uruguay",0
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1234, Number of columns: 14


The feature 'parch' corresponds the number of parents and children on-board. The feature 'sibsp' corresponds to the number of siblings and spouses on-board. We can create the feature 'family size' which is equal to parch + sibsp + 1.

In [10]:
vdf.eval(name = "family_size",
         expr = "parch + sibsp + 1")
vdf.select(["parch", "sibsp", "family_size"])

0,1,2,3
,parch,sibsp,family_size
0.0,2,1,4
1.0,2,1,4
2.0,2,1,4
3.0,0,0,1
4.0,0,0,1
,...,...,...


<object>  Name: titanic, Number of rows: 1234, Number of columns: 3

When using the 'eval' method you can enter any SQL expression and Vertica ML Python will evaluate it !

# Regular Expressions

To compute features using regular expressions, we will use the 'regexp' method.

In [31]:
help(vDataFrame.regexp)

Help on function regexp in module vertica_ml_python.vdataframe:

regexp(self, column:str, pattern:str, method:str='substr', position:int=1, occurrence:int=1, replacement:str='', return_position:int=0, name:str='')
    ---------------------------------------------------------------------------
    Computes a new vcolumn based on regular expressions. 
    
    Parameters
    ----------
    column: str
            Input vcolumn used to compute the regular expression.
    pattern: str
            The regular expression.
    method: str, optional
            Method used to compute the regular expressions.
                    count     : Returns the number times a regular expression matches 
                            each element of the input vcolumn. 
                    ilike     : Returns True if the vcolumn element contains a match 
                            for the regular expression.
                    instr     : Returns the starting or ending position in a vcolumn 
             

Let's consider the following example. We can notice that the passengers title is included on each passenger name.

In [12]:
vdf["name"]

0,1
,name
0.0,"Allison, Miss. Helen Loraine"
1.0,"Allison, Mr. Hudson Joshua Creighton"
2.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)"
3.0,"Andrews, Mr. Thomas Jr"
4.0,"Artagaveytia, Mr. Ramon"
,...


<object>  Name: name, Number of rows: 1234, dtype: varchar(164)

Let's extract it using regular expressions.

In [15]:
vdf.regexp(column = "name",
           name = "title",
           pattern = " ([A-Za-z])+\.",
           method = "substr")
vdf.select(["name", "title"])

0,1,2
,name,title
0.0,"Allison, Miss. Helen Loraine",Miss.
1.0,"Allison, Mr. Hudson Joshua Creighton",Mr.
2.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",Mrs.
3.0,"Andrews, Mr. Thomas Jr",Mr.
4.0,"Artagaveytia, Mr. Ramon",Mr.
,...,...


<object>  Name: titanic, Number of rows: 1234, Number of columns: 2

# Advanced Analytical Functions

Vertica ML Python advanced analytical functions are managed by the 'analytic' method. 

In [17]:
help(vDataFrame.analytic)

Help on function analytic in module vertica_ml_python.vdataframe:

analytic(self, func:str, column:str='', by:list=[], order_by=[], column2:str='', name:str='', offset:int=1, x_smoothing:float=0.5, add_count:bool=True)
    ---------------------------------------------------------------------------
    Adds a new vcolumn to the vDataFrame by using an advanced analytical function 
    on one or two specific vcolumns.
    
    Parameters
    ----------
    func: str
            Function to apply.
                    beta         : Beta Coefficient between 2 vcolumns
                    count        : number of non-missing elements
                    corr         : Pearson correlation between 2 vcolumns
                    cov          : covariance between 2 vcolumns
                    dense_rank   : dense rank
                    ema          : exponential moving average
                    first_value  : first non null lead
                    iqr          : interquartile range
       

Many different techniques are available. Let's use the 'USA 2015 Flights' datasets to do some computations.

In [25]:
from vertica_ml_python import *
vdf = vDataFrame("flights")
print(vdf)

0,1,2,3,4,5,6
,departure_delay,origin_airport,scheduled_departure,airline,destination_airport,arrival_delay
0.0,6,MDW,2015-05-06 21:30:00,WN,CMH,-8
1.0,0,ORD,2015-05-06 21:40:00,MQ,CMH,-13
2.0,23,BWI,2015-05-06 22:15:00,WN,CMH,28
3.0,-1,ATL,2015-05-06 22:20:00,WN,CMH,0
4.0,-7,ATL,2015-05-06 22:22:00,DL,CMH,-16
,...,...,...,...,...,...


<object>  Name: flights, Number of rows: 4068736, Number of columns: 6


For each flight, let's compute the previous departure delay for the same airline.

In [26]:
vdf.analytic(name = "previous_departure_delay",
             func = "lag",
             column = "departure_delay",
             by = ["airline", "destination_airport", "origin_airport"],
             order_by = {"scheduled_departure": "asc"})

0,1,2,3,4,5,6,7
,departure_delay,origin_airport,scheduled_departure,airline,destination_airport,arrival_delay,previous_departure_delay
0.0,-3,10397,2015-10-01 10:27:00,EV,10135,-14,
1.0,0,10397,2015-10-01 14:44:00,EV,10135,-1,-3
2.0,12,10397,2015-10-02 10:27:00,EV,10135,4,0
3.0,-4,10397,2015-10-02 14:44:00,EV,10135,-6,12
4.0,-4,10397,2015-10-03 10:27:00,EV,10135,-2,-4
,...,...,...,...,...,...,...


<object>  Name: flights, Number of rows: 4068736, Number of columns: 7

# Moving Windows

Moving windows are powerful features. They can bring a lot of information. Moving windows are managed by the 'rolling' method in Vertica ML Python.

In [27]:
help(vDataFrame.rolling)

Help on function rolling in module vertica_ml_python.vdataframe:

rolling(self, func:str, column:str, preceding, following, column2:str='', name:str='', by:list=[], order_by=[], method:str='rows', rule:str='auto')
    ---------------------------------------------------------------------------
    Adds a new vcolumn to the vDataFrame by using an advanced analytical window 
    function on one or two specific vcolumns.
    
    Parameters
    ----------
    func: str
            Function to use.
                    beta        : Beta Coefficient between 2 vcolumns
                    count       : number of non-missing elements
                    corr        : Pearson correlation between 2 vcolumns
                    cov         : covariance between 2 vcolumns
                    kurtosis    : kurtosis
                    jb          : Jarque Bera index 
                    mae         : mean absolute error (deviation)
                    max         : maximum
                    mean 

Let's for example compute the number of flights that the same airline has to manage two hours preceding the concerned flight and one hour following.

In [30]:
vdf.rolling(name = "number_flights_to_manage_by_airline_2hp_1hf",
            func = "count",
            column = "airline",
            by = ["airline", "origin_airport"],
            order_by = {"scheduled_departure": "asc"},
            preceding = "2 hours",
            following = "1 hour",
            method = "range")

0,1,2,3,4,5,6,7,8
,departure_delay,origin_airport,scheduled_departure,airline,destination_airport,arrival_delay,previous_departure_delay,number_flights_to_manage_by_airline_2hp_1hf
0.0,-11,10140,2015-10-01 10:55:00,AA,11298,-20,,1
1.0,-10,10140,2015-10-01 12:09:00,AA,11298,-18,-11,2
2.0,-2,10140,2015-10-01 14:20:00,AA,11298,-11,-10,1
3.0,-6,10140,2015-10-01 16:19:00,AA,11298,5,-2,2
4.0,-9,10140,2015-10-02 10:55:00,AA,11298,-28,-6,1
,...,...,...,...,...,...,...,...


<object>  Name: flights, Number of rows: 4068736, Number of columns: 8

Moving windows give us infinite number of possibilities to create new features. When the Data Preparation is finished, it is time to create a Machine Learning Model. Our next lesson will introduce the different types of ML algorithms !