#**Ctrl4AI**

A helper package for Machine Learning and Deep Learning solutions

**Developers:** Shaji, Charu, Selva

![AutoML](https://raw.githubusercontent.com/vkreat-tech/ctrl4ai/master/design/AutoML_Preprocess.png)

**Highlights**

- Open Source Package with emphasis on data preprocessing so far.
- Self intelligent methods that can be employed at the levels of abstraction or customization.
- The flow of auto-preprocessing is orchestrated compatible to the learning type.
- Parameter tuning allows users to transform the data precisely to their specifications.
- Developed computations for inspecting the data to discover its type, distribution, correlation etc. which are handled in the background.









# **Install & Import**

In [None]:
pip install ctrl4ai --upgrade

Requirement already up-to-date: ctrl4ai in /usr/local/lib/python3.6/dist-packages (1.0.0)


In [None]:
from ctrl4ai import preprocessing
from ctrl4ai import automl

# **Usage**

For documentation, please read [HELP.md](https://github.com/vkreat-tech/ctrl4ai/blob/master/HELP.md)

In [None]:
help(automl.preprocess)

Help on function preprocess in module ctrl4ai.automl:

preprocess(dataset, learning_type, target_variable=None, target_type=None, impute_null_method='central_tendency', tranform_categorical='label_encoding', categorical_threshold=0.3, remove_outliers=False, log_transform=None, drop_null_dominated=True, dropna_threshold=0.7, derive_from_datetime=True, ohe_ignore_cols=[], feature_selection=True, define_continuous_cols=[], define_categorical_cols=[])
    dataset=pandas DataFrame (required)
    learning_type='supervised'/'unsupervised' (required)
    target_variable=Target/Dependent variable (required for supervised learning type)
    target_type='continuous'/'categorical' (required for supervised learning type)
    impute_null_method='central_tendency' (optional) [Choose between 'central_tendency' and 'KNN']
    tranform_categorical='label_encoding' (optional) [Choose between 'label_encoding' and 'one_hot_encoding']
    categorical_threshold=0.3 (optional) [Threshold for determining categ

# **Inbuilt datasets**



In [None]:
from ctrl4ai import datasets

In [None]:
dataset1=datasets.trip_fare()
dataset1.head()

Unnamed: 0,TID,vendor_id,new_user,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount
0,AIX000345001,DST000401,NO,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,N,-73.993369,40.734247,CRD,0.5,8.4
1,AIX000345002,DST000401,NO,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,N,-73.958701,40.772533,CRD,0.0,8.5
2,AIX000345003,DST000401,NO,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,N,-73.97078,40.75835,CSH,0.0,7.0
3,AIX000345004,DST000532,NO,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,,-73.975512,40.756867,CRD,0.0,11.3
4,AIX000345005,DST000401,NO,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,N,-73.999369,40.721517,CSH,0.0,10.0


In [None]:
dataset2=datasets.titanic()
dataset2.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# **AutoML**

## **Preprocessing**

In [None]:
dataset1_labels,dataset1_processed=automl.preprocess(dataset1,'supervised',target_variable='fare_amount',target_type='continuous')
dataset1_processed.head()

Dropping single valued column(s) new_user,year_of_pickup_datetime,year_of_dropoff_datetime
Columns identified as continuous are pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
Columns identified as categorical are vendor_id,tolls_amount,tip_amount,mta_tax,passenger_count,rate_code,store_and_fwd_flag,payment_type,surcharge,hour_of_pickup_datetime,weekday_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,month_of_dropoff_datetime
Replaced nulls in tolls_amount with mode
Replaced nulls in tip_amount with mean
Replaced nulls in mta_tax with mode
Replaced nulls in passenger_count with mode
Replaced nulls in rate_code with mode
Replaced nulls in store_and_fwd_flag with mode
Replaced nulls in payment_type with mode
Replaced nulls in surcharge with mode
Labels for vendor_id: {'DST000401': 0, 'DST000532': 1}
Labels for store_and_fwd_flag: {'N': 0, 'Y': 1}
Labels for payment_type: {'CRD': 0, 'CSH': 1, 'DIS': 2, 'NOC': 3, 'UNK': 4

Unnamed: 0,pickup_longitude,dropoff_longitude,vendor_id,tolls_amount,tip_amount,passenger_count,rate_code,store_and_fwd_flag,surcharge,hour_of_pickup_datetime,weekday_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,month_of_dropoff_datetime,fare_amount
0,-74.003939,-73.993369,0,0.0,1.4,1.0,1.0,0,0.5,4,0,4,4,0,4,8.4
1,-73.973864,-73.958701,0,0.0,1.0,3.0,1.0,0,0.0,18,6,4,18,6,4,8.5
2,-73.954406,-73.97078,0,0.0,0.0,2.0,1.0,0,0.0,8,0,4,8,0,4,7.0
3,-73.962345,-73.975512,1,0.0,1.8,2.0,1.0,0,0.0,9,4,4,10,4,4,11.3
4,-74.004657,-73.999369,0,0.0,0.0,1.0,1.0,0,0.0,13,2,4,13,2,4,10.0


In [None]:
dataset2_labels,dataset2_processed=automl.preprocess(dataset2,'supervised',target_variable='Survived',target_type='categorical',impute_null_method='KNN',tranform_categorical='one_hot_encoding',define_continuous_cols=['Fare'])
dataset2_processed.head()

Dropping null dominated column(s) Cabin
Columns identified as continuous are PassengerId,Age,Fare
Columns identified as categorical are Pclass,Sex,SibSp,Parch,Embarked
Replaced nulls in Embarked with mode
One hot encoding Pclass
One hot encoding Sex
One hot encoding SibSp
One hot encoding Parch
One hot encoding Embarked


Unnamed: 0,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,Parch_0,Parch_1,Embarked_C,Embarked_S,Survived
0,7.25,0,0,1,0,1,0,1,1,0,0,1,0
1,71.2833,1,0,0,1,0,0,1,1,0,1,0,1
2,7.925,0,0,1,1,0,1,0,1,0,0,1,1
3,53.1,1,0,0,1,0,0,1,1,0,0,1,1
4,8.05,0,0,1,0,1,1,0,1,0,0,1,0


## **Collinearity Check**

- Calculates the association between variables in a dataset.
- Auto detects the type of data and checks Pearson correlation between two continuous variables, CramersV correlation between two categorical variables, Kendalls Tau correlation between a categorical and a continuos variable to find the correlation

In [None]:
automl.master_correlation(dataset1_processed)

Unnamed: 0,vendor_id,tolls_amount,tip_amount,passenger_count,rate_code,store_and_fwd_flag,surcharge,hour_of_pickup_datetime,weekday_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,month_of_dropoff_datetime,pickup_longitude,dropoff_longitude,fare_amount
vendor_id,1.0,0.0306154,0.172872,0.313218,0.0139889,0.111517,0.00976769,0.0234221,0.0453754,0.340028,0.0238844,0.0451747,0.340069,0.00116159,-0.000559229,0.0126753
tolls_amount,0.0306154,1.0,0.381564,0.0794493,0.363726,0.0177564,0.0294706,0.0182884,0.0138935,0.00612421,0.0163107,0.0136376,0.00408199,0.12991,0.101988,0.294791
tip_amount,0.172872,0.381564,1.0,0.190126,0.334327,0.0253169,0.0229813,0.0299072,0.023174,0.0282974,0.0300026,0.0228443,0.0249727,-0.0129766,0.00177814,0.387572
passenger_count,0.313218,0.0794493,0.190126,1.0,0.0150161,0.0348196,0.0147011,0.0263799,0.0245439,0.0375292,0.026624,0.0247046,0.0375335,-0.00867234,-0.00682593,0.0159067
rate_code,0.0139889,0.363726,0.334327,0.0150161,1.0,0.00915828,0.0530044,0.0332395,0.00750144,0.0057245,0.0292611,0.00733837,0.00576959,0.0908719,0.0480962,0.209065
store_and_fwd_flag,0.111517,0.0177564,0.0253169,0.0348196,0.00915828,1.0,0.0,0.00848857,0.00818684,0.0364353,0.00921027,0.00810627,0.0363995,0.00265454,0.00859467,0.00590623
surcharge,0.00976769,0.0294706,0.0229813,0.0147011,0.0530044,0.0,1.0,0.417918,0.104652,0.0129465,0.407737,0.104988,0.0130368,-0.0436306,0.00841056,0.0357015
hour_of_pickup_datetime,0.0234221,0.0182884,0.0299072,0.0263799,0.0332395,0.00848857,0.417918,1.0,0.0980588,0.0121762,0.805584,0.0971608,0.012558,-0.00536169,-0.0236183,0.0177913
weekday_of_pickup_datetime,0.0453754,0.0138935,0.023174,0.0245439,0.00750144,0.00818684,0.104652,0.0980588,1.0,0.0611606,0.0963378,0.988596,0.0610759,-0.01218,-0.013825,0.0043371
month_of_pickup_datetime,0.340028,0.00612421,0.0282974,0.0375292,0.0057245,0.0364353,0.0129465,0.0121762,0.0611606,1.0,0.012122,0.0608922,0.999515,0.000928034,0.0011195,0.0264664


# **Preprocessing - Custom Methods**

## **Derived Features**

Having the timestamp fields or geographical coordinates as it is doesn't serve any purpose for classification / regression algorithms. So, the goal should be to derive maximum information out of the them.

In [None]:
dataset1= preprocessing.get_timediff(dataset1,'pickup_datetime','dropoff_datetime')
dataset1.head()

Unnamed: 0,TID,vendor_id,new_user,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount,secs_diff_pickup_datetime_dropoff_datetime
0,AIX000345001,DST000401,NO,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,N,-73.993369,40.734247,CRD,0.5,8.4,360.0
1,AIX000345002,DST000401,NO,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,N,-73.958701,40.772533,CRD,0.0,8.5,360.0
2,AIX000345003,DST000401,NO,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,N,-73.97078,40.75835,CSH,0.0,7.0,360.0
3,AIX000345004,DST000532,NO,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,,-73.975512,40.756867,CRD,0.0,11.3,720.0
4,AIX000345005,DST000401,NO,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,N,-73.999369,40.721517,CSH,0.0,10.0,840.0


In [None]:
dataset1=preprocessing.get_distance(dataset1,'pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude')
dataset1.head()

Unnamed: 0,TID,vendor_id,new_user,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount,secs_diff_pickup_datetime_dropoff_datetime,kms_pickup_latitude_dropoff_latitude
0,AIX000345001,DST000401,NO,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,N,-73.993369,40.734247,CRD,0.5,8.4,360.0,1.311173
1,AIX000345002,DST000401,NO,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,N,-73.958701,40.772533,CRD,0.0,8.5,360.0,2.59627
2,AIX000345003,DST000401,NO,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,N,-73.97078,40.75835,CSH,0.0,7.0,360.0,1.538152
3,AIX000345004,DST000532,NO,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,,-73.975512,40.756867,CRD,0.0,11.3,720.0,1.598931
4,AIX000345005,DST000401,NO,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,N,-73.999369,40.721517,CSH,0.0,10.0,840.0,1.626473


In [None]:
dataset1=preprocessing.derive_from_datetime(dataset1)
dataset1.head()

Unnamed: 0,TID,vendor_id,new_user,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount,secs_diff_pickup_datetime_dropoff_datetime,kms_pickup_latitude_dropoff_latitude,hour_of_pickup_datetime,weekday_of_pickup_datetime,year_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,year_of_dropoff_datetime,month_of_dropoff_datetime
0,AIX000345001,DST000401,NO,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,N,-73.993369,40.734247,CRD,0.5,8.4,360.0,1.311173,4,0,2015,4,4,0,2015,4
1,AIX000345002,DST000401,NO,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,N,-73.958701,40.772533,CRD,0.0,8.5,360.0,2.59627,18,6,2015,4,18,6,2015,4
2,AIX000345003,DST000401,NO,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,N,-73.97078,40.75835,CSH,0.0,7.0,360.0,1.538152,8,0,2015,4,8,0,2015,4
3,AIX000345004,DST000532,NO,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,,-73.975512,40.756867,CRD,0.0,11.3,720.0,1.598931,9,4,2015,4,10,4,2015,4
4,AIX000345005,DST000401,NO,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,N,-73.999369,40.721517,CSH,0.0,10.0,840.0,1.626473,13,2,2015,4,13,2,2015,4


## **Feature Elimination**

In [None]:
dataset1=preprocessing.drop_null_fields(dataset1,dropna_threshold=0.5)

Dropping null dominated column(s) store_and_fwd_flag


In [None]:
dataset1=preprocessing.drop_single_valued_cols(dataset1)

Dropping single valued column(s) new_user,year_of_pickup_datetime,year_of_dropoff_datetime


## *Dealing with Categorical Data*

In [None]:
dataset2=preprocessing.get_ohe_df(dataset2,ignore_cols=['Age'])
dataset2.head()

One hot encoding Survived
One hot encoding Pclass
One hot encoding Sex
One hot encoding SibSp
One hot encoding Parch
One hot encoding Cabin
One hot encoding Embarked


Unnamed: 0,PassengerId,Name,Age,Ticket,Fare,Survived_0,Survived_1,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Cabin_A10,Cabin_A14,Cabin_A16,Cabin_A19,Cabin_A20,Cabin_A23,Cabin_A24,Cabin_A26,Cabin_A31,Cabin_A32,Cabin_A34,Cabin_A36,Cabin_A5,Cabin_A6,...,Cabin_D50,Cabin_D56,Cabin_D6,Cabin_D7,Cabin_D9,Cabin_E10,Cabin_E101,Cabin_E12,Cabin_E121,Cabin_E17,Cabin_E24,Cabin_E25,Cabin_E31,Cabin_E33,Cabin_E34,Cabin_E36,Cabin_E38,Cabin_E40,Cabin_E44,Cabin_E46,Cabin_E49,Cabin_E50,Cabin_E58,Cabin_E63,Cabin_E67,Cabin_E68,Cabin_E77,Cabin_E8,Cabin_F E69,Cabin_F G63,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,"Braund, Mr. Owen Harris",22.0,A/5 21171,7.25,1,0,0,0,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,PC 17599,71.2833,0,1,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,"Heikkinen, Miss. Laina",26.0,STON/O2. 3101282,7.925,0,1,0,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,113803,53.1,0,1,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,5,"Allen, Mr. William Henry",35.0,373450,8.05,1,0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [None]:
col_labels,dataset1=preprocessing.get_label_encoded_df(dataset1)
dataset1.head()

Labels for vendor_id: {'DST000401': 0, 'DST000532': 1}
Labels for payment_type: {'CRD': 0, 'CSH': 1, 'DIS': 2, 'NOC': 3, 'UNK': 4, 'nan': 5}


Unnamed: 0,TID,vendor_id,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount,secs_diff_pickup_datetime_dropoff_datetime,kms_pickup_latitude_dropoff_latitude,hour_of_pickup_datetime,weekday_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,month_of_dropoff_datetime
0,AIX000345001,0,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,-73.993369,40.734247,0,0.5,8.4,360.0,1.311173,4,0,4,4,0,4
1,AIX000345002,0,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,-73.958701,40.772533,0,0.0,8.5,360.0,2.59627,18,6,4,18,6,4
2,AIX000345003,0,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,-73.97078,40.75835,1,0.0,7.0,360.0,1.538152,8,0,4,8,0,4
3,AIX000345004,1,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,-73.975512,40.756867,0,0.0,11.3,720.0,1.598931,9,4,4,10,4,4
4,AIX000345005,0,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,-73.999369,40.721517,1,0.0,10.0,840.0,1.626473,13,2,4,13,2,4


## **Data Cleansing**

In [None]:
dataset2=preprocessing.drop_non_numeric(dataset2)

Dropping non categorical/continuous column(s):Name,Ticket


In [None]:
dataset2.isnull().sum()

PassengerId      0
Age            177
Fare             0
Survived_0       0
Survived_1       0
              ... 
Cabin_G6         0
Cabin_T          0
Embarked_C       0
Embarked_Q       0
Embarked_S       0
Length: 174, dtype: int64

In [None]:
dataset2=preprocessing.impute_nulls(dataset2,method='KNN')
dataset2.isnull().sum()

PassengerId    0
Age            0
Fare           0
Survived_0     0
Survived_1     0
              ..
Cabin_G6       0
Cabin_T        0
Embarked_C     0
Embarked_Q     0
Embarked_S     0
Length: 174, dtype: int64

In [None]:
dataset1.isnull().sum()

TID                                                0
vendor_id                                          0
tolls_amount                                     273
tip_amount                                    104794
mta_tax                                          138
pickup_datetime                                    0
dropoff_datetime                                   0
passenger_count                                  188
pickup_longitude                               31400
pickup_latitude                                21118
rate_code                                         52
dropoff_longitude                               3163
dropoff_latitude                                5080
payment_type                                       0
surcharge                                      62873
fare_amount                                        0
secs_diff_pickup_datetime_dropoff_datetime         0
kms_pickup_latitude_dropoff_latitude           59681
hour_of_pickup_datetime                       

In [None]:
dataset1=preprocessing.impute_nulls(dataset1)
dataset1.isnull().sum()

Replaced nulls in tolls_amount with mode
Replaced nulls in tip_amount with mean
Replaced nulls in mta_tax with mode
Replaced nulls in passenger_count with mode
Replaced nulls in pickup_longitude with mean
Replaced nulls in pickup_latitude with mean
Replaced nulls in rate_code with mode
Replaced nulls in dropoff_longitude with mean
Replaced nulls in dropoff_latitude with mean
Replaced nulls in surcharge with mode
Replaced nulls in kms_pickup_latitude_dropoff_latitude with mean


TID                                           0
vendor_id                                     0
tolls_amount                                  0
tip_amount                                    0
mta_tax                                       0
pickup_datetime                               0
dropoff_datetime                              0
passenger_count                               0
pickup_longitude                              0
pickup_latitude                               0
rate_code                                     0
dropoff_longitude                             0
dropoff_latitude                              0
payment_type                                  0
surcharge                                     0
fare_amount                                   0
secs_diff_pickup_datetime_dropoff_datetime    0
kms_pickup_latitude_dropoff_latitude          0
hour_of_pickup_datetime                       0
weekday_of_pickup_datetime                    0
month_of_pickup_datetime                

## **Feature Selection**

In [None]:
col_corr,correlated_features=preprocessing.get_correlated_features(dataset1,'fare_amount','continuous')
dataset1=dataset1[correlated_features]
dataset1.head()

Unnamed: 0,tip_amount,pickup_longitude,dropoff_longitude,kms_pickup_latitude_dropoff_latitude,vendor_id,tolls_amount,passenger_count,rate_code,surcharge,secs_diff_pickup_datetime_dropoff_datetime,hour_of_pickup_datetime,weekday_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,month_of_dropoff_datetime
0,1.4,-74.003939,-73.993369,1.311173,0,0.0,1.0,1.0,0.5,360.0,4,0,4,4,0,4
1,1.0,-73.973864,-73.958701,2.59627,0,0.0,3.0,1.0,0.0,360.0,18,6,4,18,6,4
2,0.0,-73.954406,-73.97078,1.538152,0,0.0,2.0,1.0,0.0,360.0,8,0,4,8,0,4
3,1.8,-73.962345,-73.975512,1.598931,1,0.0,2.0,1.0,0.0,720.0,9,4,4,10,4,4
4,0.0,-74.004657,-73.999369,1.626473,0,0.0,1.0,1.0,0.0,840.0,13,2,4,13,2,4


In [None]:
col_corr

{'dropoff_latitude': -0.017373606423785472,
 'dropoff_longitude': 0.01742209912372559,
 'hour_of_dropoff_datetime': 0.016189237760396855,
 'hour_of_pickup_datetime': 0.01779134715625042,
 'kms_pickup_latitude_dropoff_latitude': 0.025835253751848367,
 'month_of_dropoff_datetime': 0.026537887967730687,
 'month_of_pickup_datetime': 0.02646639410009602,
 'mta_tax': -0.0736925287661054,
 'passenger_count': 0.015906711152525267,
 'payment_type': -0.21648939742422965,
 'pickup_latitude': -0.014342803960358372,
 'pickup_longitude': 0.01484391868557368,
 'rate_code': 0.2090648217377861,
 'secs_diff_pickup_datetime_dropoff_datetime': 0.7831307009611339,
 'surcharge': 0.03570152243714292,
 'tip_amount': 0.6379550906504223,
 'tolls_amount': 0.29479111622992427,
 'vendor_id': 0.012675288180895921,
 'weekday_of_dropoff_datetime': 0.0038024380132646645,
 'weekday_of_pickup_datetime': 0.004337095703898054}

# **Standardization/Normalization**

In [None]:
dataset2=preprocessing.log_transform(dataset1)
dataset2.head()

Log Normalization(yeojohnson) applied for tip_amount
Log Normalization(yeojohnson) applied for pickup_longitude
Log Normalization(yeojohnson) applied for dropoff_longitude
Log Normalization(yeojohnson) applied for kms_pickup_latitude_dropoff_latitude
Log Normalization(yeojohnson) applied for secs_diff_pickup_datetime_dropoff_datetime


Unnamed: 0,tip_amount,pickup_longitude,dropoff_longitude,kms_pickup_latitude_dropoff_latitude,vendor_id,tolls_amount,passenger_count,rate_code,surcharge,secs_diff_pickup_datetime_dropoff_datetime,hour_of_pickup_datetime,weekday_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,month_of_dropoff_datetime
0,0.875469,-4.317541,-4.3174,0.837755,0,0.0,1.0,1.0,0.5,5.888878,4,0,4,4,0,4
1,0.693147,-4.31714,-4.316937,1.279897,0,0.0,3.0,1.0,0.0,5.888878,18,6,4,18,6,4
2,0.0,-4.31688,-4.317098,0.931436,0,0.0,2.0,1.0,0.0,5.888878,8,0,4,8,0,4
3,1.029619,-4.316986,-4.317162,0.9551,1,0.0,2.0,1.0,0.0,6.580639,9,4,4,10,4,4
4,0.0,-4.31755,-4.31748,0.965642,0,0.0,1.0,1.0,0.0,6.734592,13,2,4,13,2,4


In [None]:
automl.scale_transform(dataset1_processed,method='robust')

array([[-0.83972552, -0.46615732, -1.        , ..., -0.75      ,
        -0.33333333, -0.33333333],
       [ 0.27581602,  0.7378017 , -1.        , ...,  0.75      ,
        -0.33333333, -0.32222222],
       [ 0.99755193,  0.31831915, -1.        , ..., -0.75      ,
        -0.33333333, -0.48888889],
       ...,
       [ 0.56628338, -0.72856399, -1.        , ...,  0.25      ,
         0.83333333,  0.61111111],
       [ 0.05942136, -0.05455808,  0.        , ...,  0.5       ,
         0.83333333, -0.48888889],
       [ 0.46520772, -0.88904324, -1.        , ...,  0.25      ,
         0.83333333,  0.73333333]])