#**Ctrl4AI**

A helper package for Machine Learning and Deep Learning solutions

**Developers:** Shaji, Charu, Selva

**Highlights**

1.   Open Source Machine learning / Deep learning Package - focusing only on data preprocessing as of now.
2.   The package has lot of methods that can be used independently, but the major highlight of the package is a method with hyperparameters covering the entire flow of preprocessing.
1.   Users can simply experiment by running with default parameters which they can further tune by adjusting the parameters based on the requirements
2.   Methods are with proper description to make it friendly for the user
1.   Self-intelligent methods that understand the type of data, distribution etc. and compute accordingly
2.   Minimises the number of checks that user has to do for preprocessing









# **Install & Import**

In [8]:
pip install -i https://test.pypi.org/simple/ ctrl4ai-shaji==0.0.19

Looking in indexes: https://test.pypi.org/simple/


In [9]:
from ctrl4ai import preprocessing

# **Usage**

In [10]:
help(preprocessing.impute_nulls)

Help on function impute_nulls in module ctrl4ai.preprocessing:

impute_nulls(dataset, method='central_tendency')
    Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
    Description: Auto identifies the type of distribution in the column and imputes null values
    Note: Consumes more system mermory if the size of the dataset is huge
    Returns: Dataframe [with separate column for each categorical values]



In [11]:
help(preprocessing.derive_from_datetime)

Help on function derive_from_datetime in module ctrl4ai.preprocessing:

derive_from_datetime(dataset)
    Usage: [arg1]:[pandas dataframe]
    Prerequisite: Type for datetime columns to be defined correctly
    Description: Derives the hour, weekday, year and month from a datetime column
    Returns: Dataframe [with new columns derived from datetime columns]



# **Inbuilt datasets**



In [12]:
from ctrl4ai import datasets

In [13]:
dataset2=datasets.titanic()
dataset2.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
import pandas as pd

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)

In [15]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [16]:
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M')
dataset1 = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/Dataset.csv',parse_dates=['pickup_datetime','dropoff_datetime'], date_parser=dateparse)
dataset1.shape

  """Entry point for launching an IPython kernel.


(1048575, 18)

In [17]:
dataset1.head()

Unnamed: 0,TID,vendor_id,new_user,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount
0,AIX000345001,DST000401,NO,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,N,-73.993369,40.734247,CRD,0.5,8.4
1,AIX000345002,DST000401,NO,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,N,-73.958701,40.772533,CRD,0.0,8.5
2,AIX000345003,DST000401,NO,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,N,-73.97078,40.75835,CSH,0.0,7.0
3,AIX000345004,DST000532,NO,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,,-73.975512,40.756867,CRD,0.0,11.3
4,AIX000345005,DST000401,NO,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,N,-73.999369,40.721517,CSH,0.0,10.0


# **Derived Features**

Having the timestamp fields or geographical coordinates as it is doesn't serve any purpose for classification / regression algorithms. So, the goal should be to derive maximum information out of the them.

In [18]:
dataset1= preprocessing.get_timediff(dataset1,'pickup_datetime','dropoff_datetime')
dataset1.head()

Unnamed: 0,TID,vendor_id,new_user,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount,secs_diff_pickup_datetime_dropoff_datetime
0,AIX000345001,DST000401,NO,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,N,-73.993369,40.734247,CRD,0.5,8.4,360.0
1,AIX000345002,DST000401,NO,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,N,-73.958701,40.772533,CRD,0.0,8.5,360.0
2,AIX000345003,DST000401,NO,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,N,-73.97078,40.75835,CSH,0.0,7.0,360.0
3,AIX000345004,DST000532,NO,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,,-73.975512,40.756867,CRD,0.0,11.3,720.0
4,AIX000345005,DST000401,NO,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,N,-73.999369,40.721517,CSH,0.0,10.0,840.0


In [19]:
dataset1=preprocessing.get_distance(dataset1,'pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude')
dataset1.head()

Unnamed: 0,TID,vendor_id,new_user,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount,secs_diff_pickup_datetime_dropoff_datetime,kms_pickup_latitude_dropoff_latitude
0,AIX000345001,DST000401,NO,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,N,-73.993369,40.734247,CRD,0.5,8.4,360.0,1.311173
1,AIX000345002,DST000401,NO,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,N,-73.958701,40.772533,CRD,0.0,8.5,360.0,2.59627
2,AIX000345003,DST000401,NO,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,N,-73.97078,40.75835,CSH,0.0,7.0,360.0,1.538152
3,AIX000345004,DST000532,NO,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,,-73.975512,40.756867,CRD,0.0,11.3,720.0,1.598931
4,AIX000345005,DST000401,NO,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,N,-73.999369,40.721517,CSH,0.0,10.0,840.0,1.626473


In [20]:
dataset1=preprocessing.derive_from_datetime(dataset1)
dataset1.head()

Unnamed: 0,TID,vendor_id,new_user,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount,secs_diff_pickup_datetime_dropoff_datetime,kms_pickup_latitude_dropoff_latitude,hour_of_pickup_datetime,weekday_of_pickup_datetime,year_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,year_of_dropoff_datetime,month_of_dropoff_datetime
0,AIX000345001,DST000401,NO,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,1.0,N,-73.993369,40.734247,CRD,0.5,8.4,360.0,1.311173,4,0,2015,4,4,0,2015,4
1,AIX000345002,DST000401,NO,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,1.0,N,-73.958701,40.772533,CRD,0.0,8.5,360.0,2.59627,18,6,2015,4,18,6,2015,4
2,AIX000345003,DST000401,NO,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,1.0,N,-73.97078,40.75835,CSH,0.0,7.0,360.0,1.538152,8,0,2015,4,8,0,2015,4
3,AIX000345004,DST000532,NO,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,1.0,,-73.975512,40.756867,CRD,0.0,11.3,720.0,1.598931,9,4,2015,4,10,4,2015,4
4,AIX000345005,DST000401,NO,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,1.0,N,-73.999369,40.721517,CSH,0.0,10.0,840.0,1.626473,13,2,2015,4,13,2,2015,4


# **Feature Elimination**

In [21]:
dataset1=preprocessing.drop_null_fields(dataset1,dropna_threshold=0.5)

Dropping store_and_fwd_flag


In [22]:
dataset1=preprocessing.drop_single_valued_cols(dataset1)

Dropping new_user,year_of_pickup_datetime,year_of_dropoff_datetime


# **Dealing with Categorical Data**

In [23]:
dataset2=preprocessing.get_ohe_df(dataset2,ignore_cols=['Age'])
dataset2.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Name,Age,Ticket,Survived_0,Survived_1,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Fare_0.0,Fare_4.0125,Fare_5.0,Fare_6.2375,Fare_6.4375,Fare_6.45,Fare_6.4958,Fare_6.75,Fare_6.8583,Fare_6.95,Fare_6.975,Fare_7.0458,Fare_7.05,Fare_7.0542,Fare_7.125,Fare_7.1417,Fare_7.225,Fare_7.2292,Fare_7.25,Fare_7.3125,Fare_7.4958,Fare_7.5208,Fare_7.55,Fare_7.6292,Fare_7.65,Fare_7.725,Fare_7.7292,Fare_7.7333,Fare_7.7375,Fare_7.7417,Fare_7.75,Fare_7.775,Fare_7.7875,Fare_7.7958,Fare_7.8,Fare_7.8292,Fare_7.8542,Fare_7.875,Fare_7.8792,Fare_7.8875,Fare_7.8958,Fare_7.925,Fare_8.0292,Fare_8.05,Fare_8.1125,Fare_8.1375,Fare_8.1583,Fare_8.3,Fare_8.3625,Fare_8.4042,Fare_8.4333,Fare_8.4583,Fare_8.5167,Fare_8.6542,Fare_8.6625,Fare_8.6833,Fare_8.7125,Fare_8.85,Fare_9.0,Fare_9.2167,Fare_9.225,Fare_9.35,Fare_9.475,Fare_9.4833,Fare_9.5,Fare_9.5875,Fare_9.825,Fare_9.8375,Fare_9.8417,Fare_9.8458,Fare_10.1708,Fare_10.4625,Fare_10.5,Fare_10.5167,Fare_11.1333,Fare_11.2417,Fare_11.5,Fare_12.0,Fare_12.275,Fare_12.2875,Fare_12.35,Fare_12.475,Fare_12.525,Fare_12.65,Fare_12.875,Fare_13.0,Fare_13.4167,Fare_13.5,Fare_13.7917,Fare_13.8583,Fare_13.8625,Fare_14.0,Fare_14.1083,Fare_14.4,Fare_14.4542,Fare_14.4583,Fare_14.5,Fare_15.0,Fare_15.0458,Fare_15.05,Fare_15.1,Fare_15.2458,Fare_15.5,Fare_15.55,Fare_15.7417,Fare_15.75,Fare_15.85,Fare_15.9,Fare_16.0,Fare_16.1,Fare_16.7,Fare_17.4,Fare_17.8,Fare_18.0,Fare_18.75,Fare_18.7875,Fare_19.2583,Fare_19.5,Fare_19.9667,Fare_20.2125,Fare_20.25,Fare_20.525,Fare_20.575,Fare_21.0,Fare_21.075,Fare_21.6792,Fare_22.025,Fare_22.3583,Fare_22.525,Fare_23.0,Fare_23.25,Fare_23.45,Fare_24.0,Fare_24.15,Fare_25.4667,Fare_25.5875,Fare_25.925,Fare_25.9292,Fare_26.0,Fare_26.25,Fare_26.2833,Fare_26.2875,Fare_26.3875,Fare_26.55,Fare_27.0,Fare_27.7208,Fare_27.75,Fare_27.9,Fare_28.5,Fare_28.7125,Fare_29.0,Fare_29.125,Fare_29.7,Fare_30.0,Fare_30.0708,Fare_30.5,Fare_30.6958,Fare_31.0,Fare_31.275,Fare_31.3875,Fare_32.3208,Fare_32.5,Fare_33.0,Fare_33.5,Fare_34.0208,Fare_34.375,Fare_34.6542,Fare_35.0,Fare_35.5,Fare_36.75,Fare_37.0042,Fare_38.5,Fare_39.0,Fare_39.4,Fare_39.6,Fare_39.6875,Fare_40.125,Fare_41.5792,Fare_42.4,Fare_46.9,Fare_47.1,Fare_49.5,Fare_49.5042,Fare_50.0,Fare_50.4958,Fare_51.4792,Fare_51.8625,Fare_52.0,Fare_52.5542,Fare_53.1,Fare_55.0,Fare_55.4417,Fare_55.9,Fare_56.4958,Fare_56.9292,Fare_57.0,Fare_57.9792,Fare_59.4,Fare_61.175,Fare_61.3792,Fare_61.9792,Fare_63.3583,Fare_65.0,Fare_66.6,Fare_69.3,Fare_69.55,Fare_71.0,Fare_71.2833,Fare_73.5,Fare_75.25,Fare_76.2917,Fare_76.7292,Fare_77.2875,Fare_77.9583,Fare_78.2667,Fare_78.85,Fare_79.2,Fare_79.65,Fare_80.0,Fare_81.8583,Fare_82.1708,Fare_83.1583,Fare_83.475,Fare_86.5,Fare_89.1042,Fare_90.0,Fare_91.0792,Fare_93.5,Fare_106.425,Fare_108.9,Fare_110.8833,Fare_113.275,Fare_120.0,Fare_133.65,Fare_134.5,Fare_135.6333,Fare_146.5208,Fare_151.55,Fare_153.4625,Fare_164.8667,Fare_211.3375,Fare_211.5,Fare_221.7792,Fare_227.525,Fare_247.5208,Fare_262.375,Fare_263.0,Fare_512.3292,Cabin_A10,Cabin_A14,Cabin_A16,Cabin_A19,Cabin_A20,Cabin_A23,Cabin_A24,Cabin_A26,Cabin_A31,Cabin_A32,Cabin_A34,Cabin_A36,Cabin_A5,Cabin_A6,Cabin_A7,Cabin_B101,Cabin_B102,Cabin_B18,Cabin_B19,Cabin_B20,Cabin_B22,Cabin_B28,Cabin_B3,Cabin_B30,Cabin_B35,Cabin_B37,Cabin_B38,Cabin_B39,Cabin_B4,Cabin_B41,Cabin_B42,Cabin_B49,Cabin_B5,Cabin_B50,Cabin_B51 B53 B55,Cabin_B57 B59 B63 B66,Cabin_B58 B60,Cabin_B69,Cabin_B71,Cabin_B73,Cabin_B77,Cabin_B78,Cabin_B79,Cabin_B80,Cabin_B82 B84,Cabin_B86,Cabin_B94,Cabin_B96 B98,Cabin_C101,Cabin_C103,Cabin_C104,Cabin_C106,Cabin_C110,Cabin_C111,Cabin_C118,Cabin_C123,Cabin_C124,Cabin_C125,Cabin_C126,Cabin_C128,Cabin_C148,Cabin_C2,Cabin_C22 C26,Cabin_C23 C25 C27,Cabin_C30,Cabin_C32,Cabin_C45,Cabin_C46,Cabin_C47,Cabin_C49,Cabin_C50,Cabin_C52,Cabin_C54,Cabin_C62 C64,Cabin_C65,Cabin_C68,Cabin_C7,Cabin_C70,Cabin_C78,Cabin_C82,Cabin_C83,Cabin_C85,Cabin_C86,Cabin_C87,Cabin_C90,Cabin_C91,Cabin_C92,Cabin_C93,Cabin_C95,Cabin_C99,Cabin_D,Cabin_D10 D12,Cabin_D11,Cabin_D15,Cabin_D17,Cabin_D19,Cabin_D20,Cabin_D21,Cabin_D26,Cabin_D28,Cabin_D30,Cabin_D33,Cabin_D35,Cabin_D36,Cabin_D37,Cabin_D45,Cabin_D46,Cabin_D47,Cabin_D48,Cabin_D49,Cabin_D50,Cabin_D56,Cabin_D6,Cabin_D7,Cabin_D9,Cabin_E10,Cabin_E101,Cabin_E12,Cabin_E121,Cabin_E17,Cabin_E24,Cabin_E25,Cabin_E31,Cabin_E33,Cabin_E34,Cabin_E36,Cabin_E38,Cabin_E40,Cabin_E44,Cabin_E46,Cabin_E49,Cabin_E50,Cabin_E58,Cabin_E63,Cabin_E67,Cabin_E68,Cabin_E77,Cabin_E8,Cabin_F E69,Cabin_F G63,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,0,1,"Braund, Mr. Owen Harris",22.0,A/5 21171,1,0,0,0,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,PC 17599,0,1,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,2,3,"Heikkinen, Miss. Laina",26.0,STON/O2. 3101282,0,1,0,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,113803,0,1,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,4,5,"Allen, Mr. William Henry",35.0,373450,1,0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [24]:
rate_code_labels,dataset1=preprocessing.label_encode(dataset1,'rate_code')
print(rate_code_labels)
dataset1.head()

{1.0: 0, 2.0: 1, 4.0: 2, 5.0: 3, 3.0: 4, 0.0: 5, 6.0: 6, 210.0: 7}


Unnamed: 0,TID,vendor_id,tolls_amount,tip_amount,mta_tax,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,rate_code,dropoff_longitude,dropoff_latitude,payment_type,surcharge,fare_amount,secs_diff_pickup_datetime_dropoff_datetime,kms_pickup_latitude_dropoff_latitude,hour_of_pickup_datetime,weekday_of_pickup_datetime,month_of_pickup_datetime,hour_of_dropoff_datetime,weekday_of_dropoff_datetime,month_of_dropoff_datetime
0,AIX000345001,DST000401,,1.4,,2015-04-20 04:18:00,2015-04-20 04:24:00,1.0,-74.003939,40.742894,0,-73.993369,40.734247,CRD,0.5,8.4,360.0,1.311173,4,0,4,4,0,4
1,AIX000345002,DST000401,,1.0,,2015-04-19 18:16:00,2015-04-19 18:22:00,3.0,-73.973864,40.752194,0,-73.958701,40.772533,CRD,0.0,8.5,360.0,2.59627,18,6,4,18,6,4
2,AIX000345003,DST000401,,0.0,,2015-04-06 08:04:00,2015-04-06 08:10:00,2.0,-73.954406,40.76442,0,-73.97078,40.75835,CSH,0.0,7.0,360.0,1.538152,8,0,4,8,0,4
3,AIX000345004,DST000532,,1.8,,2015-04-10 09:48:00,2015-04-10 10:00:00,2.0,-73.962345,40.767215,0,-73.975512,40.756867,CRD,0.0,11.3,720.0,1.598931,9,4,4,10,4,4
4,AIX000345005,DST000401,,0.0,,2015-04-15 13:12:00,2015-04-15 13:26:00,1.0,-74.004657,40.707434,0,-73.999369,40.721517,CSH,0.0,10.0,840.0,1.626473,13,2,4,13,2,4


# **Data Cleansing**

In [25]:
dataset2=preprocessing.drop_non_numeric(dataset2)

Dropping Name,Ticket


In [26]:
dataset2.isnull().sum()

Unnamed: 0       0
PassengerId      0
Age            177
Survived_0       0
Survived_1       0
              ... 
Cabin_G6         0
Cabin_T          0
Embarked_C       0
Embarked_Q       0
Embarked_S       0
Length: 422, dtype: int64

In [27]:
dataset2=preprocessing.impute_nulls(dataset2,method='KNN')
dataset2.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Age,Survived_0,Survived_1,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Fare_0.0,Fare_4.0125,Fare_5.0,Fare_6.2375,Fare_6.4375,Fare_6.45,Fare_6.4958,Fare_6.75,Fare_6.8583,Fare_6.95,Fare_6.975,Fare_7.0458,Fare_7.05,Fare_7.0542,Fare_7.125,Fare_7.1417,Fare_7.225,Fare_7.2292,Fare_7.25,Fare_7.3125,Fare_7.4958,Fare_7.5208,Fare_7.55,Fare_7.6292,Fare_7.65,Fare_7.725,Fare_7.7292,Fare_7.7333,Fare_7.7375,Fare_7.7417,Fare_7.75,Fare_7.775,Fare_7.7875,Fare_7.7958,Fare_7.8,Fare_7.8292,Fare_7.8542,Fare_7.875,Fare_7.8792,Fare_7.8875,Fare_7.8958,Fare_7.925,Fare_8.0292,Fare_8.05,Fare_8.1125,Fare_8.1375,Fare_8.1583,Fare_8.3,Fare_8.3625,Fare_8.4042,Fare_8.4333,Fare_8.4583,Fare_8.5167,Fare_8.6542,Fare_8.6625,Fare_8.6833,Fare_8.7125,Fare_8.85,Fare_9.0,Fare_9.2167,Fare_9.225,Fare_9.35,Fare_9.475,Fare_9.4833,Fare_9.5,Fare_9.5875,Fare_9.825,Fare_9.8375,Fare_9.8417,Fare_9.8458,Fare_10.1708,Fare_10.4625,Fare_10.5,Fare_10.5167,Fare_11.1333,Fare_11.2417,Fare_11.5,Fare_12.0,Fare_12.275,Fare_12.2875,Fare_12.35,Fare_12.475,Fare_12.525,Fare_12.65,Fare_12.875,Fare_13.0,Fare_13.4167,Fare_13.5,Fare_13.7917,Fare_13.8583,Fare_13.8625,Fare_14.0,Fare_14.1083,Fare_14.4,Fare_14.4542,Fare_14.4583,Fare_14.5,Fare_15.0,Fare_15.0458,Fare_15.05,Fare_15.1,Fare_15.2458,Fare_15.5,Fare_15.55,Fare_15.7417,Fare_15.75,Fare_15.85,Fare_15.9,Fare_16.0,Fare_16.1,Fare_16.7,Fare_17.4,Fare_17.8,Fare_18.0,Fare_18.75,Fare_18.7875,Fare_19.2583,Fare_19.5,Fare_19.9667,Fare_20.2125,Fare_20.25,Fare_20.525,Fare_20.575,Fare_21.0,Fare_21.075,Fare_21.6792,Fare_22.025,Fare_22.3583,Fare_22.525,Fare_23.0,Fare_23.25,Fare_23.45,Fare_24.0,Fare_24.15,Fare_25.4667,Fare_25.5875,Fare_25.925,Fare_25.9292,Fare_26.0,Fare_26.25,Fare_26.2833,Fare_26.2875,Fare_26.3875,Fare_26.55,Fare_27.0,Fare_27.7208,Fare_27.75,Fare_27.9,Fare_28.5,Fare_28.7125,Fare_29.0,Fare_29.125,Fare_29.7,Fare_30.0,Fare_30.0708,Fare_30.5,Fare_30.6958,Fare_31.0,Fare_31.275,Fare_31.3875,Fare_32.3208,Fare_32.5,Fare_33.0,Fare_33.5,Fare_34.0208,Fare_34.375,Fare_34.6542,Fare_35.0,Fare_35.5,Fare_36.75,Fare_37.0042,Fare_38.5,Fare_39.0,Fare_39.4,Fare_39.6,Fare_39.6875,Fare_40.125,Fare_41.5792,Fare_42.4,Fare_46.9,Fare_47.1,Fare_49.5,Fare_49.5042,Fare_50.0,Fare_50.4958,Fare_51.4792,Fare_51.8625,Fare_52.0,Fare_52.5542,Fare_53.1,Fare_55.0,Fare_55.4417,Fare_55.9,Fare_56.4958,Fare_56.9292,Fare_57.0,Fare_57.9792,Fare_59.4,Fare_61.175,Fare_61.3792,Fare_61.9792,Fare_63.3583,Fare_65.0,Fare_66.6,Fare_69.3,Fare_69.55,Fare_71.0,Fare_71.2833,Fare_73.5,Fare_75.25,Fare_76.2917,Fare_76.7292,Fare_77.2875,Fare_77.9583,Fare_78.2667,Fare_78.85,Fare_79.2,Fare_79.65,Fare_80.0,Fare_81.8583,Fare_82.1708,Fare_83.1583,Fare_83.475,Fare_86.5,Fare_89.1042,Fare_90.0,Fare_91.0792,Fare_93.5,Fare_106.425,Fare_108.9,Fare_110.8833,Fare_113.275,Fare_120.0,Fare_133.65,Fare_134.5,Fare_135.6333,Fare_146.5208,Fare_151.55,Fare_153.4625,Fare_164.8667,Fare_211.3375,Fare_211.5,Fare_221.7792,Fare_227.525,Fare_247.5208,Fare_262.375,Fare_263.0,Fare_512.3292,Cabin_A10,Cabin_A14,Cabin_A16,Cabin_A19,Cabin_A20,Cabin_A23,Cabin_A24,Cabin_A26,Cabin_A31,Cabin_A32,Cabin_A34,Cabin_A36,Cabin_A5,Cabin_A6,Cabin_A7,Cabin_B101,Cabin_B102,Cabin_B18,Cabin_B19,Cabin_B20,Cabin_B22,Cabin_B28,Cabin_B3,Cabin_B30,Cabin_B35,Cabin_B37,Cabin_B38,Cabin_B39,Cabin_B4,Cabin_B41,Cabin_B42,Cabin_B49,Cabin_B5,Cabin_B50,Cabin_B51 B53 B55,Cabin_B57 B59 B63 B66,Cabin_B58 B60,Cabin_B69,Cabin_B71,Cabin_B73,Cabin_B77,Cabin_B78,Cabin_B79,Cabin_B80,Cabin_B82 B84,Cabin_B86,Cabin_B94,Cabin_B96 B98,Cabin_C101,Cabin_C103,Cabin_C104,Cabin_C106,Cabin_C110,Cabin_C111,Cabin_C118,Cabin_C123,Cabin_C124,Cabin_C125,Cabin_C126,Cabin_C128,Cabin_C148,Cabin_C2,Cabin_C22 C26,Cabin_C23 C25 C27,Cabin_C30,Cabin_C32,Cabin_C45,Cabin_C46,Cabin_C47,Cabin_C49,Cabin_C50,Cabin_C52,Cabin_C54,Cabin_C62 C64,Cabin_C65,Cabin_C68,Cabin_C7,Cabin_C70,Cabin_C78,Cabin_C82,Cabin_C83,Cabin_C85,Cabin_C86,Cabin_C87,Cabin_C90,Cabin_C91,Cabin_C92,Cabin_C93,Cabin_C95,Cabin_C99,Cabin_D,Cabin_D10 D12,Cabin_D11,Cabin_D15,Cabin_D17,Cabin_D19,Cabin_D20,Cabin_D21,Cabin_D26,Cabin_D28,Cabin_D30,Cabin_D33,Cabin_D35,Cabin_D36,Cabin_D37,Cabin_D45,Cabin_D46,Cabin_D47,Cabin_D48,Cabin_D49,Cabin_D50,Cabin_D56,Cabin_D6,Cabin_D7,Cabin_D9,Cabin_E10,Cabin_E101,Cabin_E12,Cabin_E121,Cabin_E17,Cabin_E24,Cabin_E25,Cabin_E31,Cabin_E33,Cabin_E34,Cabin_E36,Cabin_E38,Cabin_E40,Cabin_E44,Cabin_E46,Cabin_E49,Cabin_E50,Cabin_E58,Cabin_E63,Cabin_E67,Cabin_E68,Cabin_E77,Cabin_E8,Cabin_F E69,Cabin_F G63,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,0.0,1.0,22.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,2.0,38.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,2.0,3.0,26.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,3.0,4.0,35.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,4.0,5.0,35.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [28]:
dataset2.isnull().sum()

Unnamed: 0     0
PassengerId    0
Age            0
Survived_0     0
Survived_1     0
              ..
Cabin_G6       0
Cabin_T        0
Embarked_C     0
Embarked_Q     0
Embarked_S     0
Length: 422, dtype: int64

In [29]:
dataset1.isnull().sum()

TID                                                0
vendor_id                                          0
tolls_amount                                     273
tip_amount                                    104794
mta_tax                                          138
pickup_datetime                                    0
dropoff_datetime                                   0
passenger_count                                  188
pickup_longitude                               31400
pickup_latitude                                21118
rate_code                                          0
dropoff_longitude                               3163
dropoff_latitude                                5080
payment_type                                     152
surcharge                                      62873
fare_amount                                        0
secs_diff_pickup_datetime_dropoff_datetime         0
kms_pickup_latitude_dropoff_latitude           59681
hour_of_pickup_datetime                       

In [30]:
dataset1=preprocessing.impute_nulls(dataset1)

Replaced nulls in tolls_amount with mean
Replaced nulls in tip_amount with mean
Replaced nulls in mta_tax with mean
Replaced nulls in passenger_count with mean
Replaced nulls in pickup_longitude with mean
Replaced nulls in pickup_latitude with mean
Replaced nulls in dropoff_longitude with mean
Replaced nulls in dropoff_latitude with mean
Replaced nulls in payment_type with mode
Replaced nulls in surcharge with mean
Replaced nulls in kms_pickup_latitude_dropoff_latitude with mean


In [31]:
dataset1.isnull().sum()

TID                                           0
vendor_id                                     0
tolls_amount                                  0
tip_amount                                    0
mta_tax                                       0
pickup_datetime                               0
dropoff_datetime                              0
passenger_count                               0
pickup_longitude                              0
pickup_latitude                               0
rate_code                                     0
dropoff_longitude                             0
dropoff_latitude                              0
payment_type                                  0
surcharge                                     0
fare_amount                                   0
secs_diff_pickup_datetime_dropoff_datetime    0
kms_pickup_latitude_dropoff_latitude          0
hour_of_pickup_datetime                       0
weekday_of_pickup_datetime                    0
month_of_pickup_datetime                

# **Feature Selection**

In [33]:
correlated_features=preprocessing.get_correlated_features(dataset1,'fare_amount')
dataset1=dataset1[correlated_features]

Selected Features - tolls_amount,tip_amount,passenger_count,pickup_longitude,dropoff_longitude,fare_amount,secs_diff_pickup_datetime_dropoff_datetime,kms_pickup_latitude_dropoff_latitude,month_of_pickup_datetime,month_of_dropoff_datetime


# **Main Method - Complete Preprocessing Flow**

The entire flow of preprocessing in a single method - Work in Progress