# Overall Analysis and Implementation Plan
- `ID` - Set as Row Index - All values unique
- `User_ID` - Drop - PII - mobile number - we do not use Personal Information Identifiers in machine learning features
- `vehicle_model_id` - Drop, since >70% of values are 12

- Divide the data into 3 datasets based on travel_type_id, since for different travel types different features are applicable

- After dividing the data into 3 subsets, 3 different machine learning models are to be created:
1. ML Model 1 for travel_type_1
2. ML Model 2 for travel type_2
3. ML Model 3 for travel_type_3

# Below is the column-wise analysis for each subset regarding if they need to be dropped, kept or transformed.

## For travel_type_id = 1

- package_id - drop - all null values
- travel_type_id - drop - all same values
- from_area_id - transform and drop into cancellation volume i.e. Zero, High, Medium or Low cancellation area
- to_area_id - drop - all values null
- from_city_id - use mode to impute the null values
- to_city_id - use mode to impute the null values
- Make a column for inter-city routes - from_city_id -> to_city_id
- city_routes - transform into cancellation volume i.e. Zero, High, Medium or Low cancellation routes
- to_city_id, from_city_id, routes, cancellation_perc - drop
- from_date - transform into - dayOfWeek, Month, Weekday/weekend, TimeofDay(Hour | Morning Afternoon Evening Night)
- time_diff - calculate the number of hours b/w booking and actual start time in hours
- booking_nature - transform the time_diff into Urgent, SameDay, Regular, Advance bookings, since time_diff has >10% outliers
- online_booking | mobile_site_booking - keep
- from_date - drop
- booking_created - drop
- drop all 4 lat/long(from_lat,from_long,to_lat,to_long) as to_lat and to_long has all NULL values and from_lat,from_long's information is already captured in from_area_id


## For travel_type_id = 2

- package_id - drop - all null values
- travel_type_id - drop - all same values
- from_area_id,to_area_id - Make a column for intra-city routes - from_area_id -> to_area_id - transform into cancellation volume i.e. Zero, High, Medium or Low cancellation routes
- from_city_id - drop - all values null
- to_city_id - drop - all values null
- from_date - transform into - dayOfWeek, Month, Weekday/weekend, TimeofDay(Hour | Morning Afternoon Evening Night)
- time_diff - calculate the number of hours b/w booking and actual start time in hours
- booking_nature - transform the time_diff into Urgent, SameDay, Regular, Advance bookings, since time_diff has >10% outliers
- online_booking | mobile_site_booking - keep
- from_date - drop
- booking_created - drop
- from_lat,from_long,to_lat,to_long - impute missing using median
- from these lat/long - calculate the distance in KM (geopy) and then drop all four lat/long col


## For travel_type_id = 3

- package_id - keep
- travel_type_id - drop - all same values
- from_area_id - transform and drop into cancellation volume i.e. Zero, High, Medium or Low cancellation area
- to_area_id - drop - all values null
- from_city_id - drop - all values null
- to_city_id - drop - all values null
- from_date - transform into - dayOfWeek, Month, Weekday/weekend, TimeofDay(Hour | Morning Afternoon Evening Night)
- time_diff - calculate the number of hours b/w booking and actual start time in hours
- booking_nature - transform the time_diff into Urgent, SameDay, Regular, Advance bookings, since time_diff has >10% outliers
- online_booking | mobile_site_booking - keep
- from_date - drop
- booking_created - drop
- drop all 4 lat/long(from_lat,from_long,to_lat,to_long) as to_lat and to_long has all NULL values and from_lat,from_long's information is already captured in from_area_id


## Preprocess these subsets if required such as label encoding etc.


## Machine Learning - Classification
- Decision Tree
- Random Forest
- Naive Bayes Classifier

## Compare the performance and then select the BEST model for each of the three datasets


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('YourCabs.csv')

In [3]:
df.sample(5)

Unnamed: 0,id,user_id,vehicle_model_id,travel_type_id,package_id,from_area_id,to_area_id,from_city_id,to_city_id,from_date,online_booking,mobile_site_booking,booking_created,from_lat,from_long,to_lat,to_long,Car_Cancellation
21221,158627,29351,12,2,,1155.0,393.0,,,07-09-2013 04:00,0,0,07-05-2013 20:37,12.90245,77.66081,13.19956,77.70688,0
7228,141362,26695,12,3,1.0,846.0,,,,3/17/2013 11:30,0,0,3/17/2013 9:39,12.98635,77.58203,,,0
37774,178971,45225,12,3,1.0,293.0,,,,10/14/2013 10:00,1,0,10/13/2013 21:10,12.849482,77.663187,,,1
5566,139335,23786,12,2,,393.0,1044.0,,,3/22/2013 22:00,0,0,2/26/2013 13:50,13.19956,77.70688,12.968887,77.644329,0
13848,149488,31042,12,2,,410.0,1194.0,,,5/16/2013 12:30,0,0,5/16/2013 11:31,13.05121,77.54113,13.00446,77.56923,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43431 entries, 0 to 43430
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   43431 non-null  int64  
 1   user_id              43431 non-null  int64  
 2   vehicle_model_id     43431 non-null  int64  
 3   travel_type_id       43431 non-null  int64  
 4   package_id           7550 non-null   float64
 5   from_area_id         43343 non-null  float64
 6   to_area_id           34293 non-null  float64
 7   from_city_id         16345 non-null  float64
 8   to_city_id           1588 non-null   float64
 9   from_date            43431 non-null  object 
 10  online_booking       43431 non-null  int64  
 11  mobile_site_booking  43431 non-null  int64  
 12  booking_created      43431 non-null  object 
 13  from_lat             43338 non-null  float64
 14  from_long            43338 non-null  float64
 15  to_lat               34293 non-null 

## Set id as Index

In [5]:
df.set_index('id', inplace=True)

In [6]:
df.sample(5)

Unnamed: 0_level_0,user_id,vehicle_model_id,travel_type_id,package_id,from_area_id,to_area_id,from_city_id,to_city_id,from_date,online_booking,mobile_site_booking,booking_created,from_lat,from_long,to_lat,to_long,Car_Cancellation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
147642,30060,12,2,,1097.0,393.0,,,05-05-2013 16:15,0,0,05-05-2013 12:49,12.97943,77.66125,13.19956,77.70688,0
134354,23168,12,2,,1391.0,393.0,,,1/18/2013 16:15,0,0,1/18/2013 15:41,12.970926,77.647321,13.19956,77.70688,0
140188,19206,12,2,,949.0,393.0,,,03-06-2013 10:45,0,0,03-06-2013 08:53,12.98275,77.61582,13.19956,77.70688,0
183436,29648,12,3,6.0,1015.0,,,,11-07-2013 08:30,0,0,11-06-2013 23:08,12.93086,77.57769,,,0
181134,46261,65,1,,571.0,,15.0,146.0,10/28/2013 18:30,0,0,10/26/2013 11:44,12.95185,77.69642,,,0


## Dropping Duplicates

In [7]:
df.duplicated().sum()

np.int64(41)

In [8]:
df.drop_duplicates(inplace=True)

In [9]:
print(df.duplicated().sum())

0


In [10]:
pd.set_option('display.max_rows()',None)
df['vehicle_model_id'].value_counts()

vehicle_model_id
12    31822
85     2407
89     2391
65     1911
28     1701
24     1494
87      563
90      312
23      297
86      123
10      104
64       85
54       73
17       40
91       25
30       14
36        9
13        7
72        2
43        2
1         2
76        1
69        1
14        1
75        1
70        1
39        1
Name: count, dtype: int64

## Dropping user_id and vehicle_model_id

In [11]:
df.drop(columns=['user_id', 'vehicle_model_id'], axis=1, inplace=True)

In [12]:
df.sample(2)

Unnamed: 0_level_0,travel_type_id,package_id,from_area_id,to_area_id,from_city_id,to_city_id,from_date,online_booking,mobile_site_booking,booking_created,from_lat,from_long,to_lat,to_long,Car_Cancellation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
146615,2,,1052.0,393.0,,,05-02-2013 05:00,1,0,4/28/2013 22:14,12.912695,77.576265,13.19956,77.70688,0
148229,2,,149.0,1189.0,,,05-09-2013 19:30,0,0,05-09-2013 18:31,12.93022,77.56039,12.91873,77.61494,0


## Checking datatypes

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43390 entries, 132512 to 185941
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   travel_type_id       43390 non-null  int64  
 1   package_id           7541 non-null   float64
 2   from_area_id         43302 non-null  float64
 3   to_area_id           34261 non-null  float64
 4   from_city_id         16325 non-null  float64
 5   to_city_id           1588 non-null   float64
 6   from_date            43390 non-null  object 
 7   online_booking       43390 non-null  int64  
 8   mobile_site_booking  43390 non-null  int64  
 9   booking_created      43390 non-null  object 
 10  from_lat             43297 non-null  float64
 11  from_long            43297 non-null  float64
 12  to_lat               34261 non-null  float64
 13  to_long              34261 non-null  float64
 14  Car_Cancellation     43390 non-null  int64  
dtypes: float64(9), int64(4), object(2)


In [14]:
df['travel_type_id'].value_counts()

travel_type_id
2    34260
3     7541
1     1589
Name: count, dtype: int64

## Dividing the datasets into 3 based on travel_type_id

In [15]:
tt1 = df.loc[df.travel_type_id==1]
tt2 = df.loc[df.travel_type_id==2]
tt3 = df.loc[df.travel_type_id==3]

In [16]:
tt1.sample(3)

Unnamed: 0_level_0,travel_type_id,package_id,from_area_id,to_area_id,from_city_id,to_city_id,from_date,online_booking,mobile_site_booking,booking_created,from_lat,from_long,to_lat,to_long,Car_Cancellation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
149974,1,,1243.0,,15.0,29.0,5/19/2013 8:00,0,0,5/18/2013 21:26,12.98275,77.61582,,,0
178582,1,,1063.0,,15.0,45.0,10-12-2013 03:15,1,0,10-11-2013 22:15,12.934477,77.611284,,,0
161805,1,,1096.0,,15.0,32.0,7/21/2013 10:30,0,0,7/21/2013 9:10,12.96519,77.71932,,,0


In [17]:
tt2.sample(3)

Unnamed: 0_level_0,travel_type_id,package_id,from_area_id,to_area_id,from_city_id,to_city_id,from_date,online_booking,mobile_site_booking,booking_created,from_lat,from_long,to_lat,to_long,Car_Cancellation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
182821,2,,156.0,293.0,15.0,,11-03-2013 20:00,0,0,11-02-2013 09:15,13.02622,77.70143,12.849482,77.663187,0
149107,2,,136.0,142.0,,,5/14/2013 19:45,0,0,5/13/2013 19:38,12.90796,77.62418,12.91281,77.60923,0
145524,2,,393.0,271.0,,,4/21/2013 12:00,0,0,4/21/2013 10:31,13.19956,77.70688,12.95641,77.64076,0


In [18]:
tt3.sample(3)

Unnamed: 0_level_0,travel_type_id,package_id,from_area_id,to_area_id,from_city_id,to_city_id,from_date,online_booking,mobile_site_booking,booking_created,from_lat,from_long,to_lat,to_long,Car_Cancellation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
137424,3,1.0,357.0,,,,02-11-2013 10:00,0,0,02-11-2013 07:47,13.03064,77.6491,,,0
149849,3,2.0,1330.0,,,,5/19/2013 7:30,0,0,5/18/2013 12:36,12.953434,77.70651,,,0
159267,3,2.0,149.0,,,,07-08-2013 07:45,0,0,07-07-2013 22:38,12.93022,77.56039,,,0


## Preprocessing `tt1`