# Regression - Predict 

# <center>Definition</center>

## Predict Overview

Logistics in Sub-Saharan Africa increases the cost of manufactured goods by up to 320%; while in Europe, it only accounts for up to 90% of the manufacturing cost.
Delivery time prediction has long been a part of city logistics, but refining accuracy has recently become very important for services such as Sendy, Mr delivery and Uber Eats which deliver goods on-demand.

These services and similar ones must receive an order and have it delivered within the shortest time to appease their users. In these situations +/- 20  minutes can make a big difference so it’s very important for customer satisfaction that the initial prediction is highly accurate and that any delays are communicated effectively,which will ultimately improve customer experience. In addition, the solution will enable service providers to realise cost savings, and ultimately reduce the cost of doing business, through improved resource management and planning for order scheduling.
This project was hosted by https://www.sendyit.com/  in partnership with insight2impact facility.

<img src="https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/meme.png" width="300" height="300" align="center"/>

## Problem Statement 

The goal is to build a general model which will take data about a good delivery order as input and then output the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at final destination. 

##  Performance Metrics

There are various metrics which we can use to evaluate the performance of ML algorithms, classification as well as regression algorithms. We must carefully choose the metrics for evaluating ML model performance because they will be used to judge our model’s effectiveness.

We would now like to test our model using the testing data. To achieve this, we'll use the Root Mean Square Error:
$$
MRSE = \sqrt{\frac{1}{N}\sum_{i=1}^N (p_i - y_i)^2} 
$$
where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

and we will also be using the R Squared metric, which is generally used for explanatory purpose and provides an indication of the goodness or fit of a set of predicted output $\hat{y}_i$ values to the actual output  $y_i$  values.

The following formula will help us understand goodness of the predicted outputs :


$$R^2 = 1 - \frac{\sum_{i=1}^N(y_i-\hat{y}_i)^2}{\sum_{i=1}^N(y_i-\bar{y})^2}$$ 




 # <center>Exploratory Data Analysis</center>

## Data Exploration 

### Modules Imports 

In [141]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
import datetime
import category_encoders as ce

# Figures inline and set visualization style
%matplotlib inline
sns.set()


### Data Loading

In [111]:
train_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Train.csv')
test_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Test.csv')
riders_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Riders.csv')
variable_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/VariableDefinitions.csv')

In [129]:
train_df.head(10)

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),...,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,9,5,...,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745,1637,1309,13.8,549
1,3,Personal,18,5,3:41:17 PM,18,5,3:41:30 PM,18,5,...,-1.326774,36.787807,-1.356237,36.904295,Rider_Id_432,2886,1637,1309,13.8,549
2,3,Business,31,5,12:51:41 PM,31,5,1:12:49 PM,31,5,...,-1.255189,36.782203,-1.273412,36.818206,Rider_Id_432,2615,1637,1309,13.8,549
3,3,Personal,2,2,7:12:10 AM,2,2,7:12:29 AM,2,2,...,-1.290315,36.757377,-1.22352,36.802061,Rider_Id_432,2986,1637,1309,13.8,549
4,2,Personal,22,2,10:40:58 AM,22,2,10:42:24 AM,22,2,...,-1.273524,36.79922,-1.300431,36.752427,Rider_Id_432,1602,1637,1309,13.8,549
5,3,Business,29,3,12:14:43 PM,29,3,12:15:51 PM,29,3,...,-1.267427,36.787083,-1.34364,36.892534,Rider_Id_432,2313,1637,1309,13.8,549
6,1,Business,2,2,9:08:42 AM,2,2,9:08:57 AM,2,2,...,-1.226887,36.807395,-1.259102,36.800577,Rider_Id_432,1638,1637,1309,13.8,549
7,1,Personal,19,5,4:12:56 PM,19,5,4:16:19 PM,19,5,...,-1.25628,36.807586,-1.298135,36.821382,Rider_Id_432,1897,1637,1309,13.8,549
8,1,Personal,28,1,3:03:10 PM,28,1,3:04:38 PM,28,1,...,-1.27952,36.816829,-1.328128,36.8719,Rider_Id_432,1698,1637,1309,13.8,549
9,3,Business,14,4,2:42:58 PM,14,4,2:44:37 PM,14,4,...,-1.255189,36.782203,-1.261323,36.865718,Rider_Id_432,2693,1637,1309,13.8,549


In [130]:
train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Platform Type,21201.0,2.752182,0.625178,1.0,3.0,3.0,3.0,4.0
Placement - Day of Month,21201.0,15.653696,8.798916,1.0,8.0,15.0,23.0,31.0
Placement - Weekday (Mo = 1),21201.0,3.240083,1.567295,1.0,2.0,3.0,5.0,7.0
Confirmation - Day of Month,21201.0,15.653837,8.798886,1.0,8.0,15.0,23.0,31.0
Confirmation - Weekday (Mo = 1),21201.0,3.240225,1.567228,1.0,2.0,3.0,5.0,7.0
Arrival at Pickup - Day of Month,21201.0,15.653837,8.798886,1.0,8.0,15.0,23.0,31.0
Arrival at Pickup - Weekday (Mo = 1),21201.0,3.240225,1.567228,1.0,2.0,3.0,5.0,7.0
Pickup - Day of Month,21201.0,15.653837,8.798886,1.0,8.0,15.0,23.0,31.0
Pickup - Weekday (Mo = 1),21201.0,3.240225,1.567228,1.0,2.0,3.0,5.0,7.0
Arrival at Destination - Day of Month,21201.0,15.653837,8.798886,1.0,8.0,15.0,23.0,31.0


In [113]:
test_df.head(2)

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Pickup - Weekday (Mo = 1),Pickup - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id
0,Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,...,3,5:06:47 PM,8,,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
1,Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,...,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868


In [114]:
riders_df.head(2)

Unnamed: 0,Rider Id,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,Rider_Id_396,2946,2298,14.0,1159
1,Rider_Id_479,360,951,13.5,176


In [115]:
variable_df.head(2)

Unnamed: 0,Order No,Unique number identifying the order
0,User Id,Unique number identifying the customer on a pl...
1,Vehicle Type,"For this competition limited to bikes, however..."


The `train_df`,`test_df`, and `riders_df`  dataset have the following fields together with their datatypes:

**Features**

* Order No – Unique number identifying the order (object)
* User Id – Unique number identifying the customer on a platform (object)
* Vehicle Type – For this competition limited to bikes, however in practice, Sendy service extends to trucks and vans (object)
* Platform Type – Platform used to place the order, there are 4 types (int64)
* Personal or Business – Customer type (object)

<u> Placement times </u> 

* Placement - Day of Month i.e 1-31 (int64)
* Placement - Weekday (Monday = 1) (int64)
* Placement - Time - Time of day the order was placed (object)

<u>Confirmation times</u> 

* Confirmation - Day of Month i.e 1-31 (int64)
* Confirmation - Weekday (Monday = 1) (int64)
* Confirmation - Time - time of day the order was confirmed by a rider (object)


<u>Arrival at Pickup times</u> 
* Arrival at Pickup - Day of Month i.e 1-31 (int64)
* Arrival at Pickup - Weekday (Monday = 1) (int64)
* Arrival at Pickup - Time - Time of day the rider arrived at the location to pick up the order - as marked by the rider through the Sendy application (object)


<u>Pickup times</u> 

* Pickup - Day of Month i.e 1-31 (int64)
* Pickup - Weekday (Monday = 1) (int64)
* Pickup - Time - Time of day the rider picked up the order - as marked by the rider through the Sendy application (object)

<u> Arrival at Destination times</u> 

* Arrival at Delivery - Day of Month i.e 1-31  (int64)
* Arrival at Delivery - Weekday (Monday = 1) (int64)
* Arrival at Delivery - Time - Time of day the rider arrived at the destination to deliver the order - as marked by the rider through the Sendy application (object)

<u> Location</u> 

* Distance covered (KM) - The distance from Pickup to Destination (int64)
* Pickup Latitude and Longitude - Latitude and longitude of pick up location (int64)
* Destination Latitude and Longitude - Latitude and longitude of delivery location (int64)

<u> Weather </u>

* Temperature -Temperature at the time of order placement in Degrees Celsius (measured every three hours) (int64)
* Precipitation in Millimeters - Precipitation at the time of order placement (measured every three hours) (int64)


<u> Rider metrics </u>

* Rider ID – Unique number identifying the rider (same as in order details) (object)
* No of Orders – Number of Orders the rider has delivered  (int64)
* Age – Number of days since the rider delivered the first order (int64)
* Average Rating – Average rating of the rider (float64)
* No of Ratings - Number of ratings the rider has received. Rating an order is optional for the customer. (int64)

**Response Variable**
* Time from Pickup to Arrival - Time in seconds between ‘Pickup’ and ‘Arrival at Destination’ - calculated from the columns for the purpose of facilitating the task  (int64)

In [116]:
variable_df.head()

Unnamed: 0,Order No,Unique number identifying the order
0,User Id,Unique number identifying the customer on a pl...
1,Vehicle Type,"For this competition limited to bikes, however..."
2,Platform Type,"Platform used to place the order, there are 4 ..."
3,Personal or Business,Customer type
4,Placement - Day of Month,Placement - Day of Month i.e 1-31


The  `variable_df` has variable infomation which was used to formulate the above list of variables and their Describtion 

In [117]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 29 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Order No                                   21201 non-null  object 
 1   User Id                                    21201 non-null  object 
 2   Vehicle Type                               21201 non-null  object 
 3   Platform Type                              21201 non-null  int64  
 4   Personal or Business                       21201 non-null  object 
 5   Placement - Day of Month                   21201 non-null  int64  
 6   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 7   Placement - Time                           21201 non-null  object 
 8   Confirmation - Day of Month                21201 non-null  int64  
 9   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 10  Confirmation - Time   

In [118]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7068 entries, 0 to 7067
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Order No                              7068 non-null   object 
 1   User Id                               7068 non-null   object 
 2   Vehicle Type                          7068 non-null   object 
 3   Platform Type                         7068 non-null   int64  
 4   Personal or Business                  7068 non-null   object 
 5   Placement - Day of Month              7068 non-null   int64  
 6   Placement - Weekday (Mo = 1)          7068 non-null   int64  
 7   Placement - Time                      7068 non-null   object 
 8   Confirmation - Day of Month           7068 non-null   int64  
 9   Confirmation - Weekday (Mo = 1)       7068 non-null   int64  
 10  Confirmation - Time                   7068 non-null   object 
 11  Arrival at Pickup

As it may be suspected based on the above fields ,some of the features may not be needed in our model,which means they must be discarded during preprocessing.

Lastly you must have noticed that the `Time from Pickup to Arrival` column is  missing in the Test set,this is because `Time from Pickup to Arrival` column is what our model that we will build  will be predicting. 

## Exploratory Visualization 

## Algorithmns and Techniques 

 # <center>Methodology</center>

## Data Preprocessing 

In [119]:
## dealing with misssing values

In [153]:
train_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Train.csv', index_col=0)
test_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Test.csv',index_col=0)
riders_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Riders.csv')
variable_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/VariableDefinitions.csv')

In [154]:
test_df.head()

Unnamed: 0_level_0,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,...,Pickup - Weekday (Mo = 1),Pickup - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id
Order No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,4:44:29 PM,...,3,5:06:47 PM,8,,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,12:59:17 PM,...,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868
Order_No_768,User_Id_2154,Bike,3,Business,27,4,11:08:14 AM,27,4,11:25:05 AM,...,4,11:57:54 AM,5,22.8,,-1.290894,36.822971,-1.276574,36.851365,Rider_Id_26
Order_No_15332,User_Id_2910,Bike,3,Business,17,1,1:51:35 PM,17,1,1:53:27 PM,...,1,2:16:52 PM,5,24.5,,-1.290503,36.809646,-1.303382,36.790658,Rider_Id_685
Order_No_21373,User_Id_1205,Bike,3,Business,11,2,11:30:28 AM,11,2,11:34:45 AM,...,2,11:56:04 AM,6,24.4,,-1.281081,36.814423,-1.266467,36.792161,Rider_Id_858


## dropping columns 

In [155]:
### START FUNCTION
def drop_columns(input_df, threshold=70):
    drop_names = []
    for column in input_df.columns:
        col_num_null = input_df[column].isnull().sum() / len (input_df[column]) * 100
        if col_num_null > threshold:
            drop_names.append(column)
    result_df = input_df.copy()
    return result_df.drop(drop_names, axis=1)

In [156]:
drop_columns(train_df).info()

<class 'pandas.core.frame.DataFrame'>
Index: 21201 entries, Order_No_4211 to Order_No_9836
Data columns (total 27 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   User Id                                    21201 non-null  object 
 1   Vehicle Type                               21201 non-null  object 
 2   Platform Type                              21201 non-null  int64  
 3   Personal or Business                       21201 non-null  object 
 4   Placement - Day of Month                   21201 non-null  int64  
 5   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 6   Placement - Time                           21201 non-null  object 
 7   Confirmation - Day of Month                21201 non-null  int64  
 8   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 9   Confirmation - Time                        21201 non-null  object 
 10  Arrival

In [157]:
train_df = train_df.drop(['User Id', 'Vehicle Type'], axis=1)

In [158]:
train_df = train_df.merge(riders_df, on='Rider Id')

In [159]:
train_df.head()

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),...,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,9,5,...,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745,1637,1309,13.8,549
1,3,Personal,18,5,3:41:17 PM,18,5,3:41:30 PM,18,5,...,-1.326774,36.787807,-1.356237,36.904295,Rider_Id_432,2886,1637,1309,13.8,549
2,3,Business,31,5,12:51:41 PM,31,5,1:12:49 PM,31,5,...,-1.255189,36.782203,-1.273412,36.818206,Rider_Id_432,2615,1637,1309,13.8,549
3,3,Personal,2,2,7:12:10 AM,2,2,7:12:29 AM,2,2,...,-1.290315,36.757377,-1.22352,36.802061,Rider_Id_432,2986,1637,1309,13.8,549
4,2,Personal,22,2,10:40:58 AM,22,2,10:42:24 AM,22,2,...,-1.273524,36.79922,-1.300431,36.752427,Rider_Id_432,1602,1637,1309,13.8,549


In [160]:
'Pickup-Time'

NameError: name 'train' is not defined

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
sns.heatmap(train_df.isna(), ax = ax[0])
sns.heatmap(test_df.isna(), ax = ax[1])
ax[0].set_title('Train missing values heatmap')
ax[1].set_title('Test missing values heatmap')

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
sns.distplot(train_df['Temperature'], bins=30, color='blue', ax = ax[0], kde=False)
sns.distplot(test_df['Temperature'], bins=30, color='green', ax = ax[1], kde=False)
ax[0].set_title('Temperature Distribution for the train dataset')
ax[1].set_title('Temperature Distribution for the test dataset')
ax[0].axvline(x=train_df['Temperature'].mean(), color='red')
ax[0].axvline(x=train_df['Temperature'].median(), color='orange')
ax[1].axvline(x=test_df['Temperature'].mean(), color='red')
ax[1].axvline(x=test_df['Temperature'].median(), color='orange')

In [None]:
train_df['Temperature'].fillna(train_df['Temperature'].mean(),inplace=True)

In [None]:
train_df.info()

## outliers

In [161]:
train_df.describe()

Unnamed: 0,Platform Type,Placement - Day of Month,Placement - Weekday (Mo = 1),Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Pickup - Day of Month,Pickup - Weekday (Mo = 1),Arrival at Destination - Day of Month,...,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival,No_Of_Orders,Age,Average_Rating,No_of_Ratings
count,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,...,552.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0
mean,2.752182,15.653696,3.240083,15.653837,3.240225,15.653837,3.240225,15.653837,3.240225,15.653837,...,7.905797,-1.28147,36.811264,-1.282581,36.81122,1556.920947,1692.423706,984.742842,13.88252,341.067119
std,0.625178,8.798916,1.567295,8.798886,1.567228,8.798886,1.567228,8.798886,1.567228,8.798886,...,17.089971,0.030507,0.037473,0.034824,0.044721,987.270788,1574.308302,646.652835,0.916071,402.867746
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.1,-1.438302,36.653621,-1.430298,36.606594,1.0,2.0,96.0,0.0,0.0
25%,3.0,8.0,2.0,8.0,2.0,8.0,2.0,8.0,2.0,8.0,...,1.075,-1.300921,36.784605,-1.301201,36.785661,882.0,557.0,495.0,13.6,61.0
50%,3.0,15.0,3.0,15.0,3.0,15.0,3.0,15.0,3.0,15.0,...,2.9,-1.279395,36.80704,-1.284382,36.808002,1369.0,1212.0,872.0,14.0,161.0
75%,3.0,23.0,5.0,23.0,5.0,23.0,5.0,23.0,5.0,23.0,...,4.9,-1.257147,36.829741,-1.261177,36.829477,2040.0,2311.0,1236.0,14.3,495.0
max,4.0,31.0,7.0,31.0,7.0,31.0,7.0,31.0,7.0,31.0,...,99.1,-1.14717,36.991046,-1.030225,37.016779,7883.0,9756.0,3764.0,15.2,2298.0


In [162]:
train_df

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),...,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,9,5,...,-1.317755,36.830370,-1.300406,36.829741,Rider_Id_432,745,1637,1309,13.8,549
1,3,Personal,18,5,3:41:17 PM,18,5,3:41:30 PM,18,5,...,-1.326774,36.787807,-1.356237,36.904295,Rider_Id_432,2886,1637,1309,13.8,549
2,3,Business,31,5,12:51:41 PM,31,5,1:12:49 PM,31,5,...,-1.255189,36.782203,-1.273412,36.818206,Rider_Id_432,2615,1637,1309,13.8,549
3,3,Personal,2,2,7:12:10 AM,2,2,7:12:29 AM,2,2,...,-1.290315,36.757377,-1.223520,36.802061,Rider_Id_432,2986,1637,1309,13.8,549
4,2,Personal,22,2,10:40:58 AM,22,2,10:42:24 AM,22,2,...,-1.273524,36.799220,-1.300431,36.752427,Rider_Id_432,1602,1637,1309,13.8,549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21196,3,Business,13,2,11:09:37 AM,13,2,11:32:18 AM,13,2,...,-1.265003,36.812624,-1.265432,36.795034,Rider_Id_528,919,5770,1793,14.2,2205
21197,3,Personal,21,4,4:33:17 PM,21,4,4:47:27 PM,21,4,...,-1.269609,36.825741,-1.278067,36.783487,Rider_Id_638,2331,102,873,13.7,32
21198,3,Business,10,1,5:00:40 PM,10,1,5:11:21 PM,10,1,...,-1.250823,36.789526,-1.285850,36.830629,Rider_Id_773,2418,5,105,0.0,0
21199,3,Business,29,2,2:31:55 PM,29,2,2:32:43 PM,29,2,...,-1.291787,36.787267,-1.298575,36.808800,Rider_Id_860,717,5,448,15.0,2


## Changing DataTypes

In [163]:
train_df['Personal or Business']=train_df['Personal or Business'].astype('category')
train_df['Platform Type']=train_df['Platform Type'].astype('category')

## Varibale Encoding

In [164]:
train_df['Platform Type'].unique()

[3, 2, 1, 4]
Categories (4, int64): [3, 2, 1, 4]

In [165]:
train_df['Personal or Business'].unique()

[Business, Personal]
Categories (2, object): [Business, Personal]

In [166]:
one = ce.OneHotEncoder(cols=['Platform Type','Personal or Business'])

In [169]:
train_df.head(100)

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),...,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,9,5,...,-1.317755,36.830370,-1.300406,36.829741,Rider_Id_432,745,1637,1309,13.8,549
1,3,Personal,18,5,3:41:17 PM,18,5,3:41:30 PM,18,5,...,-1.326774,36.787807,-1.356237,36.904295,Rider_Id_432,2886,1637,1309,13.8,549
2,3,Business,31,5,12:51:41 PM,31,5,1:12:49 PM,31,5,...,-1.255189,36.782203,-1.273412,36.818206,Rider_Id_432,2615,1637,1309,13.8,549
3,3,Personal,2,2,7:12:10 AM,2,2,7:12:29 AM,2,2,...,-1.290315,36.757377,-1.223520,36.802061,Rider_Id_432,2986,1637,1309,13.8,549
4,2,Personal,22,2,10:40:58 AM,22,2,10:42:24 AM,22,2,...,-1.273524,36.799220,-1.300431,36.752427,Rider_Id_432,1602,1637,1309,13.8,549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,Business,16,4,2:08:27 PM,16,4,2:09:22 PM,16,4,...,-1.255189,36.782203,-1.263813,36.784978,Rider_Id_155,341,1023,242,12.5,114
96,3,Business,29,3,2:20:18 PM,29,3,2:36:34 PM,29,3,...,-1.263818,36.793006,-1.300406,36.829741,Rider_Id_155,1522,1023,242,12.5,114
97,3,Business,13,6,9:06:14 AM,13,6,9:06:32 AM,13,6,...,-1.263818,36.793006,-1.300406,36.829741,Rider_Id_155,1975,1023,242,12.5,114
98,3,Business,4,2,3:18:06 PM,4,2,3:24:35 PM,4,2,...,-1.301520,36.765846,-1.319279,36.711299,Rider_Id_155,1207,1023,242,12.5,114


In [137]:
def traffic_day_month(input_df):
    input_df['no_of_month'] = ''
    for i in range(0, len(input_df['Pickup - Day of Month'])):
        if input_df['Pickup - Day of Month'][i] < 8:
            input_df['no_of_month'][i] = '1st week'
        elif 8 <= input_df['Pickup - Day of Month'][i] < 15:
            input_df['no_of_month'][i] = '2nd week'
        elif 15 <= input_df['Pickup - Day of Month'][i] < 23:
            input_df['no_of_month'][i] = '3rd week'
        elif 23 <= input_df['Pickup - Day of Month'][i] <= 31:
            input_df['no_of_month'][i] = '4th week'
    input_df['no_of_month'] = input_df['no_of_month'].astype('category')
traffic_day_month(train_df)
traffic_day_month(test_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [139]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21201 entries, 0 to 21200
Data columns (total 31 columns):
 #   Column                                     Non-Null Count  Dtype   
---  ------                                     --------------  -----   
 0   Platform Type                              21201 non-null  int64   
 1   Personal or Business                       21201 non-null  object  
 2   Placement - Day of Month                   21201 non-null  int64   
 3   Placement - Weekday (Mo = 1)               21201 non-null  int64   
 4   Placement - Time                           21201 non-null  object  
 5   Confirmation - Day of Month                21201 non-null  int64   
 6   Confirmation - Weekday (Mo = 1)            21201 non-null  int64   
 7   Confirmation - Time                        21201 non-null  object  
 8   Arrival at Pickup - Day of Month           21201 non-null  int64   
 9   Arrival at Pickup - Weekday (Mo = 1)       21201 non-null  int64   
 10  Arrival at

In [None]:
train_df['']

### Exploratory Visualization 

## Implementation 

## Refinement 

## Model Evaluation and Validation

# <center>Conclusion</center>

## Reflection 

## Improvement 

## References 