# Regression - Predict 

# <center>Definition</center>

## Predict Overview

Logistics in Sub-Saharan Africa increases the cost of manufactured goods by up to 320%; while in Europe, it only accounts for up to 90% of the manufacturing cost.
Delivery time prediction has long been a part of city logistics, but refining accuracy has recently become very important for services such as Sendy, Mr delivery and Uber Eats which deliver goods on-demand.

These services and similar ones must receive an order and have it delivered within the shortest time to appease their users. In these situations +/- 20  minutes can make a big difference so it’s very important for customer satisfaction that the initial prediction is highly accurate and that any delays are communicated effectively,which will ultimately improve customer experience. In addition, the solution will enable service providers to realise cost savings, and ultimately reduce the cost of doing business, through improved resource management and planning for order scheduling.
This project was hosted by https://www.sendyit.com/  in partnership with insight2impact facility.

<img src="https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/meme.png" width="300" height="300" align="center"/>

## Problem Statement 

The goal is to build a general model which will take data about a good delivery order as input and then output the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at final destination. 

##  Performance Metrics

There are various metrics which we can use to evaluate the performance of ML algorithms, classification as well as regression algorithms. We must carefully choose the metrics for evaluating ML model performance because they will be used to judge our model’s effectiveness.

We would now like to test our model using the testing data. To achieve this, we'll use the Root Mean Square Error:
$$
MRSE = \sqrt{\frac{1}{N}\sum_{i=1}^N (p_i - y_i)^2} 
$$
where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

and we will also be using the R Squared metric, which is generally used for explanatory purpose and provides an indication of the goodness or fit of a set of predicted output $\hat{y}_i$ values to the actual output  $y_i$  values.

The following formula will help us understand goodness of the predicted outputs :


$$R^2 = 1 - \frac{\sum_{i=1}^N(y_i-\hat{y}_i)^2}{\sum_{i=1}^N(y_i-\bar{y})^2}$$ 




 # <center>Analysis</center>

## Data Exploration 

### Modules Imports 

In [26]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
import datetime

# Figures inline and set visualization style
%matplotlib inline
sns.set()


### Data Loading

In [42]:
train_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Train.csv')
test_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Test.csv')
riders_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/Riders.csv')
variable_df=pd.read_csv('https://raw.githubusercontent.com/rufusseopa/Team_18_JHB_WhatTheHack_regression-predict-api-template/master/Data/VariableDefinitions.csv')

In [43]:
train_df.head(2)

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993


In [44]:
test_df.head(2)

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Pickup - Weekday (Mo = 1),Pickup - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id
0,Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,...,3,5:06:47 PM,8,,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
1,Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,...,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868


In [45]:
riders_df.head(2)

Unnamed: 0,Rider Id,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,Rider_Id_396,2946,2298,14.0,1159
1,Rider_Id_479,360,951,13.5,176


In [46]:
variable_df.head(2)

Unnamed: 0,Order No,Unique number identifying the order
0,User Id,Unique number identifying the customer on a pl...
1,Vehicle Type,"For this competition limited to bikes, however..."


The `train_df`,`test_df`, and `riders_df`  dataset have the following fields together with their datatypes:

**Features**

* Order No – Unique number identifying the order (object)
* User Id – Unique number identifying the customer on a platform (object)
* Vehicle Type – For this competition limited to bikes, however in practice, Sendy service extends to trucks and vans (object)
* Platform Type – Platform used to place the order, there are 4 types (int64)
* Personal or Business – Customer type (object)

<u> Placement times </u> 

* Placement - Day of Month i.e 1-31 (int64)
* Placement - Weekday (Monday = 1) (int64)
* Placement - Time - Time of day the order was placed (object)

<u>Confirmation times</u> 

* Confirmation - Day of Month i.e 1-31 (int64)
* Confirmation - Weekday (Monday = 1) (int64)
* Confirmation - Time - time of day the order was confirmed by a rider (object)


<u>Arrival at Pickup times</u> 
* Arrival at Pickup - Day of Month i.e 1-31 (int64)
* Arrival at Pickup - Weekday (Monday = 1) (int64)
* Arrival at Pickup - Time - Time of day the rider arrived at the location to pick up the order - as marked by the rider through the Sendy application (object)


<u>Pickup times</u> 

* Pickup - Day of Month i.e 1-31 (int64)
* Pickup - Weekday (Monday = 1) (int64)
* Pickup - Time - Time of day the rider picked up the order - as marked by the rider through the Sendy application (object)

<u> Arrival at Destination times</u> 

* Arrival at Delivery - Day of Month i.e 1-31  (int64)
* Arrival at Delivery - Weekday (Monday = 1) (int64)
* Arrival at Delivery - Time - Time of day the rider arrived at the destination to deliver the order - as marked by the rider through the Sendy application (object)

<u> Location</u> 

* Distance covered (KM) - The distance from Pickup to Destination (int64)
* Pickup Latitude and Longitude - Latitude and longitude of pick up location (int64)
* Destination Latitude and Longitude - Latitude and longitude of delivery location (int64)

<u> Weather </u>

* Temperature -Temperature at the time of order placement in Degrees Celsius (measured every three hours) (int64)
* Precipitation in Millimeters - Precipitation at the time of order placement (measured every three hours) (int64)


<u> Rider metrics </u>

* Rider ID – Unique number identifying the rider (same as in order details) (object)
* No of Orders – Number of Orders the rider has delivered  (int64)
* Age – Number of days since the rider delivered the first order (int64)
* Average Rating – Average rating of the rider (float64)
* No of Ratings - Number of ratings the rider has received. Rating an order is optional for the customer. (int64)

**Response Variable**
* Time from Pickup to Arrival - Time in seconds between ‘Pickup’ and ‘Arrival at Destination’ - calculated from the columns for the purpose of facilitating the task  (int64)

In [47]:
variable_df.head()

Unnamed: 0,Order No,Unique number identifying the order
0,User Id,Unique number identifying the customer on a pl...
1,Vehicle Type,"For this competition limited to bikes, however..."
2,Platform Type,"Platform used to place the order, there are 4 ..."
3,Personal or Business,Customer type
4,Placement - Day of Month,Placement - Day of Month i.e 1-31


The  `variable_df` has variable infomation which was used to formulate the above list of variables and their Describtion 

In [48]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 29 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Order No                                   21201 non-null  object 
 1   User Id                                    21201 non-null  object 
 2   Vehicle Type                               21201 non-null  object 
 3   Platform Type                              21201 non-null  int64  
 4   Personal or Business                       21201 non-null  object 
 5   Placement - Day of Month                   21201 non-null  int64  
 6   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 7   Placement - Time                           21201 non-null  object 
 8   Confirmation - Day of Month                21201 non-null  int64  
 9   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 10  Confirmation - Time   

In [49]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7068 entries, 0 to 7067
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Order No                              7068 non-null   object 
 1   User Id                               7068 non-null   object 
 2   Vehicle Type                          7068 non-null   object 
 3   Platform Type                         7068 non-null   int64  
 4   Personal or Business                  7068 non-null   object 
 5   Placement - Day of Month              7068 non-null   int64  
 6   Placement - Weekday (Mo = 1)          7068 non-null   int64  
 7   Placement - Time                      7068 non-null   object 
 8   Confirmation - Day of Month           7068 non-null   int64  
 9   Confirmation - Weekday (Mo = 1)       7068 non-null   int64  
 10  Confirmation - Time                   7068 non-null   object 
 11  Arrival at Pickup

As it may be suspected based on the above fields ,some of the features may not be needed in our model,which means they must be discarded during preprocessing.

Lastly you must have noticed that the `Time from Pickup to Arrival` column is  missing in the Test set,this is because `Time from Pickup to Arrival` column is what our model that we will build  will be predicting. 

## Exploratory Visualization 

## Algorithmns and Techniques 

 # <center>Methodology</center>

## Data Preprocessing 

## Implementation 

## Refinement 

## Model Evaluation and Validation

# <center>Conclusion</center>

## Reflection 

## Improvement 

## References 