# Sendy Logistics Analysis

## 1. Defining the Question

### a) Data Analysis Question

Can we predict the delivery time of Sendy orders?

### b) Metric for Success

The model should predict an accurate delivery time, from picking up a package upto when it arrives at the final
destination.

### c) Understanding the context 

Sendy has hired you to help predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at the final destination. 

Build a model that predicts an accurate delivery time, from picking up a package to arriving at the final destination. An accurate arrival time prediction will help all business to improve their logistics and communicate the accurate time their time to their customers.

You will be required to perform various feature engineering techniques while preparing your data for further analysis.

### d) Experimental Design

1. Defining the Research Question
2. Data Importation
3. Data Exploration
4. Data Cleaning
5. Data Analysis (Univariate and Bivariate)
6. Data Preparation
7. Data Modeling
8. Model Evaluation
9. Challenging your Solution
10. Recommendations / Conclusion

### e) Data Relevance

The given data set is relevant in answering the research question.

## 2. Reading the Data

In [None]:
# Importing our libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
pd.set_option('display.max_columns', None)


In [None]:
# Additional packages

# importing six and sys
import six
import sys
sys.modules['sklearn.externals.six'] = six

# installing mlrose
!pip install mlrose
import mlrose

# importing joblib
import joblib
sys.modules['sklearn.externals.joblib'] = joblib
from mlxtend.feature_selection import SequentialFeatureSelector

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mlrose
  Downloading mlrose-1.3.0-py3-none-any.whl (27 kB)
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2344 sha256=a89383f0763113aa1624dda5495e98f4a83569d3e7276c0cc032719a637cf8b7
  Stored in directory: /root/.cache/pip/wheels/42/56/cc/4a8bf86613aafd5b7f1b310477667c1fca5c51c3ae4124a003
Successfully built sklearn
Installing collected packages: sklearn, mlrose
Successfully installed mlrose-1.3.0 sklearn-0.0.post1


In [None]:
# Load the data
# --- 
data_url = "https://raw.githubusercontent.com/wambasisamuel/DE_Week04_Friday/main/sendy_logistics.csv"
df = pd.read_csv(data_url) 

In [None]:
# Description of the data
# --- 
description_url = "https://raw.githubusercontent.com/wambasisamuel/DE_Week04_Friday/main/VariableDefinitions.csv"
data_desc = pd.read_csv(description_url)
data_desc

Unnamed: 0,Order No,Unique number identifying the order
0,User Id,Unique number identifying the customer on a pl...
1,Vehicle Type,"For this competition limited to bikes, however..."
2,Platform Type,"Platform used to place the order, there are 4 ..."
3,Personal or Business,Customer type
4,Placement - Day of Month,Placement - Day of Month i.e 1-31
5,Placement - Weekday (Mo = 1),Placement - Weekday (Monday = 1)
6,Placement - Time,Placement - Time - Time of day the order was p...
7,Confirmation - Day of Month,Confirmation - Day of Month i.e 1-31
8,Confirmation - Weekday (Mo = 1),Confirmation - Weekday (Monday = 1)
9,Confirmation - Time,Confirmation - Time - Time of day the order wa...


In [None]:
# Checking the first 5 rows of data
df.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,9,5,10:04:47 AM,9,5,10:27:30 AM,9,5,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,11:23:21 AM,12,5,11:40:22 AM,12,5,11:44:09 AM,12,5,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,12:42:44 PM,30,2,12:49:34 PM,30,2,12:53:03 PM,30,2,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,9:26:05 AM,15,5,9:37:56 AM,15,5,9:43:06 AM,15,5,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,9:56:18 AM,13,1,10:03:53 AM,13,1,10:05:23 AM,13,1,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


In [None]:
# Checking the last 5 rows of data
# ---
df.tail(5)

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
21196,Order_No_8834,User_Id_2001,Bike,3,Personal,20,3,3:54:38 PM,20,3,3:55:09 PM,20,3,3:58:49 PM,20,3,4:20:08 PM,20,3,4:20:17 PM,3,28.6,,-1.258414,36.8048,-1.275285,36.802702,Rider_Id_953,9
21197,Order_No_22892,User_Id_1796,Bike,3,Business,13,6,10:13:34 AM,13,6,10:13:41 AM,13,6,10:20:04 AM,13,6,10:33:27 AM,13,6,10:46:17 AM,7,26.0,,-1.307143,36.825009,-1.331619,36.847976,Rider_Id_155,770
21198,Order_No_2831,User_Id_2956,Bike,3,Business,7,4,5:06:16 PM,7,4,5:07:09 PM,7,4,5:30:17 PM,7,4,5:50:52 PM,7,4,6:40:05 PM,20,29.2,,-1.286018,36.897534,-1.258414,36.8048,Rider_Id_697,2953
21199,Order_No_6174,User_Id_2524,Bike,1,Personal,4,3,9:31:39 AM,4,3,9:31:53 AM,4,3,9:38:59 AM,4,3,9:45:15 AM,4,3,10:08:15 AM,13,15.0,,-1.25003,36.874167,-1.27921,36.794872,Rider_Id_347,1380
21200,Order_No_9836,User_Id_718,Bike,3,Business,26,2,2:19:47 PM,26,2,2:20:01 PM,26,2,2:24:29 PM,26,2,2:41:55 PM,26,2,3:17:23 PM,12,30.9,,-1.255189,36.782203,-1.320157,36.830887,Rider_Id_177,2128


In [None]:
# Checking number of rows and columns
df.shape

(21201, 29)

In [None]:
# Checking datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 29 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Order No                                   21201 non-null  object 
 1   User Id                                    21201 non-null  object 
 2   Vehicle Type                               21201 non-null  object 
 3   Platform Type                              21201 non-null  int64  
 4   Personal or Business                       21201 non-null  object 
 5   Placement - Day of Month                   21201 non-null  int64  
 6   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 7   Placement - Time                           21201 non-null  object 
 8   Confirmation - Day of Month                21201 non-null  int64  
 9   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 10  Confirmation - Time   

Observations:

*   The are 21201 observations in the dataset.
*   The dataset has 29 features.
*   There are 10 categorical features
*   There are 19 numerical features



## 3. External Data Source Validation

The provided dataset has enough features to help in developing a machine learning model that can predict delivery time.

## 4. Data Preparation

### Data Standardisation

In [None]:
# Standardise column names
# ---
df.columns = df.columns.str.strip().str.lower().str.replace(" ","_")
df.columns

Index(['order_no', 'user_id', 'vehicle_type', 'platform_type',
       'personal_or_business', 'placement_-_day_of_month',
       'placement_-_weekday_(mo_=_1)', 'placement_-_time',
       'confirmation_-_day_of_month', 'confirmation_-_weekday_(mo_=_1)',
       'confirmation_-_time', 'arrival_at_pickup_-_day_of_month',
       'arrival_at_pickup_-_weekday_(mo_=_1)', 'arrival_at_pickup_-_time',
       'pickup_-_day_of_month', 'pickup_-_weekday_(mo_=_1)', 'pickup_-_time',
       'arrival_at_destination_-_day_of_month',
       'arrival_at_destination_-_weekday_(mo_=_1)',
       'arrival_at_destination_-_time', 'distance_(km)', 'temperature',
       'precipitation_in_millimeters', 'pickup_lat', 'pickup_long',
       'destination_lat', 'destination_long', 'rider_id',
       'time_from_pickup_to_arrival'],
      dtype='object')

### Data Cleaning

#### Irrelevant Data

The columns *order_no, user_id* and *rider_id* serve as identifier variables and serve no major purpose in modelling, hence I will remove them.

In [None]:
df.drop(columns=['order_no', 'user_id', 'rider_id'],inplace=True)

In [None]:
vehicle_type_count = df['vehicle_type'].value_counts()
vehicle_type_count

Bike    21201
Name: vehicle_type, dtype: int64

There is only one vehicle type, hence it won't have an impact on modelling.

In [None]:
df.drop(columns=['vehicle_type'],inplace=True)

In my opinion, the time related variables that are relevant in determining delivery time, are those that measure time from the moment an Order is picked up. Hence, I will keep a single 'Day of Month', 'Weekday' and 'Time' column for 'Pickup'.

In [None]:
time_cols_drop = ['confirmation_-_day_of_month', 'confirmation_-_weekday_(mo_=_1)', \
           'confirmation_-_time', 'arrival_at_pickup_-_day_of_month', 'arrival_at_pickup_-_weekday_(mo_=_1)', \
           'arrival_at_pickup_-_time','pickup_-_day_of_month', 'pickup_-_weekday_(mo_=_1)']

df.drop(time_cols_drop, inplace=True, axis=1)

#### Duplicate data

In [None]:
# Find the total duplicate records
df.duplicated().sum()

0

#### Missing Data

In [None]:
# Checking missing entries of all the variables
# ---
# 
df.isnull().sum()

platform_type                                    0
personal_or_business                             0
placement_-_day_of_month                         0
placement_-_weekday_(mo_=_1)                     0
placement_-_time                                 0
pickup_-_time                                    0
arrival_at_destination_-_day_of_month            0
arrival_at_destination_-_weekday_(mo_=_1)        0
arrival_at_destination_-_time                    0
distance_(km)                                    0
temperature                                   4366
precipitation_in_millimeters                 20649
pickup_lat                                       0
pickup_long                                      0
destination_lat                                  0
destination_long                                 0
time_from_pickup_to_arrival                      0
dtype: int64

Since Nairobi is arid to semi-arid, precipitation in this case will refer to rainfall. Rainfall affects traffic and road confitions, hence I won't remove this variable. I will replace the null values with 0 mm of rainfall, assuming that the missing values imply no rainfall.

In [None]:
df['precipitation_in_millimeters'].fillna(0,inplace=True)

I will replace the missing values in the temperature values with the mean

In [None]:
# I will replace null values in the `temperature` column with the mean
df['temperature'].fillna(value=df['temperature'].mean(), inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 17 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   platform_type                              21201 non-null  int64  
 1   personal_or_business                       21201 non-null  object 
 2   placement_-_day_of_month                   21201 non-null  int64  
 3   placement_-_weekday_(mo_=_1)               21201 non-null  int64  
 4   placement_-_time                           21201 non-null  object 
 5   pickup_-_time                              21201 non-null  object 
 6   arrival_at_destination_-_day_of_month      21201 non-null  int64  
 7   arrival_at_destination_-_weekday_(mo_=_1)  21201 non-null  int64  
 8   arrival_at_destination_-_time              21201 non-null  object 
 9   distance_(km)                              21201 non-null  int64  
 10  temperature           

I will encode the personal_or_business variable

In [None]:
df['personal_or_business'] = df['personal_or_business'].astype('category').cat.codes
df['personal_or_business'].value_counts()

0    17384
1     3817
Name: personal_or_business, dtype: int64

I will convert the date/time columns to the format HMS e.g 3:54:38 PM becomes 155438

In [None]:
from datetime import datetime

def convert_time(time):
  formatted_time = datetime.strptime(time, "%I:%M:%S %p")
  str_from_time = datetime.strftime(formatted_time, "%H%M%S")
  return str_from_time

time_columns = ['placement_-_time', 'pickup_-_time', 'arrival_at_destination_-_time']

for col in time_columns:
  df[col] = df[col].apply(convert_time)

df.sample(5)

Unnamed: 0,platform_type,personal_or_business,placement_-_day_of_month,placement_-_weekday_(mo_=_1),placement_-_time,pickup_-_time,arrival_at_destination_-_day_of_month,arrival_at_destination_-_weekday_(mo_=_1),arrival_at_destination_-_time,distance_(km),temperature,precipitation_in_millimeters,pickup_lat,pickup_long,destination_lat,destination_long,time_from_pickup_to_arrival
17615,3,1,6,4,110125,115131,6,4,120707,9,21.8,0.0,-1.265533,36.809465,-1.296974,36.785661,936
18149,3,0,21,5,105920,113958,21,5,124719,12,22.7,0.0,-1.285991,36.875681,-1.28878,36.816831,4041
9933,3,0,21,3,121815,125522,21,3,130314,6,21.9,0.0,-1.272828,36.816608,-1.262847,36.781805,472
1869,3,0,5,2,133312,140358,5,2,141947,6,27.5,0.0,-1.277071,36.823109,-1.263605,36.7851,949
3349,3,0,11,6,93818,100048,11,6,100230,8,23.258889,0.0,-1.300406,36.829741,-1.263818,36.793006,102


## 5. Data Modelling

#### Base model - Ensemble Regressor

In [None]:
features = df.drop(['time_from_pickup_to_arrival'], axis=1)
target = df['time_from_pickup_to_arrival']

# Splitting the data into training, testing and validation sets. Of the base dataset: 20% - testing, 80% - training.
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.20, random_state=12345)

# Model 
base_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

# Train the model
base_regressor.fit(features_train, target_train)

# Predict using the model
predictions = base_regressor.predict(features_test)

# RMSE
rmse = mean_squared_error(target_test, predictions) ** 0.5

print('RMSE: ', rmse)

RMSE:  525.3969582871707


## 5. Model Optimization

#### Feature Scaling

Standardization

In [None]:
# Scaler
scaler = StandardScaler()

# Scaled model
features_train_scaled = scaler.fit_transform(features_train) 
features_test_scaled = scaler.fit_transform(features_test)

# fit the model
regressor_scaled = RandomForestRegressor(n_estimators = 10, random_state = 12345)
regressor_scaled.fit(features_train_scaled, target_train)

# predict
predictions_scaled = regressor_scaled.predict(features_test_scaled)

rmse = np.sqrt(mean_squared_error(target_test, predictions_scaled))

print('RMSE: ', rmse)



RMSE:  526.5920410670499


Normalization

In [None]:
# Normalized model
normalized = MinMaxScaler().fit(features_train) 
features_train_normalized = normalized.transform(features_train) 
features_test_normalized = normalized.transform(features_test)

# fit the model
regressor_normalized = RandomForestRegressor(n_estimators = 10, random_state = 12345)
regressor_normalized.fit(features_train_normalized, target_train)

# predict
predictions_normalized = regressor_normalized.predict(features_test_normalized)

# RMSE
rmse = np.sqrt(mean_squared_error(target_test, predictions_normalized))

print('RMSE:', rmse)

RMSE: 525.5213235397118


Normalization achieves better result than standardization

#### Feature Selection

**Step Forward Feature Selection**

In [None]:
# I'll pass the normalised model defined above to the SequentialFeatureSelector function. 
# k_features: number of features to select. 
# forward: if set to True, performs step forward feature selection. 
# verbose: used for logging the progress of the feature selector
# scoring: defines the performance evaluation criteria 
# cv: refers to cross-validation folds.

feature_selector_sf = SequentialFeatureSelector(regressor_normalized,
           k_features=15,
           forward=True,
           verbose=2,
           scoring='r2',
           cv=4)
 
# Step forward feature selection
feature_selector_sf = feature_selector_sf.fit(features_train_normalized, target_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    7.7s finished

[2022-11-29 05:01:07] Features: 1/15 -- score: 0.3391752466522554[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    7.4s finished

[2022-11-29 05:01:15] Features: 2/15 -- score: 0.33770154263715374[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:    8.0s finished

[2022-11-29 05:01:23] Features: 3/15 -- score: 0.33470381059571186[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done

In [None]:
selected_features_sf = list(feature_selector_sf.k_feature_idx_)
#selected_features

# Modelling with features from step forward selection
step_forward_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)
step_forward_regressor.fit(features_train_normalized[:, selected_features_sf], target_train)

# Making Predictions
step_forward_predictions = step_forward_regressor.predict(features_test_normalized[:, selected_features_sf])

# RMSE
rmse = np.sqrt(mean_squared_error(target_test, step_forward_predictions))

print('RMSE with step forward features: ', rmse)

RMSE with step forward features:  490.43403365549784


**Step Backward Feature Selection**

In [None]:
# This time I set the `forward` parameter to False

feature_selector_sb = SequentialFeatureSelector(regressor_normalized,
           k_features=15,
           forward=False,
           verbose=2,
           scoring='r2',
           cv=4)
 
# Step forward feature selection
feature_selector_sb = feature_selector_sb.fit(features_train_normalized, target_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:  1.3min finished

[2022-11-29 05:05:37] Features: 15/15 -- score: 0.6827128528080655

In [None]:
selected_features_sb = list(feature_selector_sb.k_feature_idx_)
#selected_features

# Modelling with features from step backward selection
step_backward_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)
step_backward_regressor.fit(features_train_normalized[:, selected_features_sb], target_train)

# Making Predictions and determining the accuracies
step_backward_predictions = step_backward_regressor.predict(features_test_normalized[:, selected_features_sb])

# RMSE
rmse = np.sqrt(mean_squared_error(target_test, step_backward_predictions))

print('RMSE with step backward features: ', rmse)

RMSE with step backward features:  490.43403365549784


**Recursive Feature Elimination**

In [None]:
# I'm selecting the best 15 features for the model. 
# n_features_to_select will include the response variable
regressor_rfe = RFE(regressor_normalized, n_features_to_select = 15, step=1)

regressor_rfe.fit(features_train_normalized, target_train) 

# Predict using the model  
rfe_predictions = regressor_rfe.predict(features_test_normalized)

# RMSE
rmse = np.sqrt(mean_squared_error(target_test, rfe_predictions))

print('RMSE for RFE:', rmse)

RMSE for RFE: 526.1907049308937


**Linear Discriminant Analysis**

In [None]:
lda = LinearDiscriminantAnalysis()
lda.fit(features_train_normalized, target_train)

# Predictions
lda_predictions = lda.predict(features_test_normalized)

# Evaluation
rmse = np.sqrt(mean_squared_error(target_test, lda_predictions))

print('RMSE with Linear Discriminant Analysis:', rmse)

RMSE with Linear Discriminant Analysis: 705.8346313142732


Step Backward Feature Selection and Step Forward Feature selections give the same lowest rmse

#### Feature Construction

In [None]:
 # I will create a new feature:  speed = distance/time
# convert time to hours so that speed is km/hr
df['speed'] = df['distance_(km)'] / (df['time_from_pickup_to_arrival'] / 3600)
df.head()

Unnamed: 0,platform_type,personal_or_business,placement_-_day_of_month,placement_-_weekday_(mo_=_1),placement_-_time,pickup_-_time,arrival_at_destination_-_day_of_month,arrival_at_destination_-_weekday_(mo_=_1),arrival_at_destination_-_time,distance_(km),temperature,precipitation_in_millimeters,pickup_lat,pickup_long,destination_lat,destination_long,time_from_pickup_to_arrival,speed
0,3,0,9,5,93546,102730,9,5,103955,4,20.4,0.0,-1.317755,36.83037,-1.300406,36.829741,745,19.328859
1,3,1,12,5,111616,114409,12,5,121722,16,26.4,0.0,-1.351453,36.899315,-1.295004,36.814358,1993,28.901154
2,3,0,30,2,123925,125303,30,2,130038,3,23.258889,0.0,-1.308284,36.843419,-1.300921,36.828195,455,23.736264
3,3,0,15,5,92534,94306,15,5,100527,9,19.2,0.0,-1.281301,36.832396,-1.257147,36.795063,1341,24.161074
4,1,1,13,1,95518,100523,13,1,102537,9,15.4,0.0,-1.266597,36.792118,-1.295041,36.809817,1214,26.688633


#### Final Modelling

Without Normalisation

In [None]:
# Split data
features = df.drop(['time_from_pickup_to_arrival'], axis=1)
target = df['time_from_pickup_to_arrival']

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=12345)

# The model
base_regressor_new = RandomForestRegressor(n_estimators = 10, random_state = 12345)
base_regressor_new.fit(features_train, target_train)

# Predict 
new_predictions = base_regressor_new.predict(features_test)

# RMSE
rmse = np.sqrt(mean_squared_error(target_test, new_predictions))
print('RMSE:', rmse)

RMSE: 51.31816703528443


With Normalisation

In [None]:
norm = MinMaxScaler().fit(features_train)
features_train_normalized = norm.transform(features_train) 
features_test_normalized = norm.transform(features_test)

# Model
new_base_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)
new_base_regressor.fit(features_train_normalized, target_train)

# Predict
new_predictions_normalized = new_base_regressor.predict(features_test_normalized)

# RMSE
rmse = np.sqrt(mean_squared_error(target_test, new_predictions_normalized))
print('RMSE:', rmse)

RMSE: 49.63455594714024


With Step Backward Feature Selection

I chose this because it takes less time than Step Forward Feature Selection

In [None]:
# Modelling with features from step backward selection

# Normalized model
normalized = MinMaxScaler().fit(features_train) 
features_train_normalized = normalized.transform(features_train) 
features_test_normalized = normalized.transform(features_test)

new_sb_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

# We pass the regressor the estimator to the SequentialFeatureSelector function. 
# k_features specifies the number of features to select. 
# forward parameter, if set to True, performs step forward feature selection. 
# verbose parameter is used for logging the progress of the feature selector
# scoring parameter defines the performance evaluation criteria 
# cv refers to cross-validation folds.

feature_selector_sb_new = SequentialFeatureSelector(new_sb_regressor,
           k_features=15,
           forward=False,
           verbose=2,
           scoring='r2',
           cv=4)
 
# Perform step backward feature selection
feature_selector_sb_new = feature_selector_sb_new.fit(features_train_normalized, target_train)

In [None]:
#selected_features
selected_features_sb_new = list(feature_selector_sb_new.k_feature_idx_)

step_backward_regressor_new = RandomForestRegressor(n_estimators = 10, random_state = 12345)
step_backward_regressor_new.fit(features_train_normalized[:, selected_features_sb_new], target_train)

# Making Predictions and determining the accuracies
step_backward_predictions_new = step_backward_regressor_new.predict(features_test_normalized[:, selected_features_sb_new])

# RMSE
rmse = np.sqrt(mean_squared_error(target_test, step_backward_predictions_new))

print('RMSE with step backward features: ', rmse)

## Summary

The RMSE to dropped from 525 to 49 after creating new feature, normalizing and performing Step Back Fature selection

Feature construction produced the best RMSE of 51. RMSE improved to 49 after normalisation followed by step backward feature selection.

This is reliable in predicting delivery time.

## Challenge the Solution

Did we have the right question? Yes

What can be done to improve the solution?

* Handle any outliers in the dataset
* Hyperparameter tuning
* Construct more features