# Sendy Logistics
The costs of logistics in Sub-Saharan Africa increase the costs of manufactured goods up to 320% whereas in Europe it is only 90% of the manufacturing cost. Economies are better when logistics are efficient and affordable.

Sendy is a logistics company situated in Nairobi, Kenya. They help men and women behind every type of business to trade easily , deliver more competitively and build extraordinary business. 


***“We believe in them; we believe that logistics should be an enabler for them to achieve their goals, rather than a hindrance. We believe that everyone should be able to participate and thrive in the economy and that no small business should be left out because the cost of logistics is either too high or inaccessible.”***


The purpose of this notebook is to predict the estimated time of delivery of orders from the time the order is picked up till it is delivered at its final destination.

This will help Sendy improve their customer satisfaction as well ad the reliabilty of their service. It will also ensure Sendy's resources are being used efficiently through a decrease of cost in doing business, planning for order scheduling and improved resource management. 

# Baseline Model

The models contained within the Baseline section are trained using very little data transformation. The data is cleaned minimally so that it could be used to build models to make predictions on it. The Baseline models will be then compared to the actual models where further data preprocessing would be done to the data.

***This notebook was designed with the following libraries. Should you not have them already installed, simply uncomment the cell below and run it to pip install.***

In [378]:
# !pip install xgboost
# !pip install ipython
# !pip install geopy

### Importing the libraries

In [417]:
#Import python libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

#Packages used to display the Exploratory Data Analysis(EDA)
from pandas_profiling import ProfileReport
from IPython.display import IFrame

#Train_test_split used to split the x dataframe into the training set and test set
from sklearn.model_selection import train_test_split

#Training the simple and multiple linear regression model on the split data
from sklearn.linear_model import LinearRegression

#Training the XGBoost regression model on the split data
import xgboost as xgb
#Accuracy packages
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

### Importing the datasets

> **Variable definitions**

The Variable definitions dataframe lists the column names found in the Rider, Train, Test and Sample Submissions dataframes and gives a brief description of the data found within the column.

*Aside note:* Column names listed in this datframe may not appear the Rider, Train, Test and Sample Submissions dataframes because during the process of modelling the data, the columns may not be useful and will be removed.

In [380]:
# Importing the VariableDefinitions.csv file from github as a Pandas DataFrame.
# Creating new column names for the Pandas DataFrame.
baseline_vardefinitions = pd.read_csv(
    "https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/VariableDefinitions.csv", 
    names=['Column_Name', 'Description'])

# Set the Pandas DataFrame style to display all the contents within the columns.
baseline_vardefinitions.style.set_properties(
    subset=['Description'], **{'width': '600px'})

Unnamed: 0,Column_Name,Description
0,Order No,Unique number identifying the order
1,User Id,Unique number identifying the customer on a platform
2,Vehicle Type,"For this competition limited to bikes, however in practice Sendy service extends to trucks and vans"
3,Platform Type,"Platform used to place the order, there are 4 types"
4,Personal or Business,Customer type
5,Placement - Day of Month,Placement - Day of Month i.e 1-31
6,Placement - Weekday (Mo = 1),Placement - Weekday (Monday = 1)
7,Placement - Time,Placement - Time - Time of day the order was placed
8,Confirmation - Day of Month,Confirmation - Day of Month i.e 1-31
9,Confirmation - Weekday (Mo = 1),Confirmation - Weekday (Monday = 1)


> **Rider DataFrame**

The Rider Dataframe lists all the riders that have delivered orders for Sendy and any information pertaining to that particular rider.

In [381]:
# Importing the Riders.csv file from github as a Pandas DataFrame and setting the RiderId as the index.
baseline_rider = pd.read_csv(
    'https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Riders.csv', 
    index_col= 0)

# Replacing all the blank spaces between words in the column names with an underscore.
baseline_rider.columns = [col.replace(' ', '_')
                          for col in baseline_rider.columns]

# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.
baseline_rider.head(10)


Unnamed: 0_level_0,No_Of_Orders,Age,Average_Rating,No_of_Ratings
Rider Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Rider_Id_396,2946,2298,14.0,1159
Rider_Id_479,360,951,13.5,176
Rider_Id_648,1746,821,14.3,466
Rider_Id_753,314,980,12.5,75
Rider_Id_335,536,1113,13.7,156
Rider_Id_720,2608,1798,13.2,504
Rider_Id_95,3464,1304,13.4,950
Rider_Id_122,4831,2124,14.1,1469
Rider_Id_900,1936,1436,14.2,610
Rider_Id_196,550,2379,13.4,224


In [382]:
# A Pandas method that prints the information about a DataFrame.
# The information printed: the index dtype,column dtypes, non-null values and memory usage.
baseline_rider.info()

# Returns a tuple representing the dimensionality of the DataFrame.
baseline_rider.shape


<class 'pandas.core.frame.DataFrame'>
Index: 960 entries, Rider_Id_396 to Rider_Id_904
Data columns (total 4 columns):
No_Of_Orders      960 non-null int64
Age               960 non-null int64
Average_Rating    960 non-null float64
No_of_Ratings     960 non-null int64
dtypes: float64(1), int64(3)
memory usage: 37.5+ KB


(960, 4)

> **Training set DataFrame**

Most of the modelling will be done with this dataframe to get a prediction of the arrival time for motorbike deliveries in Nairobi.

In [383]:
# Importing the Train.csv file from github as a Pandas DataFrame and setting the OrderNo as the index.
baseline_train = pd.read_csv(
    'https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Train.csv',
    index_col= 0)

# Replacing all the blank spaces between words in the column names with an underscore.
baseline_train.columns = [col.replace(' ', '_')
                          for col in baseline_train.columns]

# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.
baseline_train.head(10)


Unnamed: 0_level_0,User_Id,Vehicle_Type,Platform_Type,Personal_or_Business,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),Confirmation_-_Time,...,Arrival_at_Destination_-_Time,Distance_(KM),Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Rider_Id,Time_from_Pickup_to_Arrival
Order No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,11:23:21 AM,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,12:42:44 PM,...,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,9:26:05 AM,...,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,9:56:18 AM,...,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214
Order_No_7408,User_Id_1342,Bike,3,Business,14,5,3:07:35 PM,14,5,3:08:57 PM,...,4:23:41 PM,9,27.2,,-1.302583,36.767081,-1.257309,36.806008,Rider_Id_124,3191
Order_No_22680,User_Id_2803,Bike,3,Business,9,5,9:33:45 AM,9,5,9:49:47 AM,...,10:19:45 AM,5,20.3,,-1.279395,36.825364,-1.276574,36.851365,Rider_Id_114,1380
Order_No_21578,User_Id_1075,Bike,3,Business,11,1,2:13:01 PM,11,1,2:14:13 PM,...,2:33:26 PM,3,28.7,,-1.252796,36.800313,-1.255189,36.782203,Rider_Id_913,646
Order_No_5234,User_Id_733,Bike,3,Business,30,2,11:10:44 AM,30,2,11:15:49 AM,...,1:19:35 PM,9,,,-1.255189,36.782203,-1.300255,36.825657,Rider_Id_394,3398
Order_No_1768,User_Id_2112,Bike,3,Business,23,5,4:48:54 PM,23,5,5:17:56 PM,...,6:31:57 PM,14,24.6,,-1.225322,36.80855,-1.215601,36.891686,Rider_Id_660,3439


In [384]:
# A Pandas method that prints the information about a DataFrame.
# The information printed: the index dtype,column dtypes, non-null values and memory usage.
baseline_train.info()

# Returns a tuple representing the dimensionality of the DataFrame.
baseline_train.shape


<class 'pandas.core.frame.DataFrame'>
Index: 21201 entries, Order_No_4211 to Order_No_9836
Data columns (total 28 columns):
User_Id                                      21201 non-null object
Vehicle_Type                                 21201 non-null object
Platform_Type                                21201 non-null int64
Personal_or_Business                         21201 non-null object
Placement_-_Day_of_Month                     21201 non-null int64
Placement_-_Weekday_(Mo_=_1)                 21201 non-null int64
Placement_-_Time                             21201 non-null object
Confirmation_-_Day_of_Month                  21201 non-null int64
Confirmation_-_Weekday_(Mo_=_1)              21201 non-null int64
Confirmation_-_Time                          21201 non-null object
Arrival_at_Pickup_-_Day_of_Month             21201 non-null int64
Arrival_at_Pickup_-_Weekday_(Mo_=_1)         21201 non-null int64
Arrival_at_Pickup_-_Time                     21201 non-null object
Pickup_-_Day

(21201, 28)

> **Testing set DataFrame**

In [385]:
# Importing the Test.csv file from github as a Pandas DataFrame and setting the OrderNo as the index.
baseline_test = pd.read_csv(
    'https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Test.csv', 
    index_col= 0)

# Replacing all the blank spaces between words in the column names with an underscore.
baseline_test.columns = [col.replace(' ', '_')
                         for col in baseline_test.columns]

# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.
baseline_test.head(10)


Unnamed: 0_level_0,User_Id,Vehicle_Type,Platform_Type,Personal_or_Business,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),Confirmation_-_Time,...,Pickup_-_Weekday_(Mo_=_1),Pickup_-_Time,Distance_(KM),Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Rider_Id
Order No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,4:44:29 PM,...,3,5:06:47 PM,8,,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,12:59:17 PM,...,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868
Order_No_768,User_Id_2154,Bike,3,Business,27,4,11:08:14 AM,27,4,11:25:05 AM,...,4,11:57:54 AM,5,22.8,,-1.290894,36.822971,-1.276574,36.851365,Rider_Id_26
Order_No_15332,User_Id_2910,Bike,3,Business,17,1,1:51:35 PM,17,1,1:53:27 PM,...,1,2:16:52 PM,5,24.5,,-1.290503,36.809646,-1.303382,36.790658,Rider_Id_685
Order_No_21373,User_Id_1205,Bike,3,Business,11,2,11:30:28 AM,11,2,11:34:45 AM,...,2,11:56:04 AM,6,24.4,,-1.281081,36.814423,-1.266467,36.792161,Rider_Id_858
Order_No_14573,User_Id_2338,Bike,1,Personal,13,1,6:29:29 PM,13,1,6:29:33 PM,...,1,6:39:02 PM,16,19.3,,-1.256606,36.795974,-1.223983,36.898452,Rider_Id_452
Order_No_6731,User_Id_488,Bike,2,Personal,17,3,9:53:29 AM,17,3,9:53:50 AM,...,3,10:08:00 AM,18,20.9,,-1.225272,36.875672,-1.304713,36.808955,Rider_Id_704
Order_No_18436,User_Id_3764,Bike,3,Business,28,4,8:51:13 AM,28,4,8:52:46 AM,...,4,8:58:53 AM,8,22.7,,-1.273539,36.833775,-1.297299,36.789446,Rider_Id_62
Order_No_2288,User_Id_2866,Bike,3,Business,28,4,8:58:21 AM,28,4,8:58:40 AM,...,4,9:30:35 AM,8,19.4,,-1.255189,36.782203,-1.28577,36.759172,Rider_Id_177
Order_No_9063,User_Id_1329,Bike,3,Business,4,5,4:28:28 PM,4,5,4:29:22 PM,...,5,4:44:01 PM,15,21.7,,-1.273056,36.811298,-1.330552,36.714289,Rider_Id_674


In [386]:
# A Pandas method that prints the information about a DataFrame.
#The information printed: the index dtype,column dtypes, non-null values and memory usage.
baseline_test.info()

#Returns a tuple representing the dimensionality of the DataFrame.
baseline_test.shape


<class 'pandas.core.frame.DataFrame'>
Index: 7068 entries, Order_No_19248 to Order_No_1603
Data columns (total 24 columns):
User_Id                                 7068 non-null object
Vehicle_Type                            7068 non-null object
Platform_Type                           7068 non-null int64
Personal_or_Business                    7068 non-null object
Placement_-_Day_of_Month                7068 non-null int64
Placement_-_Weekday_(Mo_=_1)            7068 non-null int64
Placement_-_Time                        7068 non-null object
Confirmation_-_Day_of_Month             7068 non-null int64
Confirmation_-_Weekday_(Mo_=_1)         7068 non-null int64
Confirmation_-_Time                     7068 non-null object
Arrival_at_Pickup_-_Day_of_Month        7068 non-null int64
Arrival_at_Pickup_-_Weekday_(Mo_=_1)    7068 non-null int64
Arrival_at_Pickup_-_Time                7068 non-null object
Pickup_-_Day_of_Month                   7068 non-null int64
Pickup_-_Weekday_(Mo_=_1)     

(7068, 24)

> **Sample submission DataFrame**

This database is the format in which we will submit our predicted test values to Zindi.

In [387]:
# Importing the SampleSumbmission.csv file from github as a Pandas DataFrameand setting the OrderNo as the index..
baseline_samplesubmission= pd.read_csv('https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/SampleSubmission.csv', 
                                       index_col= 0)

# Replacing all the blank spaces between words in the column names with an underscore.
baseline_samplesubmission.columns= [col.replace(' ', '_') 
                                    for col in baseline_samplesubmission.columns]

# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.
baseline_samplesubmission.head()


Unnamed: 0_level_0,Time_from_Pickup_to_Arrival
Order_No,Unnamed: 1_level_1
Order_No_19248,567.0
Order_No_12736,4903.0
Order_No_768,5649.0
Order_No_15332,
Order_No_21373,


In [388]:
# A Pandas method that prints the information about a DataFrame.
#The information printed: the index dtype,column dtypes, non-null values and memory usage.
baseline_samplesubmission.info()

#Returns a tuple representing the dimensionality of the DataFrame.
baseline_samplesubmission.shape


<class 'pandas.core.frame.DataFrame'>
Index: 7068 entries, Order_No_19248 to Order_No_1603
Data columns (total 1 columns):
Time_from_Pickup_to_Arrival    3 non-null float64
dtypes: float64(1)
memory usage: 110.4+ KB


(7068, 1)

### Transformation and Processing of data

> **Splitting the data**

Separating the Training dataset into the Independent variable and Dependent variable.

In [389]:
# Splitting the baseline_train DataFrame into the X and Y variable using the Pandas .iloc[] method.

baseline_X = baseline_train.iloc[:, :-1]
baseline_Y = baseline_train.iloc[:, -1]

> **Alignment of datasets**

Aligning the dataset of baseline_X to the baseline_test dataset by dropping extra columns in baseline_X.

In [390]:
# Using the Pandas .drop() method.
# Remove columns by specifying the column names and corresponding axis.

baseline_X = baseline_X.drop(['Arrival_at_Destination_-_Day_of_Month', 'Arrival_at_Destination_-_Weekday_(Mo_=_1)',
                              'Arrival_at_Destination_-_Time'], axis=1)

> **Missing data**

For the purpose of the Baseline section any missing data will be replaced with a zero.

In [391]:
# Using the Pandas .replace() method, replace all NaN values with 0.

baseline_X = baseline_X.replace(np.nan, 0)

In [392]:
# Using the Pandas .replace() method, replace all NaN values with 0.

baseline_test= baseline_test.replace(np.nan, 0)

> **Transformation of data**

**Converting time strings into seconds**

In [393]:
# Converting time strings to seconds using the Pandas .to_timedelta() method.
# Using the .dt accessor object for datetimelike properties of the Series values to convert to seconds.

baseline_X['Placement_-_Time'] = pd.to_timedelta(
    baseline_X['Placement_-_Time']).dt.total_seconds()

baseline_X['Confirmation_-_Time'] = pd.to_timedelta(
    baseline_X['Confirmation_-_Time']).dt.total_seconds()

baseline_X['Arrival_at_Pickup_-_Time'] = pd.to_timedelta(
    baseline_X['Arrival_at_Pickup_-_Time']).dt.total_seconds()

baseline_X['Pickup_-_Time'] = pd.to_timedelta(
    baseline_X['Pickup_-_Time']).dt.total_seconds()

In [394]:
# Converting time strings to seconds using the Pandas .to_timedelta() method.
# Using the .dt accessor object for datetimelike properties of the Series values to convert to seconds.

baseline_test['Placement_-_Time'] = pd.to_timedelta(
    baseline_test['Placement_-_Time']).dt.total_seconds()

baseline_test['Confirmation_-_Time'] = pd.to_timedelta(
    baseline_test['Confirmation_-_Time']).dt.total_seconds()

baseline_test['Arrival_at_Pickup_-_Time'] = pd.to_timedelta(
    baseline_test['Arrival_at_Pickup_-_Time']).dt.total_seconds()

baseline_test['Pickup_-_Time'] = pd.to_timedelta(
    baseline_test['Pickup_-_Time']).dt.total_seconds()

**Dropping columns**

In [395]:
# Using the Pandas .drop() method.
# Remove columns that are not useful by specifying the column names and corresponding axis.

baseline_X = baseline_X.drop(['User_Id', 'Rider_Id'], axis=1)

In [396]:
# Using the Pandas .drop() method.
# Remove columns that are not useful by specifying the column names and corresponding axis.

baseline_test = baseline_test.drop(['User_Id', 'Rider_Id'], axis=1)

> **Encoding categorical data**

In [397]:
# Encoding categorical data using Pandas .get_dummies() method.

baseline_X = pd.get_dummies(baseline_X, columns=[
                            'Vehicle_Type', 'Platform_Type', 'Personal_or_Business'], 
                            drop_first=True)

In [398]:
# Encoding categorical data using Pandas .get_dummies() method.

baseline_test = pd.get_dummies(baseline_test, columns=[
                            'Vehicle_Type', 'Platform_Type', 'Personal_or_Business'], 
                            drop_first=True)

> **Rename of columns**

In [399]:
baseline_X.columns= ['Placement(Day)', 'Placement(Weekday)', 'Placement(Time)', 'Confirmation(Day)',
                     'Confirmation(Weekday)', 'Confirmation(Time)', 'Arrival(Day)', 'Arrival(Weekday)',
                     'Arrival(Time)', 'Pickup(Day)', 'Pickup(Weekday)', 'Pickup(Time)', 'Distance(KM)',
                     'Temperature', 'Precipitation(mm)', 'Pickup(Lat)', 'Pickup(Long)', 
                     'Destination(Lat)', 'Destination(Long)', 'Platform(Type2)', 'Platform(Type3)', 
                     'Platform(Type4)',	'Personal/Business']

In [400]:
baseline_test.columns= ['Placement(Day)', 'Placement(Weekday)', 'Placement(Time)', 'Confirmation(Day)',
                        'Confirmation(Weekday)', 'Confirmation(Time)', 'Arrival(Day)', 'Arrival(Weekday)',
                        'Arrival(Time)', 'Pickup(Day)', 'Pickup(Weekday)', 'Pickup(Time)', 'Distance(KM)',
                        'Temperature', 'Precipitation(mm)', 'Pickup(Lat)', 'Pickup(Long)', 
                        'Destination(Lat)', 'Destination(Long)', 'Platform(Type2)', 'Platform(Type3)', 
                        'Platform(Type4)',	'Personal/Business']

> **How the DataFrame looks after the data transformation process**

In [401]:
# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.

baseline_X.head(10)

Unnamed: 0_level_0,Placement(Day),Placement(Weekday),Placement(Time),Confirmation(Day),Confirmation(Weekday),Confirmation(Time),Arrival(Day),Arrival(Weekday),Arrival(Time),Pickup(Day),...,Temperature,Precipitation(mm),Pickup(Lat),Pickup(Long),Destination(Lat),Destination(Long),Platform(Type2),Platform(Type3),Platform(Type4),Personal/Business
Order No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Order_No_4211,9,5,34546.0,9,5,34810.0,9,5,36287.0,9,...,20.4,0.0,-1.317755,36.83037,-1.300406,36.829741,0,1,0,0
Order_No_25375,12,5,40576.0,12,5,41001.0,12,5,42022.0,12,...,26.4,0.0,-1.351453,36.899315,-1.295004,36.814358,0,1,0,1
Order_No_1899,30,2,45565.0,30,2,45764.0,30,2,46174.0,30,...,0.0,0.0,-1.308284,36.843419,-1.300921,36.828195,0,1,0,0
Order_No_9336,15,5,33934.0,15,5,33965.0,15,5,34676.0,15,...,19.2,0.0,-1.281301,36.832396,-1.257147,36.795063,0,1,0,0
Order_No_27883,13,1,35718.0,13,1,35778.0,13,1,36233.0,13,...,15.4,0.0,-1.266597,36.792118,-1.295041,36.809817,0,0,0,1
Order_No_7408,14,5,11255.0,14,5,11337.0,14,5,12096.0,14,...,27.2,0.0,-1.302583,36.767081,-1.257309,36.806008,0,1,0,0
Order_No_22680,9,5,34425.0,9,5,35387.0,9,5,35592.0,9,...,20.3,0.0,-1.279395,36.825364,-1.276574,36.851365,0,1,0,0
Order_No_21578,11,1,7981.0,11,1,8053.0,11,1,8493.0,11,...,28.7,0.0,-1.252796,36.800313,-1.255189,36.782203,0,1,0,0
Order_No_5234,30,2,40244.0,30,2,40549.0,30,2,43998.0,30,...,0.0,0.0,-1.255189,36.782203,-1.300255,36.825657,0,1,0,0
Order_No_1768,23,5,17334.0,23,5,19076.0,23,5,19961.0,23,...,24.6,0.0,-1.225322,36.80855,-1.215601,36.891686,0,1,0,0


In [402]:
baseline_X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21201 entries, Order_No_4211 to Order_No_9836
Data columns (total 23 columns):
Placement(Day)           21201 non-null int64
Placement(Weekday)       21201 non-null int64
Placement(Time)          21201 non-null float64
Confirmation(Day)        21201 non-null int64
Confirmation(Weekday)    21201 non-null int64
Confirmation(Time)       21201 non-null float64
Arrival(Day)             21201 non-null int64
Arrival(Weekday)         21201 non-null int64
Arrival(Time)            21201 non-null float64
Pickup(Day)              21201 non-null int64
Pickup(Weekday)          21201 non-null int64
Pickup(Time)             21201 non-null float64
Distance(KM)             21201 non-null int64
Temperature              21201 non-null float64
Precipitation(mm)        21201 non-null float64
Pickup(Lat)              21201 non-null float64
Pickup(Long)             21201 non-null float64
Destination(Lat)         21201 non-null float64
Destination(Long)        21201

In [403]:
# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.

baseline_test.head(10)

Unnamed: 0_level_0,Placement(Day),Placement(Weekday),Placement(Time),Confirmation(Day),Confirmation(Weekday),Confirmation(Time),Arrival(Day),Arrival(Weekday),Arrival(Time),Pickup(Day),...,Temperature,Precipitation(mm),Pickup(Lat),Pickup(Long),Destination(Lat),Destination(Long),Platform(Type2),Platform(Type3),Platform(Type4),Personal/Business
Order No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Order_No_19248,27,3,17050.0,27,3,17069.0,27,3,17584.0,27,...,0.0,0.0,-1.333275,36.870815,-1.305249,36.82239,0,1,0,0
Order_No_12736,17,5,46655.0,17,5,46757.0,17,5,4827.0,17,...,0.0,0.0,-1.272639,36.794723,-1.277007,36.823907,0,1,0,0
Order_No_768,27,4,40094.0,27,4,41105.0,27,4,41600.0,27,...,22.8,0.0,-1.290894,36.822971,-1.276574,36.851365,0,1,0,0
Order_No_15332,17,1,6695.0,17,1,6807.0,17,1,7361.0,17,...,24.5,0.0,-1.290503,36.809646,-1.303382,36.790658,0,1,0,0
Order_No_21373,11,2,41428.0,11,2,41685.0,11,2,42439.0,11,...,24.4,0.0,-1.281081,36.814423,-1.266467,36.792161,0,1,0,0
Order_No_14573,13,1,23369.0,13,1,23373.0,13,1,23806.0,13,...,19.3,0.0,-1.256606,36.795974,-1.223983,36.898452,0,0,0,1
Order_No_6731,17,3,35609.0,17,3,35630.0,17,3,35791.0,17,...,20.9,0.0,-1.225272,36.875672,-1.304713,36.808955,1,0,0,1
Order_No_18436,28,4,31873.0,28,4,31966.0,28,4,32265.0,28,...,22.7,0.0,-1.273539,36.833775,-1.297299,36.789446,0,1,0,0
Order_No_2288,28,4,32301.0,28,4,32320.0,28,4,34037.0,28,...,19.4,0.0,-1.255189,36.782203,-1.28577,36.759172,0,1,0,0
Order_No_9063,4,5,16108.0,4,5,16162.0,4,5,16953.0,4,...,21.7,0.0,-1.273056,36.811298,-1.330552,36.714289,0,1,0,0


> **Splitting the dataset into the training and test set**

In [404]:
# Using sklearn.model_selection, train_test_split() method to split the baseline_X and baseline_Y.
# Test size will be 0.2 (20% of the data will the test case).

baseline_Xtrain, baseline_Xtest, baseline_Ytrain, baseline_Ytest = train_test_split(
    baseline_X, baseline_Y, 
    test_size=0.2, 
    random_state=1)

### Simple Linear Regression 

> **Training the model**

In [405]:
# Using the LinearRegression() method from sklearn.linear_model.
# Train a model on the distance column.

baseline_simpregression = LinearRegression()
baseline_simpregression.fit(
    (baseline_Xtrain.iloc[:, -11][:, np.newaxis]), baseline_Ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

> **Predictions**

In [406]:
# Trained model from above to predict the outcomes for the baseline_Xtest.

baseline_simppredict = baseline_simpregression.predict(
    baseline_Xtest.iloc[:, -11][:, np.newaxis])
baseline_simppredict[:, None]

array([[1303.53247603],
       [3222.21384152],
       [1404.51570579],
       ...,
       [1101.5660165 ],
       [1101.5660165 ],
       [1101.5660165 ]])

> **Assessing model accuracy**

**Mean Squared Error**

In [427]:
# Calculating the mean squared error between the predicted values and test set values.

def mse(baseline_Ytest, baseline_simppredict):

    MSE = mean_squared_error(baseline_Ytest, baseline_simppredict)
    return MSE

In [428]:
mse(baseline_Ytest, baseline_simppredict)

630189.2670758963

**Residual Sum of Squares**

In [429]:
# Calculating the residual sum of squares between the predicted values and test set values.

def rss(baseline_Ytest, baseline_simppredict):

    RSS = mean_squared_error(
        baseline_Ytest, baseline_simppredict)*len(baseline_X)
    return RSS

In [430]:
rss(baseline_Ytest, baseline_simppredict)

13360642651.276077

**R squared**

In [431]:
# Calculating the R squared between the predicted values and test set values.

def r2(baseline_Ytest, baseline_simppredict):

    R2 = r2_score(baseline_Ytest, baseline_simppredict)
    return R2

In [432]:
r2(baseline_Ytest, baseline_simppredict)

0.33885619187981375

**Root Mean Squared Error**

In [433]:
# Calculating the root mean squared error between the predicted values and test set values.

def rmse(baseline_Ytest, baseline_simppredict):
    RMSE = np.sqrt(mean_squared_error(baseline_Ytest, baseline_simppredict))

    return RMSE

In [434]:
rmse(baseline_Ytest, baseline_simppredict)

793.8446114170557

### Multiple Linear Regression

> **Training the model**

In [327]:
# Create a model using the LinearRegression() method from sklearn.linear_model.

baseline_multiregression = LinearRegression()
baseline_multiregression.fit(baseline_Xtrain, baseline_Ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [328]:
# extract model intercept

beta_bimultiregression = float(baseline_multiregression.intercept_)
print("Intercept:", beta_bimultiregression)

Intercept: 46606.70020303298


In [329]:
# extract model coefficients

beta_bcmultiregression = pd.DataFrame(
    baseline_multiregression.coef_, 
    baseline_X.columns, 
    columns=['Coefficient'])
display(beta_bcmultiregression)

Unnamed: 0,Coefficient
Placement(Day),-161.273015
Placement(Weekday),-159.274513
Placement(Time),0.002279
Confirmation(Day),53.443611
Confirmation(Weekday),55.442113
Confirmation(Time),-0.001688
Arrival(Day),53.443611
Arrival(Weekday),55.442113
Arrival(Time),-0.002015
Pickup(Day),53.443611


> **Predictions**

In [331]:
# Using the trained model from above to predict the outcomes for the baseline_Xtest.

baseline_multipredict= baseline_multiregression.predict(baseline_Xtest)
baseline_multipredict[:, None]

array([[1312.15014999],
       [3272.450594  ],
       [1356.8443304 ],
       ...,
       [1046.81838199],
       [1060.02741748],
       [1081.22697426]])

> **Assessing model accuracy**

**Mean Squared Error**

In [435]:
# Calculating the mean squared error between the predicted values and test set values.

def mse(baseline_Ytest, baseline_multipredict):

    MSE = mean_squared_error(baseline_Ytest, baseline_multipredict)
    return MSE

In [436]:
mse(baseline_Ytest, baseline_multipredict)

626904.8833432547

**Residual Sum of Squares**

In [440]:
# Calculating the residual sum of squares between the predicted values and test set values.

def rss(baseline_Ytest, baseline_multipredict):
    
    RSS = mean_squared_error(
        baseline_Ytest, baseline_multipredict)*len(baseline_X)
    return RSS

In [441]:
rss(baseline_Ytest, baseline_multipredict)

13291010431.760342

**R squared**

In [442]:
# Calculating the R squared between the predicted values and test set values.

def r2(baseline_Ytest, baseline_multipredict):
    
    R2 = r2_score(baseline_Ytest, baseline_multipredict)
    return R2

In [443]:
r2(baseline_Ytest, baseline_multipredict)

0.3423019026873022

**Root Mean Squared Error**

In [444]:
# Calculating the root mean squared error between the predicted values and test set values.

def rmse(baseline_Ytest, baseline_multipredict):
    RMSE = np.sqrt(mean_squared_error(baseline_Ytest, baseline_multipredict))

    return RMSE

In [445]:
rmse(baseline_Ytest, baseline_multipredict)

791.7732524803138

# Data Preprocessing

### Importing the libraries

In [332]:
# Import python libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Packages used to display the Exploratory Data Analysis(EDA)
from pandas_profiling import ProfileReport
from IPython.display import IFrame

# Using the distance package from geopy to calculate the distance between coordinates
from geopy.distance import distance

# Train_test_split used to split the x dataframe into the training set and test set
from sklearn.model_selection import train_test_split

# Training the multiple linear regression model on the split data
from sklearn.linear_model import LinearRegression

# Training the XGBoost regression model on the split data
import xgboost as xgb
# Accuracy packages
from sklearn.metrics import mean_squared_error

### Importing the dataset

##### Variable definitions

In [333]:
# Importing the VariableDefinitions.csv file from github as a Pandas DataFrame.
# Creating new column names for the Pandas DataFrame.
vardefinitions = pd.read_csv(
    "https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/VariableDefinitions.csv", 
    names=['Column_Name', 'Description'])

# Set the Pandas DataFrame style to display all the contents within the columns.
vardefinitions.style.set_properties(
    subset=['Description'], **{'width': '600px'})

Unnamed: 0,Column_Name,Description
0,Order No,Unique number identifying the order
1,User Id,Unique number identifying the customer on a platform
2,Vehicle Type,"For this competition limited to bikes, however in practice Sendy service extends to trucks and vans"
3,Platform Type,"Platform used to place the order, there are 4 types"
4,Personal or Business,Customer type
5,Placement - Day of Month,Placement - Day of Month i.e 1-31
6,Placement - Weekday (Mo = 1),Placement - Weekday (Monday = 1)
7,Placement - Time,Placement - Time - Time of day the order was placed
8,Confirmation - Day of Month,Confirmation - Day of Month i.e 1-31
9,Confirmation - Weekday (Mo = 1),Confirmation - Weekday (Monday = 1)


##### Rider DataFrame

In [334]:
# Importing the Riders.csv file from github as a Pandas DataFrame and setting the RiderId as the index.
rider = pd.read_csv(
    'https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Riders.csv', 
    index_col=0)

# Replacing all the blank spaces between words in the column names with an underscore.
rider.columns = [col.replace(' ', '_') 
                 for col in rider.columns]

# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.
rider.head(10)

Unnamed: 0_level_0,No_Of_Orders,Age,Average_Rating,No_of_Ratings
Rider Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Rider_Id_396,2946,2298,14.0,1159
Rider_Id_479,360,951,13.5,176
Rider_Id_648,1746,821,14.3,466
Rider_Id_753,314,980,12.5,75
Rider_Id_335,536,1113,13.7,156
Rider_Id_720,2608,1798,13.2,504
Rider_Id_95,3464,1304,13.4,950
Rider_Id_122,4831,2124,14.1,1469
Rider_Id_900,1936,1436,14.2,610
Rider_Id_196,550,2379,13.4,224


In [335]:
# A Pandas method that prints the information about a DataFrame.
# The information printed: the index dtype,column dtypes, non-null values and memory usage.
rider.info()

# Returns a tuple representing the dimensionality of the DataFrame.
rider.shape

<class 'pandas.core.frame.DataFrame'>
Index: 960 entries, Rider_Id_396 to Rider_Id_904
Data columns (total 4 columns):
No_Of_Orders      960 non-null int64
Age               960 non-null int64
Average_Rating    960 non-null float64
No_of_Ratings     960 non-null int64
dtypes: float64(1), int64(3)
memory usage: 37.5+ KB


(960, 4)

##### Training set DataFrame

In [336]:
# Importing the Train.csv file from github as a Pandas DataFrame and setting the OrderNo as the index.
train = pd.read_csv(
    'https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Train.csv', 
    index_col=0)

# Replacing all the blank spaces between words in the column names with an underscore.
train.columns = [col.replace(' ', '_') 
                 for col in train.columns]

# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.
train.head(10)

Unnamed: 0_level_0,User_Id,Vehicle_Type,Platform_Type,Personal_or_Business,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),Confirmation_-_Time,...,Arrival_at_Destination_-_Time,Distance_(KM),Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Rider_Id,Time_from_Pickup_to_Arrival
Order No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,11:23:21 AM,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,12:42:44 PM,...,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,9:26:05 AM,...,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,9:56:18 AM,...,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214
Order_No_7408,User_Id_1342,Bike,3,Business,14,5,3:07:35 PM,14,5,3:08:57 PM,...,4:23:41 PM,9,27.2,,-1.302583,36.767081,-1.257309,36.806008,Rider_Id_124,3191
Order_No_22680,User_Id_2803,Bike,3,Business,9,5,9:33:45 AM,9,5,9:49:47 AM,...,10:19:45 AM,5,20.3,,-1.279395,36.825364,-1.276574,36.851365,Rider_Id_114,1380
Order_No_21578,User_Id_1075,Bike,3,Business,11,1,2:13:01 PM,11,1,2:14:13 PM,...,2:33:26 PM,3,28.7,,-1.252796,36.800313,-1.255189,36.782203,Rider_Id_913,646
Order_No_5234,User_Id_733,Bike,3,Business,30,2,11:10:44 AM,30,2,11:15:49 AM,...,1:19:35 PM,9,,,-1.255189,36.782203,-1.300255,36.825657,Rider_Id_394,3398
Order_No_1768,User_Id_2112,Bike,3,Business,23,5,4:48:54 PM,23,5,5:17:56 PM,...,6:31:57 PM,14,24.6,,-1.225322,36.80855,-1.215601,36.891686,Rider_Id_660,3439


In [337]:
# A Pandas method that prints the information about a DataFrame.
# The information printed: the index dtype,column dtypes, non-null values and memory usage.
train.info()

# Returns a tuple representing the dimensionality of the DataFrame.
train.shape


<class 'pandas.core.frame.DataFrame'>
Index: 21201 entries, Order_No_4211 to Order_No_9836
Data columns (total 28 columns):
User_Id                                      21201 non-null object
Vehicle_Type                                 21201 non-null object
Platform_Type                                21201 non-null int64
Personal_or_Business                         21201 non-null object
Placement_-_Day_of_Month                     21201 non-null int64
Placement_-_Weekday_(Mo_=_1)                 21201 non-null int64
Placement_-_Time                             21201 non-null object
Confirmation_-_Day_of_Month                  21201 non-null int64
Confirmation_-_Weekday_(Mo_=_1)              21201 non-null int64
Confirmation_-_Time                          21201 non-null object
Arrival_at_Pickup_-_Day_of_Month             21201 non-null int64
Arrival_at_Pickup_-_Weekday_(Mo_=_1)         21201 non-null int64
Arrival_at_Pickup_-_Time                     21201 non-null object
Pickup_-_Day

(21201, 28)

> **Observations**


- The training set DataFrame has a total of 29 columns and over 21 000 rows. 
- Upon closer observation, the dataset has missing values from the Temperature and Precipitation_in_mm columns. 
    - The precipitation column has 20 649 values missing from the column which makes it over 97% of the column missing.
    - The Temperature column has 4 366 values missing from the column which makes it over 20% of the column missing.
- The dataset has 10 oject data types which means that they will have to be converted into an integer or float in order for it to be used in building a predictive model.
    - When looking further into the data within the columns, 5 of the columns contain a time in the format below which would either have to be converted into an integer or a float or will have to be dropped.
    Example: 9:35:46 AM or 12:39:25 PM
    - 2 of the other columns contain categorical data which will have to be encoded in order to be of use to building a predictive model

##### Testing set DataFrame

In [338]:
# Importing the Test.csv file from github as a Pandas DataFrame and setting the OrderNo as the index.
test = pd.read_csv(
    'https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Test.csv', 
    index_col=0)

# Replacing all the blank spaces between words in the column names with an underscore.
test.columns = [col.replace(' ', '_') for col in test.columns]

# Displays the first 10 rows of the rider DataFrame to show the layout of the DataFrame.
test.head(10)

Unnamed: 0_level_0,User_Id,Vehicle_Type,Platform_Type,Personal_or_Business,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),Confirmation_-_Time,...,Pickup_-_Weekday_(Mo_=_1),Pickup_-_Time,Distance_(KM),Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Rider_Id
Order No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,4:44:29 PM,...,3,5:06:47 PM,8,,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,12:59:17 PM,...,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868
Order_No_768,User_Id_2154,Bike,3,Business,27,4,11:08:14 AM,27,4,11:25:05 AM,...,4,11:57:54 AM,5,22.8,,-1.290894,36.822971,-1.276574,36.851365,Rider_Id_26
Order_No_15332,User_Id_2910,Bike,3,Business,17,1,1:51:35 PM,17,1,1:53:27 PM,...,1,2:16:52 PM,5,24.5,,-1.290503,36.809646,-1.303382,36.790658,Rider_Id_685
Order_No_21373,User_Id_1205,Bike,3,Business,11,2,11:30:28 AM,11,2,11:34:45 AM,...,2,11:56:04 AM,6,24.4,,-1.281081,36.814423,-1.266467,36.792161,Rider_Id_858
Order_No_14573,User_Id_2338,Bike,1,Personal,13,1,6:29:29 PM,13,1,6:29:33 PM,...,1,6:39:02 PM,16,19.3,,-1.256606,36.795974,-1.223983,36.898452,Rider_Id_452
Order_No_6731,User_Id_488,Bike,2,Personal,17,3,9:53:29 AM,17,3,9:53:50 AM,...,3,10:08:00 AM,18,20.9,,-1.225272,36.875672,-1.304713,36.808955,Rider_Id_704
Order_No_18436,User_Id_3764,Bike,3,Business,28,4,8:51:13 AM,28,4,8:52:46 AM,...,4,8:58:53 AM,8,22.7,,-1.273539,36.833775,-1.297299,36.789446,Rider_Id_62
Order_No_2288,User_Id_2866,Bike,3,Business,28,4,8:58:21 AM,28,4,8:58:40 AM,...,4,9:30:35 AM,8,19.4,,-1.255189,36.782203,-1.28577,36.759172,Rider_Id_177
Order_No_9063,User_Id_1329,Bike,3,Business,4,5,4:28:28 PM,4,5,4:29:22 PM,...,5,4:44:01 PM,15,21.7,,-1.273056,36.811298,-1.330552,36.714289,Rider_Id_674


In [339]:
# A Pandas method that prints the information about a DataFrame.
#The information printed: the index dtype,column dtypes, non-null values and memory usage.
test.info()

#Returns a tuple representing the dimensionality of the DataFrame.
test.shape

<class 'pandas.core.frame.DataFrame'>
Index: 7068 entries, Order_No_19248 to Order_No_1603
Data columns (total 24 columns):
User_Id                                 7068 non-null object
Vehicle_Type                            7068 non-null object
Platform_Type                           7068 non-null int64
Personal_or_Business                    7068 non-null object
Placement_-_Day_of_Month                7068 non-null int64
Placement_-_Weekday_(Mo_=_1)            7068 non-null int64
Placement_-_Time                        7068 non-null object
Confirmation_-_Day_of_Month             7068 non-null int64
Confirmation_-_Weekday_(Mo_=_1)         7068 non-null int64
Confirmation_-_Time                     7068 non-null object
Arrival_at_Pickup_-_Day_of_Month        7068 non-null int64
Arrival_at_Pickup_-_Weekday_(Mo_=_1)    7068 non-null int64
Arrival_at_Pickup_-_Time                7068 non-null object
Pickup_-_Day_of_Month                   7068 non-null int64
Pickup_-_Weekday_(Mo_=_1)     

(7068, 24)

> **Observations**

- The testing set DataFrame contains 25 columns and 7068 rows.
- Just like the Training set DataFrame, the Precipitation_in_mm and Temperature columns are missing data. 
    - The missing data will be dealt in the same way as the Training set DataFrame.
- The differences between the Training set and Testing set dataframe would have to be:
    - The dependent variable(DV) is not in the Testing set DataFrame.
    - The Training set DataFrame has 3 extra columns (Arrival_at_Destination_-_Day_of_Month, Arrival_at_Destination_-_Weekday_(Mo_=_1) and Arrival_at_Destination_-_Time), the columns will most likely be dropped when further analysis is done in order to make the Training set DataFrame congruent with the Testing set DataFrame.    

### Exploratory Data Analysis
Exploratory Data Analysis(EDA) is an approach for data analysis that uses many tools(mainly graphicale to maximize insight into a data set, extract important variables, detect outliers and anomalies, amongst other details that is missed when looking at DataFrame.

##### Training set DataFrame

In [340]:
# Pandas Profile Report on the train DataFrame and converting the report to an html file.
profile_train = ProfileReport(train, check_correlation = True)
profile_train.to_file(outputfile='output.html')
HtmlFile = open('output.html', 'r', encoding='utf-8')
source_code = HtmlFile.read()

IFrame(src='./output.html', width=900, height=800)


  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


add more plots and visual

>**Observations**

**Aside**: The Profile report is interactive snd contains the statistics and graphs associated to the DataFrame(just click Toggle details).

**Information about the dataset:**
- The train DataFrame contains 29 columns and 21 201 observations with 4.1% of the data is missing.
- There are 11 numeric variables, 8 categorical variables, 1 uniques text and 9 rejected variables
- The User_Id, Placement_-_Time, Confirmation_-_Time, Arrival_at_Pickup_-Time, Pickup_-_Time, Arrival_at_Desination_-Time and Rider_Id has a high cardinality. 
    - This means that there are a number of unique values contained in a particular column. The cardinality or distinct values are 3186, 15 686, 15 742, 15 767, 15 690, 15 725 and 924 respectively.
- The Vehicle_type has a only one value within the column which is Bike and should be ignored for analysis.
- There is a high correlation between (Confirmation_-_Day_of_Month and Placement_-_Day_of_Month), (Arrival_at_Pickup_-_Day_of_Month and Confirmation_-_Day_of_Month), (Arrival_at_Pickup_-_Weekday_(Mo_=_1) and Confirmation_-_Weekday_(Mo_=_1)), (Pickup_-_Day_of_Month and Arrival_at_Pickup_-_Day_of_Month), (Pickup_-_Weekday_(Mo_=_1) and Arrival_at_Pickup_-_Weekday_(Mo_=_1)), (Arrival_at_Destination_-_Day_of_Month and Pickup_-_Day_of_Month), (Arrival_at_Destination_-_Weekday_(Mo_=_1) and Pickup_-_Weekday_(Mo_=_1)) where the p-value = 1. 
    - This means that there is a direct relation between variables and inessence they are the same. when one variable moves either up or down, the other variable moves in the same direction as it is a near perfect positive correlation. 
- The Temperature and Precipitation_in_mm column is missing 20.6% and 97.4% of their data respectively. 


**Order_no:**
- The values are categorical and unique where no values repeat itself.

**User_Id:**
- The values are categorical  and there are no missing values.
- 15% (3 186) of the values are unique.
    - This means that 3 186 people use Sendy as their logistics company for the transportation of goods. 
    - There are many people who have repeated used Sendy's services. The highest being User_ID_393 who used Sendy 645 times which counts to 3% of the values within the column. 
    
 **Platform_Type:**
- The values are numeric and the values are between 1 to 4.
     - 1, 2, 3 and 4 each represent a method on how an order is placed to Sendy. 
- There are no missing values and since each value represents a method, the mean is irrelevant as it is averaging the method number.
- From the data, we have gathered that the most used platform is 3 accounting for 85.2% of the values and the least used would be 4 (0.1%).
    - Platform 3 is Quartile 1, median and Quartile 3 of the dataset which means most of the dataset falls within those ranges.
    
**Personal_or_Business:**
- The values are categorical and there is no missing or unique values.
- There are 2 distinct values:
    - Business accounts for 82% of the values and Personal for 18%.
    
**Placement_-_Day_of_Month:**
- The values are numeric, no missing values and there is 31 distinct values which accounts to each day of the month.
- Confirmation_-_Day_of_Month is highly correlated and be ignored 
- Confirmation_-_Day_of_Month and  Arrival-_Day_of_Month is highly correlated and be ignored 
- Pickup_-_Day_of_Month 
- Arrival_at_Destination_-_Day_of_Month

**Placement_-_Weekday_(MO_=_1):**
- The values are numeric, no missing values and there is 7 distinct values which accounts to each day of the week.
- Confirmation_-_Weekday_(MO_=_1) is highly correlated and be ignored
- Confirmation_-_Weekday_(MO_=_1) and  Arrival-_Weekday_(MO_=_1) is highly correlated and be ignored  multicollinearity
- Pickup_-_Weekday_(Mo_=_1)
- Arrival_at_Destination_-_Weekday_(Mo_=_1)

**Placement_-_Time:**
- The values are categorical, no missing values and 74% of the values are distinct.
- Since it measures the placement time down to the second, it will be very hard to get repeated values.
    - The maximum number of orders placed at exact time is 6 which also occurred 6 other times.

**Confirmation_-_Time:**
- The values are categorical, no missing values and 74.3% of the values are distinct.
- Since it measures the placement time down to the second, it will be very hard to get repeated values.
    - The maximum number of orders placed at exact time is 6.

**Arrival_at_Pickup_-_Time:**
- The values are categorical, no missing values and 74.4% of the values are distinct.
- Since it measures the placement time down to the second, it will be very hard to get repeated values.
    - The maximum number of orders placed at exact time is 6 which also occurred 5 other times.

**Pickup_-_Time:**
- The values are categorical, no missing values and 74% of the values are distinct.
- Since it measures the placement time down to the second, it will be very hard to get repeated values.
    - The maximum number of orders placed at exact time is 6 which also occurred 4 other times.

**Arrival_at_Destination_-_Time:**
- The values are categorical, no missing values and 74.2% of the values are distinct.
- Since it measures the placement time down to the second, it will be very hard to get repeated values.
    - The maximum number of orders placed at exact time is 7.

**Distance_(KM):**
- The values are numerical, there are no missing values and 0.2% are unique.
- The minimum distanced Sendy logistics bike had to deliver was 1km, the furthest was 49km and the most common distance travelled is 8km.
- The average distance travelled was 9.5km with a standard deviation of 5.669.

**Temperature:**
- The values are numerical, 20.6% of the data is missing and the 0.9% are unique.
- The average temperature from the dataset is 23.259 with the minimum being 11.2 and maximum 32.1.

**Precipitation_in_millimeters:**
- The values are numerical, 97.4% of the data is missing and the 0.3% are unique.
- The average precipitation from the dataset is 7.9mm with the minimum being 0.1mm and maximum 99.1mm.
- It's got a standard deviation of 17.09 and coefficeint of variation of 2.16.

**Pickup_Lat:**
- The values are numerical, no missing vales and there are 17.3% unique values.
- Since the values within the data set are co-ordinates, extracting the mean and all the other things would not be benefical as it won't be according to the data.
- From the co-ordinates, you can see where the deliveries are made the most at that destination.

**Pickup_Long:**
- The values are numerical, no missing vales and there are 17.2% unique values.
- Since the values within the data set are co-ordinates, extracting the mean and all the other things would not be benefical as it won't be according to the data.
- From the co-ordinates, you can see where the deliveries are made the most at that destination.

**Destination_Lat:**
- The values are numerical, no missing vales and there are 25% unique values.
- Since the values within the data set are co-ordinates, extracting the mean and all the other things would not be benefical as it won't be according to the data.
- From the co-ordinates, you can see where the deliveries are made the most at that destination.

**Destination_Long:**
- The values are numerical, no missing vales and there are 24.8% unique values.
- Since the values within the data set are co-ordinates, extracting the mean and all the other things would not be benefical as it won't be according to the data.
- From the co-ordinates, you can see where the deliveries are made the most at that destination.

**Rider_Id:**
- The values are categorical, no missing values and there are 4.4% unique values.
- There is a rider that has done 247 deliveries for Sendy.

**Time_from_Pickup_to_Arrival:**
- The values are numerical, no missing vales and there are 19.2% unique values.
- The average time taken is 1556.9 minutes with the minimum being 1 and maximum 7883







##### Rider DataFrame

In [341]:
# Pandas Profile Report on the train DataFrame and converting the report to an html file.
profile_rider = ProfileReport(rider, check_correlation = True)
profile_rider.to_file(outputfile='output.html')

HtmlFile = open('output.html', 'r', encoding='utf-8')
source_code = HtmlFile.read()

IFrame(src='./output.html', width=900, height=800)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


> **Observations**

### Transformation and Processing of data

> **Splitting the data**

In [342]:
# Splitting the baseline_train DataFrame into the X and Y variable using the Pandas .iloc[] method.

X = train.iloc[:, :-1]
Y = train.iloc[:, -1]

> **Alignment of the datasets**


In [343]:
# Using the Pandas .drop() method.
# Remove columns by specifying the column names and corresponding axis.

X = X.drop(['Arrival_at_Destination_-_Day_of_Month',
            'Arrival_at_Destination_-_Weekday_(Mo_=_1)', 'Arrival_at_Destination_-_Time'], axis=1)

> **Missing data**

**Precipitation_in_mm**

Through research, Nairobi receives an average during July (driest weather) of 19 mm  and April (wettest weather) of 206 mm of rainfall. From the Exploratory Data Analysis, we gathered the average rainfall would be 7.9mm or median would be 2.9mm which is far below the researched averages. Therefore, the Precipitaion_in_mm column would be dropped because if we were to replace the missing values with the mean or median, it may change the learning quality of the data models

In [344]:
# Using the Pandas .drop() method.
# Remove columns that are not useful by specifying the column names and corresponding axis.

X = X.drop(['Precipitation_in_millimeters'], axis=1)

In [345]:
# Using the Pandas .drop() method.
# Remove columns that are not useful by specifying the column names and corresponding axis.

test = test.drop(['Precipitation_in_millimeters'], axis=1)

**Temperature**

Through research, Nairobi receives an average of between 10 degress celcius during the winter months and 29 degrees celcius during the summer months. From the Exploratory Data Analysis, the average temperature is 23.2 degrees celcius with the minimum and maximum being 11.2 and 32.1 degrees celcius respectively. This is close to the researched values. The median and modal temperature is 23.5 24.7 degrees celcuis respectively which is also close to the average temperature of the dataset. 

Since only 20.6% of the values are missing, we will fill the missing data using the mean temperature (23.2 degrees celcius) of the dataset.

In [346]:
# Replacing NaN values using the Pandas .fillna() method with the mean of the column.

X['Temperature']= X['Temperature'].fillna(X['Temperature'].mean())

In [347]:
# Replacing NaN values using the Pandas .fillna() method with the mean of the column.

test['Temperature']= test['Temperature'].fillna(test['Temperature'].mean())

> **Transformation of data**

**Dropping columns**

From Exploratory Data Analysis, There is a high correlation between (Confirmation_-_Day_of_Month and Placement_-_Day_of_Month), (Arrival_at_Pickup_-_Day_of_Month and Confirmation_-_Day_of_Month), (Arrival_at_Pickup_-_Weekday_(Mo_=_1) and Confirmation_-_Weekday_(Mo_=_1)), (Pickup_-_Day_of_Month and Arrival_at_Pickup_-_Day_of_Month), (Pickup_-_Weekday_(Mo_=_1) and Arrival_at_Pickup_-_Weekday_(Mo_=_1)), (Arrival_at_Destination_-_Day_of_Month and Pickup_-_Day_of_Month), (Arrival_at_Destination_-_Weekday_(Mo_=_1) and Pickup_-_Weekday_(Mo_=_1)).


There is a direct relationship between the variables and inessence they are the same. Therefore, we will only keep the Placement_-_Day_of_Month and Placement_-_Weekday_(Mo_=_1) and drop the rest of the columns.

In [348]:
# Using the Pandas .drop() method.
# Remove columns that are not useful by specifying the column names and corresponding axis.

X = X.drop(['Confirmation_-_Day_of_Month', 'Confirmation_-_Weekday_(Mo_=_1)',
            'Arrival_at_Pickup_-_Day_of_Month', 'Arrival_at_Pickup_-_Weekday_(Mo_=_1)',
            'Pickup_-_Day_of_Month', 'Pickup_-_Weekday_(Mo_=_1)'], axis=1)

In [349]:
# Using the Pandas .drop() method.
# Remove columns that are not useful by specifying the column names and corresponding axis.

test = test.drop(['Confirmation_-_Day_of_Month', 'Confirmation_-_Weekday_(Mo_=_1)',
                  'Arrival_at_Pickup_-_Day_of_Month', 'Arrival_at_Pickup_-_Weekday_(Mo_=_1)',
                  'Pickup_-_Day_of_Month', 'Pickup_-_Weekday_(Mo_=_1)'], axis=1)

In [350]:
# Using the Pandas .drop() method.
# Remove columns that are not useful by specifying the column names and corresponding axis.

X = X.drop(['Vehicle_Type', 'User_Id', 'Rider_Id'], axis=1)

In [351]:
# Using the Pandas .drop() method.
# Remove columns that are not useful by specifying the column names and corresponding axis.

test = test.drop(['Vehicle_Type', 'User_Id', 'Rider_Id'], axis=1)

**Converting time strings into seconds**

The type object for the time columns are strings, in order to build a predictive model it needs to be either an integer or float. Therefore, we need to convert the string into floats in order to use the data in our model.

In [352]:
# Converting time strings to seconds using the Pandas .to_timedelta() method.
# Using the .dt accessor object for datetimelike properties of the Series values to convert to seconds.

X['Placement_-_Time'] = pd.to_timedelta(
    X['Placement_-_Time']).dt.total_seconds()

X['Confirmation_-_Time'] = pd.to_timedelta(
    X['Confirmation_-_Time']).dt.total_seconds()

X['Arrival_at_Pickup_-_Time'] = pd.to_timedelta(
    X['Arrival_at_Pickup_-_Time']).dt.total_seconds()

X['Pickup_-_Time'] = pd.to_timedelta(
    X['Pickup_-_Time']).dt.total_seconds()

In [353]:
# Converting time strings to seconds using the Pandas .to_timedelta() method.
# Using the .dt accessor object for datetimelike properties of the Series values to convert to seconds.

test['Placement_-_Time'] = pd.to_timedelta(
    test['Placement_-_Time']).dt.total_seconds()

test['Confirmation_-_Time'] = pd.to_timedelta(
    test['Confirmation_-_Time']).dt.total_seconds()

test['Arrival_at_Pickup_-_Time'] = pd.to_timedelta(
    test['Arrival_at_Pickup_-_Time']).dt.total_seconds()

test['Pickup_-_Time'] = pd.to_timedelta(
    test['Pickup_-_Time']).dt.total_seconds()

**Calculating the distance between the order Pick and order Destination**

In [354]:
# Using the geopy.distance to calculate the distance between the two latitude and longitude points

def distance_calc(X):
    '''calculate distance (m) between two lat&long points using the Vincenty formula '''

    dist_calc = distance((X.Pickup_Lat, X.Pickup_Long),
                         (X.Destination_Lat, X.Destination_Long)).km
    return dist_calc

In [355]:
# Using the lambda function to iterate distance_calc(X) over the X DataFrame and create a new colmun.

X['Distance_(lat/long)_(KM)'] = X.apply(lambda r: distance_calc(r), axis=1)

In [356]:
# Using the geopy.distance to calculate the distance between the two latitude and longitude points

def distance_calc(test):

    dist_calc = distance((test.Pickup_Lat, test.Pickup_Long),
                         (test.Destination_Lat, test.Destination_Long)).km
    return dist_calc

In [357]:
# Using the lambda function to iterate distance_calc(X) over the X DataFrame and create a new colmun.

test['Distance_(lat/long)_(KM)'] = test.apply(lambda r: distance_calc(r), axis=1)

> **Encoding categorical data**

In [358]:
# Encoding categorical data using pd.get_dummies for the x dataframe

X = pd.get_dummies(X, columns=['Platform_Type', 'Personal_or_Business'], drop_first=True)

In [359]:
# Encoding categorical data using pd.get_dummies for the test dataframe

test = pd.get_dummies(test, columns=['Platform_Type', 'Personal_or_Business'], drop_first=True)

> **Rename columns**

In [360]:
X.columns= ['Placement(Day)', 'Placement(Weekday)', 'Placement(Time)', 'Confirmation(Time)', 
            'Arrival(Time)', 'Pickup(Time)', 'Distance(KM)','Temperature', 'Pickup(Lat)', 
            'Pickup(Long)', 'Destination(Lat)', 'Destination(Long)', 'Distance(lat/long_KM)',
            'Platform(Type2)', 'Platform(Type3)', 'Platform(Type4)', 'Personal/Business']

In [361]:
test.columns= ['Placement(Day)', 'Placement(Weekday)', 'Placement(Time)', 'Confirmation(Time)', 
               'Arrival(Time)', 'Pickup(Time)', 'Distance(KM)','Temperature', 'Pickup(Lat)', 
               'Pickup(Long)', 'Destination(Lat)', 'Destination(Long)', 'Distance(lat/long_KM)',
               'Platform(Type2)', 'Platform(Type3)', 'Platform(Type4)', 'Personal/Business']

> **How the transformed data look**

In [362]:
X.head()

Unnamed: 0_level_0,Placement(Day),Placement(Weekday),Placement(Time),Confirmation(Time),Arrival(Time),Pickup(Time),Distance(KM),Temperature,Pickup(Lat),Pickup(Long),Destination(Lat),Destination(Long),Distance(lat/long_KM),Platform(Type2),Platform(Type3),Platform(Type4),Personal/Business
Order No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Order_No_4211,9,5,34546.0,34810.0,36287.0,37650.0,4,20.4,-1.317755,36.83037,-1.300406,36.829741,1.919586,0,1,0,0
Order_No_25375,12,5,40576.0,41001.0,42022.0,42249.0,16,26.4,-1.351453,36.899315,-1.295004,36.814358,11.329354,0,1,0,1
Order_No_1899,30,2,45565.0,45764.0,46174.0,46383.0,3,23.258889,-1.308284,36.843419,-1.300921,36.828195,1.879806,0,1,0,0
Order_No_9336,15,5,33934.0,33965.0,34676.0,34986.0,9,19.2,-1.281301,36.832396,-1.257147,36.795063,4.939253,0,1,0,0
Order_No_27883,13,1,35718.0,35778.0,36233.0,36323.0,9,15.4,-1.266597,36.792118,-1.295041,36.809817,3.711035,0,0,0,1


##### Splitting the dataset

In [363]:
# Using sklearn.model_selection, train_test_split() method to split the baseline_X and baseline_Y.
# Test size will be 0.2 (20% of the data will the test case).

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.2, random_state=1)

# Multiple Linear Regression

> **Training the model**

In [364]:
multiregression= LinearRegression()

In [365]:
multiregression.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [366]:
# extract model intercept
beta0_multiregression = float(multiregression.intercept_)
print("Intercept:", beta0_multiregression)

Intercept: 49439.484226825574


In [367]:
# extract model coefficients
betaj_multiregression = pd.DataFrame(multiregression.coef_, X.columns, columns=['Coefficient'])


In [368]:
betaj_multiregression

Unnamed: 0,Coefficient
Placement(Day),-0.998106
Placement(Weekday),6.955475
Placement(Time),0.002388
Confirmation(Time),-0.001727
Arrival(Time),-0.00203
Pickup(Time),0.000181
Distance(KM),82.200409
Temperature,2.091895
Pickup(Lat),592.083498
Pickup(Long),-1072.912732


> **Predictions**

In [370]:
multiregression_predict= multiregression.predict(X_test)
multiregression_predict[:, None]

array([[1305.76041207],
       [3348.13407729],
       [1305.9846667 ],
       ...,
       [1067.22323318],
       [1001.11261072],
       [1104.39400009]])

> **Assessing model accuracy**

**Mean Squared Error**

In [446]:
# Calculating the mean squared error between the predicted values and test set values.

def mse(Y_test, multiregression_predict):

    MSE = mean_squared_error(Y_test, multiregression_predict)
    return MSE

In [447]:
mse(Y_test, multiregression_predict)

624340.5561579801

**Residual Sum of Squares**

In [440]:
# Calculating the residual sum of squares between the predicted values and test set values.

def rss(Y_test, multiregression_predict):
    
    RSS = mean_squared_error(
        Y_test, multiregression_predict)*len(X)
    return RSS

In [448]:
rss(Y_test, multiregression_predict)

13236644131.105337

**R squared**

In [449]:
# Calculating the R squared between the predicted values and test set values.

def r2(Y_test, multiregression_predict):
    
    R2 = r2_score(Y_test, multiregression_predict)
    return R2

In [450]:
r2(Y_test, multiregression_predict)

0.3449921881763034

**Root Mean Squared Error**

In [444]:
# Calculating the root mean squared error between the predicted values and test set values.

def rmse(Y_test, multiregression_predict):
    RMSE = np.sqrt(mean_squared_error(Y_test, multiregression_predict))

    return RMSE

In [451]:
rmse(Y_test, multiregression_predict)

790.1522360646587

# XGBoost Regression

> **Training the model**

In [373]:
xgbregression= xgb.XGBRegressor()
xgbregression.fit(X_train, Y_train)

  if getattr(data, 'base', None) is not None and \




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

> **Predictions**

In [374]:
xgbregression_predict= xgbregression.predict(X_test)
xgbregression_predict[:, None]

array([[1123.529 ],
       [3080.5867],
       [1437.8683],
       ...,
       [1527.0631],
       [ 940.1639],
       [1397.5447]], dtype=float32)

> **Assessing model accuracy**

**Mean Squared Error**

In [453]:
# Calculating the mean squared error between the predicted values and test set values.

def mse(Y_test, xgbregression_predict):

    MSE = mean_squared_error(Y_test, xgbregression_predict)
    return MSE

In [454]:
mse(Y_test, xgbregression_predict)

587086.046361122

**Residual Sum of Squares**

In [455]:
# Calculating the residual sum of squares between the predicted values and test set values.

def rss(Y_test, xgbregression_predict):
    
    RSS = mean_squared_error(
        Y_test, xgbregression_predict)*len(X)
    return RSS

In [456]:
rss(Y_test, xgbregression_predict)

12446811268.90215

**R squared**

In [457]:
# Calculating the R squared between the predicted values and test set values.

def r2(Y_test, xgbregression_predict):
    
    R2 = r2_score(Y_test, xgbregression_predict)
    return R2

In [458]:
r2(Y_test, xgbregression_predict)

0.3840766184634653

**Root Mean Squared Error**

In [459]:
# Calculating the root mean squared error between the predicted values and test set values.

def rmse(Y_test, xgbregression_predict):
    RMSE = np.sqrt(mean_squared_error(Y_test, xgbregression_predict))

    return RMSE

In [460]:
rmse(Y_test, xgbregression_predict)

766.2154046749008

# Polynomial Regression

# Support Vector Regression

# Decision Tree Regression

# Random Forest Regression