# Sendy Logistics

The costs of logistics in Sub-Saharan Africa increase the costs of manufactured goods up to 320% whereas in Europe it is only 90% of the manufacturing cost. Economies are better when logistics are efficient and affordable.

Sendy is a logistics company situated in Nairobi, Kenya. They help men and women behind every type of business to trade easily , deliver more competitively and build extraordinary business. 


***“We believe in them; we believe that logistics should be an enabler for them to achieve their goals, rather than a hindrance. We believe that everyone should be able to participate and thrive in the economy and that no small business should be left out because the cost of logistics is either too high or inaccessible.”***


The purpose of this notebook is to predict the estimated time of delivery of orders from the time the order is picked up till it is delivered at its final destination.

This will help Sendy improve their customer satisfaction as well ad the reliabilty of their service. It will also ensure Sendy's resources are being used efficiently through a decrease of cost in doing business, planning for order scheduling and improved resource management. 

# Data Preprocessing
The models within the Baseline notebook are trained using very little data transformation. The data is cleaned minimally so that it could be used to build models to make predictions on it. The Baseline models will be then compared to the actual models where further data preprocessing would be done to the data.

### Importing the libraries

In [104]:
#Import python libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

#Packages used to display the Exploratory Data Analysis(EDA)
from pandas_profiling import ProfileReport
from IPython.display import IFrame

#Train_test_split used to split the x dataframe into the training set and test set
from sklearn.model_selection import train_test_split

#Training the multiple linear regression model on the split data
from sklearn.linear_model import LinearRegression

#Training the XGBoost regression model on the split data
import xgboost as xgb
#Accuracy packages
from sklearn.metrics import mean_squared_error

### Importing the datasets

##### **Variable definitions**
The Variable definitions dataframe lists the column names found in the Rider, Train, Test and Sample Submissions dataframes and gives a brief description of the data found within the column.

*Aside note: Column names listed in this datframe may not appear the Rider, Train, Test and Sample Submissions dataframes because during the process of modelling the data, the columns may not be useful and will be removed.*

In [65]:
var_definitions = pd.read_csv("https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/VariableDefinitions.csv", names=['Column_Name', 'Description'])
var_definitions.style.set_properties(subset=['Description'], **{'width': '600px'})

Unnamed: 0,Column_Name,Description
0,Order No,Unique number identifying the order
1,User Id,Unique number identifying the customer on a platform
2,Vehicle Type,"For this competition limited to bikes, however in practice Sendy service extends to trucks and vans"
3,Platform Type,"Platform used to place the order, there are 4 types"
4,Personal or Business,Customer type
5,Placement - Day of Month,Placement - Day of Month i.e 1-31
6,Placement - Weekday (Mo = 1),Placement - Weekday (Monday = 1)
7,Placement - Time,Placement - Time - Time of day the order was placed
8,Confirmation - Day of Month,Confirmation - Day of Month i.e 1-31
9,Confirmation - Weekday (Mo = 1),Confirmation - Weekday (Monday = 1)


##### **Rider dataframe**
The Rider Dataframe lists all the riders that have delivered orders for Sendy and any information pertaining to that particular rider.

In [66]:
rider= pd.read_csv('https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Riders.csv')
rider.columns= [col.replace(' ', '_') for col in rider.columns]
rider.head()

Unnamed: 0,Rider_Id,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,Rider_Id_396,2946,2298,14.0,1159
1,Rider_Id_479,360,951,13.5,176
2,Rider_Id_648,1746,821,14.3,466
3,Rider_Id_753,314,980,12.5,75
4,Rider_Id_335,536,1113,13.7,156


In [67]:
rider.info()
rider.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 5 columns):
Rider_Id          960 non-null object
No_Of_Orders      960 non-null int64
Age               960 non-null int64
Average_Rating    960 non-null float64
No_of_Ratings     960 non-null int64
dtypes: float64(1), int64(3), object(1)
memory usage: 37.6+ KB


(960, 5)

##### **Training set dataframe**
Most of the modelling will be done with this dataframe to get a prediction of the arrival time for motorbike deliveries in Nairobi.

In [68]:
train= pd.read_csv('https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Train.csv')
train.columns= [col.replace(' ', '_') for col in train.columns]
train.head()

Unnamed: 0,Order_No,User_Id,Vehicle_Type,Platform_Type,Personal_or_Business,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),...,Arrival_at_Destination_-_Time,Distance_(KM),Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Rider_Id,Time_from_Pickup_to_Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


In [69]:
train.info()
train.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 29 columns):
Order_No                                     21201 non-null object
User_Id                                      21201 non-null object
Vehicle_Type                                 21201 non-null object
Platform_Type                                21201 non-null int64
Personal_or_Business                         21201 non-null object
Placement_-_Day_of_Month                     21201 non-null int64
Placement_-_Weekday_(Mo_=_1)                 21201 non-null int64
Placement_-_Time                             21201 non-null object
Confirmation_-_Day_of_Month                  21201 non-null int64
Confirmation_-_Weekday_(Mo_=_1)              21201 non-null int64
Confirmation_-_Time                          21201 non-null object
Arrival_at_Pickup_-_Day_of_Month             21201 non-null int64
Arrival_at_Pickup_-_Weekday_(Mo_=_1)         21201 non-null int64
Arrival_at_Pickup_-_Time   

(21201, 29)

> **Observations**


- The training set dataframe has a total of 29 columns and over 21 000 rows. 
- Upon closer observation, the dataset has missing values from the Temperature and Precipitation in mm columns. 
    - The precipitation column has 20 649 values missing from the column which makes it over 97% of the column missing.
    - The Temperature column has 4 366 values missing from the column which makes it over 20% of the column missing.
- The dataset has 10 oject data types which means that they will have to be converted into an integer or float in order for a model to be built.
    - When looking further into the data within the columns, 5 of the columns contain a time in the format below which would either have to be converted or have the column dropped all together.
    Example: 9:35:46 AM or 12:39:25 PM
    - 2 of the other columns contain categorical data which will have to be encoded in order to be of use to building a predictive model
    
    

##### **Testing set dataframe**
The models built during this process will be tested on this dataframe. It will make a prediction for the time of arrival for motorbike deliveries based on the information in this dataset.

In [70]:
test= pd.read_csv('https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/Test.csv')
test.columns= [col.replace(' ', '_') for col in test.columns]
test.head()

Unnamed: 0,Order_No,User_Id,Vehicle_Type,Platform_Type,Personal_or_Business,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),...,Pickup_-_Weekday_(Mo_=_1),Pickup_-_Time,Distance_(KM),Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Rider_Id
0,Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,...,3,5:06:47 PM,8,,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
1,Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,...,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868
2,Order_No_768,User_Id_2154,Bike,3,Business,27,4,11:08:14 AM,27,4,...,4,11:57:54 AM,5,22.8,,-1.290894,36.822971,-1.276574,36.851365,Rider_Id_26
3,Order_No_15332,User_Id_2910,Bike,3,Business,17,1,1:51:35 PM,17,1,...,1,2:16:52 PM,5,24.5,,-1.290503,36.809646,-1.303382,36.790658,Rider_Id_685
4,Order_No_21373,User_Id_1205,Bike,3,Business,11,2,11:30:28 AM,11,2,...,2,11:56:04 AM,6,24.4,,-1.281081,36.814423,-1.266467,36.792161,Rider_Id_858


In [71]:
test.info()
test.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7068 entries, 0 to 7067
Data columns (total 25 columns):
Order_No                                7068 non-null object
User_Id                                 7068 non-null object
Vehicle_Type                            7068 non-null object
Platform_Type                           7068 non-null int64
Personal_or_Business                    7068 non-null object
Placement_-_Day_of_Month                7068 non-null int64
Placement_-_Weekday_(Mo_=_1)            7068 non-null int64
Placement_-_Time                        7068 non-null object
Confirmation_-_Day_of_Month             7068 non-null int64
Confirmation_-_Weekday_(Mo_=_1)         7068 non-null int64
Confirmation_-_Time                     7068 non-null object
Arrival_at_Pickup_-_Day_of_Month        7068 non-null int64
Arrival_at_Pickup_-_Weekday_(Mo_=_1)    7068 non-null int64
Arrival_at_Pickup_-_Time                7068 non-null object
Pickup_-_Day_of_Month                   7068 n

(7068, 25)

> **Observations**

- The testing set dataframe contains 25 columns and 7068 rows.
- Just like the Training set dataframe, the Precipitation in mm and Temperature is missing data. 
    - The missing data will be dealt in the same way as the Training set dataframe.
- The difference between the Training set and Testing set dataframe would have to be:
    - The dependent variable(DV) is not in the Testing set dataframe.
    - The Training set dataframe has 3 extra columns, Arrival_at_Destination_-_Day_of_Month, Arrival_at_Destination_-_Weekday_(Mo_=_1) and Arrival_at_Destination_-_Time. The columns will most likely be dropped when further analysis is done.     

##### **Sample submissions dataframe**

In [72]:
sample_submission= pd.read_csv('https://raw.githubusercontent.com/thembeks/Regression-Sendy-Logistics-Challenge-Team-14/Predict/SampleSubmission.csv')
sample_submission.columns= [col.replace(' ', '_') for col in sample_submission.columns]
sample_submission.head()

Unnamed: 0,Order_No,Time_from_Pickup_to_Arrival
0,Order_No_19248,567.0
1,Order_No_12736,4903.0
2,Order_No_768,5649.0
3,Order_No_15332,
4,Order_No_21373,


In [73]:
sample_submission.info()
sample_submission.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7068 entries, 0 to 7067
Data columns (total 2 columns):
Order_No                       7068 non-null object
Time_from_Pickup_to_Arrival    3 non-null float64
dtypes: float64(1), object(1)
memory usage: 110.6+ KB


(7068, 2)

### Exploratory Data Analysis
Exploratory Data Analysis(EDA) is an approach for data analysis that uses many tools(mainly graphicale to maximize insight into a data set, extract important variables, detect outliers and anomalies, amongst other things.

##### **Training set dataframe**

In [74]:
prof_train = ProfileReport(train, check_correlation = True)
prof_train.to_file(outputfile='output.html')

HtmlFile = open('output.html', 'r', encoding='utf-8')
source_code = HtmlFile.read()

IFrame(src='./output.html', width=900, height=800)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


> **Observations**

Enter observations

##### **Rider dataframe**

In [75]:
prof_rider = ProfileReport(rider, check_correlation = True)
prof_rider.to_file(outputfile='output.html')

HtmlFile = open('output.html', 'r', encoding='utf-8')
source_code = HtmlFile.read()

IFrame(src='./output.html', width=900, height=800)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


>**Observations**

Enter observations

##### **Visuals and Insights**

### Transformation and Processing of data

##### **Spiltting the data**
Separating the Training dataset into the Independent variable and Dependent variable.

In [76]:
x= train.iloc[:, :-1]
y= train.iloc[:, -1]

##### **Alignment of datasets**
Aligning the dataset of x to the Testing dataset by dropping extra columns.

In [77]:
x= x.drop(['Arrival_at_Destination_-_Day_of_Month', 'Arrival_at_Destination_-_Weekday_(Mo_=_1)', 
           'Arrival_at_Destination_-_Time'], axis= 1)

##### **Missing data**

In [78]:
x= x.replace(np.nan, 0)
x.head()

Unnamed: 0,Order_No,User_Id,Vehicle_Type,Platform_Type,Personal_or_Business,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),...,Pickup_-_Weekday_(Mo_=_1),Pickup_-_Time,Distance_(KM),Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Rider_Id
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,5,10:27:30 AM,4,20.4,0.0,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,5,11:44:09 AM,16,26.4,0.0,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,2,12:53:03 PM,3,0.0,0.0,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,5,9:43:06 AM,9,19.2,0.0,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,1,10:05:23 AM,9,15.4,0.0,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770


In [79]:
test= test.replace(np.nan, 0)
test.head()

Unnamed: 0,Order_No,User_Id,Vehicle_Type,Platform_Type,Personal_or_Business,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),...,Pickup_-_Weekday_(Mo_=_1),Pickup_-_Time,Distance_(KM),Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Rider_Id
0,Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,...,3,5:06:47 PM,8,0.0,0.0,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
1,Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,...,5,1:25:37 PM,5,0.0,0.0,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868
2,Order_No_768,User_Id_2154,Bike,3,Business,27,4,11:08:14 AM,27,4,...,4,11:57:54 AM,5,22.8,0.0,-1.290894,36.822971,-1.276574,36.851365,Rider_Id_26
3,Order_No_15332,User_Id_2910,Bike,3,Business,17,1,1:51:35 PM,17,1,...,1,2:16:52 PM,5,24.5,0.0,-1.290503,36.809646,-1.303382,36.790658,Rider_Id_685
4,Order_No_21373,User_Id_1205,Bike,3,Business,11,2,11:30:28 AM,11,2,...,2,11:56:04 AM,6,24.4,0.0,-1.281081,36.814423,-1.266467,36.792161,Rider_Id_858


##### **Transformation of data**

###### Converting time strings into seconds

In [80]:
#Converting time strings to seconds for the x dataframe

x['Placement_-_Time']= pd.to_timedelta(x['Placement_-_Time']).dt.total_seconds()
x['Confirmation_-_Time']= pd.to_timedelta(x['Confirmation_-_Time']).dt.total_seconds()
x['Arrival_at_Pickup_-_Time']= pd.to_timedelta(x['Arrival_at_Pickup_-_Time']).dt.total_seconds()
x['Pickup_-_Time']= pd.to_timedelta(x['Pickup_-_Time']).dt.total_seconds()

In [81]:
#Converting time strings to seconds for the test dataframe

test['Placement_-_Time']= pd.to_timedelta(test['Placement_-_Time']).dt.total_seconds()
test['Confirmation_-_Time']= pd.to_timedelta(test['Confirmation_-_Time']).dt.total_seconds()
test['Arrival_at_Pickup_-_Time']= pd.to_timedelta(test['Arrival_at_Pickup_-_Time']).dt.total_seconds()
test['Pickup_-_Time']= pd.to_timedelta(test['Pickup_-_Time']).dt.total_seconds()

###### **Dropping columns** 

In [82]:
#Column drop for data not useful in the x dataframe

x= x.drop(['Order_No', 'User_Id', 'Rider_Id' ], axis= 1)

In [83]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 22 columns):
Vehicle_Type                            21201 non-null object
Platform_Type                           21201 non-null int64
Personal_or_Business                    21201 non-null object
Placement_-_Day_of_Month                21201 non-null int64
Placement_-_Weekday_(Mo_=_1)            21201 non-null int64
Placement_-_Time                        21201 non-null float64
Confirmation_-_Day_of_Month             21201 non-null int64
Confirmation_-_Weekday_(Mo_=_1)         21201 non-null int64
Confirmation_-_Time                     21201 non-null float64
Arrival_at_Pickup_-_Day_of_Month        21201 non-null int64
Arrival_at_Pickup_-_Weekday_(Mo_=_1)    21201 non-null int64
Arrival_at_Pickup_-_Time                21201 non-null float64
Pickup_-_Day_of_Month                   21201 non-null int64
Pickup_-_Weekday_(Mo_=_1)               21201 non-null int64
Pickup_-_Time                

In [84]:
#Column drop for data not useful in the test dataframe

test= test.drop(['Order_No', 'User_Id', 'Rider_Id' ], axis= 1)

In [85]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7068 entries, 0 to 7067
Data columns (total 22 columns):
Vehicle_Type                            7068 non-null object
Platform_Type                           7068 non-null int64
Personal_or_Business                    7068 non-null object
Placement_-_Day_of_Month                7068 non-null int64
Placement_-_Weekday_(Mo_=_1)            7068 non-null int64
Placement_-_Time                        7068 non-null float64
Confirmation_-_Day_of_Month             7068 non-null int64
Confirmation_-_Weekday_(Mo_=_1)         7068 non-null int64
Confirmation_-_Time                     7068 non-null float64
Arrival_at_Pickup_-_Day_of_Month        7068 non-null int64
Arrival_at_Pickup_-_Weekday_(Mo_=_1)    7068 non-null int64
Arrival_at_Pickup_-_Time                7068 non-null float64
Pickup_-_Day_of_Month                   7068 non-null int64
Pickup_-_Weekday_(Mo_=_1)               7068 non-null int64
Pickup_-_Time                           7068 

##### **Encoding categorical data**

In [86]:
#Encoding categorical data using pd.get_dummies for the x dataframe

x=  pd.get_dummies(x, columns=['Vehicle_Type', 'Platform_Type', 'Personal_or_Business'], 
                   drop_first= True)
x.head()

Unnamed: 0,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),Confirmation_-_Time,Arrival_at_Pickup_-_Day_of_Month,Arrival_at_Pickup_-_Weekday_(Mo_=_1),Arrival_at_Pickup_-_Time,Pickup_-_Day_of_Month,...,Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Platform_Type_2,Platform_Type_3,Platform_Type_4,Personal_or_Business_Personal
0,9,5,34546.0,9,5,34810.0,9,5,36287.0,9,...,20.4,0.0,-1.317755,36.83037,-1.300406,36.829741,0,1,0,0
1,12,5,40576.0,12,5,41001.0,12,5,42022.0,12,...,26.4,0.0,-1.351453,36.899315,-1.295004,36.814358,0,1,0,1
2,30,2,45565.0,30,2,45764.0,30,2,46174.0,30,...,0.0,0.0,-1.308284,36.843419,-1.300921,36.828195,0,1,0,0
3,15,5,33934.0,15,5,33965.0,15,5,34676.0,15,...,19.2,0.0,-1.281301,36.832396,-1.257147,36.795063,0,1,0,0
4,13,1,35718.0,13,1,35778.0,13,1,36233.0,13,...,15.4,0.0,-1.266597,36.792118,-1.295041,36.809817,0,0,0,1


In [87]:
#Encoding categorical data using pd.get_dummies for the test dataframe

test=  pd.get_dummies(test, columns=['Vehicle_Type', 'Platform_Type', 'Personal_or_Business'], 
                      drop_first= True)
test.head()

Unnamed: 0,Placement_-_Day_of_Month,Placement_-_Weekday_(Mo_=_1),Placement_-_Time,Confirmation_-_Day_of_Month,Confirmation_-_Weekday_(Mo_=_1),Confirmation_-_Time,Arrival_at_Pickup_-_Day_of_Month,Arrival_at_Pickup_-_Weekday_(Mo_=_1),Arrival_at_Pickup_-_Time,Pickup_-_Day_of_Month,...,Temperature,Precipitation_in_millimeters,Pickup_Lat,Pickup_Long,Destination_Lat,Destination_Long,Platform_Type_2,Platform_Type_3,Platform_Type_4,Personal_or_Business_Personal
0,27,3,17050.0,27,3,17069.0,27,3,17584.0,27,...,0.0,0.0,-1.333275,36.870815,-1.305249,36.82239,0,1,0,0
1,17,5,46655.0,17,5,46757.0,17,5,4827.0,17,...,0.0,0.0,-1.272639,36.794723,-1.277007,36.823907,0,1,0,0
2,27,4,40094.0,27,4,41105.0,27,4,41600.0,27,...,22.8,0.0,-1.290894,36.822971,-1.276574,36.851365,0,1,0,0
3,17,1,6695.0,17,1,6807.0,17,1,7361.0,17,...,24.5,0.0,-1.290503,36.809646,-1.303382,36.790658,0,1,0,0
4,11,2,41428.0,11,2,41685.0,11,2,42439.0,11,...,24.4,0.0,-1.281081,36.814423,-1.266467,36.792161,0,1,0,0


##### **Splitting the dataset**

In [89]:
# Using just the training dataset to test model accuracy 

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state= 1) 

##### **Feature scaling**

# Multiple Linear Regression

##### **Training the model**

In [91]:
multi_reg= LinearRegression()

In [92]:
multi_reg.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

##### **Predicting the test set results**

In [93]:
y_predict= multi_reg.predict(x_test)

In [98]:
display(y_predict)

array([1312.15014999, 3272.450594  , 1356.8443304 , ..., 1046.81838199,
       1060.02741748, 1081.22697426])

##### **Test results accuracy**

In [99]:
def rmse(y_test, y_predict):
    return np.sqrt(mean_squared_error(y_test, y_predict))

In [100]:
rmse(y_test, y_predict)

791.7732524803138

# XGBoost Regression

##### **Training the model**

In [105]:
xg_reg = xgb.XGBRegressor()

In [106]:
xg_reg.fit(x_train,y_train)

  if getattr(data, 'base', None) is not None and \




##### **Predicting the test result**

In [107]:
pred_xgb = xg_reg.predict(x_test)

In [108]:
display(pred_xgb)

array([1074.8295, 3033.379 , 1562.9337, ..., 1421.3206, 1118.4666,
       1339.4319], dtype=float32)

##### **Test result accuracy**

In [110]:
rmse(y_test, pred_xgb)

768.5642701414943