# Demand prediction

In [2]:
from IPython.display import HTML
from IPython.display import Image
Image(url = "https://media.credencys.com/wp-content/uploads/2017/10/Taxi-App.jpg")

## Introduction

Yellow taxis (medallion taxis) are able to pick up passengers anywhere in the five boroughs. Taxicab vehicles, each of which must have a medallion to operate, are driven an average of 180 miles per shift. As of March 14, 2014, there were 51,398 individuals licensed to drive medallion taxicabs. There were 13,605 taxicab medallion licenses in existence. By July 2015, that number had dropped slightly to 13,587 medallions, or 18 lower than the 2014 total. Taxi patronage has declined since 2011 due to competition from rideshare services.

### Objective

The main objective is to predict the number of pickups as accurately as possible for each region in a given interval. We will break up the whole New York City into regions.

## And here it starts...

#### Importing the libraries

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
import statsmodels.api as sm
from sklearn import linear_model
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import preprocessing
from pylab import rcParams
from scipy.stats import spearmanr
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import accuracy_score
import itertools
import plotly.offline as py#visualization
py.init_notebook_mode(connected=True)#visualization
import plotly.graph_objs as go#visualization
import plotly.tools as tls#visualization
import plotly.figure_factory as ff#visualization
import warnings
warnings.filterwarnings("ignore")
from pandas import *

#### Loading the dataset

In [2]:
df = pd.read_csv("Mergedd.csv")
df.head()

Unnamed: 0,Zone,PickupTime,service_zone,CabRequest,total_amount,tip_amount
0,Allerton/Pelham Gardens,2018-04-01 11,Boro Zone,1,41.8,0.0
1,Allerton/Pelham Gardens,2018-04-01 2,Boro Zone,1,23.8,0.0
2,Allerton/Pelham Gardens,2018-04-01 8,Boro Zone,1,44.8,0.0
3,Allerton/Pelham Gardens,2018-04-01 9,Boro Zone,2,69.6,0.0
4,Allerton/Pelham Gardens,2018-04-03 12,Boro Zone,1,11.3,0.0


#### Checking for the null values

In [3]:
df.isnull().sum()

Zone            0
PickupTime      0
service_zone    0
CabRequest      0
total_amount    0
tip_amount      0
dtype: int64

We can see that the data is clean and there are no null values present

#### Generating descriptive statistics of the dataset distribution

In [4]:
df.describe()

Unnamed: 0,CabRequest,total_amount,tip_amount
count,1048564.0,1048564.0,1048564.0
mean,89.54993,1461.547,167.0406
std,161.8861,3402.06,371.8295
min,1.0,0.0,0.0
25%,2.0,43.1,1.0
50%,10.0,178.98,15.97
75%,101.0,1531.173,169.63
max,2677.0,907106.1,5991.91


#### Splitting the pickup_time column into two new columns i,e. Date and hour
#### Retrieving the first five rows of the dataset

In [5]:
df['Date']=df.PickupTime.str.split(' ').str[0].str.strip()
df['hour']=df.PickupTime.str.split(' ').str[1].str.strip()
df.head()

Unnamed: 0,Zone,PickupTime,service_zone,CabRequest,total_amount,tip_amount,Date,hour
0,Allerton/Pelham Gardens,2018-04-01 11,Boro Zone,1,41.8,0.0,2018-04-01,11
1,Allerton/Pelham Gardens,2018-04-01 2,Boro Zone,1,23.8,0.0,2018-04-01,2
2,Allerton/Pelham Gardens,2018-04-01 8,Boro Zone,1,44.8,0.0,2018-04-01,8
3,Allerton/Pelham Gardens,2018-04-01 9,Boro Zone,2,69.6,0.0,2018-04-01,9
4,Allerton/Pelham Gardens,2018-04-03 12,Boro Zone,1,11.3,0.0,2018-04-03,12


#### Converting column Date from "date" format to "datetime" format
#### Creating a new column year using date column

In [6]:
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year

#### Assigning the year as 2018 in the dataset

In [7]:
df_Output=df[(df['Year']== 2018)]

#### Creating new columns such as week_day, month, month_name and month_year using the existing columns in the dataset

In [8]:
df_Output["week_day"]   = df_Output["Date"].dt.weekday_name
df_Output["month"] = pd.DatetimeIndex(df_Output["Date"]).month
df_Output["month_name"] = df_Output["month"].map({1:"JAN",2:"FEB",3:"MAR",
                                                4:"APR",5:"MAY",6:"JUN",
                                                7:"JUL",8:"AUG",9:"SEP",
                                                10:"OCT",11:"NOV",12:"DEC"
                                               })
df_Output["month_year"] = df_Output["Year"].astype(str) + " - " + df_Output["month_name"]

#### A map function is used above to map a particular month to its corresponding number

In [9]:
df_Output.head()

Unnamed: 0,Zone,PickupTime,service_zone,CabRequest,total_amount,tip_amount,Date,hour,Year,week_day,month,month_name,month_year
0,Allerton/Pelham Gardens,2018-04-01 11,Boro Zone,1,41.8,0.0,2018-04-01,11,2018,Sunday,4,APR,2018 - APR
1,Allerton/Pelham Gardens,2018-04-01 2,Boro Zone,1,23.8,0.0,2018-04-01,2,2018,Sunday,4,APR,2018 - APR
2,Allerton/Pelham Gardens,2018-04-01 8,Boro Zone,1,44.8,0.0,2018-04-01,8,2018,Sunday,4,APR,2018 - APR
3,Allerton/Pelham Gardens,2018-04-01 9,Boro Zone,2,69.6,0.0,2018-04-01,9,2018,Sunday,4,APR,2018 - APR
4,Allerton/Pelham Gardens,2018-04-03 12,Boro Zone,1,11.3,0.0,2018-04-03,12,2018,Tuesday,4,APR,2018 - APR


#### Creating an array of dates having public holidays in the year 2018

In [10]:
from array import array
datearr = ['2018-01-01','2018-01-15','2018-02-19','2018-04-16','2018-05-13','2018-05-28','2018-06-17','2018-04-04','2018-05-07','2018-20-08','2018-11-11','2018-11-12','2018-11-22','2018-11-23','2018-12-05','2018-12-24','2018-12-25']

#### Creating an array of dates having extreme weather conditions in the year 2018

In [11]:
from array import array
datearr1 = ['2018-01-04','2018-01-12','2018-01-13','2018-01-17','2018-01-30','2018-02-10','2018-02-12','2018-02-11','2018-02-18','2018-02-21','2018-02-24','2018-02-25','2018-03-02','2018-03-05','2018-02-17','2018-03-18','2018-02-02','2018-05-15','2018-09-25','2018-11-15','2018-04-01']

#### Assigning the binary values to the public holidays corresponding to their respective dates

In [12]:
df_Output['is_public_holidays'] = ['yes' if x.strftime('%Y-%m-%d') in datearr else 'no' for x in df_Output['Date']]


#### Assigning the binary values to the extreme weather condition corresponding to their respective dates

In [13]:
df_Output['is_extreme_cliamte'] = ['yes' if y.strftime('%Y-%m-%d') in datearr1 else 'no' for y in df_Output['Date']]

#### Checking the dataset with all the added columns and values

In [14]:
df_Output.head()

Unnamed: 0,Zone,PickupTime,service_zone,CabRequest,total_amount,tip_amount,Date,hour,Year,week_day,month,month_name,month_year,is_public_holidays,is_extreme_cliamte
0,Allerton/Pelham Gardens,2018-04-01 11,Boro Zone,1,41.8,0.0,2018-04-01,11,2018,Sunday,4,APR,2018 - APR,no,yes
1,Allerton/Pelham Gardens,2018-04-01 2,Boro Zone,1,23.8,0.0,2018-04-01,2,2018,Sunday,4,APR,2018 - APR,no,yes
2,Allerton/Pelham Gardens,2018-04-01 8,Boro Zone,1,44.8,0.0,2018-04-01,8,2018,Sunday,4,APR,2018 - APR,no,yes
3,Allerton/Pelham Gardens,2018-04-01 9,Boro Zone,2,69.6,0.0,2018-04-01,9,2018,Sunday,4,APR,2018 - APR,no,yes
4,Allerton/Pelham Gardens,2018-04-03 12,Boro Zone,1,11.3,0.0,2018-04-03,12,2018,Tuesday,4,APR,2018 - APR,no,no


We can observe from the dataset that the binary values are assigned to the dates that are represented in an array for public holidays and extreme_weather_condition

#### Checking the unique values of column "is_extreme_climate"

In [15]:
df_Output['is_extreme_cliamte'].unique()

array(['yes', 'no'], dtype=object)

It is inferred that the column "is_extreme_climate" is now a binary column with only two values

#### Creating a new dataframe, considering only few columns from the older original dataset

In [23]:
df_demand = pd.DataFrame({"CabRequest":df_Output['CabRequest'], "Zone":df_Output['Zone'],"Date":df_Output['Date'],"week_day":df_Output['week_day'],"month":df_Output['month'],"hour":df_Output['hour'],"is_public_holidays":df_Output['is_public_holidays'],"is_extreme_cliamte":df_Output['is_extreme_cliamte']})
df_demand.head()

Unnamed: 0,CabRequest,Zone,Date,week_day,month,hour,is_public_holidays,is_extreme_cliamte
0,1,Allerton/Pelham Gardens,2018-04-01,Sunday,4,11,no,yes
1,1,Allerton/Pelham Gardens,2018-04-01,Sunday,4,2,no,yes
2,1,Allerton/Pelham Gardens,2018-04-01,Sunday,4,8,no,yes
3,2,Allerton/Pelham Gardens,2018-04-01,Sunday,4,9,no,yes
4,1,Allerton/Pelham Gardens,2018-04-03,Tuesday,4,12,no,no


We can see that only the required columns are considered for the further analysis

#### Creating a new column "just_date" from "date" column

In [24]:
df_demand['just_date'] = df_demand['Date'].dt.date

#### Changing the format of the "just_date" into date_hour by mapping date against hours

In [25]:
df_demand['is_Date']= df_demand['just_date'].map(str) +" " +df_demand['hour'].map(str)

#### Dropping the redundant columns( Date and hour) from the dataset

In [26]:
df_demand=df_demand.drop(['Date','hour'] ,axis='columns')

#### Checking the unique values present in the column week_day

In [27]:
df_demand.week_day.unique()

array(['Sunday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Monday',
       'Saturday'], dtype=object)

In [28]:
df_demand.head()

Unnamed: 0,CabRequest,Zone,week_day,month,is_public_holidays,is_extreme_cliamte,just_date,is_Date
0,1,Allerton/Pelham Gardens,Sunday,4,no,yes,2018-04-01,2018-04-01 11
1,1,Allerton/Pelham Gardens,Sunday,4,no,yes,2018-04-01,2018-04-01 2
2,1,Allerton/Pelham Gardens,Sunday,4,no,yes,2018-04-01,2018-04-01 8
3,2,Allerton/Pelham Gardens,Sunday,4,no,yes,2018-04-01,2018-04-01 9
4,1,Allerton/Pelham Gardens,Tuesday,4,no,no,2018-04-03,2018-04-03 12


#### Using lable encoder to transform binary and multi-categorical values into numerical values

In [29]:
# Import label encoder 
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
df_demand['Zone']= label_encoder.fit_transform(df_demand['Zone']) 
df_demand['is_public_holidays']= label_encoder.fit_transform(df_demand['is_public_holidays']) 
df_demand['is_extreme_cliamte']= label_encoder.fit_transform(df_demand['is_extreme_cliamte']) 
df_demand['week_day']= label_encoder.fit_transform(df_demand['week_day']) 


In [30]:
df_demand.head()

Unnamed: 0,CabRequest,Zone,week_day,month,is_public_holidays,is_extreme_cliamte,just_date,is_Date
0,1,0,3,4,0,1,2018-04-01,2018-04-01 11
1,1,0,3,4,0,1,2018-04-01,2018-04-01 2
2,1,0,3,4,0,1,2018-04-01,2018-04-01 8
3,2,0,3,4,0,1,2018-04-01,2018-04-01 9
4,1,0,5,4,0,0,2018-04-03,2018-04-03 12


#### Sorting the values of the column 'is_Date'

In [31]:
df_demand = df_demand.sort_values(by = 'is_Date')

#### Converting is_date to numerical values

In [32]:
df_demand['is_Date']= label_encoder.fit_transform(df_demand['is_Date']) 

#### Checking for the number of rows and columns in the dataset df_demand

In [33]:
df_demand.shape

(1047611, 8)

#### Checking for the unique values present in column is_date after converting it into numerical values

In [34]:
df_demand.is_Date.unique()

array([   0,    1,    2, ..., 8757, 8758, 8759], dtype=int64)

#### Checking the unique values present in column zone after converting it into numerical values

In [35]:
df_demand.Zone.unique()

array([252, 235,  85, 178, 229, 124, 193, 182, 137,  45,  39, 156, 221,
        14, 125, 136,  96, 250, 154,  86, 128,  60, 157, 175,  65, 123,
        19, 256,  18, 100, 158, 141, 198,  44, 233,  40, 251, 177, 225,
        42, 242,  49,  15, 211, 239, 243, 253,  62,   4, 130,  47,  64,
        37, 186, 147, 224, 132, 258,  79,  70, 133,  88,  89, 189, 148,
       185, 129, 191,  78,  46,  71, 257, 135, 227, 143,  93, 209, 153,
       207, 208, 184, 222,  91,  90, 234, 228,  28, 121,  34, 146,  21,
       161, 163, 169,  74, 119, 139,  76,  66,  38,  10, 231, 166,  52,
        53, 219, 140,  57, 113, 107, 226,  75, 138, 215, 232, 230, 245,
       106,  33, 108, 259, 110, 220,  22,   1, 165, 164,  84, 216,  83,
       101, 159,  30, 240,   0, 205, 162, 248, 238, 246, 131, 152,  31,
       160,  48,  11,  72, 170, 212,  58, 213, 120,   6,  23,   7, 155,
       114, 181, 117,   9, 196,  81,  17, 142, 192, 194,  87, 237,  25,
       168,  98, 122, 188, 105, 134,  29, 214, 174, 102, 249, 19

#### Retrieving the rows with no public holidays and extreme weather conditions

In [29]:
s1i1 = df_demand.loc[(df_demand['is_public_holidays'] == 0) & (df_demand['is_extreme_cliamte'] == 0)]
s1i1.head()

Unnamed: 0,CabRequest,Zone,week_day,is_public_holidays,is_extreme_cliamte,just_date,is_Date
420635,58,157,5,0,0,2018-01-02,24
450240,92,245,5,0,0,2018-01-02,24
405055,11,119,5,0,0,2018-01-02,24
381262,10,42,5,0,0,2018-01-02,24
398503,10,93,5,0,0,2018-01-02,24


#### Assigning the columns to be used for further predictions

In [36]:
cols_to_use = ['Zone','week_day','month','is_public_holidays','is_extreme_cliamte','is_Date']
X = df_demand[cols_to_use]
y = df_demand.CabRequest

WE can observe from the above assignment that our response variable is CabRequest and the independent variables are assigned to X 

#### A simple ordinary least squares model.

In [37]:
model = sm.OLS(y, X)
results = model.fit()
# Statsmodels gives R-like statistical output
results.summary()
#Ordinary Least Squares regression (OLS) is more commonly named
#linear regression (simple or multiple depending on the number of explanatory variables)
#The OLS method corresponds to minimizing the sum of square differences between the observed and predicted values.

0,1,2,3
Dep. Variable:,CabRequest,R-squared:,0.253
Model:,OLS,Adj. R-squared:,0.253
Method:,Least Squares,F-statistic:,59130.0
Date:,"Thu, 25 Apr 2019",Prob (F-statistic):,0.0
Time:,10:58:15,Log-Likelihood:,-6803100.0
No. Observations:,1047611,AIC:,13610000.0
Df Residuals:,1047605,BIC:,13610000.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Zone,0.4465,0.002,247.603,0.000,0.443,0.450
week_day,3.4800,0.071,48.892,0.000,3.341,3.620
month,30.6671,0.458,66.999,0.000,29.770,31.564
is_public_holidays,-9.3933,0.738,-12.720,0.000,-10.841,-7.946
is_extreme_cliamte,8.3942,0.690,12.170,0.000,7.042,9.746
is_Date,-0.0420,0.001,-65.224,0.000,-0.043,-0.041

0,1,2,3
Omnibus:,514143.934,Durbin-Watson:,1.878
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3017562.787
Skew:,2.362,Prob(JB):,0.0
Kurtosis:,9.843,Cond. No.,24100.0


#### Summary of OLS Regression : AIC, BIC and adjusted R-squared
The AIC can be termed as a mesaure of the goodness of fit of any estimated statistical model. The BIC is a type of model selection among a class of parametric models with different numbers of parameters.

R2 shows how well terms (data points) fit a curve or line. Adjusted R2 also indicates how well terms fit a curve or line, but adjusts for the number of terms in a model. If you add more and more useless variables to a model, adjusted r-squared will decrease. If you add more useful variables, adjusted r-squared will increase.

We can observe from the ols model that R squared value is pretty less. The model is not upto the mark

In [38]:
df_demand.head()

Unnamed: 0,CabRequest,Zone,week_day,month,is_public_holidays,is_extreme_cliamte,just_date,is_Date
451913,32,252,1,1,1,0,2018-01-01,0
446598,571,235,1,1,1,0,2018-01-01,0
395844,2,85,1,1,1,0,2018-01-01,0
427945,1,178,1,1,1,0,2018-01-01,0
442663,172,229,1,1,1,0,2018-01-01,0


#### Splitting the data and train and test data
#### Assigning the training and testing features to train and test data

In [41]:
train, test = train_test_split(df_demand, test_size=0.3, shuffle=False)
training_features = ['Zone','week_day','month','is_public_holidays','is_extreme_cliamte','is_Date']
target = 'CabRequest'
train_X = train[training_features]
train_Y = train[target]
test_X = test[training_features]
test_Y = test[target]

## Linear Regression

Linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

In [42]:
model = LinearRegression()
model.fit(train_X, train_Y)
train_pred_Y = model.predict(train_X)
test_pred_Y = model.predict(test_X)
train_pred_Y = pd.Series(train_pred_Y.clip(0, train_pred_Y.max()), index=train_Y.index)
test_pred_Y = pd.Series(test_pred_Y.clip(0, test_pred_Y.max()), index=test_Y.index)

rmse_train = np.sqrt(mean_squared_error(train_pred_Y, train_Y))
msle_train = mean_squared_log_error(train_pred_Y, train_Y)
rmse_test = np.sqrt(mean_squared_error(test_pred_Y, test_Y))
msle_test = mean_squared_log_error(test_pred_Y, test_Y)

print('rmse_train:',rmse_train,'msle_train:',msle_train)
print('rmse_test:',rmse_test,'msle_test:',msle_test)

rmse_train: 163.16381262382555 msle_train: 6.289202100000228
rmse_test: 150.39163246717686 msle_test: 5.795741623167363


### Calucating the root mean square error and mean square log error for train and test datasets

#### Linear Regression is applied on the dataset and Root mean square error and Mean Squared log error are calculated as below:
Root-mean-square error (RMSE) (or sometimes root-mean-squared error) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSE represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. These deviations are called residuals when the calculations are performed over the data sample that was used for estimation and are called errors (or prediction errors) when computed out-of-sample.

Mean Squared Error is a measure of how close a fitted line is to data points. It is the sum, over all the data points, of the square of the difference between the predicted and actual target variables, divided by the number of data points. RMSE is the square root of MSE.

It is observed that the rmse and msle scores for model 1 are 150.39163246717686 and 5.795741623167363

## Model 2

#### Creating a new dataframe for model 2

In [43]:
df_demand['is_Date_2'] = df_demand['is_Date']**2

In [44]:
df_demand1 = pd.DataFrame({"CabRequest":df_demand['CabRequest'], "Zone":df_demand['Zone'],"month":df_demand['month'],"is_Date_2":df_demand['is_Date_2'],"is_Date":df_demand['is_Date'],"week_day":df_demand['week_day'],"is_public_holidays":df_demand['is_public_holidays'],"is_extreme_cliamte":df_demand['is_extreme_cliamte']})
df_demand1.head()

Unnamed: 0,CabRequest,Zone,month,is_Date_2,is_Date,week_day,is_public_holidays,is_extreme_cliamte
451913,32,252,1,0,0,1,1,0
446598,571,235,1,0,0,1,1,0
395844,2,85,1,0,0,1,1,0
427945,1,178,1,0,0,1,1,0
442663,172,229,1,0,0,1,1,0


#### Assignment of columns to be used in model 2

In [46]:
cols_to_use = ['Zone','week_day','month','is_public_holidays','is_extreme_cliamte','is_Date_2','is_Date']
X = df_demand1[cols_to_use]
y = df_demand1.CabRequest

In [47]:
train, test = train_test_split(df_demand1, test_size=0.3, shuffle=False)
training_features = ['Zone','week_day','month','is_public_holidays','is_extreme_cliamte','is_Date_2','is_Date']
target = 'CabRequest'
train_X = train[training_features]
train_Y = train[target]
test_X = test[training_features]
test_Y = test[target]

In [48]:
model1 = LinearRegression()
model1.fit(train_X, train_Y)
train_pred_Y = model1.predict(train_X)
test_pred_Y = model1.predict(test_X)
train_pred_Y = pd.Series(train_pred_Y.clip(0, train_pred_Y.max()), index=train_Y.index)
test_pred_Y = pd.Series(test_pred_Y.clip(0, test_pred_Y.max()), index=test_Y.index)

rmse_train = np.sqrt(mean_squared_error(train_pred_Y, train_Y))
msle_train = mean_squared_log_error(train_pred_Y, train_Y)
rmse_test = np.sqrt(mean_squared_error(test_pred_Y, test_Y))
msle_test = mean_squared_log_error(test_pred_Y, test_Y)

print('rmse_train:',rmse_train,'msle_train:',msle_train)
print('rmse_test:',rmse_test,'msle_test:',msle_test)

rmse_train: 163.13741034118868 msle_train: 6.287014794172203
rmse_test: 153.60146980114249 msle_test: 4.964743551764663


It is observed from model 2 that rmse and msle score are 153.60146980114249 and 4.964743551764663

## Decision Tree Regressor

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
Leaf node represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data. 

In [49]:
regressor = DecisionTreeRegressor(random_state = 0) 
regressor.fit(train_X, train_Y)
train_pred_Y = regressor.predict(train_X)
test_pred_Y = regressor.predict(test_X)
train_pred_Y = pd.Series(train_pred_Y.clip(0, train_pred_Y.max()), index=train_Y.index)
test_pred_Y = pd.Series(test_pred_Y.clip(0, test_pred_Y.max()), index=test_Y.index)

rmse_train = np.sqrt(mean_squared_error(train_pred_Y, train_Y))
msle_train = mean_squared_log_error(train_pred_Y, train_Y)
rmse_test = np.sqrt(mean_squared_error(test_pred_Y, test_Y))
msle_test = mean_squared_log_error(test_pred_Y, test_Y)

print('rmse_train:',rmse_train,'msle_train:',msle_train)
print('rmse_test:',rmse_test,'msle_test:',msle_test)

rmse_train: 15.752281011632805 msle_train: 0.04882461568590056
rmse_test: 125.97500502340405 msle_test: 2.2486474646906824


It is observed from the decision tree regressor that the rmse and msle score are 125.97500502340405 and 2.2486474646906824

### Comapring the above models


1. Model                           RMSE               MSLE
2. Linear regression            150.3916324671768    5.79574162316736
3. Linear regression            153.60146980114249   4.964743551764663
4. Decision tree regressor      125.97500502340405   2.2486474646906824




### Conclusion

From the evaluation metrics of above model, we can say the decision tree regressor is overfitting.
Hence, the model 1 of linear regression is preferred

#### Checking the average number of CabRequest in a particular zone and sorting them in descending order

In [60]:
df_Output[["Zone", "CabRequest"]].groupby(['Zone'], as_index=False).mean().sort_values(by='CabRequest', ascending=False)

Unnamed: 0,Zone,CabRequest
233,Upper East Side South,481.978135
156,Midtown Center,452.690283
232,Upper East Side North,435.788525
157,Midtown East,411.589156
226,Times Sq/Theatre District,396.638308
182,Penn Station/Madison Sq West,396.123676
165,Murray Hill,375.748771
45,Clinton East,368.655643
230,Union Sq,364.107503
137,Lincoln Square East,350.419788


 We can observe from the above table that Upper East Side South has the highest number of cab requests

## Summary

A public dataset was chosen and the data cleaning part was done. Data analysis has been implemented to uncover interesting insights from the data. Data transformation was carried out for better analysis.Linear regression and decision tree regressor are the models used. The data models were trained and tested using the estimators and the respected results were noted down. The three models had different rmse and msle results and the model with the least rmse and msle score is chosen. Decision tree regressor has the least msle value which states that the uncertainity of prediction is least for this model.