![New York Taxi](https://media.wired.com/photos/595485ddce3e5e760d52d542/master/pass/GettyImages-182859572.jpg)

# Welcome onboard
Hii, everyone thanks for joining me here. In this notebook , i will run you through the analysis that i have done on the New York City Taxi Fare Prediction. This seems to be an interesting problem so i thought of creating a kernel. 


Lets get started!!!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from math import sin, cos, sqrt, atan2, radians
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import ensemble
from sklearn.preprocessing import RobustScaler
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')
%matplotlib inline  
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Lets just import the data and get some high level summary , which will set the stage for the further analysis. Since the dataset is quite large to be fit into memory we will just be loading a portion of it and do our analysis.

In [None]:
taxi_ride_train= pd.read_csv("../input/train.csv", sep=",", index_col="key", header=0, parse_dates=["pickup_datetime"], nrows=99999)
taxi_ride_test= pd.read_csv("../input/test.csv", sep=",", index_col="key", header=0, parse_dates=["pickup_datetime"])
taxi_ride_train.head()

In [None]:
print("The shape train data are {0}".format((taxi_ride_train.shape)))
print("The shape test data are {0}".format((taxi_ride_test.shape)))

In [None]:
taxi_ride_train.info()

In [None]:
taxi_ride_test.info()

In [None]:
taxi_ride_train.dtypes.value_counts().reset_index()

So our dataset consist of 6 numerical  column out of 1 is discrete(passenger_count) and rest of them are continuous in nature. We also have one date-time variable. 

In [None]:
taxi_ride_train.isnull().sum().sum()

In [None]:
taxi_ride_test.isnull().sum().sum()

Incase we have any null entries we are going to drop that so that we are very sure about it.

In [None]:
taxi_ride_train=taxi_ride_train.dropna(axis=0)
taxi_ride_test=taxi_ride_test.dropna(axis=0)
print(taxi_ride_train.isnull().sum().sum())
print(taxi_ride_test.isnull().sum().sum())

We will start by creating some additional features which does not seem to be provided explicitly.

# Feature Generation

In [None]:
def calculate_distance(row):
    R = 6373.0 # approximate radius of earth in km
    lat1 = radians(row[0])
    lon1 = radians(row[1])
    lat2 = radians(row[2])
    lon2 = radians(row[3])
    longitude_distance = lon2 - lon1
    latitude_distance = lat2 - lat1
    a = sin(latitude_distance / 2)**2 + cos(lat1) * cos(lat2) * sin(longitude_distance / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    distance = R * c
    return distance

In [None]:
taxi_ride_train['ride_distance_km']=taxi_ride_train[['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']].apply(calculate_distance, axis=1)
taxi_ride_test['ride_distance_km']=taxi_ride_test[['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']].apply(calculate_distance, axis=1)

Here we performed a small transformation by calculating the distance between the pickup and drop points for both training and test set. Next we will spend some time investigating and improving the quality of this newly created variable.

In [None]:
taxi_ride_train['ride_distance_km'].describe()

In [None]:
sns.boxplot(taxi_ride_train['ride_distance_km'])

From the above analysis it is clear that  this variable is definitely having outliers , normally we would not expect a ride greater than lets say 30km . Lets get out of this situation and fix this outlier problem.

In [None]:
IQR = taxi_ride_train.ride_distance_km.quantile(0.75) - taxi_ride_train.ride_distance_km.quantile(0.25)
Lower_fence = taxi_ride_train.ride_distance_km.quantile(0.25) - (IQR * 3)
Upper_fence = taxi_ride_train.ride_distance_km.quantile(0.75) + (IQR * 3)
print('Distance outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))

By calculating the IQR we get an estimate about the range within which our values should be however given the variable which is the distance of ride we can expect very few values that have ride distance greater than the upper boundary but that does not mean that a person cant book a ride greater than that so in this case i will manually set up the upper boundary equal to 30, i.e we expect the max ride distance to be 30 any distance that does not abide with that will be replaced with this 30. For the lower boundary we cant have distance in negative ,i.e the min distance should be 0.

In [None]:
distance_outlier_train=len(taxi_ride_train[taxi_ride_train['ride_distance_km']>=30])
distance_outlier_test=len(taxi_ride_test[taxi_ride_test['ride_distance_km']>=30])
print("There are {0} trains rows and {1} test rows that have distance value more than 30km".format(distance_outlier_train,distance_outlier_test))

In [None]:
taxi_ride_train['ride_distance_km'] = np.where(taxi_ride_train['ride_distance_km'].astype("float64") <= 30.0, taxi_ride_train['ride_distance_km'], 30.0)
taxi_ride_train['ride_distance_km'] = np.where(taxi_ride_train['ride_distance_km'].astype("float64") >= 0.0 , taxi_ride_train['ride_distance_km'], 0.0)

taxi_ride_test['ride_distance_km'] = np.where(taxi_ride_test['ride_distance_km'].astype("float64") <= 30.0, taxi_ride_test['ride_distance_km'], 30.0)
taxi_ride_test['ride_distance_km'] = np.where(taxi_ride_test['ride_distance_km'].astype("float64") >= 0.0 , taxi_ride_test['ride_distance_km'], 0.0)

In [None]:
sns.boxplot(taxi_ride_train['ride_distance_km'])

In [None]:
sns.jointplot(x="ride_distance_km", y="fare_amount", data=taxi_ride_train);

Now this is much better , the box plot still shows some upper outliers , but we are okay with that , now we have distance that ranges approx from 0 to 30 km.  The joint-plot shows a good relationship with the fare prices however there is a peak of data at the left and right due to the clipping we performed but thats ok. We will apply some transformation while doing feature engineering later.

Lets see what additional features we can create from existing data.


Next we have the information about the pickup timestamp , we can very well pull out some important information from this variable and we will do that for both train and test set.

In [None]:
pick_up_date_train = taxi_ride_train.ix[:,'pickup_datetime']
pick_up_date_test = taxi_ride_test.ix[:,'pickup_datetime']

temp_df_train=pd.DataFrame({"year": pick_up_date_train.dt.year,
              "month": pick_up_date_train.dt.month,
              "day": pick_up_date_train.dt.day,
              "hour": pick_up_date_train.dt.hour,
              "dayofyear": pick_up_date_train.dt.dayofyear,
              "week": pick_up_date_train.dt.week,
              "weekday": pick_up_date_train.dt.weekday,
              "quarter": pick_up_date_train.dt.quarter,
             })

temp_df_test=pd.DataFrame({"year": pick_up_date_test.dt.year,
              "month": pick_up_date_test.dt.month,
              "day": pick_up_date_test.dt.day,
              "hour": pick_up_date_test.dt.hour,
              "dayofyear": pick_up_date_test.dt.dayofyear,
              "week": pick_up_date_test.dt.week,
              "weekday": pick_up_date_test.dt.weekday,
              "quarter": pick_up_date_test.dt.quarter,
             })

taxi_ride_train= pd.concat([taxi_ride_train, temp_df_train], axis=1)
taxi_ride_test= pd.concat([taxi_ride_test, temp_df_test], axis=1)
taxi_ride_train.drop("pickup_datetime", inplace=True, axis=1)
taxi_ride_test.drop("pickup_datetime", inplace=True, axis=1)
taxi_ride_train.head()

I think we are good with the features now.Lets move to next section of our analysis.

# Univariate analysis

In [None]:
taxi_ride_train.dtypes.value_counts().reset_index()

Perfect we have all the variables as numeric , so we can start by looking at them individually capture there distribution, identify outliers.

In [None]:
print("The new dataset contains {0} null entries ".format(taxi_ride_train.isnull().sum().sum()))

In [None]:
sns.distplot(taxi_ride_train['fare_amount'])

In [None]:
taxi_ride_train['fare_amount'].describe()

In [None]:
length_before=len(taxi_ride_train)
taxi_ride_train= taxi_ride_train[taxi_ride_train.fare_amount>=0.0]
length_after=len(taxi_ride_train)
print("No of rows removed {0}".format(length_before-length_after))

What i did was,  since fare amount cant be negative i just removed those rows that reported negative fare value. Also the distribution looks little skewed we will fix this below.

We will try making the distribution more normal by applying a logarithmic transformation.

In [None]:
print("Skweness before transformation {0}".format( taxi_ride_train.fare_amount.skew()))
sns.distplot(np.log(taxi_ride_train['fare_amount']+1))
taxi_ride_train['fare_amount']=np.log(taxi_ride_train['fare_amount']+1)
print("Skweness after transformation {0}".format( taxi_ride_train.fare_amount.skew()))

Before moving on with the investigation of other variables , i will first quickly create testing, validation, training set and segregate our dependent and independent variables.

In [None]:
Y_train=taxi_ride_train.fare_amount
X_train=taxi_ride_train.drop("fare_amount", axis=1)
X_test=taxi_ride_test
X_train, X_valid, Y_train, Y_valid = train_test_split(X_train, Y_train, test_size=0.33, random_state=42)
print("Shape of training set is {0}".format(X_train.shape))
print("Shape of Validation set is {0}".format(X_valid.shape))
print("Shape of testing set is {0}".format(X_test.shape))

Lets quickly seperate out numerical and categorical variables since the anaylsis for them would differ a bit.

In [None]:
discrete_col_list=[]
continous_col_list=[]
for col in X_train.columns.tolist():
    if(taxi_ride_train[col].value_counts().count()/len(taxi_ride_train)) < 0.1:
        discrete_col_list.append(col)
    else:
        continous_col_list.append(col)
print("The descrete column in our data are {0}".format(discrete_col_list))
print("The continous column in our data are {0}".format(continous_col_list))

In [None]:
# box plot and histogram of all continous variable.
for var in continous_col_list:
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    fig = taxi_ride_train.boxplot(column=var)
    fig.set_title('')
    
    plt.subplot(1, 2, 2)
    fig = taxi_ride_train[var].hist(bins=20)
    fig.set_xlabel(var)
 
    plt.show()

Seems like we have outliers in all the continuous variables and thats the reason we had got outlier introduced in our distance calculation metric. I will use the clipping method again to fix these columns.

In [None]:
# Fix the range of latitude 
latitude_upper_range=90.0
latitude_lower_range=-90.0
for var in ['pickup_latitude','dropoff_latitude']:
    taxi_ride_train[var] = np.where(taxi_ride_train[var].astype("float64") <= latitude_upper_range, taxi_ride_train[var], latitude_upper_range)
    taxi_ride_train[var] = np.where(taxi_ride_train[var].astype("float64") >= latitude_lower_range , taxi_ride_train[var], latitude_lower_range)
    
    taxi_ride_test[var] = np.where(taxi_ride_test[var].astype("float64") <= latitude_upper_range, taxi_ride_test[var], latitude_upper_range)
    taxi_ride_test[var] = np.where(taxi_ride_test[var].astype("float64") >= latitude_lower_range , taxi_ride_test[var], latitude_lower_range)
    
# Fix the range of longitude 
longitude_upper_range=180.0
longitude_lower_range=-180.0
for var in ['pickup_latitude','dropoff_latitude']:
    taxi_ride_train[var] = np.where(taxi_ride_train[var].astype("float64") <= longitude_upper_range, taxi_ride_train[var], longitude_upper_range)
    taxi_ride_train[var] = np.where(taxi_ride_train[var].astype("float64") >= longitude_lower_range , taxi_ride_train[var], longitude_lower_range)
    
    taxi_ride_test[var] = np.where(taxi_ride_test[var].astype("float64") <= longitude_upper_range, taxi_ride_test[var], longitude_upper_range)
    taxi_ride_test[var] = np.where(taxi_ride_test[var].astype("float64") >= longitude_lower_range , taxi_ride_test[var], longitude_lower_range)

In [None]:
for var in continous_col_list:
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    fig = taxi_ride_train.boxplot(column=var)
    fig.set_title('')
    
    plt.subplot(1, 2, 2)
    fig = taxi_ride_train[var].hist(bins=20)
    fig.set_xlabel(var)

In [None]:
sns.distplot(np.sqrt(taxi_ride_train["ride_distance_km"]))
taxi_ride_train["ride_distance_km"]=np.sqrt(taxi_ride_train["ride_distance_km"])
taxi_ride_test["ride_distance_km"]=np.sqrt(taxi_ride_test["ride_distance_km"])

In [None]:
for i,var in enumerate(discrete_col_list):
    fig, ax = plt.subplots()
    fig.set_size_inches(8, 8)
    sns.countplot(taxi_ride_train[var], ax=ax)

The discrete vbariable are good with cardinatlity and frequency and does not require any processing.

# Bivariate analysis

Lets explore some interesting patterns, by studying relationship between 2 variables.

In [None]:
#sns.pairplot(X_train[continous_col_list])
sns.pairplot(taxi_ride_train, x_vars=continous_col_list, y_vars='fare_amount', size=15, aspect=0.7, kind='reg')

In [None]:
sns.heatmap(X_train.corr())

Seems like there is some kind of interaction between some of the variables suspecting existence of some colinerity which we are going to remove while doing feature selection.

In [None]:
taxi_ride_train.groupby("hour")['fare_amount'].sum().plot()

Here we can start seeing the effect of our categorical variables on the taxi ride sales. The sales peaks at around 19 hrs , and thats acceptable bet.

In [None]:
taxi_ride_train.groupby("weekday")['fare_amount'].sum().plot()

The maximum rides are booked on Thrusday, Friday, to al the party geeks. There is also an interesting trend, i was suspecting the sales to be quite comparable across working days but thats not the case, the curve increases as the week progresses , why will some one not use a cab on Monday, Tuesday but start using that on Wednesday, Thursday. This can be a point of investigation, we can really look at that.

In [None]:
taxi_ride_train.groupby("passenger_count")['fare_amount'].sum().plot()

So we have most of the rides booked by 1-2 passengers. 

In [None]:
taxi_ride_train.groupby("month")['fare_amount'].sum().plot()

The company did quite well in the first half of the year and then the sale started to decrease, another point of investigation.

In [None]:
taxi_ride_train.groupby("year")['fare_amount'].sum().plot()

So the ride booking services had a fall in 2010 and then it picked up on. We see a dip from 2014 and thats probably because we would not have enough data for that year. 

In [None]:
pd.crosstab(taxi_ride_train.quarter, len(taxi_ride_train.fare_amount), margins=True) # create a crosstab

So the sales has been really good in 1st and 2nd quater but not so bad in 3rd and 4th quater.

# Feature selection

In [None]:
constant_features = [
    feat for feat in taxi_ride_train.columns if taxi_ride_train[feat].std() == 0
]
print(constant_features)

In [None]:
sel_ = SelectFromModel(RandomForestRegressor(n_estimators=100))
sel_.fit(X_train, Y_train)
selected_feat = X_train.columns[(sel_.get_support())]
print("So the feature that holds highest importance are {0}".format(list(selected_feat)))

In [None]:
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

corr_features = correlation(X_train, 0.8)
print("The features that are corelated with each other are {0}".format(corr_features))
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_valid.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)
print(X_train.shape)
print(X_valid.shape)
print(X_test.shape)

Here we have dropped the features that are correlated with each other as thats not going to help our model in any ways.

# Model Generation

Next we perform the feature scaling , since our variables have some skewness we will use robust scaler which performs a better scaling in such cases.

In [None]:
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train) #  fit  the scaler to the train set and then transform it
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test) # transform (scale) the test set

We start by fitting the simplest model that is the linear regression and see what we have got on our table. We will make use of residual plots that will give us an intuition wether the relationship is linear or not.

In [None]:
regr = linear_model.LinearRegression()
regr.fit(X_train_scaled, Y_train)
Y_valid_pred = regr.predict(X_valid_scaled)
Y_test_pred = regr.predict(X_test_scaled)
print('Coefficients: \n', regr.coef_)
print("Mean squared error: %.2f"
      % mean_squared_error(Y_valid, Y_valid_pred))
print('Variance score: %.2f' % r2_score(Y_valid, Y_valid_pred))

In [None]:
# the function can be used to generate residual plots
def generate_residual_plot(label, prediction, type):
    plt.scatter(prediction, np.subtract(label, prediction))  # scatter plot
    title = 'Residual plot for predicting ' + type
    plt.title(title)  # set title
    plt.xlabel("Fitted Value")
    plt.ylabel("Residuals")
    plt.tight_layout()
    plt.hlines(y=0, xmin=min(prediction), xmax=max(prediction), colors='orange', linewidth=3)  # plot ref line

In [None]:
# function that can be used to generate a scatter plot of actual vs prediction values
def generate_actual_vs_predicted_plot(label, prediction, type):
    plt.scatter(prediction, label, s=30, c='r', marker='+', zorder=10)  # scatter plot
    title = 'Actual vs Predicted values for ' + type
    plt.title(title)  # set title
    plt.xlabel("Predicted Values from model")  # set the xlabel
    plt.ylabel("Actual Values")  # set the ylabel
    plt.tight_layout()

In [None]:
generate_residual_plot(Y_valid, Y_valid_pred,
                       "Taxi fares")

In [None]:
generate_actual_vs_predicted_plot(Y_valid, Y_valid_pred,
                       "Taxi fares")

From the above plots its clear that the relationship between the dependent and independent variable is some what non linear.

In [None]:
params = {'n_estimators': 700, 'max_depth': 2, 'min_samples_split': 2,
          'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

clf.fit(X_train_scaled, Y_train)
mse = mean_squared_error(Y_valid, clf.predict(X_valid_scaled))
print("MSE: %.4f" % mse)
print('Variance score: %.2f' % r2_score(Y_valid, clf.predict(X_valid_scaled)))

In [None]:
test_pred=pd.DataFrame(clf.predict(X_test_scaled), index=X_test.index)
test_pred.columns=["fare_amount"]
test_pred['fare_amount']= np.exp(test_pred.fare_amount)
test_pred.to_csv("my_submission.csv")


In [None]:
# Plot feature importance
feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

In [None]:
generate_actual_vs_predicted_plot(Y_valid, clf.predict(X_valid_scaled),
                       "Taxi fares")

In [None]:
generate_residual_plot(Y_valid, clf.predict(X_valid_scaled),
                       "Taxi fares")

# Conclusion

This was an interesting regression problem, the main variable that seem to have an influence in the prediction is ride distance ,The GBR gave a better performance than simple linear model.

Thanks, everyone for going through the analysis, of you enjoyed it , do like it!!!!