<h1>Abstract

This notebook mainly focus on investigating the competition "New York City Taxi Fare Prediction" by Google Cloud that aim to find out the relationship between the Taxi Fare and other features and then make the prediction of fares. This notebook contains data engineering, exploratory analysis, feature engineering, modeling and validation. Some statistical functions are used to get Fares-related factors and Seaborn package was used for the correlation visualization between 'Fares'and other facors.The sklearn package was mainly used for performing the Linear Regression ,Random Forest Regression and Gradient Boosting Regression analysis to predect the taxi fare. In this notebook it also provide some examples for feature selection. In conclusion, we built three models using three different regression methods and chose the best one for prediction of test dataset.

<h1>Libraries

In [1]:
# importing libraries

# This Python 3 environment comes with many helpful analytics libraries installed

import numpy as np 
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
% matplotlib inline
import os


UsageError: Line magic function `%` not found.


<h1>Dataset

In [None]:
# read data in pandas dataframe
df_train =  pd.read_csv('train.csv',nrows= 5000000,parse_dates=["pickup_datetime"])

# list first few rows (datapoints)
df_train.head()

<h1>A Glimpse of the Dataset

In [None]:
df_train.columns.values

In [None]:
df_train.describe()

In [None]:
df_train.dtypes

<h1>Check for Missing Data

In [None]:
df_train.isnull().sum()

In [None]:
df_train = df_train.dropna()

<h1> Data Manipulation and Exploratory Analysis for Outliers Detection¶

* Preknoledges and Findings based on the Dataset:

   *Initial charge:  2.50(Taxifaresbeginat 2.50 regardless of distance)

   *Mileage: 40 cents per 1/5 mile

   *Waiting charge: 40 cents per 120 seconds

   *JFK flat fare:  45.(was 35)

   *Newark surcharge:  15.(was 10) 4 p.m.–8 p.m. weekday

   *The coordinate system is bounded by (-90,90) for latitude and (-90,90) for longitude, so   anything outside this range is an error

   *Locations where drivers pick you up from shouldn't be the same locations where they drop you off at.

   *Any value that is unrealistic or abnormal should be treated as an outlier

In [None]:
# Cab rides should not have negative numbers, along with that, taxi standarad fares begin at $2.50
df_train = df_train[df_train['fare_amount'] >= 2.5]
    
# our latitude and longitude should not be equal to 0 becuase the dataset is based in NY
df_train = df_train[df_train['pickup_latitude']!= 0]
df_train = df_train[df_train['pickup_longitude'] != 0]
df_train = df_train[df_train['dropoff_latitude'] != 0]
df_train = df_train[df_train['dropoff_longitude'] != 0]

# latitude and longitude are bounded by 90 and -90. We shouldnt have any coordiantes out of that range
df_train = df_train[(df_train['pickup_latitude']<=90) & (df_train['pickup_latitude']>=-90)]
df_train = df_train[(df_train['pickup_longitude']<=90) & (df_train['pickup_longitude']>=-90)]
df_train = df_train[(df_train['dropoff_latitude']<=90) & (df_train['dropoff_latitude']>=-90)]
df_train = df_train[(df_train['dropoff_longitude']<=90) & (df_train['dropoff_longitude']>=-90)]
    
# I dont want to include destinations that have not moved from there pickup coordinates to there dropoff coordinates
df_train = df_train[(df_train['pickup_latitude'] != df_train['dropoff_latitude']) & (df_train['pickup_longitude']!= df_train['dropoff_longitude']v)]

In [None]:
# list first few rows (datapoints)
df_train.head()

In [None]:
#Plot variables using only 1000 rows for efficiency
df_train.iloc[:10000].plot.scatter('pickup_longitude', 'pickup_latitude')
df_train.iloc[:10000].plot.scatter('dropoff_longitude', 'dropoff_latitude')

Latitudes and longitudes have values near 0 that cannot be correct since NYC is at (40,-74) aprox. We will remove points not near these coordinates.

In [None]:
##Clean dataset
def clean_df(df):
    return df[(df.fare_amount > 0) & 
            (df.pickup_longitude > -80) & (df.pickup_longitude < -70) &
            (df.pickup_latitude > 35) & (df.pickup_latitude < 45) &
            (df.dropoff_longitude > -80) & (df.dropoff_longitude < -70) &
            (df.dropoff_latitude > 35) & (df.dropoff_latitude < 45)]

df_train= clean_df(df_train)

In [None]:
print(len(df_train))

In [None]:
# Distribution of fares
sns.distplot(df_train['fare_amount'])

plt.title('Distribution of Fare Amount')

In [None]:
%matplotlib inline
sns.boxplot(df_train['fare_amount'], palette="Set2" )

plt.title('Looking for Outliers with a Boxplot')

We can't tell if these very large values are outliers, we will see.

*Double check to make sure there are no outliers of coordinate

In [None]:
# Double check the coordinate by adding new features 'abs_diff_longitude' and 'abs_diff_latitude'
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()




In [None]:
add_travel_vector_features(df_train)

In [None]:
len(df_train)

In [None]:
plot = df_train.iloc[:4841558].plot.scatter('abs_diff_longitude', 'abs_diff_latitude')

We expect most of these values to be very small (likely between 0 and 1) since it should all be differences between GPS coordinates within one city. For reference, one degree of latitude is about 69 miles.

In [None]:
# remove unrealistic data
print('Old size: %d' % len(df_train))
df_train = df_train[(df_train.abs_diff_longitude < 1.0) & (df_train.abs_diff_latitude < 1.0)]
print('New size: %d' % len(df_train))

In [None]:
#Distribution of Fare Amount after removing outliers
sns.distplot(df_train['fare_amount'])

plt.title('Distribution of Fare Amount after removing outliers')

In [None]:
len(df_train)

In [None]:
# Check the distribution of "passenger_count"
passenger_count = df_train.groupby(['passenger_count']).count()

fig, ax = plt.subplots(figsize=(15,8))

sns.barplot(passenger_count.index, passenger_count['key'], palette = "Set3")

plt.xlabel('Number of Passengers')
plt.ylabel('Count')
plt.title('Count of Passengers')
plt.show()

In [None]:
passenger_fare = df_train.groupby(['passenger_count']).mean()

fig, ax = plt.subplots(figsize=(15,8))

sns.barplot(passenger_fare.index, passenger_fare['fare_amount'], palette = "Set3")

plt.xlabel('Number of Passengers')
plt.ylabel('Average Fare Price')
plt.title('Average Fare Price for Number of Passengers')
plt.show()

In [None]:
df_train = df_train[(df_train['passenger_count']<=7) & (df_train['passenger_count']>=1)]

* Map Plot

In [None]:
import folium

In [None]:
coordinates = [[40.711303, -74.016048],[40.782004, -73.979268],]

# Create the map and add the line
m = folium.Map(location=[40.730610,-73.935242], zoom_start=12)
my_PolyLine=folium.PolyLine(locations=coordinates,weight=5, color = "black")
m.add_children(my_PolyLine)

# Feature Engineering

* Trip Distance

In [None]:
#calculate trip distance in miles
def distance(lat1, lat2, lon1,lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))

In [None]:
df_train['trip_distance']=df_train.apply(lambda row:distance(row['pickup_latitude'],row['dropoff_latitude'],row['pickup_longitude'],row['dropoff_longitude']),axis=1)

In [None]:
df_train['trip_distance'].head(10)

In [None]:
plot = df_train.iloc[:4823477].plot.scatter('fare_amount', 'trip_distance')

In [None]:
df_train = df_train[(df_train['trip_distance']>=0.2)]

According to the billing rules, we can find that there are a lot of rows of data that are completely unreasonable. For example, Some long-distance trips only cost less than $10. We should try to reduce the existence of such kinds of data.

Let's make a basic assumption that traffic jams can cost up to $35 (including $2.5), which means that traffic jams last about 1 hour. We don't have to think about extreme situation (such as traffic jams for two hours).

In [None]:
# Create two features : "min_distance" and " max_distance "

In [None]:
def min_distance(df):
    df['min_distance'] = (df.fare_amount - 35)/1.56
   

In [None]:
min_distance(df_train)

In [None]:
def max_distance(df):
    df['max_distance'] = (df.fare_amount - 2.5) /1.56 +0.2
   

In [None]:
max_distance(df_train)

In [None]:
df_train.head()

In [None]:
# According to the min_distance and mix_distance, we need to remove trips whose min_distance >=100 (miles)

In [None]:
df_train = df_train[(df_train['min_distance']>=100)]

In [None]:
# Check for unreasonable data according to rules

In [None]:
unreal_data_1 = df_train.loc[(df_train['trip_distance']>df_train['max_distance'])]

In [None]:
unreal_data_2 = df_train.loc[(df_train['trip_distance']<df_train['min_distance'])]

In [None]:
unreal_data=pd.concat([unreal_data_1,unreal_data_2],ignore_index=True)



In [None]:
unreal_data

In [None]:
len(unreal_data)

replace the initial distance values with distance values calculated using the fare using the following formula

distance = (fare_amount - 2.5)/1.56

In [None]:
unreal_data['trip_distant'] = unreal_data.apply(lambda row: (row['fare_amount'] - 2.50)/1.56,axis=1)

In [None]:
#The distance values have been replaced by the newly calculated ones according to the fare
unreal_data

In [None]:
#sync the train data with the newly computed distance values from high_distance dataframe
df_train.update(unreal_data)

In [None]:
len(df_train)

In [None]:
# Add some useful features instead of the feature "key" (Date)
def date_columns(data):
    data['key'] = pd.to_datetime(data['key'], yearfirst=True)
    data['year'] = data['key'].dt.year
    data['month'] = data['key'].dt.month
    data['day'] = data['key'].dt.day
    data['weekday'] = data['key'].dt.weekday
    data['hour'] = data['key'].dt.hour
    #data['day_of_week'] = data['key'].dt.day_name()

In [None]:
date_columns(df_train)
df_train.columns.values

In [None]:
df_train.dtypes

In [None]:
df_train.head(10)

In [None]:
#Hours_Plot
time_of_day = df_train.groupby(['hour']).mean()

plt.figure(figsize=(20,8))
plt.plot(time_of_day.index, time_of_day.fare_amount, color = 'blue')

plt.xlabel('Hour')
plt.ylabel('Fare Price')
plt.title('Average Fare Price During Time of Day')
plt.show()

In [None]:
#Time Series Plot

taxi = df_train.sort_values(by='key').reset_index()

year = taxi['key'].dt.year.astype(str)
month = taxi['key'].dt.month.astype(str)
day = taxi['key'].dt.day.astype(str)

date = year+"-"+month+"-"+day
date = pd.to_datetime(date)
year_month = year +'-'+month
year_month = pd.to_datetime(year_month)
taxi['year_month'] = year_month
taxi['date'] = date


taxi_rate = taxi.groupby(['date']).mean()

In [None]:
plt.figure(figsize=(20,8))

plt.plot(taxi_rate.index, taxi_rate.fare_amount, color = "#C2A0FA")



plt.xlabel('Year')
plt.ylabel('Average Fare Price Per Day')
plt.title('Average Fare Price Over the Years')
plt.show()

there had been a significant increase in fare price since 2013.

# Modelling

Linear Regression

In [None]:
from  sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score 
from sklearn.linear_model import LinearRegression 
import xgboost as xgb

In [None]:
X = df_train.drop(['fare_amount','key', 'pickup_datetime','max_distance','min_distance'],axis = 1)
y = df_train['fare_amount']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=46)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

In [None]:
lm = LinearRegression()
lm.fit(X_train,y_train)
print(lm.score(X_train,y_train))
print(lm.score(X_test,y_test))

In [None]:
y_pred = lm.predict(X_test)
lrmse = np.sqrt(metrics.mean_squared_error(y_pred, y_test))
lrmse

Random Forest Regression

In [None]:
from sklearn.ensemble import RandomForestRegressor

randomForest = RandomForestRegressor(random_state=42)
randomForest.fit(X_train, y_train)

In [None]:
randomForestPredict = randomForest.predict(X_test)
randomForest_mse = mean_squared_error(y_test, randomForestPredict)
randomForestMSE = np.sqrt(randomForest_mse)
randomForestMSE

Gradient Boosting Regression

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
model_gradient= GradientBoostingRegressor(n_estimators=100, learning_rate=1, max_depth=3, random_state=0)
gradientBoost = model_gradient.fit(X_train, y_train)

In [None]:
predicted = model_gradient.predict(X_test)
grmse = np.sqrt(metrics.mean_squared_error(predicted, y_test))
grmse


In [None]:
# Generate bar chart of results

In [None]:
regression = pd.DataFrame({"regression": ['Multi Linear Regression','Random Forest',  'Gradient Boosting Regrssion'], 
                           "rmse": [lrmse,randomForestMSE,grmse]},columns = ['regression','rmse'])

In [None]:
regression = regression.sort_values(by='rmse', ascending = False)

In [None]:
sns.barplot(regression['rmse'], regression['regression'], palette = 'Set2')
plt.xlabel("Root Mean Square Error")
plt.ylabel('Regression Type')
plt.title('Comparing the different types of Regressions used')

# Test Submission

In [None]:
test = pd.read_csv('test.csv')

test.head(10)

In [None]:
# generate new features instead of Date
date_columns(test)

In [None]:

test.columns.values

In [None]:
add_travel_vector_features(test)


In [None]:
test.columns.values

In [None]:
test['trip_distance']=test.apply(lambda row:distance(row['pickup_latitude'],row['dropoff_latitude'],row['pickup_longitude'],row['dropoff_longitude']),axis=1)

In [None]:
test.sort_values(by='trip_distance',ascending= False)

In [None]:
testing = test.drop(['key','pickup_datetime'], axis = 1)

In [None]:
sample_submission =  pd.read_csv('sample_submission.csv')

In [None]:
len(test)

In [None]:
randomForestPredict = randomForest.predict(testing)

In [None]:


submission = pd.DataFrame({"key": sample_submission['key'], "fare_amount": randomForestPredict},columns = ['key','fare_amount'])

In [None]:
submission.to_csv('submission.csv', index = False)

# <h1>Conclusion





1.Only used about 10% of the training data, more rows of data means more possibilities and better insights.

2.Performed data manipulation and visualized distrubution of Taxi Fare to remove some outliers.

3.Many new features had been created for a better model including absolute longitude and latitude, distance, 
  
  year, month and day hours.

4.Built three models using three different regression methods (Linear Regression, Random Forest, Gradient 
                                                            
  Boosting Regression) and performed cross-validation.




# <h1>Countributions

In [None]:
*90% code was done by myself and the last 10% was used from the online package tutorial.
*Data Engineering
*Exploratory Analysis
*Modeling
*Feature Engineering
*regression analysis 



# <h1>Citations

All the regression analysis with package sklearn were quoted in https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Some of the visualization function used from Seaborn were quoted in https://seaborn.pydata.org/tutorial.html

Map visualizasion of NYC used from Folium were quoted in https://github.com/python-visualization/folium

# <h1>License

The text in the document by Yufan Yang is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/us/ The code in the document by Yufan Yang is licensed under the MIT License https://opensource.org/licenses/MIT