# **New York City Taxi Trip Duration**

The competition dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this playground competition. Based on individual trip attributes, participants should predict the duration of each trip in the test set.

# **Setup**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from pandas import Series
from datetime import datetime

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

# **Download the Data**

In [None]:
train_df = pd.read_csv('/kaggle/input/nyc-taxi-trip-duration/train.zip')
test_df = pd.read_csv('/kaggle/input/nyc-taxi-trip-duration/test.zip')

**Take a Quick Look at the Data Structure**

In [None]:
train_df.shape, test_df.shape

In [None]:
train_df.info()

In [None]:
train_df.duplicated().sum()

In [None]:
train_df.isna().sum()

> ***Finding:** There are no duplicated or missing values*

# **Visualization**

In [None]:
round(train_df.describe())

> ***Finding**: We clearly see trip_duration takes strange values for min and max. Let's have a quick visualization with a boxplot.*

**Outlier visualization**

In [None]:
plt.subplots(figsize=(18,6))
plt.title("Outliers visualization")
train_df.boxplot();

> ***Finding:** We are asked to predict trip_duration of the test set, so I first check what kind of trips durations are present in the dataset. Because of the outliers i decided to go ahead with the log scale.*

**Visualize the trip duration**

In [None]:
%matplotlib inline

sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(1, 1, figsize=(11, 7), sharex=True)
sns.despine(left=True)
sns.distplot(np.log(train_df['trip_duration'].values+1), axlabel = 'Log(trip_duration)', label = 'log(trip_duration)', bins = 50, color="y")
plt.setp(axes, yticks=[])
plt.tight_layout()
plt.show()

> ***Finding**: There are outliers for trip_duration. I think it will probably damage my model, so I choose to get rid of them*

**Visualize pickup and dropoff coordinations**



In [None]:
df = train_df.loc[(train_df.pickup_latitude > 40.6) & (train_df.pickup_latitude < 40.9)]
df = df.loc[(df.dropoff_latitude>40.6) & (df.dropoff_latitude < 40.9)]
df = df.loc[(df.dropoff_longitude > -74.05) & (df.dropoff_longitude < -73.7)]
df = df.loc[(df.pickup_longitude > -74.05) & (df.pickup_longitude < -73.7)]
train_data_new = df.copy()
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(2,2,figsize=(12, 12), sharex=False, sharey = False)#
sns.despine(left=True)
sns.distplot(train_data_new['pickup_latitude'].values, label = 'pickup_latitude',color="m",bins = 100, ax=axes[0,0])
sns.distplot(train_data_new['pickup_longitude'].values, label = 'pickup_longitude',color="g",bins =100, ax=axes[0,1])
sns.distplot(train_data_new['dropoff_latitude'].values, label = 'dropoff_latitude',color="m",bins =100, ax=axes[1, 0])
sns.distplot(train_data_new['dropoff_longitude'].values, label = 'dropoff_longitude',color="g",bins =100, ax=axes[1, 1])
plt.setp(axes, yticks=[])
plt.tight_layout()
plt.show()

**Findings** - It is clear that pick and drop latitude are centered around 40 to 41, and longitude are situated around -74 ton-73.

**Plot pickup positions**

In [None]:
pickup_longitude = list(train_df.pickup_longitude)
pickup_latitude = list(train_df.pickup_latitude)
plt.subplots(figsize=(10,5))
plt.plot(pickup_longitude, pickup_latitude, '.', alpha = 0.8, markersize = 10)
plt.xlabel('pickup_longitude')
plt.ylabel('pickup_latitude')
plt.show()

> ***Finding:** I decided to remove those large duration trip by using a cap on coordinations*

**Plot pickup positions**

In [None]:
dropoff_longitude = list(train_df.dropoff_longitude)
dropoff_latitude = list(train_df.dropoff_latitude)
plt.subplots(figsize=(10,5))
plt.xlim(-120,-60)
plt.plot(dropoff_longitude, dropoff_latitude, '.', alpha = 0.8, markersize = 10)
plt.xlabel('dropoff_longitude')
plt.ylabel('dropoff_latitude')
plt.show()

**Remove pickup point outliers**

In [None]:
train_df = train_df[(train_df.pickup_longitude > -100)]
train_df = train_df[(train_df.pickup_latitude < 50)]

**The average time taken by two different vendors vs weekday**

In [None]:
train_data = train_df.copy()
train_data['pickup_datetime'] = pd.to_datetime(train_data.pickup_datetime)
train_data.loc[:, 'pick_month'] = train_data['pickup_datetime'].dt.month
train_data.loc[:, 'hour'] = train_data['pickup_datetime'].dt.hour
train_data.loc[:, 'week_of_year'] = train_data['pickup_datetime'].dt.weekofyear
train_data.loc[:, 'day_of_year'] = train_data['pickup_datetime'].dt.dayofyear
train_data.loc[:, 'day_of_week'] = train_data['pickup_datetime'].dt.dayofweek

In [None]:
summary_wdays_avg_duration = pd.DataFrame(train_data.groupby(['vendor_id','day_of_week'])['trip_duration'].mean())
summary_wdays_avg_duration.reset_index(inplace = True)
summary_wdays_avg_duration['unit']=1
summary_wdays_avg_duration_piv = summary_wdays_avg_duration.pivot("day_of_week", "vendor_id", "trip_duration")

plt.figure(figsize=(8,5))
sns.set(style="white", palette="muted", color_codes=True)
sns.lineplot(data=summary_wdays_avg_duration_piv)


> ***Finding**: We have quite explainable pattern between vendor_id and average time so i decided to use vendor_id in my model. Also it's clear that the vendor 1 is taking more time than vendor 2 on all the days of the week*

In [None]:
plt.figure(figsize=(8,8))
sns.set(style="whitegrid", color_codes=True)
sns.set_context("poster")
train_data2 = train_data.copy()
train_data2['trip_duration']= np.log(train_data['trip_duration'])
sns.violinplot(x="passenger_count", y="trip_duration", hue="vendor_id", data=train_data2, split=True,
               inner="quart",palette={1: "b", 2: "r"})

sns.despine(left=True)


***Findings** -
* There are trips for both the vendor with zeros passengers and few of these trips have negative time as well, so i decided to drop these outliers
* There are very less number of trips with passenger count 7, 8 and 9*



In [None]:
plt.figure(figsize=(10,5))
summary_hour_duration = pd.DataFrame(train_data.groupby(['day_of_week','hour'])['trip_duration'].mean())
summary_hour_duration.reset_index(inplace = True)
summary_hour_duration['unit']=1
summary_hour_duration_piv = summary_hour_duration.pivot("hour","day_of_week", "trip_duration")

sns.set(style="white", palette="muted", color_codes=True)
sns.lineplot(data=summary_hour_duration_piv)

**Findings** -

* Its clear from the above plot that on day 5, that is Saturday and day 6 that is Sunday, the trip duration is very less that all the weekdays at 5 AM to 15 AM time.
* See this, on Saturday (5) around midnight, the rides are taking far more than usual time, On Sunday(6) in the morning, The rides are takes far less than usual time, this is obvious through now verified using given data

**Travel time VS Trip Duration**

Check if trip duration and the difference between pickup and dropoff time is equal

In [None]:
train_data=train_df.copy()
train_data['pickup_datetime'] = pd.to_datetime(train_data.pickup_datetime)
train_data['dropoff_datetime'] = pd.to_datetime(train_data.dropoff_datetime)

train_df["travel_time"]=(train_data['dropoff_datetime'] - train_data['pickup_datetime']).dt.total_seconds()
difference=train_df['trip_duration']-train_df['travel_time']

In [None]:
round(difference.describe())

# **Prepare the Data for Machine Learning Algorithms**

**#Drop outliers from trip_duration**

In [None]:
round(train_df["trip_duration"].describe([0.99,0.995,0.998]))

In [None]:
train_df = train_df[(train_df.trip_duration < 5500)]

train_df.shape

**Only keep trips with passengers >0**

In [None]:
train_df = train_df[(train_df.passenger_count > 0)]

train_df.shape

**Conduct log-transformation of trip_duration**

In [None]:
#Visualize the distribution of trip_duration values
plt.subplots(figsize=(18,6))
plt.xlim(0,4000)
plt.hist(train_df['trip_duration'].values, bins=100,color="b")
plt.xlabel('trip_duration')
plt.ylabel('number of train records')
plt.show()

In [None]:
train_df['trip_duration'] = np.log(train_df['trip_duration'].values)

In [None]:
#Log-transformation
plt.subplots(figsize=(18,6))
plt.xlim(1,10)
plt.hist(train_df['trip_duration'].values, bins=100,color="y")
plt.xlabel('log(trip_duration)')
plt.ylabel('number of train records')
plt.show()

**Add Distance (miles)**

This uses the **‘haversine’** formula to calculate the great-circle distance between two points.The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere.

In [None]:
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance in kilometers between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 3956    # Use 6371 Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
    return c * r

In [None]:
x = lambda train_df: haversine(train_df.pickup_longitude,train_df.pickup_latitude,train_df.dropoff_longitude,train_df.dropoff_latitude) #lambda array function
train_df["distance"] = train_df.apply(x, axis=1)

y = lambda test_df: haversine(test_df.pickup_longitude,test_df.pickup_latitude,test_df.dropoff_longitude,test_df.dropoff_latitude)
test_df["distance"] = test_df.apply(y, axis=1)


**Remove distance outliers**

In [None]:
#Visualize distance outliers
train_df.boxplot(column='distance', return_type='axes');

In [None]:
round(train_df.distance.describe([0.998,0.99999,0.999995]))

In [None]:
train_df = train_df[(train_df.distance < 185)]
train_df.shape

**Add spead (miles/seconds) and drop outliers**

In [None]:
train_df['speed'] = train_df.distance / np.log(train_df.travel_time)
train_df.head()

In [None]:
train_df.boxplot(column='speed', return_type='axes');

In [None]:
train_df = train_df[(train_df.speed < 10)]
train_df.drop(['speed'], axis=1, inplace=True)
train_df.shape

**Add direction** values 0 to 7
> ["north","north east","east","south east","south","south west","west","north west"]

In [None]:
import math
def calcBearing (lat1, long1, lat2, long2):
    dLon = (long2 - long1)
    x = math.cos(math.radians(lat2)) * math.sin(math.radians(dLon))
    y = math.cos(math.radians(lat1)) * math.sin(math.radians(lat2)) - math.sin(math.radians(lat1)) * math.cos(math.radians(lat2)) * math.cos(math.radians(dLon))
    bearing = math.atan2(x,y)   # use atan2 to determine the quadrant
    bearing = math.degrees(bearing)
    bearing += 22.5
    bearing = bearing % 360
    bearing = int(bearing / 45) # values 0 to 7 ["north", "north east", "east", "south east", "south", "south west", "west", "north west"]
    return bearing


In [None]:
t = lambda train_df: calcBearing(train_df.pickup_latitude,train_df.pickup_longitude,train_df.dropoff_latitude,train_df.dropoff_longitude,) #lambda array function
train_df["direction"] = train_df.apply(t, axis=1)

f = lambda test_df: calcBearing(test_df.pickup_latitude,test_df.pickup_longitude,test_df.dropoff_latitude,test_df.dropoff_longitude,) #lambda array function
test_df["direction"] = test_df.apply(f, axis=1)

In [None]:
train_df.head()

**Add variables related to pickup time**

In [None]:
train_df['pickup_datetime'] = pd.to_datetime(train_df.pickup_datetime)
test_df['pickup_datetime'] = pd.to_datetime(test_df.pickup_datetime)

train_df.loc[:, 'pick_month'] = train_df['pickup_datetime'].dt.month
train_df.loc[:, 'day_of_year'] = train_df['pickup_datetime'].dt.dayofyear
train_df.loc[:, 'day_of_week'] = train_df['pickup_datetime'].dt.dayofweek
train_df.loc[:, 'hour'] = train_df['pickup_datetime'].dt.hour

test_df.loc[:, 'pick_month'] = test_df['pickup_datetime'].dt.month
test_df.loc[:, 'day_of_year'] = test_df['pickup_datetime'].dt.dayofyear
test_df.loc[:, 'day_of_week'] = test_df['pickup_datetime'].dt.dayofweek
test_df.loc[:, 'hour'] = test_df['pickup_datetime'].dt.hour




In [None]:
train_df.head()

**Add weekday/weekend Boolean**

In [None]:
train_df["weekday_weekend"]=np.where(train_df['day_of_week']>4,0,1)
test_df["weekday_weekend"]=np.where(test_df['day_of_week']>4,0,1)

In [None]:
train_df.head()

**From object to category**

In [None]:
train_df['vendor_id'] = train_df['vendor_id'].astype('category')
test_df['vendor_id'] = test_df['vendor_id'].astype('category')
train_df['store_and_fwd_flag'] = train_df['store_and_fwd_flag'].astype('category')
test_df['store_and_fwd_flag'] = test_df['store_and_fwd_flag'].astype('category')

**Encoding**

In [None]:
y_train = train_df["trip_duration"]


features = ["vendor_id",'store_and_fwd_flag']
features_num=["passenger_count","pick_month","day_of_year","day_of_week","weekday_weekend","hour","distance","direction"]

X_train = pd.get_dummies(train_df[features])
X_test = pd.get_dummies(test_df[features])

In [None]:
X_train = pd.concat([X_train,train_df[features_num]],axis=1)
X_test = pd.concat([X_test,test_df[features_num]],axis=1)

In [None]:
X_train.head()

**Correlation Heatmap**

In [None]:
plt.figure(figsize=(10,10))
corr = X_train.corr()
sns.heatmap(corr, cmap='RdYlGn', vmin=-1, vmax=1, square=True)
plt.title("Correlation Heatmap", fontsize=16)
plt.show()

In [None]:
X_train.shape

# **Model_XGBoost**

In [None]:
from xgboost import XGBRegressor
from sklearn import metrics
from sklearn.metrics import explained_variance_score

In [None]:
xgb = XGBRegressor(n_jobs=-1)
cross_val_score(xgb, X_train, y_train, scoring='r2', cv=5)

In [None]:
xgb.fit(X_train,y_train)

In [None]:
y_pred = xgb.predict(X_test)

In [None]:
submission = pd.DataFrame({'id': test_df.id, 'trip_duration': np.exp(y_pred)})
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)