## NYC Taxi Trip Duration Modeling
#### <i>How long will my taxi trip take?</i>

<p style="font-size:16px">This research explores the public dataset used for <a href="https://www.kaggle.com/c/nyc-taxi-trip-duration" target="_blank" rel="noopener noreferrer">NYC Taxi Trip Duration</a> competition, processes the features, and builds a prediction model. Based on this research, <a href="https://nyc-taxi-trip.herokuapp.com/" target="_blank" rel="noopener noreferrer">this Flask App</a> has been deployed and running via Heroku. Details about the app development can also be found on <a href="https://github.com/Q-shick/taxi_trip_duration" target="_blank" rel="noopener noreferrer">this GitHub repo.</a></p>

<ol style="font-size:16px; margin-left:40px">
    <li>Preparation - Reads data and imputes observations</li>
    <li>Analysis - Understands variables to affect trip durations</li>
    <li>Features - Processes variables and creates other variables using external data</li>
    <li>Modeling - Builds a nueral network for prediction</li>
</ol>

<br>
<hr>

## Preparation
<p style="font-size:16px">First, we need to import all the necessary Python libraries so we can read datasets and check the data quality to begin.</p>

In [None]:
# Basic data handling
import numpy as np
import pandas as pd 
import json 

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
import plotly.express as px

# Statistical testing
from scipy.stats import ranksums

# Datetime handling
from datetime import datetime as dt
import calendar
import holidays as hd

# Geographical processing
import geopandas as gpd 
import geopy.distance as gpy
from geopy.geocoders import Nominatim
from shapely.geometry import LineString, Point, Polygon, LinearRing, shape, asShape
import shapely.ops as so
from rtree import index # for fast look-up

# Model preparing
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Lasso regression
from sklearn.linear_model import Lasso

# Nueral network
from keras.models import Model, Sequential, load_model
from keras.layers import Input, Dense, Dropout, BatchNormalization, Activation, Add
from keras.metrics import RootMeanSquaredError
from keras.optimizers import Adam
from keras.regularizers import l1

# Others
import os
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
!unzip '../input/nyc-taxi-trip-duration/train.zip'

In [None]:
taxi_trip_data = pd.read_csv('./train.csv')
print("Dataset Rows and Columns: ", taxi_trip_data.shape)

In [None]:
taxi_trip_data.head()

<p style="font-size:16px">The following descriptions are from the competition introduction. We will create new variables from these and utilize external data such as physical map and historical weather.</p>

<ul style="font-size:16px; margin-left:40px">
    <li>id - a unique identifier for each trip</li>
    <li>vendor_id - a code indicating the provider associated with the trip record</li>
    <li>pickup_datetime - date and time when the meter was engaged</li>
    <li>dropoff_datetime - date and time when the meter was disengaged</li>
    <li>passenger_count - the number of passengers in the vehicle (driver entered value)</li>
    <li>pickup_longitude - the longitude where the meter was engaged</li>
    <li>pickup_latitude - the latitude where the meter was engaged</li>
    <li>dropoff_longitude - the longitude where the meter was disengaged</li>
    <li>dropoff_latitude - the latitude where the meter was disengaged</li>
    <li>store_and_fwd_flag - this flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip</li>
    <li>trip_duration - duration of the trip in seconds</li>
</ul>

In [None]:
print("[Taxi Trip Data Types]\n", taxi_trip_data.info(), sep='')

<p style="font-size:16px">Now, we can impute observations that do not make sense in terms of trip distance and speed. There are observations that made no distance but still had some duration. Also, some observations made a trip but their speed was as unrealistic as 2 miles per hour or 100 miles per hour. These trips do not make sense and are likely to interfere with our model to generalize observations.</p>

In [None]:
# Straight line between pickup and dropoff points - Actual trip should be longer
taxi_trip_data['dist_mile'] = taxi_trip_data.apply(lambda d : \
                              gpy.distance((d.pickup_latitude, d.pickup_longitude), 
                                           (d.dropoff_latitude, d.dropoff_longitude)).miles, axis=1)

# Speed derived from duration and distance - Converted to per hour
taxi_trip_data['speed'] = taxi_trip_data['dist_mile'] / taxi_trip_data['trip_duration'] * 3600 

In [None]:
print("Trips with no distance: ", sum(taxi_trip_data['dist_mile']<0.25))
taxi_trip_data = taxi_trip_data[taxi_trip_data['dist_mile']>0.25] 
print("Trips with speed less than 3 mps: ", sum(taxi_trip_data['speed']<=4))
taxi_trip_data = taxi_trip_data[taxi_trip_data['speed']>4] 
print("Trips with speed greater than 90 mps: ", sum(taxi_trip_data['speed']>=80))
taxi_trip_data = taxi_trip_data[taxi_trip_data['speed']<80] 

<p style="font-size:16px">The removal above has taken care of most unrealistic trips, but we can make sure of it by looking into long trips. Given that NYC is a quite busy city, the speed of the trips below might make sense but their distances are too long to believe that they were recorded properly. The distances are mostly even longer than the distance from top to bottom of Manhattan, and even if the trips were entirely in busy Manhattan in rush hour it is hard to assume that a passenger would stay in the taxi instead of getting off and finding other transportations like subway.

In [None]:
print("3+ hours trips: ", sum(taxi_trip_data['trip_duration']>=7200))
taxi_trip_data = taxi_trip_data[taxi_trip_data['trip_duration']<7200]
print("Less than 30 seconds trips: ", sum(taxi_trip_data['trip_duration']<60))
taxi_trip_data = taxi_trip_data[taxi_trip_data['trip_duration']>=60]

<p style="font-size:16px">Lastly, we need to remove trips with no passengers.</p>

In [None]:
print("Trips with no passengers: ", sum(taxi_trip_data['passenger_count']>0))
taxi_trip_data = taxi_trip_data[taxi_trip_data['passenger_count']>0]
print("After removing error observations: ", taxi_trip_data.shape[0])

<br>
<hr>

## Analysis
<p style="font-size:16px">Because at large we have pickup/dropoff locations and pickup time (dropoff time information already in trip duration), we will focus on those variables to find busy locations and times.

In [None]:
print("Basic Statistics of Trip Duration\n", sep='')
taxi_trip_data.describe().apply(lambda s : s.apply('{0:.0f}'.format))['trip_duration']

In [None]:
# Limit printing too long trips
trip_duration_hist = taxi_trip_data[taxi_trip_data['trip_duration']<=3600]\
                        .sample(n=10000, replace=True, random_state=123)['trip_duration'] / 60

fig = plt.figure(figsize=(12, 5))
sb.histplot(data=trip_duration_hist, x=trip_duration_hist.values, alpha=0.7, bins=60, kde=True)
plt.axvline(x=trip_duration_hist.median(), color='g')
plt.xlabel("Minutes (Median at Green Line)")
plt.title("Distribution of Trip Duration")

<p style="font-size:16px">First off, we want to take a look at the target variable. Trip durations are mostly short around 10 minutes with a few exceptionally long trips as seen in the right-skewed distribution.</p>

In [None]:
# Trip percent by vendor id
vendor_id_bar = pd.DataFrame(taxi_trip_data['vendor_id'].value_counts(normalize=True)).reset_index().\
    rename(columns={'index':'vendor_id', 'vendor_id':'trips'}).\
        sort_values(by='vendor_id')
vendor_id_bar['trips'] = vendor_id_bar['trips']*100

# Trip percent by store and forward
store_and_fwd_flag_bar = pd.DataFrame(taxi_trip_data['store_and_fwd_flag'].value_counts(normalize=True)).reset_index().\
    rename(columns={'index':'store_and_fwd_flag', 'store_and_fwd_flag':'trips'}).\
        sort_values(by='store_and_fwd_flag', ascending=False)
store_and_fwd_flag_bar['trips'] = store_and_fwd_flag_bar['trips']*100

fig, ax = plt.subplots(1, 2, figsize=(12, 5))

sb.barplot(ax=ax[0], data=vendor_id_bar, x='vendor_id', y='trips')
ax[0].set_title("Trips by Vendor ID")
ax[0].set_xlabel("Vendor ID")
ax[0].set_ylabel("Trip %")

sb.barplot(ax=ax[1], data=store_and_fwd_flag_bar, x='store_and_fwd_flag', y='trips')
ax[1].set_title("Trips by Store and Forward")
ax[1].set_xlabel("Store and Forward")
ax[1].set_ylabel("Trip %")

<p style="font-size:16px">Next, we can see that both Vendor ID and Store and Forward Flag have two categories. While Store and Forward Flag is lopsided, Vendor ID is rather evenly divided and we like to know how much of a discriminant power the variable has against trip durations. As we already know the target variable is not normally distributed, we can think of a non-normal test to see if Vendor 1 and Vendor 2 have a different trip duration distributions. As below, the result tells that they are from different distributions.</p>

In [None]:
# Encoding for later use
taxi_trip_data['store_and_fwd_flag'] = taxi_trip_data['store_and_fwd_flag'].apply(lambda x : 1 if x=='Y' else 0)

# Wilcoxon rank sum test for non-normal distributions - A small P-value means they are different
rank_test = ranksums(taxi_trip_data[taxi_trip_data['vendor_id']==1]['trip_duration'],
                     taxi_trip_data[taxi_trip_data['vendor_id']==2]['trip_duration'])

print("Vendor ID Rank Test P-value: ", rank_test[1])

<p style="font-size:16px">The following process parses pickup and dropoff datetimes into month, day, and hour. This way, we can also extract day of week.</p>

In [None]:
# Convert objects to datetimes
taxi_trip_data['pickup_datetime'] = taxi_trip_data['pickup_datetime'].\
    apply(lambda t : dt.strptime(t, '%Y-%m-%d %H:%M:%S'))

# Parse datetimes
taxi_trip_data['pickup_month'] = taxi_trip_data.pickup_datetime.apply(lambda M : M.month)
taxi_trip_data['pickup_date'] = taxi_trip_data.pickup_datetime.apply(lambda D : D.day)
taxi_trip_data['pickup_hour'] = taxi_trip_data.pickup_datetime.apply(lambda h : h.hour)
taxi_trip_data['pickup_minute'] = taxi_trip_data.pickup_datetime.apply(lambda m : m.minute)
taxi_trip_data['pickup_day'] = taxi_trip_data.pickup_datetime.apply(lambda d : d.weekday())

# Print out for checking
print("Pick Up Time Example: ", taxi_trip_data.pickup_datetime[0])
print("Month : ", taxi_trip_data.pickup_month[0],
      "\nDate : ", taxi_trip_data.pickup_date[0],
      "\nHour : ", taxi_trip_data.pickup_hour[0],
      "\nMinute : ", taxi_trip_data.pickup_minute[0],
      "\nDay of Week (Mon=0): ", taxi_trip_data.pickup_day[0])

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

ax[0].plot(np.arange(1, len(taxi_trip_data['pickup_month'].unique())+1), \
           taxi_trip_data['pickup_month'].value_counts().sort_index().values, '-o')
ax[0].set_xlabel("Month")
ax[0].set_ylabel("Trips")
ax[0].set_title("Trips by Month")

ax[1].plot(np.arange(1, len(taxi_trip_data['pickup_hour'].unique())+1), \
        taxi_trip_data['pickup_hour'].value_counts().sort_index().values, '-o')
ax[1].set_xlabel("Hour")
ax[1].set_ylabel("Trips")
ax[1].set_title("Trips by Hour")

ax[2].plot(np.arange(0, len(taxi_trip_data['pickup_day'].unique())), \
           taxi_trip_data['pickup_day'].value_counts().sort_index().values, '-o')
ax[2].set_xlabel("Date")
ax[2].set_xticks(list(dict(enumerate(calendar.day_name)).keys()))
ax[2].set_xticklabels(list(dict(enumerate(calendar.day_name)).values()), rotation=90)
ax[2].set_ylabel("Trips")
ax[2].set_title("Trips by Day")

plt.tight_layout()

In [None]:
trips_by_day = pd.DataFrame(taxi_trip_data.groupby('pickup_day').mean()['trip_duration'].reset_index())
trips_by_day['pickup_day'] = trips_by_day['pickup_day'].map(dict(enumerate(calendar.day_name)))

plt.plot(trips_by_day['pickup_day'], trips_by_day['trip_duration'], '-o')
plt.title("Trips by Day")
plt.xlabel("Date")
plt.xticks(trips_by_day['pickup_day'], rotation=90)
plt.ylabel("Average Duration in Seconds")

<p style="font-size:16px">Not surprisingly, the average durations have the similar trend as the day of week because less trips on a day like Sunday mean less cars on roads resulting in faster trips. Also, we can extract holidays while we are handling datetimes.</p>

In [None]:
us_holidays = hd.US()
taxi_trip_data['holiday_ind'] = taxi_trip_data.pickup_datetime.apply \
    (lambda d : 1 if dt.strftime(d, "%Y-%m-%d") in us_holidays else 0)

In [None]:
fig = plt.figure(figsize=(4,4))

passenger_bar = pd.DataFrame(taxi_trip_data['passenger_count'].value_counts(normalize=True)).reset_index().\
    rename(columns={'index':'passenger_count', 'passenger_count':'trips'})
passenger_bar['trips'] = passenger_bar['trips']*100

sb.set_color_codes("muted")
sb.barplot(data=passenger_bar, x='passenger_count', y='trips')
plt.title("Passenger Counts")
plt.xlabel("Passengers")
plt.ylabel("Trip %")

In [None]:
# Limit printing too long trips
fig, ax = plt.subplots(2, 1, figsize=(12, 8))

trip_mile_hist = taxi_trip_data[taxi_trip_data['dist_mile']<=15].\
    sample(n=10000, replace=True, random_state=123)['dist_mile']
sb.histplot(ax=ax[0], data=trip_mile_hist, x=trip_mile_hist.values, alpha=0.7, bins=60, kde=True)
ax[0].axvline(x=trip_mile_hist.median(), color='g')
ax[0].set_title("Distribution of Trip Distances")
ax[0].set_xlabel("Miles (Median at Green Line)")
ax[0].set_ylabel("Trips")

trip_speed_hist = taxi_trip_data['speed'].sample(n=10000, replace=True, random_state=123)

sb.histplot(ax=ax[1], data=trip_speed_hist, x=trip_speed_hist.values, alpha=0.7, bins=60, kde=True)
ax[1].axvline(x=trip_speed_hist.median(), color='g')
ax[1].set_title("Distribution of Speed")
ax[1].set_xlabel("Miles per Second (Median at Green Line)")
ax[1].set_ylabel("Trips")

plt.tight_layout()

In [None]:
corr_vars = ['trip_duration','dist_mile','passenger_count',
             'pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude',
             'holiday_ind','vendor_id','store_and_fwd_flag']
taxi_trip_corr = taxi_trip_data[corr_vars].corr()

fig = plt.figure(figsize=(8,6))
sb.heatmap(taxi_trip_corr, cmap="YlGnBu")

<p style="font-size:16px">First, Speed shouldn't be considered because the variable is derived from the target variable so we won't be allowed to use the variable for prediction. The following location variables have a modest to strong relationship with trip durations.</p>
<ul style="font-size:16px; margin-left:40px">
    <li>Distance - The longer a trip is, the longer time it takes</li>
    <li>Latitude - The southern part of NYC (lower latitude) such as Manhattan takes longer time</li>
    <li>Longitude - The eastern part of NYC (higher longitude) such as JFK Airport takes longer time</li>
</ul>

In [None]:
taxi_trip_corr.iloc[0,1:6]

<p style="font-size:16px">The DBSCAN process below finds dense pickup/dropoff spot for slow and fast trips.</p>

In [None]:
# Trips slower than 5 mps
slow_trip = taxi_trip_data[taxi_trip_data['speed'] < 5] \
    [['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']]. \
        sample(n=3000, replace=True, random_state=123)

# Dense areas
pickup_cluster = DBSCAN(eps=0.0012, min_samples=20).fit(slow_trip[['pickup_longitude','pickup_latitude']])
dropoff_cluster = DBSCAN(eps=0.0012, min_samples=20).fit(slow_trip[['dropoff_longitude','dropoff_latitude']])

# -1 for all the other areas
slow_trip['pickup_cluster'] = pickup_cluster.labels_
slow_trip['dropoff_cluster'] = dropoff_cluster.labels_

fig_slow_pickup = px.scatter_mapbox(slow_trip, 
    lon='pickup_longitude', lat='pickup_latitude',
    center={"lat": 40.75, "lon": -73.98},
    color='pickup_cluster', 
    mapbox_style="carto-positron", zoom=11,
    width=400, height=400)
fig_slow_pickup.update_layout(margin={"r":0,"t":0,"l":0,"b":0}, coloraxis_showscale=False)
fig_slow_pickup.show()

<p style="font-size:16px">Most slow trips are in Manhattan, regarless of pickup/dropoff. Among dense spots are bus terminals and shopping areas.</p>

In [None]:
print("<Clusters from 3000 Samples>")
print("Pick-up Busy Locations: ", dict(Counter(pickup_cluster.labels_)))
print("Drop-off Busy Locations: ", dict(Counter(dropoff_cluster.labels_)))

In [None]:
# Trips faster than 50 mps
fast_trip = taxi_trip_data[taxi_trip_data['speed'] > 50] \
    [['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']]. \
        sample(n=3000, replace=True, random_state=123)

# Dense areas
pickup_cluster = DBSCAN(eps=0.001, min_samples=30).fit(fast_trip[['pickup_longitude','pickup_latitude']])
dropoff_cluster = DBSCAN(eps=0.001, min_samples=30).fit(fast_trip[['dropoff_longitude','dropoff_latitude']])

# -1 for all the other areas
fast_trip['pickup_cluster'] = pickup_cluster.labels_
fast_trip['dropoff_cluster'] = dropoff_cluster.labels_

fig_fast_pickup = px.scatter_mapbox(fast_trip, 
    lon='pickup_longitude', lat='pickup_latitude',
    center={"lat": 40.71, "lon": -73.89},
    color='pickup_cluster', 
    mapbox_style="carto-positron", zoom=10,
    width=400, height=400)
fig_fast_pickup.update_layout(margin={"r":0,"t":0,"l":0,"b":0}, coloraxis_showscale=False)
fig_fast_pickup.show()

<p style="font-size:16px">Unlike slow trips, only meaningful dense spot for fast trips is JFK Airport. From the results, we will create two features in the next. One is to indicate if a trip is in one of those slow pickup/dropoff areas and the other is to show how close to the center of the spot if it is in one of them. For this, we won't process fast trips spots because we will create another set of features that will take care of JFK Airport later.</p>

In [None]:
# Busy pick up areas
slow_trip_pickup = slow_trip[slow_trip['pickup_cluster']>=0].groupby(['pickup_cluster']).\
    agg(['mean','count'])[['pickup_longitude','pickup_latitude']].reset_index().\
    sort_values(by=[('pickup_longitude','count')], ascending=False).\
    droplevel(level=1, axis=1).iloc[:,[0,1,3]].\
    rename(columns = {'pickup_cluster':'cluster','pickup_longitude':'longitude','pickup_latitude':'latitude'})

# Busy drop off areas
slow_trip_dropoff = slow_trip[slow_trip['dropoff_cluster']>=0].groupby(['dropoff_cluster']).\
    agg(['mean','count'])[['dropoff_longitude','dropoff_latitude']].reset_index().\
    sort_values(by=[('dropoff_longitude','count')], ascending=False).\
    droplevel(level=1, axis=1).iloc[:,[0,1,3]].\
    rename(columns = {'dropoff_cluster':'cluster','dropoff_longitude':'longitude','dropoff_latitude':'latitude'})

def dist_from_center(lat, long, centers, radius):
    for _, row in centers.iterrows():
        radius_measured = gpy.distance((lat, long), (row['latitude'], row['longitude'])).miles
        if radius_measured < radius:
            return row['cluster'], radius_measured
    return -1, 0

# taxi_trip_data[['busy_pickup_spot','busy_pickup_dist']] = taxi_trip_data.apply(lambda d : \
#     dist_from_center(d.pickup_latitude, d.pickup_longitude, slow_trip_pickup, 0.2), axis=1).tolist()
# taxi_trip_data[['busy_dropoff_spot','busy_dropoff_dist']] = taxi_trip_data.apply(lambda d : \
#     dist_from_center(d.dropoff_latitude, d.dropoff_longitude, slow_trip_dropoff, 0.2), axis=1).tolist()

<p style="font-size:16px">While we are handling locations, we can create two more features regarding movements as follows. Those are not directly interpretable in readable distances (mile or km) as they are geocoordinate differences. But this unit issue won't matter when it comes to modeling because all the features are to be normalized anyways.</p>

In [None]:
taxi_trip_data['horizontal_move'] = taxi_trip_data['dropoff_longitude'] - taxi_trip_data['pickup_longitude']
taxi_trip_data['vertical_move'] = taxi_trip_data['dropoff_latitude'] - taxi_trip_data['pickup_latitude']

<br>
<hr>

## Features
<p style="font-size:16px">In this section, we will create features from the existing variables as well as external datasets such as NYC Congested Areas and Historical Weather. First, we can utilize the dataset we already have because we can estimate the road situation by aggregating the trips for trip count, average duration and speed.</p>

<p style="font-size:16px">Next, we will create multiple features from congestion data based on <a href="https://abc7ny.com/traffic-commuting-lincoln-tunnel-roads/1095908/">ABC New Articles</a>. First, the fuctions below are to calculate how much of a trip is overlapped in any of the congested areas. For example, a 10 mile long trip that goes through the Brooklyn Bridge area for 1 mile will have the area valued 10%.</p>

In [None]:
# Read congested area data and prepare geographical inputs
congested_df = pd.read_csv('../input/taxi-data-temp/congested_areas.csv')
congested_areas = [{area[1][0] : Polygon(eval(area[1][1]))} for area in congested_df.iterrows()]

def within_congested_area(area, pickup_lon, pickup_lat, dropoff_lon, dropoff_lat):
    """ Return miles overlapping the given area 
        if a straight line (trip) is within the area """
    ext = LinearRing(area.exterior.coords)
    line = LineString([(pickup_lon, pickup_lat),(dropoff_lon, dropoff_lat)])
    inter_p = line.intersection(ext)
    
    if (Point(pickup_lon, pickup_lat).within(area) == True) & (Point(dropoff_lon, dropoff_lat).within(area) == True): 
        return -1
    elif Point(pickup_lon, pickup_lat).within(area) == True:
        return gpy.distance((pickup_lat, pickup_lon),(inter_p.coords[0][1], inter_p.coords[0][0])).miles
    elif Point(dropoff_lon, dropoff_lat).within(area) == True:
        return gpy.distance((dropoff_lat, dropoff_lon),(inter_p.coords[0][1], inter_p.coords[0][0])).miles
    elif line.intersection(ext).is_empty == False:
        coords = [(p.x, p.y) for p in inter_p]
        return gpy.distance((coords[0][1], coords[0][0]),(coords[1][1], coords[1][0])).miles
    else:
        return 0
    
def congested_area_processing(df):
    """ Call within congested area for congested areas and 
        calculate the within area percentage """
    for idx, row in congested_df.iterrows():
        df[row['area']] = df.apply(lambda p : within_congested_area(congested_areas[idx][row['area']], \
            p.pickup_longitude, p.pickup_latitude, p.dropoff_longitude, p.dropoff_latitude), axis=1)
        df[row['area']] = df.apply(lambda p : p[row['area']]/p['dist_mile'] if p['dist_mile'] > 0 else 0, axis=1)
        df[row['area']] = df[row['area']].apply(lambda p : 1 if p < 0 else p)
        
def congested_percent_group(df):
    """ Group within area percentages """
    for idx, row in congested_df.iterrows():
        df[row['area']+'_group']=df[row['area']].apply(lambda x : \
            'high' if x > 0.66 else 'mid' if x > 0.33 else 'low' if x > 0 else 'n/a')
    

# congested_area_processing(taxi_trip_data)
# congested_percent_group(taxi_trip_data)

<p style="font-size:16px">Now we have all the percentages calculated for the trips that go through one or more congested areas. Instead of just congested portions, we can give more detailed information. The function will multiply the conegested portions with the average speeds of the areas. By doing so, we can not only provide the model with how much each trip had congested parts but also how congested the parts were. One caution here is that we will have speed 0 for trips not going through any congested areas, which could give the model a wrong signal. Thus, we should add one more feature to mark those trips as 'N/A'.</p>

In [None]:
taxi_trip_data = pd.read_csv('../input/taxi-data-temp/taxi_trip_temp.csv')
group_names = taxi_trip_data.columns[taxi_trip_data.columns.str.contains('_group')]

# Mean aggregation
congested_speed_mean = pd.DataFrame([taxi_trip_data.groupby([area,'pickup_day','pickup_hour']).mean()['speed'] 
                                     for area in group_names], index=congested_df['area']).reset_index()
congested_speed_mean = congested_speed_mean.melt(id_vars='area')
congested_speed_mean.columns = ['area','dist_percent','pickup_day','pickup_hour','avg_speed']

# Count aggregation
congested_speed_count = pd.DataFrame([taxi_trip_data.groupby([area,'pickup_day','pickup_hour']).count()['speed'] 
                                      for area in group_names], index=congested_df['area']).reset_index()
congested_speed_count = congested_speed_count.melt(id_vars='area')
congested_speed_count.columns = ['area','dist_percent','pickup_day','pickup_hour','count']

# Complete aggregation
congested_agg = pd.merge(congested_speed_mean, congested_speed_count, \
                         how='inner', on=['area','dist_percent','pickup_day','pickup_hour'])
congested_agg['avg_speed'] = congested_agg.apply(lambda s : s['avg_speed'] if s['count'] >= 5 else np.nan, axis=1)
congested_agg = congested_agg.sort_values(by=['area','dist_percent','pickup_day','pickup_hour'])
congested_agg['avg_speed'] = congested_agg['avg_speed'].fillna(method='ffill')
congested_agg = congested_agg[(congested_agg['dist_percent'] != 'n/a') & \
                              (np.isnan(congested_agg['avg_speed'])==False)]

def congested_speed_process(df, areas):
    """ Multiply congested portion with congested speed """
    for area in areas:
        df[area+'_speed'] = pd.merge(df, congested_agg[congested_agg['area']==area], how='left', \
                                     left_on=[area+'_group','pickup_day','pickup_hour'], \
                                     right_on=['dist_percent','pickup_day','pickup_hour'])['avg_speed']
        df[area+'_speed_na'] = df[area+'_speed'].apply(lambda s : 1 if np.isnan(s)==True else 0)
        df[area+'_speed'] = df[area+'_speed'].fillna(0)
        
        
# congested_speed_process(taxi_trip_data, congested_df['area'])

<p style="font-size:16px">As we know pickup/dropoff locations for every trip, we can consider bringing population data based on them. <a href="https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Neighborhood-Tabulatio/swpk-hqdp">Population by Neighborhood Data</a> and <a href="https://data.cityofnewyork.us/City-Government/Neighborhood-Tabulation-Areas-NTA-/cpf4-rkhq">NYC Neiborhood Map Data</a> allow us to do the job.

In [None]:
# Read NTA data
neighborhoods = gpd.read_file('../input/taxi-data-temp/Neighborhood_Tabulation_Areas.geojson')
population = pd.read_csv('../input/taxi-data-temp/New_York_City_Population_By_Neighborhood_Tabulation_Areas.csv')
population = population[population['Year']==2010]

print("Neighborhoods Columns :", neighborhoods.columns)
print("Population Columns :", population.columns)

In [None]:
neighbor_pop = pd.merge(neighborhoods, population, how='inner', left_on='ntacode', right_on='NTA Code')
neighbor_pop = neighbor_pop[['Borough','ntaname','Population','geometry']]

# Exclude NTA with population less than 2000
print(neighbor_pop[neighbor_pop['Population']<2000][['ntaname','Population']])
neighbor_pop = neighbor_pop[neighbor_pop['Population']>=2000].reset_index()

<p style="font-size:16px">Searching points in geographical data is computationally expensive. Rtree may help search pickup/dropoff locations faster, which is implemented below. As the function is to return populations, we should carefully handle trips that don't have the data available by creating 'N/A' feature.</p>

In [None]:
def rtree_build():
    """ Build search trees from neighborhoods' geographical data """
    global idx 
    idx = index.Index()
    for fid, feature in neighbor_pop['geometry'].items():
        idx.insert(fid, feature.bounds)

def neighbor_population(df, lon, lat):
    """ Search points with trees and remove false positives """
    eps = 1e-7 # to make squares
    all_hits = idx.intersection([lon, lat, lon+eps, lat+eps]) # rtree intersection not allowing points
    real_hits = []
    
    for p in all_hits:
        if Point(lon, lat).within(df.iloc[p]['geometry']):
            real_hits.append(p)
    
    if len(real_hits) > 0:
        return [df.loc[real_hits[0]]['Borough'], df.loc[real_hits[0]]['ntaname'], df.loc[real_hits]['Population'].mean(), 0]
    else: 
        return ['unknown', 'unknown', 0, 1]

    
# rtree_build()

# taxi_trip_data[['pickup_borough','pickup_nta','pickup_pop','pickup_pop_na']] = taxi_trip_data.apply(lambda p : \
#     neighbor_population(neighbor_pop, p['pickup_longitude'], p['pickup_latitude']), axis=1).tolist()
# taxi_trip_data[['dropoff_borough','dropoff_nta','dropoff_pop','dropoff_pop_na']] = taxi_trip_data.apply(lambda p : \
#     neighbor_population(neighbor_pop, p['dropoff_longitude'], p['dropoff_latitude']), axis=1).tolist()

<p style="font-size:16px">Lastly, we will use weather data downloaded from <a href="https://openweathermap.org/history-bulk">Open Weather Map</a>. The data includes 40 years of weather data for a selected location, but there is a charge per location. Because all the boroughs mostly have virtually the same weather at any given time, we will use the weather dataset for Manhattan.</p>

In [None]:
# Trip count/average by neighborhood
nta_agg = taxi_trip_data.groupby(['pickup_nta','pickup_day','pickup_hour']).\
    agg(['count','mean'])['trip_duration'].reset_index().\
    rename(columns={'count':'nta_trips','mean':'nta_mean_duration'})

nta_agg['nta_trips'] = nta_agg.apply(lambda x : 0 
    if x['nta_trips']<5 or x['nta_trips']==np.nan or x['pickup_nta']=='unknown' else x['nta_trips'], axis=1)
nta_agg['nta_mean_duration'] = nta_agg.apply(lambda x : 0 if x['nta_trips'] == 0 else x['nta_mean_duration'], axis=1)
nta_agg['nta_na'] = nta_agg['nta_trips'].apply(lambda x : 1 if x == 0 else 0)
                                               
taxi_trip_data = pd.merge(taxi_trip_data, nta_agg, how='left', 
    on=['pickup_nta','pickup_day','pickup_hour'])

# Average speed by borough, day, hour, and distance quantile
taxi_trip_data['dist_bins'] = pd.qcut(taxi_trip_data['dist_mile'], q=np.arange(0, 1.1, 0.1))
speed_agg = taxi_trip_data.groupby(['pickup_borough','pickup_day','pickup_hour','dist_bins']).\
    mean()['speed'].reset_index().rename(columns={'speed':'mean_speed'})
speed_agg['mean_speed'].fillna(method='ffill', inplace=True)
                                               
taxi_trip_data = pd.merge(taxi_trip_data, speed_agg, how='left', \
    on=['pickup_borough','pickup_day','pickup_hour','dist_bins'])

<p style="font-size:16px">Finally, we want to find <a href="https://openweathermap.org/" target="_blank" rel="noopener noreferrer">weather status</a> for each trip by hour to use it for prediction. We already know that weather affects drives, so we will skip analyses and let the model figure out how to use it for prediction.</p>

In [None]:
taxi_trip_data = pd.read_csv('../input/taxi-data-temp/taxi_trip_temp.csv')
taxi_trip_data = taxi_trip_data.drop(columns=['temp','wind_deg','wind_speed','Clear', 
                                              'Clouds','Drizzle','Fog','Haze','Mist','Rain','Snow',
                                              'rain_1h', 'rain_3h', 'snow_1h', 'snow_3h'])
congested_df = pd.read_csv('../input/taxi-data-temp/congested_areas.csv')
group_names = taxi_trip_data.columns[taxi_trip_data.columns.str.contains('_group')]

hourly_weather = pd.read_csv('../input/taxi-data-temp/nyc_weather_history.csv')

hourly_weather['datetime'] = hourly_weather['dt_iso'].apply(\
    lambda t : dt.strptime(t[0:19], "%Y-%m-%d %H:%M:%S"))

hourly_weather['year'] = hourly_weather.datetime.apply(lambda Y : Y.year)
hourly_weather['month'] = hourly_weather.datetime.apply(lambda M : M.month)
hourly_weather['date'] = hourly_weather.datetime.apply(lambda D : D.day)
hourly_weather['hour'] = hourly_weather.datetime.apply(lambda h : h.hour)

hourly_weather = hourly_weather[(hourly_weather['year'] == 2016) & (hourly_weather['month'] < 7)]
hourly_weather.head()

In [None]:
print("Number of Weather Observations with Multiple Status: ", \
    sum(hourly_weather.groupby(['dt_iso'])['weather_main'].nunique() > 1))

In [None]:
# Weather status values to columns - "weather_main" can have multiple statuses (e.g. Fog and Rain)
hourly_weather['main_value'] = 1
weather_main = pd.pivot_table(hourly_weather, index=hourly_weather.datetime, \
                     columns='weather_main', values='main_value').reset_index()

# Join the statuses and precipitation columns
weather_cols = ['datetime','month','date','hour','temp','clouds_all','wind_deg','wind_speed']
hourly_weather = hourly_weather[weather_cols].drop_duplicates(keep='first')
hourly_weather = hourly_weather.set_index('datetime').\
                    join(weather_main.reset_index().set_index('datetime'))
hourly_weather = hourly_weather.reset_index().fillna(0).drop(columns='index')
hourly_weather.describe()

In [None]:
taxi_trip_data = pd.merge(taxi_trip_data, hourly_weather, how='left', 
                 left_on=['pickup_month','pickup_date','pickup_hour'], right_on=['month','date','hour'])
taxi_trip_data = taxi_trip_data.drop(columns=['datetime','month','date','hour'])

<br>
<hr>

## Modeling
<p style="font-size:16px">Our ultimate goal is to predict the trip duration for given pickup and dropoff points. To do that, we need to once more prepare the data we have processed so far. This includes encoding and normalization.</p>

In [None]:
# Explicitly convert types to category
taxi_trip_data['pickup_month'] = taxi_trip_data.pickup_month.astype("category")
taxi_trip_data['pickup_day'] = taxi_trip_data.pickup_day.astype("category")
taxi_trip_data['pickup_hour'] = taxi_trip_data.pickup_hour.astype("category")
taxi_trip_data['pickup_minute'] = (taxi_trip_data.pickup_minute//10).astype("category")

taxi_trip_data['vendor_id'] = taxi_trip_data.vendor_id.astype("category")
taxi_trip_data['store_and_fwd_flag'] = taxi_trip_data.store_and_fwd_flag.astype("category")

taxi_trip_data['busy_pickup_spot'] = taxi_trip_data.busy_pickup_spot.astype("category")
taxi_trip_data['busy_dropoff_spot'] = taxi_trip_data.busy_dropoff_spot.astype("category")

# Speed to be dropped as derived from the target variable 
taxi_trip_data.drop(columns='speed', inplace=True)

# No longer needed
taxi_trip_data.drop(columns='dist_bins', inplace=True)
taxi_trip_data.drop(columns=group_names, inplace=True)
taxi_trip_data.drop(columns=congested_df['area'], inplace=True)

# Encode categories
categorical_cols = ['pickup_day','pickup_hour','pickup_minute',
                    'vendor_id','store_and_fwd_flag',
                    'busy_pickup_spot','busy_dropoff_spot',
                    'pickup_borough','dropoff_borough']
taxi_trip_data = pd.get_dummies(taxi_trip_data, columns=categorical_cols)

In [None]:
target = 'trip_duration'
features = list(taxi_trip_data.columns[(taxi_trip_data.dtypes == 'float64') | 
    (taxi_trip_data.dtypes == 'int64') | (taxi_trip_data.dtypes == 'uint8')])

features = [item for item in features if item not in \
    [target, 'pickup_month','pickup_date','pickup_hour''pickup_minute']]
print("Selected Features: ", features)

In [None]:
# Scale observations
scaler = StandardScaler()
scaler.fit(taxi_trip_data[features])
taxi_trip_data_scaled = scaler.transform(taxi_trip_data[features])

# Split into training and test set
df_train, df_test, Ytrain, Ytest = train_test_split(
    taxi_trip_data_scaled, taxi_trip_data[target].to_numpy(), test_size=0.25, random_state=1234)

INPUT_DIM = df_train.shape[1]
print("Number of Featuers: ", INPUT_DIM)

<p style="font-size:16px">Lasso regression fits the purpose of the prediction because this is basically a linear regression with regularization. It will serve as a base model so we can compare the performance with the neural network we will also build later.</p>

In [None]:
lasso_lr = Lasso(alpha=0.1, fit_intercept=True, max_iter=100)
lasso_lr.fit(df_train, Ytrain)

lasso_pred = lasso_lr.predict(df_test)
lasso_pred[lasso_pred < 0] = 0

lasso_features = pd.DataFrame()
lasso_features['feature'] = features
lasso_features['lasso_coef'] = lasso_lr.coef_
lasso_features = lasso_features.sort_values(by='lasso_coef', ascending=False)

lasso_error = np.sqrt(mean_squared_error(lasso_pred, Ytest))

print("Lasso Regression Root Mean Squared Log Error: %8.2f" % lasso_error)
print("Lasso R-squared: %8.2f" % r2_score(Ytest, lasso_lr.predict(df_test)))

In [None]:
lasso_features.head(10)

In [None]:
lasso_features.tail(10)

<p style="font-size:16px">Finally, we will build a neural network. One of neural networks' advantages over other machine learning models is that we don't need too much feature selection/engineering as neural networks are capable of doing the jobs during training. But we still need some model optimization as seen below. Because the problem is a simple regression, we don't need a lot of tuning.</p>

In [None]:
model = Sequential()

model.add(Dense(128, input_dim=INPUT_DIM, activation='elu', kernel_initializer='he_normal'))
model.add(Dropout(3e-2))
model.add(Dense(64, input_dim=INPUT_DIM, activation='relu', activity_regularizer=l1(1e-4)))
model.add(Dropout(2e-2))
model.add(Dense(32, input_dim=INPUT_DIM, activation='elu', activity_regularizer=l1(1e-5)))
model.add(Dropout(1e-2))
model.add(Dense(16, input_dim=INPUT_DIM, activation='relu', activity_regularizer=l1(1e-6)))
model.add(Dense(1, activation='linear'))

model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=5e-4), metrics=RootMeanSquaredError())
result = model.fit(df_train, Ytrain, validation_data=(df_test, Ytest), epochs=50, batch_size=32, verbose=0)

In [None]:
model.save('./model.h5')
model.summary()

plt.plot(result.history['root_mean_squared_error'], label='train_error')
plt.plot(result.history['val_root_mean_squared_error'], label='val_error')
plt.legend()

In [None]:
train_pred = model.predict(df_train)
print("Training Set R-squared: %8.3f" % r2_score(Ytrain, train_pred))

test_pred = model.predict(df_test)
print("Test Set R-squared: %8.3f" % r2_score(Ytest, test_pred))

In [None]:
pred_plot = pd.DataFrame(zip(Ytrain, train_pred.flatten()), columns=['True', 'Pred'])

plt.figure(figsize=(7,7))
sb.scatterplot(data=pred_plot.sample(n=1000, random_state=1234), x='True', y='Pred', size=0.1, legend=None)
plt.gca().set_aspect('equal', adjustable='box')
plt.plot(np.linspace(0, 5000, 2), np.linspace(0, 5000, 2), color='r')

<br>
<hr>

## Conclusion
<p style="font-size:16px">Durations predicted are generally similar to their true values (on or close to the red line) with the r-squared almost 0.9. The model performance mainly came from adding variables, instead of heavy feature engineering such as transformation or interaction term creation. Figuring out what could help predict trip durations, carefully adding them, and harnessing the neural network's ability to find interaction and non-linearity was the core of the modeling.</p>