# Objective

This dataset is a record of taxi trips in New York City, split into training and test datasets.

Training and test observations both have the same features: passenger count, pickup longitude, pickup latitude, dropoff longitude, dropoff latitude and pickup datetime.

Training observations are labeled with the fare amount paid.

Our objective is to create a model that will accurately estimate the fare amount of trips in the test dataset. Accuracy is measured by root mean square error.

We will also be performing exploratory data analysis in an attempt to better understand the dataset.

# Results

### Modelling

Our final model is a weighted average of two separate models:

1. A gradient boosting decision tree model with 94% weight. This model has 28 features, making the most use of the time of year, time of day, trip distance and proximity to JFK airport.
2. A k-nearest neighbours model with 6% weight. This model uses only pickup and dropoff locations.

The leaderboard RMSE obtained by this model is 2.93.

### EDA Insights

* Most taxi rides have 1 passenger.
* Average fare saw a large increase between 2011 and 2013, going from \\$10.43 to \\$12.58.
* Fares are highest around 5am, primarily due to long trips leaving the city.
* People take long trips later in the day on weekends.
* Airports and the city centre are taxi hotspots.
* There is an area to the west of the city centre with very high fares per km travelled.
* There is a location about 90km from the city centre from which long trips come and go with substantially lower fares than we would expect.
* The grid most streets are laid along can be seen when examining fare vs point to point distance based on the direction of the trip.

# Setup

In [None]:
import pandas as pd
import numpy as np
import pandas_profiling
import os
import seaborn as sns
import matplotlib.pyplot as plt
from pyproj import Geod
import scipy

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import lightgbm as lgbm

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.float_format', lambda x: '%.4f' % x)

TRAIN_PATH = '../input/new-york-city-taxi-fare-prediction/train.csv'
TEST_PATH = '../input/new-york-city-taxi-fare-prediction/test.csv'

# Data Validation and Cleaning

We load test.csv to check for missing values and see what value range the features lie in.

In [None]:
test = pd.read_csv(TEST_PATH)
print('Null values:',test.isnull().sum().sum())
test.head()

The test dataset is already clean, with no missing or implausible values.

The training dataset consists of ~55 million rows. A dataset this large will be slow to load and perform operations on, so we load a subset for quick examination.

In [None]:
df_temp = pd.read_csv(TRAIN_PATH, nrows=100000)
profile = pandas_profiling.ProfileReport(df_temp, title="Profile Report", minimal=True, progress_bar=False)
profile.to_notebook_iframe()

The key column is used as an index and can be removed while working with the data.

fare_amount contains some negative values, we will remove these rows and set a plausible upper limit.

passenger_count contains some trips with zero passengers, we will remove these rows and set a plausible upper limit.

The longitude/latitude for locations in NYC should be around -73/40. The minimum latitude in this sample is -74 while the maximum longitude is 40. It seems possible that these values have been switched in some rows.

In [None]:
df_temp[df_temp['pickup_longitude']>0].head()

This confirms the suspicion that some rows have reversed longitude/latitude values. We will switch these values back to the correct orientation then remove rows that do not contain longitude/latitude values within NYC.

The functions below will load the dataset and apply the outlined changes.

In [None]:
def clean_df(df):
    df['pickup_datetime'] = df['pickup_datetime'].str.slice(0, 15)
    df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], utc=True, format='%Y-%m-%d %H:%M')
    
    #reverse incorrectly assigned longitude/latitude values
    df = df.assign(rev=df.dropoff_latitude<df.dropoff_longitude)
    idx = (df['rev'] == 1)
    df.loc[idx,['dropoff_longitude','dropoff_latitude']] = df.loc[idx,['dropoff_latitude','dropoff_longitude']].values
    df.loc[idx,['pickup_longitude','pickup_latitude']] = df.loc[idx,['pickup_latitude','pickup_longitude']].values
    
    #remove data points outside appropriate ranges
    criteria = (
    " 0 < fare_amount <= 500"
    " and 0 < passenger_count <= 6 "
    " and -75 <= pickup_longitude <= -72 "
    " and -75 <= dropoff_longitude <= -72 "
    " and 40 <= pickup_latitude <= 42 "
    " and 40 <= dropoff_latitude <= 42 "
    )
    df = (df
          .dropna()
          .query(criteria)
          .reset_index()
          .drop(columns=['rev', 'index'])          
         )
    return df

def load_df(nrows=None, features=None):
    #load dataframe in chunks if the number of rows requested is high (currently only using 1 million rows for faster training)
    cols = [
        'fare_amount', 'pickup_datetime','pickup_longitude', 'pickup_latitude',
        'dropoff_longitude', 'dropoff_latitude', 'passenger_count'
    ]
    df_as_list = []
    for df_chunk in pd.read_csv(TRAIN_PATH, usecols=cols, nrows=nrows, chunksize=5000000):
        df_chunk = clean_df(df_chunk) 
        if features == 'explore':
            df_chunk = exploration_features(df_chunk)
        elif features == 'model':
            df_chunk = modelling_features(df_chunk)
        else:
            df_chunk = df_chunk.drop(columns='pickup_datetime')
        df_as_list.append(df_chunk)
    df = pd.concat(df_as_list)
    return df

to keep run times short we will only be using 10 million rows of the data at most. We load this subset below, and see that all values are now valid and in a reasonable range.

In [None]:
train = load_df(10000000)
train.describe()

# Testing Baseline Models
Given the non-linearity of the relationship I expect to exist between location and fare, more flexible models should outperform the linear regression, which will have little to work with before further features are extracted. 

Fares have a standard deviation of $9.63. This is the first score any model we make should beat, but before further exploring the data set and creating more features, let's get a better baseline we can look to improve on by seeing the RMSE obtained by training some models on just the initial features.

We will use 4 models: Linear regression, KNN, gradient boosted trees, and an ensemble of the former.

In [None]:
def get_split_sets(train):
    x = train.drop(columns=['fare_amount'])
    y = train['fare_amount'].values
    x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.1, random_state=0)
    return x_train, x_val, y_train, y_val


def lin_model(x_train, x_val, y_train, y_val):
    model = LinearRegression()
    model.fit(x_train, y_train)
    pred = model.predict(x_val)
    rmse = np.sqrt(mean_squared_error(y_val, pred))
    return model, rmse, pred


def knn_model(x_train, x_val, y_train, y_val, neighbors):
    min_rmse = 1000
    for n in neighbors:
        knn = KNeighborsRegressor(n_neighbors=n)
        knn.fit(x_train, y_train)
        pred = knn.predict(x_val)
        rmse = np.sqrt(mean_squared_error(y_val, pred))
        if rmse < min_rmse:
            min_rmse = rmse
            model = knn
            best_pred = pred
        print('Neighbours', n, 'RMSE', rmse)
    return model, min_rmse, best_pred


def lgbm_model(params,x_train, x_val, y_train, y_val):
    lgbm_train = lgbm.Dataset(x_train, y_train, silent=True)
    lgbm_val = lgbm.Dataset(x_val, y_val, silent=True)
    model = lgbm.train(params=params, train_set=lgbm_train, valid_sets=lgbm_val, verbose_eval=100)
    pred = model.predict(x_val, num_iteration=model.best_iteration)
    rmse = np.sqrt(mean_squared_error(y_val, pred))
    return model, rmse, pred

In [None]:
train = load_df(1000000)
x_train, x_val, y_train, y_val= get_split_sets(train)
test = pd.read_csv(TEST_PATH)
x_test = test.drop(columns=['key'])

In [None]:
lin_init_model, lin_init_rmse, lin_init_pred = lin_model(x_train, x_val, y_train, y_val)

In [None]:
k_choices = [10,20,30,40,50,60]
knn_cols = ['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']
knn_init, knn_init_rmse, knn_init_pred = knn_model(x_train[knn_cols], x_val[knn_cols], y_train, y_val, k_choices)

In [None]:
lgbm_params = {
    'objective': 'regression',
    'boosting': 'gbdt',
    'num_leaves': 400,
    'learning_rate': 0.1,
    'max_bin': 3000,
    'num_rounds': 5000,
    'early_stopping_rounds': 100,
    'metric' : 'rmse'
}
lgbm_init_model, lgbm_init_rmse, lgbm_init_pred = lgbm_model(lgbm_params, x_train, x_val, y_train, y_val)

In [None]:
lgbm.plot_importance(lgbm_init_model)

In [None]:
print('Linear Regression RMSE', lin_init_rmse)
print('KNN RMSE', knn_init_rmse)
print('LightGBM RMSE', lgbm_init_rmse)

Linear regression is predictably our worst performer by far, with a validation RMSE of 8.3020.

KNN regression achieves a validation RMSE of 4.0210, a huge improvement on the linear regression model.

The LightGBM model obtains a validation RMSE of 4.0261, almost identical to KNN regression. We see in the feature importance graph that the model makes relatively little use of passenger count.

In [None]:
init_preds_ave = (lgbm_init_pred+knn_init_pred)/2
rmse = np.sqrt(mean_squared_error(y_val, init_preds_ave))
print('Combined RMSE: ', rmse)

The performance of the linear regression model was poor, so we left it out of the ensemble. The averaged prediction of the KNN and LightGBM models produces the lowest RMSE of these initial attempts, with a validation RMSE of 3.958.

# Exploration and Feature Extraction

We create a function to extract more features which will be helpful in visualising the information in this dataset.

**Created Features**

Datetime: year, day of the year, day of the week, time of day.  

Geographic: Trip distance, distance between pickup/dropoff and other locations, direction of the trip, binned direction, and distances from the city centre, airports and a hotspot of long distance trips. Our longitude range is 3 degrees and our latitude range is 2 degrees, we bin these values into ~100m wide squares using the fact that a degree of latitude is ~111km and a degree of longitude is ~85km at New York's latitude. We also create 1km square bins. 

Price: Fare per KM.

In [None]:
def distance(lon1,lat1,lon2,lat2):
    az12,az21,dist = Geod(ellps='WGS84').inv(lon1,lat1,lon2,lat2)
    return dist
def direction(lon1,lat1,lon2,lat2):
    az12,az21,dist = Geod(ellps='WGS84').inv(lon1,lat1,lon2,lat2)
    return az12

def shared_features(df):
    """adds features that will be used by both the modelling and EDA dataframes"""
    rows = len(df)
    #these long/lat values are needed as lists to hand to the distance function
    nyc_long, nyc_lat = [-74.001541]*rows, [40.724944]*rows    
    jfk_long, jfk_lat = [-73.785937]*rows, [40.645494]*rows
    lga_long, lga_lat = [-73.872067]*rows, [40.774071]*rows
    nla_long, nla_lat = [-74.177721]*rows, [40.690764]*rows
    chp_long, chp_lat = [-73.137393]*rows, [41.366138]*rows
    exp_long, exp_lat = [-74.0375]*rows, [40.736]*rows
    pickup_long = df.pickup_longitude.tolist()
    pickup_lat = df.pickup_latitude.tolist()
    dropoff_long = df.dropoff_longitude.tolist()
    dropoff_lat = df.dropoff_latitude.tolist()
    
    #add features to the data
    df = df.assign(
        #time features
        year=df.pickup_datetime.dt.year,
        dayofyear=df.pickup_datetime.dt.dayofyear,
        weekday=df.pickup_datetime.dt.dayofweek,
        time=(df.pickup_datetime.dt.hour+df.pickup_datetime.dt.minute/5),
        
        #distance between pickup and dropoff, and bearing from pickup to dropoff
        distance=distance(pickup_long, pickup_lat, dropoff_long, dropoff_lat),
        direction=direction(pickup_long, pickup_lat, dropoff_long, dropoff_lat),
        
        #distance from locations
        pickup_dist_nyc=pd.Series(distance(pickup_long, pickup_lat, nyc_long, nyc_lat)),
        dropoff_dist_nyc=pd.Series(distance(dropoff_long, dropoff_lat, nyc_long, nyc_lat)),
        pickup_dist_jfk=pd.Series(distance(pickup_long, pickup_lat, jfk_long, jfk_lat)),
        dropoff_dist_jfk=pd.Series(distance(dropoff_long, dropoff_lat, jfk_long, jfk_lat)),
        pickup_dist_lga=pd.Series(distance(pickup_long, pickup_lat, lga_long, lga_lat)),
        dropoff_dist_lga=pd.Series(distance(dropoff_long, dropoff_lat, lga_long, lga_lat)),
        pickup_dist_nla=pd.Series(distance(pickup_long, pickup_lat, nla_long, nla_lat)),
        dropoff_dist_nla=pd.Series(distance(dropoff_long, dropoff_lat, nla_long, nla_lat)),
        pickup_dist_chp=pd.Series(distance(pickup_long, pickup_lat, chp_long, chp_lat)),
        dropoff_dist_chp=pd.Series(distance(dropoff_long, dropoff_lat, chp_long, chp_lat)),
        pickup_dist_exp=pd.Series(distance(pickup_long, pickup_lat, exp_long, exp_lat)),
        dropoff_dist_exp=pd.Series(distance(dropoff_long, dropoff_lat, exp_long, exp_lat))
    )
    return df


def exploration_features(df):
    """adds features for use in the EDA section"""
    df = shared_features(df)
    df = (
        df
        .assign(
            hour=df.pickup_datetime.dt.hour,
            close_to_airport='No',
            fare_per_km=df.fare_amount*1000/df.distance,
            direction_bucket = pd.cut(df.direction, np.linspace(-180, 180, 37)),

            #small location buckets
            pickup_long_bucket=pd.cut(df.pickup_longitude, bins=2550, labels=False),
            pickup_lat_bucket=pd.cut(df.pickup_latitude, bins=2200, labels=False),
            dropoff_long_bucket=pd.cut(df.dropoff_longitude, bins=2550, labels=False),
            dropoff_lat_bucket=pd.cut(df.dropoff_latitude, bins=2200, labels=False),

            #large location buckets
            pickup_long_bucket_big=pd.cut(df.pickup_longitude, bins=255, labels=False),
            pickup_lat_bucket_big=pd.cut(df.pickup_latitude, bins=220, labels=False),
            dropoff_long_bucket_big=pd.cut(df.dropoff_longitude, bins=255, labels=False),
            dropoff_lat_bucket_big=pd.cut(df.dropoff_latitude, bins=220, labels=False)
        )
        .drop(columns='pickup_datetime')
        .query("0 < distance")
    )
    df.loc[((df['pickup_dist_jfk']<1500) | (df['dropoff_dist_jfk']<1500)), 'close_to_airport'] = 'JFK'
    df.loc[((df['pickup_dist_lga']<1500) | (df['dropoff_dist_lga']<1500)), 'close_to_airport'] = 'LaGuardia'
    df.loc[((df['pickup_dist_nla']<1500) | (df['dropoff_dist_nla']<1500)), 'close_to_airport'] = 'Newark'  
    return df

We load 5 million rows of the dataset and take a look at passenger counts.

In [None]:
train = load_df(5000000, features='explore')

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,5))
sns.countplot(train.passenger_count, ax=ax[0])
ax[0].set_xlabel('Passenger Count')
ax[0].set_ylabel('Frequency')
ax[0].set_title('Distribution of Passenger Count')
sns.barplot(train.passenger_count, train.fare_amount, ax=ax[1], ci=None)
ax[1].set_xlabel('Passenger Count')
ax[1].set_ylabel('Fare ($)')
ax[1].set_title('Average Fare by Passenger Count')
fig.show()

Most taxi rides only have 1 passenger, but passenger count doesn't show any relationship to the fare amount.

Let's check how the trip distance relates to fare.

In [None]:
fig = plt.figure(figsize=(12, 5))
sns.scatterplot(x='distance', y='fare_amount', data=train).set_title('Fare vs Distance')
plt.xlabel('Distance (m)')
plt.ylabel('Fare ($)')

We see an expected relationship, with fares increasing as distance increases. There are, however, a large number of data points that seem to hug each axis. 

1. Many taxi rides with close to 0 distance between the pickup and dropoff locations have much higher fares than we would expect, with the fare distribution of very low distance trips seeming similar to the distribution of all other trips. Could this have some unknown but real cause, or is it erroneous data?

2. As distance goes above ~75km the average fare actually decreases due to a mass of points around 80-110km with fares below $50. Are these trips on routes with less traffic? Fixed price trips? Or is it erroneous data?

**Answering Q1**

If we assumed that all trips with a pickup to dropoff point distance of less than 50m have had location data entered incorrectly, then we would expect fares for this group to come from the same distribution as properly recorded trips. A confounding factor we should look out for is the potential for some true very short trips to coexist with misrecorded data. 

We will explore the distributions of these two subsets by using a Q-Q plot and the Kolmogorov-Smirnov test.

In [None]:
percs = np.linspace(0,99,34)
short = np.percentile(train[train['distance']<=50].fare_amount, percs)
long = np.percentile(train[train['distance']>50].fare_amount, percs)
sns.scatterplot(x=short, y=long)
x = np.linspace(np.min((short.min(),long.min())), np.max((short.max(),long.max())))
plt.plot(x,x, color="k", ls="--")
plt.title('')
plt.xlabel('Short Trip Fare ($)')
plt.ylabel('Long Trip Fare ($)')


In [None]:
ks = scipy.stats.ks_2samp(
    train.where(train.distance > 50).dropna()['fare_amount'],
    train.where(train.distance <= 50).dropna()['fare_amount']
)
print('p-value:', ks[1])

The tiny p-value output by the Kolmogorov-Smirnov test and the shape of the QQ-plot mean we can be confident that these subsets follow different distributions. It is tough to say what the cause of the large fare values is but there is some information encoded here, so we will not drop these rides from the dataset or impute the fare amount.

**Answering Q2**

Let's see how many trips are over 75km long.

In [None]:
long_trips = train[train.distance>75000].fare_amount.count()
print(long_trips, 'trips over 75km.')

There are 788 trips made with a pickup to dropoff point distance of over 75km.

Now let's see what the average fare is for these trips compared to shorter trips, 50-75km long.

In [None]:
print('Average fare for distance over 75km:', train[train.distance>75000].fare_amount.mean())
print('Average fare for distance 50-75km:', train.query('50000 < distance < 75000').fare_amount.mean())
sns.barplot(['50-75km', '>75km'],[train.query('50000 < distance < 75000').fare_amount.mean(),train[train.distance>75000].fare_amount.mean()])
plt.title('Fare by Trip Distance')
plt.ylabel('Fare ($)')
plt.xlabel('Distance')

This confirms the huge decrease in fare that is happening for the longest trips. We will plot these trips to see if we can narrow down the cause.

In [None]:
def plot_long_trips(df):
    rows=len(df)
    fig, ax = plt.subplots(1, 1, figsize=(12, 12))
    for i in range(rows):
        plt.plot([df.pickup_longitude[i],df.dropoff_longitude[i]], [df.pickup_latitude[i], df.dropoff_latitude[i]], marker='o', color='b', alpha=0.1)
    plt.title('Linked Pickup and Dropoff Points for Trips longer than 75km')
    plt.ylabel('Latitude')
    plt.xlabel('Longitude')
    plt.show()    

plot_long_trips(train[train.distance>75000].reset_index())

There is a clear source of most of the long distance trips, but is this location responsible for the low fares?

In [None]:
print(train[train.distance>75000].query('41.36 < pickup_latitude < 41.37 or 41.36 < dropoff_latitude < 41.37')
      .fare_amount.count(), f'of the {long_trips} trips with distance>75km start or end in this area')

print(f'The average fare of these trips is',
      train[train.distance>75000].query('41.36 < pickup_latitude < 41.37 or 41.36 < dropoff_latitude < 41.37')
      .fare_amount.mean())

print(f'The average fare of long trips starting and ending elsewhere is',
      train[train.distance>75000].query('(41.36 > pickup_latitude or pickup_latitude > 41.37) and (41.36 > dropoff_latitude or dropoff_latitude > 41.37)')
      .fare_amount.mean())

Though not explaining it entirely, we have found a huge contributor to the massive drop in fares we are seeing for long trips. Checking the coordinates on a map, I'm not sure if these data points are incorrect as there does not seem to be anything there. Regardless, with it having such a big impact on fares in this set we will create a feature for our model that measures distance from this area.

Now we look at how time features interact with fares.

In [None]:
train.pivot_table('fare_amount', index='year').plot(figsize=(15,2))
plt.title('Fare Paid by Year')
plt.ylabel('Fare ($)')
plt.xlabel('Year')
train.pivot_table('fare_amount', index='dayofyear').plot(figsize=(15,2))
plt.title('Fare Paid by Day of Year')
plt.ylabel('Fare ($)')
plt.xlabel('Day')
train.pivot_table('fare_amount', index='weekday').plot(figsize=(15,2))
plt.title('Fare Paid by Weekday (Monday-Sunday)')
plt.ylabel('Fare ($)')
plt.xlabel('Day')
train.pivot_table('fare_amount', index='time').plot(figsize=(15,2))
plt.ylabel('Fare ($)')
plt.xlabel('Time')
plt.title('Fare Paid by Time of Day')
plt.show()

There has been a steady increase in the average fare by year between 2009 and 2015, with a clear jump between 2012 and 2013. 

Throughout the year the average fare fluctuates as one would expect given the smaller sample size per data point, but clear seasonality is present.

Fares are lowest on Saturdays and highest on Sundays, but are overall quite stable throughout the week (note the scale of the y-axis).

Fares show a very large spike in pricing around 5am, commutes into the city in heavy traffic potentially? Let's see how distance from the city centre interacts with pickup time.

In [None]:
train.query('50 < distance').pivot_table('fare_per_km', index='hour', columns='weekday').plot(figsize=(15,2))
plt.title('$ per KM vs Time of Day')
plt.ylabel('$ per KM')
plt.xlabel('24hr Time')
train.pivot_table('distance', index='time', columns='weekday').plot(figsize=(15,2))
plt.ylabel('Meters')
plt.xlabel('24hr Time')
plt.title('Trip Distance vs Time of Day')
plt.show()

We see that the spike in average fare around 5am is probably people commuting for the day, and is a result of longer trip distances rather than a high price per km. Price per km is actually at its highest during standard work hours from 9am-5pm.

Interestingly, we can also see that the peak in trip distance occurs later in the day on Saturday and Sunday, evidence of what one would assume - that people take long trips slightly later in the day on the weekend.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 4))
plt.plot(train.groupby('time').time.unique(), train.groupby('time')['pickup_dist_nyc'].mean()/1000, label='Pickup Distance')
plt.plot(train.groupby('time').time.unique(), train.groupby('time')['dropoff_dist_nyc'].mean()/1000, label='Dropoff Distance')
plt.legend(loc="upper right")
plt.xlabel('24hr Time')
plt.ylabel('Distance (km)')
plt.legend()
plt.title('Distance from City Centre vs Time of Day')

I was expecting the long trips to be people coming in to the city, but it turns out the longer trip distances in the morning are caused by people travelling outwards from the city centre.

In [None]:
def density_heatmap(direction):
    df = np.log(
        train
        .query(f'-74.3 < {direction}_longitude < -73.6')
        .query(f'40.4 < {direction}_latitude < 41')
        .groupby([f'{direction}_long_bucket_big',f'{direction}_lat_bucket_big'])
        .distance
        .count()
        .unstack(level=0)
        .iloc[::-1]
    )
    sns.heatmap(df, cmap="plasma", vmax=8, cbar_kws={'label':f'log({direction}s per sq. km)'}, ax=ax)
    plt.title(f'{direction} density')
    plt.ylabel('Latitude Bucket')
    plt.xlabel('Longitude Bucket')
    
fig = plt.figure(figsize=(10, 9))
fig.subplots_adjust(wspace=0.2, right=1.8)
ax = fig.add_subplot(1, 2, 1)
density_heatmap('pickup')
ax = fig.add_subplot(1, 2, 2)
density_heatmap('dropoff')
plt.show()

Most pickups and dropoffs occur in the city centre. Checking the location of the three spots that are brighter than we would expect them to be given their distance from the city centre, we can see that they are Newark Liberty Airport, LaGuardia Airport and JFK Airport.

In [None]:
sns.boxplot(x=train['close_to_airport'], y=train['fare_amount'])
plt.title('Fare Distribution by Proximity to Airports')
plt.ylim(-1,200)
plt.xlabel('Close to Airport')
plt.ylabel('Fare ($)')

We can see that fares differ substantially based on whether they are close to an airport, and which airport they are close to. We will include features in the model measuring distance from each of these airports.

We now see if fares themselves differ based on pickup and dropoff location.

In [None]:
def fare_heatmap(direction):
    df = (
        train
        .groupby([f'{direction}_long_bucket', f'{direction}_lat_bucket'])
        .fare_amount
        .mean()
        .unstack(level=0)
        .iloc[::-1]
    )
    sns.heatmap(df, cmap="Blues", vmin= 0, vmax=30, ax=ax, cbar_kws={'label':'Average Fare ($)'})
    plt.title(f'Average Fare by {direction} Location')
    plt.ylabel('Latitude Bucket')
    plt.xlabel('Longitude Bucket')
    
fig = plt.figure(figsize=(10, 9))
fig.subplots_adjust(wspace=0.2, right=1.8)
ax = fig.add_subplot(1, 2, 1)
fare_heatmap('pickup')
ax = fig.add_subplot(1, 2, 2)
fare_heatmap('dropoff')
plt.show()

In [None]:
def dist_heatmap(direction):
    df = (
        train
        .groupby([f'{direction}_long_bucket', f'{direction}_lat_bucket'])
        .distance
        .mean()
        .unstack(level=0)
        .iloc[::-1]
    )
    sns.heatmap(df, cmap="Blues", vmin= 0, vmax=10000, ax=ax, cbar_kws={'label':f'Distance (m)'})
    plt.title(f'Average Distance by {direction} Location')
    plt.ylabel('Latitude Bucket')
    plt.xlabel('Longitude Bucket')
    
fig = plt.figure(figsize=(10, 9))
fig.subplots_adjust(wspace=0.2, right=1.8)
ax = fig.add_subplot(1, 2, 1)
dist_heatmap('pickup')
ax = fig.add_subplot(1, 2, 2)
dist_heatmap('dropoff')
plt.show()

These heatmaps show that as you get further away from the city centre, fares and trip distances increase, an expected pattern.

The largest point of interest is the area directly to the west of the city centre. In this area fares are high, but this is not due to a higher average trip distance as there is no such hotspot on the distance heatmap.

In [None]:
def ratio_heatmap(direction):
    df = (
        train
        .query(f'-74.1 < {direction}_longitude < -73.95')
        .query(f'40.65 < {direction}_latitude < 40.8')
        .groupby([f'{direction}_long_bucket', f'{direction}_lat_bucket'])
        .fare_per_km
        .mean()
        .unstack(level=0)
        .iloc[::-1]
    )
    sns.heatmap(df, cmap="viridis", vmin= 0, vmax=80, ax=ax, cbar_kws={'label':f'Fare per km ($)'})
    plt.title(f'Average Fare Per km by {direction} Location')
    plt.ylabel('Latitude Bucket')
    plt.xlabel('Longitude Bucket')
    
fig = plt.figure(figsize=(10, 9))
fig.subplots_adjust(wspace=0.2, right=1.8)
ax = fig.add_subplot(1, 2, 1)
ratio_heatmap('pickup')
ax = fig.add_subplot(1, 2, 2)
ratio_heatmap('dropoff')
plt.show()

This heatmap highlights the area in question, an area of mostly yellow where the fare per km travelled is consistently very high. Could this be due to tolls in the area?

We'll create a feature for use in our model that measures proximity to this area.

Finally, we examine whether the direction of travel relates to the fare paid.

In [None]:
train.query('distance > 5').pivot_table('fare_per_km', index='direction_bucket', aggfunc='mean').plot(figsize=(15,2))
plt.ylabel('Fare per km($)')
plt.xlabel('Direction')
plt.title('Fare Paid vs Direction')
plt.show()

If you travel up or down roads along the length of central NYC, you will be travelling on a bearing of ~20 degrees or ~-160 degrees depending on your direction. Fare per km being lowest on these bearings makes sense because we are using a point to point distance, meaning the real distance travelled will be closest to our measure when the journey is directly along a road - on other bearings our distance metric is more heavily underestimating the real distance that had to be travelled.

We will create an adjusted distance feature based on the bearing of the trip for use in our model.

# Final Modelling

We will now add all the features that might prove useful and train the model which we will use for our submission.

In [None]:
def modelling_features(df):
    df = shared_features(df)
    # using alternative representation of cyclic features
    df = df.assign(
        sin_time=np.sin(2*np.pi*df['time']/24),
        cos_time=np.cos(2*np.pi*df['time']/24),
        sin_direction=np.sin(2*np.pi*df['direction']/360),
        cos_direction=np.cos(2*np.pi*df['direction']/360),
        sin_dayofyear=np.sin(2*np.pi*df['dayofyear']/365),
        cos_dayofyear=np.cos(2*np.pi*df['dayofyear']/365),
        sin_weekday=np.sin(2*np.pi*df['weekday']/6),
        cos_weekday=np.cos(2*np.pi*df['weekday']/6),
        direction_bucket=pd.cut(df['direction'], bins=37, labels=False)
        ).drop(columns=['pickup_datetime', 'time', 'direction', 'weekday', 'dayofyear'])
    return df

In [None]:
train = load_df(10000000, features='model')

test['pickup_datetime'] = test['pickup_datetime'].str.slice(0, 15)
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'], utc=True, format='%Y-%m-%d %H:%M')
test = modelling_features(test)

train = (train
    .query(f'{test.pickup_longitude.min()-0.1} <= pickup_longitude <= {test.pickup_longitude.max()+0.1}')
    .query(f'{test.pickup_latitude.min()-0.1} <= pickup_latitude <= {test.pickup_latitude.max()+0.1}')
    .query(f'{test.dropoff_longitude.min()-0.1} <= dropoff_longitude <= {test.dropoff_longitude.max()+0.1}')
    .query(f'{test.dropoff_latitude.min()-0.1} <= dropoff_latitude <= {test.dropoff_latitude.max()+0.1}')
)

x_train, x_val, y_train, y_val = get_split_sets(train)

x_train['fare_per_km'] = y_train*1000/(x_train.distance+5)
fares_by_direction = x_train.query('5 < distance').groupby('direction_bucket')['fare_per_km'].mean()

x_train['adj_dist'] = [fares_by_direction[i] for i in x_train.direction_bucket]*x_train.distance/fares_by_direction.max()
x_val['adj_dist'] = [fares_by_direction[i] for i in x_val.direction_bucket]*x_val.distance/fares_by_direction.max()
test['adj_dist'] = [fares_by_direction[i] for i in test.direction_bucket]*test.distance/fares_by_direction.max()

x_train = x_train.drop(columns=['fare_per_km', 'direction_bucket'])
x_val = x_val.drop(columns=['direction_bucket'])
x_test = test.drop(columns=['key', 'direction_bucket'])

In [None]:
lin_final_model, lin_final_rmse, lin_final_pred = lin_model(x_train, x_val, y_train, y_val)

In [None]:
knn_cols = ['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']
k_choices = [18,24,30,40]
knn_final_model, knn_final_rmse, knn_final_pred = knn_model(x_train[knn_cols], x_val[knn_cols], y_train, y_val, k_choices)

In [None]:
lgbm_params = {
    'objective': 'regression',
    'boosting': 'gbdt',
    'reg_sqrt': True,
    'learning_rate': 0.03,
    'num_leaves': 1200,
    'max_depth': -1,
    'max_bin': 5000,
    'num_rounds': 5000,
    'early_stopping_round': 50,
    'metric': 'rmse'
}
lgbm_final_model, lgbm_final_rmse, lgbm_final_pred = lgbm_model(lgbm_params, x_train, x_val, y_train, y_val)

In [None]:
lgbm.plot_importance(lgbm_final_model)

In [None]:
print('Linear Regression RMSE', lin_final_rmse)
print('KNN RMSE', knn_final_rmse)
print('LightGBM RMSE', lgbm_final_rmse)

The features we added have resulted in much better performance by the linear regression model, though the features we extracted were not molded for use with linear regression. Techniques such as target encoding of the time features would probably prove more fruitful if this model was our primary focus.

Using KNN in a high dimensional space in this notebook will hurt its performance and slow the model fitting to a crawl, so I have rerun it on this larger subset using just the original features.

The LightGBM model was the primary focus of our feature extraction efforts, and it now achieves a RMSE on the validation set of 3.393. Time of year, time of day, trip distance and proximity to JFK airport are the features this model makes the most use of.

We will see if ensembling the LightGBM and KNN models still yields any gains now that there is a gap in their individual performances.

In [None]:
d = {}
for a in np.linspace(0,1,101):
    final_preds_ave = (lgbm_final_pred*(1-a) + knn_final_pred * a)
    rmse = np.sqrt(mean_squared_error(y_val, final_preds_ave))
    d[a] = rmse
alpha = min(d, key=d.get)
print('Best weight to give KNN: ', alpha)

A weighted average ensemble with a weight of 0.06 for the KNN model results in the best performance on the validation set. We will use this weighting for our final submission.

In [None]:
lgbm_test_pred = lgbm_final_model.predict(x_test, num_iteration=lgbm_final_model.best_iteration)
knn_test_pred = knn_final_model.predict(x_test[['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']])
submission_pred = (lgbm_test_pred*(1-alpha) + knn_test_pred * alpha)
submission = pd.DataFrame({'key': test.key, 'fare_amount': submission_pred})
submission.to_csv('submission_10_10_20_comb.csv', index=False)

This submission achieves a RMSE of 2.93 on the leaderboard set.