# New York City Taxi Fare Prediction

We'll train a machine learning model to predict the fare for a taxi ride in New York city given information like pickup date & time, pickup location, drop location and no. of passengers. 

Dataset Link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

## 1. Download the Dataset

- Install required libraries
- Download data from Kaggle
- View dataset files
- Load training set with Pandas
- Load test set with Pandas



Dataset link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview


### Install Required Libraries

In [273]:
# !pip install xgboostimport numpy as np
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import opendatasets as od

### Download Data from Kaggle

We'll use the opendatasets library: https://github.com/JovianML/opendatasets

In [275]:
dataset_url =  'https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview'
od.download(dataset_url)

Skipping, found downloaded files in "./new-york-city-taxi-fare-prediction" (use force=True to force download)


### View Dataset Files

In [277]:
data_dir = 'new-york-city-taxi-fare-prediction'
!ls -lh {data_dir}

total 5.4G
-rw-rw-r-- 1 lenny lenny  486 Feb 17 06:11 GCP-Coupons-Instructions.rtf
-rw-rw-r-- 1 lenny lenny 336K Feb 17 06:11 sample_submission.csv
-rw-rw-r-- 1 lenny lenny 960K Feb 17 06:11 test.csv
-rw-rw-r-- 1 lenny lenny 5.4G Feb 17 06:12 train.csv


In [278]:
# Training set
!head {data_dir}/train.csv

key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
2012-12-03 13:10:00.000000125,9,2012-12-03 13:10:00 UTC,-74.006462,40.726713,-73.99

In [279]:
# Test set
!head {data_dir}/test.csv

key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.973320007324219,40.7638053894043,-73.981430053710938,40.74383544921875,1
2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862182617188,40.719383239746094,-73.998886108398438,40.739200592041016,1
2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.75126,-73.979654,40.746139,1
2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.98116,40.767807,-73.990448,40.751635,1
2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966046,40.789775,-73.988565,40.744427,1
2012-12-01 21:12:12.0000005,2012-12-01 21:12:12 UTC,-73.960983,40.765547,-73.979177,40.740053,1
2011-10-06 12:10:20.0000001,2011-10-06 12:10:20 UTC,-73.949013,40.773204,-73.959622,40.770893,1
2011-10-06 12:10:20.0000003,2011-10-06 12:10:20 UTC,-73.777282,40.646636,-73.985083,40.759368,1
2011-10-06 12:10:20.0000002,2011-10-06 12:10:20 UTC,-74.01409

In [280]:
# Sample submission file
!head {data_dir}/sample_submission.csv

key,fare_amount
2015-01-27 13:08:24.0000002,11.35
2015-01-27 13:08:24.0000003,11.35
2011-10-08 11:53:44.0000002,11.35
2012-12-01 21:12:12.0000002,11.35
2012-12-01 21:12:12.0000003,11.35
2012-12-01 21:12:12.0000005,11.35
2011-10-06 12:10:20.0000001,11.35
2011-10-06 12:10:20.0000003,11.35
2011-10-06 12:10:20.0000002,11.35


In [281]:
!wc -l {data_dir}/train.csv

55423856 new-york-city-taxi-fare-prediction/train.csv


In [282]:
!wc -l {data_dir}/test.csv

9914 new-york-city-taxi-fare-prediction/test.csv


Observations:

- This is a supervised learning regression problem
- Training data is 5.5 GB in size
- Training data has 5.5 million rows
- Test set is much smaller (< 10,000 rows)
- The training set has 8 columns:
    - `key` (a unique identifier)
    - `fare_amount` (target column)
    - `pickup_datetime`
    - `pickup_longitude`
    - `pickup_latitude`
    - `dropoff_longitude`
    - `dropoff_latitude`
    - `passenger_count`
- The test set has all columns except the target column `fare_amount`.
- The submission file should contain the `key` and `fare_amount` for each test sample.


### Loading Training Set

Loading the entire dataset into Pandas is going to be slow, so we can use the following optimizations:

- Ignore the `key` column
- Parse pickup datetime while loading data 
- Specify data types for other columns
   - `float32` for geo coordinates
   - `float32` for fare amount
   - `uint8` for passenger count
- Work with a 1% sample of the data (~500k rows)

We can apply these optimizations while using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [463]:
sample_frac = 0.021

In [465]:
%%time
selected_cols = 'fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')
dtypes = {
    'fare_amount': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'passenger_count': 'float32'
}

def skip_row(row_idx):
    if row_idx ==0:
        return False
    return random.random() > sample_frac

random.seed(42)
df = pd.read_csv(
    data_dir+'/train.csv',
    usecols=selected_cols,
    dtype=dtypes,
    parse_dates=['pickup_datetime'],
    skiprows=skip_row
)

CPU times: user 1min 20s, sys: 4.58 s, total: 1min 24s
Wall time: 1min 29s


In [466]:
df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.0,2014-12-06 20:36:22+00:00,-73.979813,40.751904,-73.979446,40.755481,1.0
1,10.0,2014-11-12 12:40:29+00:00,-74.002579,40.739571,-73.994583,40.760682,1.0
2,8.0,2013-01-17 17:22:00+00:00,0.000000,0.000000,0.000000,0.000000,2.0
3,19.0,2013-09-17 04:22:00+00:00,-73.987213,40.729324,-73.931984,40.697207,1.0
4,8.9,2011-06-15 18:07:00+00:00,-73.996330,40.753223,-73.978897,40.766963,3.0
...,...,...,...,...,...,...,...
1162445,7.0,2013-12-06 13:46:56+00:00,-73.986145,40.772259,-73.976654,40.785374,4.0
1162446,4.5,2013-02-17 22:27:00+00:00,-73.992531,40.748619,-73.998436,40.740142,1.0
1162447,14.5,2013-01-27 12:41:00+00:00,-74.012115,40.706635,-73.988724,40.756217,1.0
1162448,6.0,2014-10-18 07:51:00+00:00,-73.997681,40.724380,-73.994148,40.717797,1.0


### Load Test Set

For the test set, we'll simply provide the data types.

In [468]:
test_df = pd.read_csv(data_dir+'/test.csv', dtype=dtypes, parse_dates=['pickup_datetime'])

In [469]:
test_df

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24+00:00,-73.973320,40.763805,-73.981430,40.743835,1.0
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24+00:00,-73.986862,40.719383,-73.998886,40.739201,1.0
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44+00:00,-73.982521,40.751259,-73.979652,40.746139,1.0
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12+00:00,-73.981163,40.767807,-73.990448,40.751635,1.0
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12+00:00,-73.966049,40.789776,-73.988564,40.744427,1.0
...,...,...,...,...,...,...,...
9909,2015-05-10 12:37:51.0000002,2015-05-10 12:37:51+00:00,-73.968124,40.796997,-73.955643,40.780388,6.0
9910,2015-01-12 17:05:51.0000001,2015-01-12 17:05:51+00:00,-73.945511,40.803600,-73.960213,40.776371,6.0
9911,2015-04-19 20:44:15.0000001,2015-04-19 20:44:15+00:00,-73.991600,40.726608,-73.789742,40.647011,6.0
9912,2015-01-31 01:05:19.0000005,2015-01-31 01:05:19+00:00,-73.985573,40.735432,-73.939178,40.801731,6.0


## 2. Explore the Dataset

- Basic info about training set
- Basic info about test set
- Exploratory data analysis & visualization
- Ask & answer questions

### Training Set

In [472]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1162450 entries, 0 to 1162449
Data columns (total 7 columns):
 #   Column             Non-Null Count    Dtype              
---  ------             --------------    -----              
 0   fare_amount        1162450 non-null  float32            
 1   pickup_datetime    1162450 non-null  datetime64[ns, UTC]
 2   pickup_longitude   1162450 non-null  float32            
 3   pickup_latitude    1162450 non-null  float32            
 4   dropoff_longitude  1162444 non-null  float32            
 5   dropoff_latitude   1162444 non-null  float64            
 6   passenger_count    1162450 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(5), float64(1)
memory usage: 39.9 MB


In [473]:
df.isna().sum()

fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    6
dropoff_latitude     6
passenger_count      0
dtype: int64

In [474]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,1162450.0,1162450.0,1162450.0,1162444.0,1162444.0,1162450.0
mean,11.34398,-72.51142,39.91786,-72.51435,39.92763,1.684722
std,9.780316,12.17245,8.149201,12.42203,9.197481,1.342856
min,-300.0,-3439.245,-3084.49,-3356.73,-3084.324,0.0
25%,6.0,-73.99203,40.73488,-73.9914,40.73397,1.0
50%,8.5,-73.98181,40.75262,-73.98017,40.75312,1.0
75%,12.5,-73.96713,40.76706,-73.96365,40.76807,2.0
max,499.0,2420.209,2560.143,3440.82,3351.403,208.0


In [475]:
df.pickup_datetime.min(), df.pickup_datetime.max()

(Timestamp('2009-01-01 00:11:46+0000', tz='UTC'),
 Timestamp('2015-06-30 23:59:54+0000', tz='UTC'))

Observations about training data:

- 550k+ rows, as expected
- No missing data (in the sample)
- `fare_amount` ranges from \$-52.0 to \$499.0 
- `passenger_count` ranges from 0 to 208 
- There seem to be some errors in the latitude & longitude values
- Dates range from 1st Jan 2009 to 30th June 2015
- The dataset takes up ~19 MB of space in the RAM

We may need to deal with outliers and data entry errors before we train our model.

### Test Set

In [478]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9914 entries, 0 to 9913
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   key                9914 non-null   object             
 1   pickup_datetime    9914 non-null   datetime64[ns, UTC]
 2   pickup_longitude   9914 non-null   float32            
 3   pickup_latitude    9914 non-null   float32            
 4   dropoff_longitude  9914 non-null   float32            
 5   dropoff_latitude   9914 non-null   float64            
 6   passenger_count    9914 non-null   float32            
dtypes: datetime64[ns, UTC](1), float32(4), float64(1), object(1)
memory usage: 387.4+ KB


In [479]:
test_df.isna().sum()

key                  0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64

In [480]:
test_df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.751743,1.671273
std,0.042799,0.033542,0.039093,0.035435,1.278756
min,-74.25219,40.573143,-74.263245,40.568973,1.0
25%,-73.9925,40.736125,-73.991249,40.735254,1.0
50%,-73.982327,40.753052,-73.980015,40.754065,1.0
75%,-73.968012,40.767113,-73.964062,40.768757,2.0
max,-72.986534,41.709557,-72.990967,41.696683,6.0


In [481]:
test_df.pickup_datetime.min(),test_df.pickup_datetime.max()

(Timestamp('2009-01-01 11:04:24+0000', tz='UTC'),
 Timestamp('2015-06-30 20:03:50+0000', tz='UTC'))

Some observations about the test set:

- 9914 rows of data
- No missing values
- No obvious data entry errors
- 1 to 6 passengers (we can limit training data to this range)
- Latitudes lie between 40 and 42
- Longitudes lie between -75 and -72
- Pickup dates range from Jan 1st 2009 to Jun  30th 2015 (same as training set)

We can use the ranges of the test set to drop outliers/invalid data from the training set.

### Exploratory Data Analysis and Visualization

**Exercise**: Create graphs (histograms, line charts, bar charts, scatter plots, box plots, geo maps etc.) to study the distrubtion of values in each column, and the relationship of each input column to the target.


## 3. Prepare Dataset for Training

- Split Training & Validation Set
- Fill/Remove Missing Values
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

### Split Training & Validation Set

We'll set aside 20% of the training data as the validation set, to evaluate the models we train on previously unseen data. 

Since the test set and training set have the same date ranges, we can pick a random 20% fraction.

In [486]:
from sklearn.model_selection import train_test_split

In [487]:
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

### Fill/Remove Missing Values

There are no missing values in our sample, but if there were, we could simply drop the rows with missing values instead of trying to fill them (since we have a lot of training data)>

In [489]:
train_df = train_df.dropna()
val_df = val_df.dropna()

In [490]:
train_df.isna().sum()

fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64

### Extract Inputs and Outputs

In [492]:
df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

In [493]:
input_cols = ['pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_col = 'fare_amount'

#### Training

In [495]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_col]

In [496]:
train_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
789591,-73.956841,40.770977,-73.956848,40.770957,1.0
635325,-74.000282,40.742729,-73.956993,40.771600,1.0
7279,-73.974815,40.757458,-73.990517,40.751049,2.0
1153110,-73.992493,40.689709,-73.861954,40.768505,1.0
583881,-73.981110,40.753277,-73.998474,40.755687,5.0
...,...,...,...,...,...
110268,-73.988640,40.722504,-73.996338,40.734616,1.0
259178,-73.966225,40.753452,-73.980568,40.787482,1.0
131932,-73.782219,40.644695,-73.761742,40.721727,6.0
671155,-73.994278,40.756229,-74.002220,40.709110,1.0


In [497]:
train_targets

789591     13.000000
635325     15.500000
7279        9.000000
1153110    25.299999
583881      9.000000
             ...    
110268      5.700000
259178     12.100000
131932     28.000000
671155     13.000000
121958      9.300000
Name: fare_amount, Length: 929954, dtype: float32

### Validation.

In [499]:
val_inputs = val_df[input_cols]
val_targets = val_df[target_col]

In [500]:
val_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
762494,0.000000,0.000000,0.000000,0.000000,1.0
963013,-73.975189,40.751469,-74.011002,40.708541,1.0
774022,0.000000,0.000000,0.000000,0.000000,1.0
580088,-73.987564,40.732635,-73.971443,40.749628,1.0
333716,-73.977936,40.752434,-73.994263,40.766364,1.0
...,...,...,...,...,...
787517,-73.994217,40.751373,-73.988800,40.722212,2.0
1030205,-73.963928,40.776848,-73.976318,40.744305,2.0
16709,-73.967140,40.772366,-73.986137,40.776920,1.0
880926,-73.959610,40.772522,-73.871796,40.775018,5.0


In [501]:
val_targets

762494      5.00
963013     23.50
774022      9.00
580088     11.00
333716     11.50
           ...  
787517     12.10
1030205    11.50
16709       7.30
880926     30.67
741524      5.00
Name: fare_amount, Length: 232490, dtype: float32

#### Test

In [503]:
test_inputs = test_df[input_cols]

In [504]:
test_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,-73.973320,40.763805,-73.981430,40.743835,1.0
1,-73.986862,40.719383,-73.998886,40.739201,1.0
2,-73.982521,40.751259,-73.979652,40.746139,1.0
3,-73.981163,40.767807,-73.990448,40.751635,1.0
4,-73.966049,40.789776,-73.988564,40.744427,1.0
...,...,...,...,...,...
9909,-73.968124,40.796997,-73.955643,40.780388,6.0
9910,-73.945511,40.803600,-73.960213,40.776371,6.0
9911,-73.991600,40.726608,-73.789742,40.647011,6.0
9912,-73.985573,40.735432,-73.939178,40.801731,6.0


## 4. Train Hardcoded & Baseline Models

- Hardcoded model: always predict average fare
- Baseline model: Linear regression 

For evaluation the dataset uses RMSE error: 
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

### Train & Evaluate Hardcoded Model

Let's create a simple model that always predicts the average.m

In [507]:
class MeanRegressor():
    def fit(self,inputs,targets):
        self.mean = targets.mean()

    def predict(self,inputs):
        return np.full(inputs.shape[0],self.mean)

In [508]:
mean_model = MeanRegressor()

In [509]:
mean_model.fit(train_inputs, train_targets)

In [510]:
mean_model.mean

11.346068

In [511]:
train_preds = mean_model.predict(train_inputs)
train_preds

array([11.346068, 11.346068, 11.346068, ..., 11.346068, 11.346068,
       11.346068], dtype=float32)

In [512]:
val_preds = mean_model.predict(val_inputs)
val_preds

array([11.346068, 11.346068, 11.346068, ..., 11.346068, 11.346068,
       11.346068], dtype=float32)

In [513]:
from sklearn.metrics import mean_squared_error

train_rmse = mean_squared_error(train_targets, train_preds,squared=False )
train_rmse



9.792213

In [514]:
val_rmse = mean_squared_error(val_targets, val_preds, squared=False)
val_rmse



9.75117

### Train & Evaluate Baseline Model

We'll traina linear regression model as our baseline, which tries to express the target as a weighted sum of the inputs.

In [516]:
from sklearn.linear_model import LinearRegression

linreg_model = LinearRegression()
linreg_model.fit(train_inputs,train_targets)

In [517]:
train_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
789591,-73.956841,40.770977,-73.956848,40.770957,1.0
635325,-74.000282,40.742729,-73.956993,40.771600,1.0
7279,-73.974815,40.757458,-73.990517,40.751049,2.0
1153110,-73.992493,40.689709,-73.861954,40.768505,1.0
583881,-73.981110,40.753277,-73.998474,40.755687,5.0
...,...,...,...,...,...
110268,-73.988640,40.722504,-73.996338,40.734616,1.0
259178,-73.966225,40.753452,-73.980568,40.787482,1.0
131932,-73.782219,40.644695,-73.761742,40.721727,6.0
671155,-73.994278,40.756229,-74.002220,40.709110,1.0


In [518]:
train_preds = linreg_model.predict(train_inputs)
train_preds

array([11.26425679, 11.26465027, 11.36480324, ..., 11.76877688,
       11.26429269, 11.26426526])

In [519]:
val_preds = linreg_model.predict(val_inputs)
val_preds

array([11.89691619, 11.26425258, 11.89691619, ..., 11.2641173 ,
       11.66677872, 11.26420928])

In [520]:
train_rmse = mean_squared_error(train_targets, train_preds, squared=False)
train_rmse



9.790707109258614

In [521]:
val_rmse = mean_squared_error(val_targets, val_preds, squared=False)
val_rmse



9.7506941243395

## 5. Make Predictions and Submit to Kaggle

- Make predictions for test set
- Generate submissions CSV
- Submit to Kaggle
- Record in experiment tracking sheet

In [523]:
test_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,-73.973320,40.763805,-73.981430,40.743835,1.0
1,-73.986862,40.719383,-73.998886,40.739201,1.0
2,-73.982521,40.751259,-73.979652,40.746139,1.0
3,-73.981163,40.767807,-73.990448,40.751635,1.0
4,-73.966049,40.789776,-73.988564,40.744427,1.0
...,...,...,...,...,...
9909,-73.968124,40.796997,-73.955643,40.780388,6.0
9910,-73.945511,40.803600,-73.960213,40.776371,6.0
9911,-73.991600,40.726608,-73.789742,40.647011,6.0
9912,-73.985573,40.735432,-73.939178,40.801731,6.0


In [524]:
test_preds = linreg_model.predict(test_inputs)

In [525]:
test_preds

array([11.26425638, 11.26466075, 11.26441572, ..., 11.76831537,
       11.7674002 , 11.76692227])

In [526]:
def generate_submission(test_preds, fname):
    sub_df = pd.read_csv(data_dir+'/sample_submission.csv')
    sub_df['fare_amount'] = test_preds
    sub_df.to_csv(fname, index=None)

In [527]:
generate_submission(test_preds, 'linreg_submission.csv')