<a href="https://colab.research.google.com/github/yugal82/Machine-Learning/blob/master/Projects/NYC_Taxi_fare_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [2]:
import opendatasets  as od

In [3]:
dataset_url = 'https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/'
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: yugalkhanter
Your Kaggle Key: ··········
Downloading new-york-city-taxi-fare-prediction.zip to ./new-york-city-taxi-fare-prediction


100%|██████████| 1.56G/1.56G [00:21<00:00, 77.0MB/s]



Extracting archive ./new-york-city-taxi-fare-prediction/new-york-city-taxi-fare-prediction.zip to ./new-york-city-taxi-fare-prediction


In [4]:
import pandas as pd
import numpy as np
import random

In [5]:
selected_cols = "fare_amount,pickup_datetime,pickup_latitude,pickup_longitude,dropoff_longitude,dropoff_latitude,passenger_count".split(",")
selected_cols

['fare_amount',
 'pickup_datetime',
 'pickup_latitude',
 'pickup_longitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [6]:
dtypes = {
    'fare_amount': 'float32',
    'pickup_latitude': 'float32',
    'pickup_longitude': 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}

### Loading Training Set

Loading the entire dataset into Pandas is going to be slow, so we can use the following optimizations:

- Ignore the `key` column
- Parse pickup datetime while loading data
- Specify data types for other columns
   - `float32` for geo coordinates
   - `float32` for fare amount
   - `uint8` for passenger count
- Work with a 1% sample of the data (~500k rows)

We can apply these optimizations while using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [7]:
random.seed(8)
def skip_row(row_idx):
  if row_idx == 0:
    return False
  return random.random() > 0.01

train_ds = pd.read_csv(
    "/content/new-york-city-taxi-fare-prediction/train.csv",
    usecols=selected_cols,
    parse_dates=['pickup_datetime'],
    dtype=dtypes,
    skiprows=skip_row
)

**Load Test Set**

For the test set, we'll simply provide the data types.

In [81]:
test_df = pd.read_csv("/content/new-york-city-taxi-fare-prediction/test.csv", dtype=dtypes, parse_dates=["pickup_datetime"])

## Explore the Dataset

- Basic info about training set
- Basic info about test set
- Exploratory data analysis & visualization

In [46]:
# train_ds.head(10)
print("Describe: \n", train_ds.describe())
print("Info: \n", train_ds.info())
print("Null values: \n", train_ds.isna().sum())

Describe: 
          fare_amount  pickup_longitude  pickup_latitude  dropoff_longitude  \
count  553708.000000     553708.000000    553708.000000      553699.000000   
mean       11.343386        -72.525970        39.912071         -72.497482   
std         9.744153         13.166656         7.654707          11.747736   
min       -52.000000      -3383.284912     -2555.488037       -1301.503662   
25%         6.000000        -73.992088        40.734921         -73.991394   
50%         8.500000        -73.981827        40.752670         -73.980141   
75%        12.500000        -73.967079        40.767113         -73.963661   
max       450.000000        728.531738       430.516663        2497.105713   

       dropoff_latitude  passenger_count  pickup_datetime_year  \
count     553699.000000    553708.000000         553708.000000   
mean          39.920895         1.686246           2011.738162   
std            8.819188         1.310580              1.858516   
min        -2475.7185

In [47]:
# test_ds.head(10)
print("Describe: \n", test_df.describe())
print("Info: \n", test_df.info())
print("Null values: \n", test_df.isna().sum())

Describe: 
        pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  \
count       9914.000000      9914.000000        9914.000000       9914.000000   
mean         -73.974716        40.751041         -73.973656         40.751740   
std            0.042774         0.033541           0.039072          0.035435   
min          -74.252190        40.573143         -74.263245         40.568974   
25%          -73.992500        40.736125         -73.991249         40.735253   
50%          -73.982327        40.753052         -73.980015         40.754065   
75%          -73.968012        40.767113         -73.964062         40.768757   
max          -72.986534        41.709557         -72.990967         41.696682   

       passenger_count  
count      9914.000000  
mean          1.671273  
std           1.278747  
min           1.000000  
25%           1.000000  
50%           1.000000  
75%           2.000000  
max           6.000000  
<class 'pandas.core.frame.DataFra

## Prepare Dataset for Training

- Split Training & Validation Set
- Fill/Remove Missing Values
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

In [48]:
from sklearn.model_selection import train_test_split

### Split Training & Validation Set

We'll set aside 20% of the training data as the validation set, to evaluate the models we train on previously unseen data.

Since the test set and training set have the same date ranges, we can pick a random 20% fraction.

In [49]:
train_df, val_df = train_test_split(train_ds, test_size=0.2, random_state=8)

In [50]:
len(train_df), len(val_df)

(442966, 110742)

In [51]:
train_df = train_df.dropna()
val_df = val_df.dropna()

In [52]:
train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetimeday',
       'pickup_datetime_weekday', 'pickup_datetime_hour'],
      dtype='object')

In [53]:
input_features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_label = "fare_amount"

In [54]:
train_ds_input = train_df[input_features]
train_ds_target = train_df[target_label]

In [55]:
print(train_ds_input)
print(train_ds_target)

        pickup_longitude  pickup_latitude  dropoff_longitude  \
327779        -74.011520        40.708054         -73.993057   
509461        -73.974068        40.753933         -73.978043   
456430        -73.990227        40.751686         -73.986275   
284224        -73.990784        40.755768         -73.973396   
198663        -73.991020        40.742054         -73.998093   
...                  ...              ...                ...   
403592        -73.955635        40.779488         -73.951050   
324570        -74.007660        40.709560         -73.999107   
231557        -73.979118        40.787209         -73.960762   
149489        -73.989250        40.731621         -73.982498   
550228        -73.972328        40.790688         -73.978043   

        dropoff_latitude  passenger_count  
327779         40.742462                1  
509461         40.747253                5  
456430         40.744461                6  
284224         40.763702                1  
198663     

In [56]:
val_ds_input = val_df[input_features]
val_ds_target = val_df[target_label]

In [57]:
print(val_ds_input)
print(val_ds_target)

        pickup_longitude  pickup_latitude  dropoff_longitude  \
16227         -73.961151        40.768848         -73.966934   
449789        -73.955368        40.782791         -73.975891   
269481        -73.863579        40.770000         -74.000244   
141054        -73.997261        40.724819         -73.972557   
376590        -73.999847        40.726799         -73.981049   
...                  ...              ...                ...   
512280        -73.980789        40.779812         -73.989197   
123562        -73.982407        40.764572         -73.984886   
452177          0.000000         0.000000           0.000000   
414910        -73.995934        40.726151         -73.941605   
46335         -73.961617        40.719326         -74.000954   

        dropoff_latitude  passenger_count  
16227          40.767139                1  
449789         40.754559                1  
269481         40.714333                1  
141054         40.753582                1  
376590     

In [58]:
test_ds = test_df[input_features]

## Train Hardcoded & Baseline Models

- Hardcoded model: always predict average fare
- Baseline model: Linear regression

For evaluation the dataset uses RMSE error:
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

In [59]:
class MeanRegressor:
  def fit(self, inputs, targets):
    self.mean = targets.mean()

  def predict(self, inputs):
    return np.full(inputs.shape[0], self.mean)

In [60]:
mean_model = MeanRegressor()

In [61]:
mean_model.fit(train_ds_input, train_ds_target)

In [62]:
mean_model.mean

11.346553

In [63]:
train_preds = mean_model.predict(train_ds_input)
train_preds

array([11.346553, 11.346553, 11.346553, ..., 11.346553, 11.346553,
       11.346553], dtype=float32)

In [64]:
train_ds_target

327779     8.9
509461     3.3
456430     6.0
284224     7.0
198663    10.5
          ... 
403592     4.9
324570     6.9
231557     8.1
149489    17.5
550228     4.0
Name: fare_amount, Length: 442959, dtype: float32

In [65]:
val_preds = mean_model.predict(val_ds_input)
val_preds

array([11.346553, 11.346553, 11.346553, ..., 11.346553, 11.346553,
       11.346553], dtype=float32)

## Error between predicted and actual

Above, the predicted mean by the `MeanRegressor` model is 11.364 for every record in the training dataset.

And, the actual target is 8.9, 3.3, 6.0, etc

1. The error for first record is 11.364 - 8.9 = 2.464
2. The error for second record is 11.364 - 3.3 = 8.064
3. The error for third record is 11.364 - 6.0 = 5.364

and so on for each record.

We make use of root mean squared error to see how badly our model performs.

In [66]:
from sklearn.metrics import mean_squared_error

In [67]:
def rmse(actual, predictions):
  return mean_squared_error(actual, predictions, squared=False)

In [68]:
train_rmse = rmse(train_ds_target, train_preds)
train_rmse

9.771893

In [69]:
val_rmse = rmse(val_ds_target, val_preds)
val_rmse

9.632431

## RMSE = 9.771893

That means that, on average, for each record, the prediction of the model of fare amount is approx. 9.7 dollars off the actual fare amount, which is quite bad.

### Train & Evaluate Baseline Model

We'll traina linear regression model as our baseline, which tries to express the target as a weighted sum of the inputs.

In [70]:
from sklearn.linear_model import LinearRegression
linreg_model = LinearRegression()
linreg_model.fit(train_ds_input, train_ds_target)
train_preds = linreg_model.predict(train_ds_input)
train_preds

array([11.246101 , 11.730324 , 11.8510895, ..., 11.73045  , 11.488152 ,
       11.488618 ], dtype=float32)

In [71]:
train_ds_target

327779     8.9
509461     3.3
456430     6.0
284224     7.0
198663    10.5
          ... 
403592     4.9
324570     6.9
231557     8.1
149489    17.5
550228     4.0
Name: fare_amount, Length: 442959, dtype: float32

In [72]:
test_preds = linreg_model.predict(test_ds)
test_preds

array([11.246746 , 11.246227 , 11.246632 , ..., 11.8532095, 11.851213 ,
       11.850919 ], dtype=float32)

## 6. Feature Engineering


- Extract parts of date
- Remove outliers & invalid data
- Add distance between pickup & drop
- Add distance from landmarks

Exercise: We're going to apply all of the above together, but you should observer the effect of adding each feature individually.


### Extract Parts of Date

- Year
- Month
- Day
- Weekday
- Hour


In [73]:
def add_dateparts(ds, col):
  ds[col+ '_year'] = ds[col].dt.year
  ds[col+ '_month'] = ds[col].dt.month
  ds[col+ 'day'] = ds[col].dt.day
  ds[col+ '_weekday'] = ds[col].dt.weekday
  ds[col+ '_hour'] = ds[col].dt.hour

In [74]:
add_dateparts(train_ds, "pickup_datetime")

In [75]:
add_dateparts(val_df, "pickup_datetime")

In [82]:
add_dateparts(test_df, "pickup_datetime")

### Add Distance Between Pickup and Drop

We can use the haversine distance:
- https://en.wikipedia.org/wiki/Haversine_formula
- https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

In [84]:
import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [85]:
def add_trip_distance(ds):
    ds['trip_distance'] = haversine_np(ds['pickup_longitude'], ds['pickup_latitude'], ds['dropoff_longitude'], ds['dropoff_latitude'])

In [86]:
add_trip_distance(train_ds)

In [87]:
add_trip_distance(val_df)

In [88]:
add_trip_distance(test_df)