<a href="https://colab.research.google.com/github/yugal82/Machine-Learning/blob/master/Projects/NYC_Taxi_fare_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install opendatasets



In [2]:
import opendatasets  as od
dataset_url = 'https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/'
od.download(dataset_url)

Skipping, found downloaded files in "./new-york-city-taxi-fare-prediction" (use force=True to force download)


In [3]:
import pandas as pd
import numpy as np
import random

In [4]:
selected_cols = "fare_amount,pickup_datetime,pickup_latitude,pickup_longitude,dropoff_longitude,dropoff_latitude,passenger_count".split(",")

dtypes = {
    'fare_amount': 'float32',
    'pickup_latitude': 'float32',
    'pickup_longitude': 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}

### Loading Training Set

Loading the entire dataset into Pandas is going to be slow, so we can use the following optimizations:

- Ignore the `key` column
- Parse pickup datetime while loading data
- Specify data types for other columns
   - `float32` for geo coordinates
   - `float32` for fare amount
   - `uint8` for passenger count
- Work with a 1% sample of the data (~500k rows)

We can apply these optimizations while using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [5]:
random.seed(8)
def skip_row(row_idx):
  if row_idx == 0:
    return False
  return random.random() > 0.01

train_ds = pd.read_csv(
    "/content/new-york-city-taxi-fare-prediction/train.csv",
    usecols=selected_cols,
    parse_dates=['pickup_datetime'],
    dtype=dtypes,
    skiprows=skip_row
)

**Load Test Set**

For the test set, we'll simply provide the data types.

In [6]:
test_df = pd.read_csv(
    "/content/new-york-city-taxi-fare-prediction/test.csv",
    dtype=dtypes,
    parse_dates=["pickup_datetime"]
)

## Explore the Dataset

- Basic info about training set
- Basic info about test set
- Exploratory data analysis & visualization

In [7]:
# train_ds.head(10)
print("Describe: \n", train_ds.describe())
print("Info: \n", train_ds.info())
print("Null values: \n", train_ds.isna().sum())

Describe: 
          fare_amount  pickup_longitude  pickup_latitude  dropoff_longitude  \
count  553708.000000     553708.000000    553708.000000      553699.000000   
mean       11.343386        -72.525970        39.912071         -72.497482   
std         9.744153         13.166656         7.654707          11.747736   
min       -52.000000      -3383.284912     -2555.488037       -1301.503662   
25%         6.000000        -73.992088        40.734921         -73.991394   
50%         8.500000        -73.981827        40.752670         -73.980141   
75%        12.500000        -73.967079        40.767113         -73.963661   
max       450.000000        728.531738       430.516663        2497.105713   

       dropoff_latitude  passenger_count  
count     553699.000000    553708.000000  
mean          39.920895         1.686246  
std            8.819188         1.310580  
min        -2475.718506         0.000000  
25%           40.734089         1.000000  
50%           40.753220    

In [10]:
# test_ds.head(10)
print("Describe: \n", test_df.describe())
print("Info: \n", test_df.info())
print("Null values: \n", test_df.isna().sum())

Describe: 
        pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  \
count       9914.000000      9914.000000        9914.000000       9914.000000   
mean         -73.974716        40.751041         -73.973656         40.751740   
std            0.042774         0.033541           0.039072          0.035435   
min          -74.252190        40.573143         -74.263245         40.568974   
25%          -73.992500        40.736125         -73.991249         40.735253   
50%          -73.982327        40.753052         -73.980015         40.754065   
75%          -73.968012        40.767113         -73.964062         40.768757   
max          -72.986534        41.709557         -72.990967         41.696682   

       passenger_count  
count      9914.000000  
mean          1.671273  
std           1.278747  
min           1.000000  
25%           1.000000  
50%           1.000000  
75%           2.000000  
max           6.000000  
<class 'pandas.core.frame.DataFra

## Prepare Dataset for Training

- Split Training & Validation Set
- Fill/Remove Missing Values
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

In [11]:
from sklearn.model_selection import train_test_split

### Split Training & Validation Set

We'll set aside 20% of the training data as the validation set, to evaluate the models we train on previously unseen data.

Since the test set and training set have the same date ranges, we can pick a random 20% fraction.

In [12]:
train_df, val_df = train_test_split(train_ds, test_size=0.2, random_state=8)

In [13]:
len(train_df), len(val_df)

(442966, 110742)

In [14]:
train_df = train_df.dropna()
val_df = val_df.dropna()

In [15]:
train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

In [16]:
input_features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_label = "fare_amount"

In [17]:
train_ds_input = train_df[input_features]
train_ds_target = train_df[target_label]

In [18]:
print(train_ds_input)
print(train_ds_target)

        pickup_longitude  pickup_latitude  dropoff_longitude  \
327779        -74.011520        40.708054         -73.993057   
509461        -73.974068        40.753933         -73.978043   
456430        -73.990227        40.751686         -73.986275   
284224        -73.990784        40.755768         -73.973396   
198663        -73.991020        40.742054         -73.998093   
...                  ...              ...                ...   
403592        -73.955635        40.779488         -73.951050   
324570        -74.007660        40.709560         -73.999107   
231557        -73.979118        40.787209         -73.960762   
149489        -73.989250        40.731621         -73.982498   
550228        -73.972328        40.790688         -73.978043   

        dropoff_latitude  passenger_count  
327779         40.742462                1  
509461         40.747253                5  
456430         40.744461                6  
284224         40.763702                1  
198663     

In [19]:
val_ds_input = val_df[input_features]
val_ds_target = val_df[target_label]

In [22]:
print(val_ds_input)
print(val_ds_target)

        pickup_longitude  pickup_latitude  dropoff_longitude  \
16227         -73.961151        40.768848         -73.966934   
449789        -73.955368        40.782791         -73.975891   
269481        -73.863579        40.770000         -74.000244   
141054        -73.997261        40.724819         -73.972557   
376590        -73.999847        40.726799         -73.981049   
...                  ...              ...                ...   
512280        -73.980789        40.779812         -73.989197   
123562        -73.982407        40.764572         -73.984886   
452177          0.000000         0.000000           0.000000   
414910        -73.995934        40.726151         -73.941605   
46335         -73.961617        40.719326         -74.000954   

        dropoff_latitude  passenger_count  
16227          40.767139                1  
449789         40.754559                1  
269481         40.714333                1  
141054         40.753582                1  
376590     

In [39]:
test_ds = test_df[input_features]
test_ds

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,-73.973320,40.763805,-73.981430,40.743835,1
1,-73.986862,40.719383,-73.998886,40.739201,1
2,-73.982521,40.751259,-73.979652,40.746140,1
3,-73.981163,40.767807,-73.990448,40.751637,1
4,-73.966049,40.789776,-73.988564,40.744427,1
...,...,...,...,...,...
9909,-73.968124,40.796997,-73.955643,40.780388,6
9910,-73.945511,40.803600,-73.960213,40.776371,6
9911,-73.991600,40.726608,-73.789742,40.647011,6
9912,-73.985573,40.735432,-73.939178,40.801731,6


## Train Hardcoded & Baseline Models

- Hardcoded model: always predict average fare
- Baseline model: Linear regression

For evaluation the dataset uses RMSE error:
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

In [23]:
class MeanRegressor:
  def fit(self, inputs, targets):
    self.mean = targets.mean()

  def predict(self, inputs):
    return np.full(inputs.shape[0], self.mean)

In [24]:
mean_model = MeanRegressor()

In [25]:
mean_model.fit(train_ds_input, train_ds_target)

In [26]:
mean_model.mean

11.346553

In [27]:
train_preds = mean_model.predict(train_ds_input)
train_preds

array([11.346553, 11.346553, 11.346553, ..., 11.346553, 11.346553,
       11.346553], dtype=float32)

In [28]:
train_ds_target

327779     8.9
509461     3.3
456430     6.0
284224     7.0
198663    10.5
          ... 
403592     4.9
324570     6.9
231557     8.1
149489    17.5
550228     4.0
Name: fare_amount, Length: 442959, dtype: float32

In [29]:
val_preds = mean_model.predict(val_ds_input)
val_preds

array([11.346553, 11.346553, 11.346553, ..., 11.346553, 11.346553,
       11.346553], dtype=float32)

## Error between predicted and actual

Above, the predicted mean by the `MeanRegressor` model is 11.364 for every record in the training dataset.

And, the actual target is 8.9, 3.3, 6.0, etc

1. The error for first record is 11.364 - 8.9 = 2.464
2. The error for second record is 11.364 - 3.3 = 8.064
3. The error for third record is 11.364 - 6.0 = 5.364

and so on for each record.

We make use of root mean squared error to see how badly our model performs.

In [30]:
from sklearn.metrics import mean_squared_error

In [31]:
def rmse(actual, predictions):
  return mean_squared_error(actual, predictions, squared=False)

In [32]:
train_rmse = rmse(train_ds_target, train_preds)
train_rmse

9.771893

In [33]:
val_rmse = rmse(val_ds_target, val_preds)
val_rmse

9.632431

## RMSE = 9.771893

That means that, on average, for each record, the prediction of the model of fare amount is approx. 9.7 dollars off the actual fare amount, which is quite bad.

### Train & Evaluate Baseline Model

We'll traina linear regression model as our baseline, which tries to express the target as a weighted sum of the inputs.

In [34]:
from sklearn.linear_model import LinearRegression
linreg_model = LinearRegression()
linreg_model.fit(train_ds_input, train_ds_target)
train_preds = linreg_model.predict(train_ds_input)
train_preds

array([11.246101 , 11.730324 , 11.8510895, ..., 11.73045  , 11.488152 ,
       11.488618 ], dtype=float32)

In [35]:
train_ds_target

327779     8.9
509461     3.3
456430     6.0
284224     7.0
198663    10.5
          ... 
403592     4.9
324570     6.9
231557     8.1
149489    17.5
550228     4.0
Name: fare_amount, Length: 442959, dtype: float32

In [36]:
test_preds = linreg_model.predict(test_ds)
test_preds

array([11.246746 , 11.246227 , 11.246632 , ..., 11.8532095, 11.851213 ,
       11.850919 ], dtype=float32)

## 6. Feature Engineering


- Extract parts of date
- Remove outliers & invalid data
- Add distance between pickup & drop
- Add distance from landmarks

Exercise: We're going to apply all of the above together, but you should observer the effect of adding each feature individually.


### Extract Parts of Date

- Year
- Month
- Day
- Weekday
- Hour


In [37]:
def add_dateparts(ds, col):
  ds[col+ '_year'] = ds[col].dt.year
  ds[col+ '_month'] = ds[col].dt.month
  ds[col+ 'day'] = ds[col].dt.day
  ds[col+ '_weekday'] = ds[col].dt.weekday
  ds[col+ '_hour'] = ds[col].dt.hour

In [40]:
add_dateparts(train_ds, "pickup_datetime")

In [41]:
add_dateparts(val_df, "pickup_datetime")

In [43]:
add_dateparts(test_df, "pickup_datetime")

### Add Distance Between Pickup and Drop

We can use the haversine distance:
- https://en.wikipedia.org/wiki/Haversine_formula
- https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

In [44]:
import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [45]:
def add_trip_distance(ds):
    ds['trip_distance'] = haversine_np(ds['pickup_longitude'], ds['pickup_latitude'], ds['dropoff_longitude'], ds['dropoff_latitude'])

In [46]:
add_trip_distance(train_ds)

In [47]:
add_trip_distance(val_df)

In [48]:
add_trip_distance(test_df)

### Add Distance From Popular Landmarks

- JFK Airport
- LGA Airport
- EWR Airport
- Times Square
- Met Meuseum
- World Trade Center

We'll add the distance from drop location.

In [49]:
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
met_lonlat = -73.9632, 40.7794
wtc_lonlat = -74.0099, 40.7126

In [50]:
def add_landmark_dropoff_distance(ds, landmark_name, landmark_lonlat):
    lon, lat = landmark_lonlat
    ds[landmark_name + '_drop_distance'] = haversine_np(lon, lat, ds['dropoff_longitude'], ds['dropoff_latitude'])

In [51]:
def add_landmark(ds):
  landmarks = [('jfk', jfk_lonlat), ('lga', lga_lonlat), ('ewr', ewr_lonlat), ('met', met_lonlat), ('wtc', wtc_lonlat)]
  for name, lonlat in landmarks:
    add_landmark_dropoff_distance(ds, name, lonlat)

In [52]:
add_landmark(train_ds)
add_landmark(val_df)
add_landmark(test_df)

In [54]:
train_ds.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetimeday,pickup_datetime_weekday,pickup_datetime_hour,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
0,8.5,2012-03-25 09:22:45+00:00,-73.986679,40.725658,-73.977066,40.754505,1,2012,3,25,6,9,3.306201,20.958735,9.024495,18.129211,3.002265,5.41572
1,12.9,2009-07-16 20:33:00+00:00,-73.980759,40.680557,-73.989113,40.726788,2,2009,7,16,3,20,5.185302,20.16004,11.17634,16.157242,6.24012,2.356164
2,4.0,2014-02-22 18:41:34+00:00,-73.990364,40.757343,-73.985374,40.760605,1,2014,2,22,5,18,0.55452,21.92494,9.546513,17.779884,2.800522,5.720479
3,9.7,2012-07-10 13:13:07+00:00,-73.991005,40.733334,-73.987,40.747791,1,2012,7,10,1,13,1.641773,21.209339,10.045919,17.069136,4.043235,4.360438
4,4.0,2014-01-18 23:03:00+00:00,-74.005432,40.736958,-73.998642,40.744995,1,2014,1,18,5,23,1.060889,21.864847,11.073668,16.044769,4.848884,3.722812


### Remove Outliers and Invalid Data

There seems to be some invalide data in each of the following columns:

- Fare amount
- Passenger count
- Pickup latitude & longitude
- Drop latitude & longitude


We'll use the following ranges:

- `fare_amount`: \$1 to \$500
- `longitudes`: -75 to -72
- `latitudes`: 40 to 42
- `passenger_count`: 1 to 6

In [55]:
def remove_outliers(ds):
    return ds[(ds['fare_amount'] >= 1.) &
              (ds['fare_amount'] <= 500.) &
              (ds['pickup_longitude'] >= -75) &
              (ds['pickup_longitude'] <= -72) &
              (ds['dropoff_longitude'] >= -75) &
              (ds['dropoff_longitude'] <= -72) &
              (ds['pickup_latitude'] >= 40) &
              (ds['pickup_latitude'] <= 42) &
              (ds['dropoff_latitude'] >=40) &
              (ds['dropoff_latitude'] <= 42) &
              (ds['passenger_count'] >= 1) &
              (ds['passenger_count'] <= 6)]

In [56]:
train_ds = remove_outliers(train_ds)

In [57]:
val_df = remove_outliers(val_df)

## 7. Train & Evaluate Different Models

We'll train each of the following & submit predictions to Kaggle:

- Linear Regression
- Random Forests
- Gradient Boosting

Exercise: Train Ridge, SVM, KNN, Decision Tree models

In [58]:
train_ds.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetimeday',
       'pickup_datetime_weekday', 'pickup_datetime_hour', 'trip_distance',
       'jfk_drop_distance', 'lga_drop_distance', 'ewr_drop_distance',
       'met_drop_distance', 'wtc_drop_distance'],
      dtype='object')

In [59]:
input_cols = ['pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetimeday',
       'pickup_datetime_weekday', 'pickup_datetime_hour', 'trip_distance',
       'jfk_drop_distance', 'lga_drop_distance', 'ewr_drop_distance',
       'met_drop_distance', 'wtc_drop_distance']

target_col = 'fare_amount'

In [60]:
train_inputs = train_ds[input_cols]
train_target = train_ds[target_col]

In [61]:
val_inputs = val_df[input_cols]
val_target = val_df[target_col]

In [64]:
test_inputs = test_df[input_cols]
test_inputs.head()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetimeday,pickup_datetime_weekday,pickup_datetime_hour,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
0,-73.97332,40.763805,-73.98143,40.743835,1,2015,1,27,1,13,2.321899,20.574911,9.760167,17.346842,4.239343,4.218709
1,-73.986862,40.719383,-73.998886,40.739201,1,2015,1,27,1,13,2.423777,21.550976,11.31599,15.789623,5.382879,3.098136
2,-73.982521,40.751259,-73.979652,40.74614,1,2011,10,8,5,11,0.618015,20.594069,9.526829,17.576965,3.946721,4.514503
3,-73.981163,40.767807,-73.990448,40.751637,1,2012,12,1,5,21,1.959681,21.689365,10.195091,16.96965,3.843892,4.637048
4,-73.966049,40.789776,-73.988564,40.744427,1,2012,12,1,5,21,5.383829,21.113993,10.295857,16.808367,4.433764,3.967223


In [69]:
def evaluate_model(model):
  train_preds = model.predict(train_inputs)
  train_rmse = rmse(train_target, train_preds)
  val_preds = model.predict(val_inputs)
  val_rmse = rmse(val_target, val_preds)
  return train_rmse, val_rmse, train_preds, val_preds

In [107]:
def evaluate_model_metrics(model):
  train_score = model.score(train_inputs, train_target)
  val_score = model.score(val_inputs, val_target)

  return train_score, val_score

---------------------------------------------------------------------------
## RIDGE REGRESSION

In [115]:
from sklearn.linear_model import Ridge
def ridge_regression(alpha):
  ridge_model = Ridge(random_state=8, alpha=alpha)
  ridge_model.fit(train_inputs, train_target)
  train_rmse, val_rmse, train_preds, val_preds = evaluate_model(ridge_model)

  print("The RMSE for training dataset is: ", train_rmse)
  print("The RMSE for validation dataset is: ", val_rmse)
  print("-----------------------------------------------------------\n")
  print("Predictions for training dataset are: \n", train_preds)
  print("-----------------------------------------------------------\n")
  print("Predictions for validation dataset are: \n", val_preds)

  print("-----------------------------------------------------------\n")
  train_score, val_score = evaluate_model_metrics(ridge_model)
  print("Train dataset score: ", train_score)
  print("Validation dataset score: ", val_score)

  # predict on test data
  test_preds = ridge_model.predict(test_inputs)
  print("-----------------------------------------------------------\n")
  print("Predictions for test dataset are: \n", test_preds)

In [116]:
ridge_regression(0.8)

The RMSE for training dataset is:  5.246131853884653
The RMSE for validation dataset is:  5.155613745395518
-----------------------------------------------------------

Predictions for training dataset are: 
 [10.47759894 13.84739549  6.52853133 ...  8.34081656 12.34794325
 12.93271625]
-----------------------------------------------------------

Predictions for validation dataset are: 
 [ 4.74031108 12.29895367 27.71814394 ...  6.01143214 14.70602713
 20.44239657]
-----------------------------------------------------------

Train dataset score:  0.7059962179900492
Validation dataset score:  0.7090483094061852
-----------------------------------------------------------

Predictions for test dataset are: 
 [10.14843198 11.49244385  5.54210768 ... 47.32475444 22.27505766
  9.14263889]


## Results of Ridge Regression:

As we can see above that the
1. RMSE for `train_dataset = 5.246131853884653`
2. RMSE for `val_dataset = 5.155613745395518`

## Accuracy:
1. Train dataset score:  `0.7059962179900492`
2. Validation dataset score:  `0.7090483094061852`


The above accuracy and RMSE is much better than the baseline model that we created.




-------------------------------------------------------------------
## RANDOM FOREST

In [120]:
from sklearn.ensemble import RandomForestRegressor
def random_forest(max_depth, n_estimators):
  rf_model = RandomForestRegressor(random_state=8, n_jobs=-1, max_depth=max_depth, n_estimators=n_estimators)
  rf_model.fit(train_inputs, train_target)

  train_rmse, val_rmse, train_preds, val_preds = evaluate_model(rf_model)
  print("The RMSE for training dataset is: ", train_rmse)
  print("The RMSE for validation dataset is: ", val_rmse)

  print("-----------------------------------------------------------\n")
  print("Predictions for training dataset are: \n", train_preds)

  print("-----------------------------------------------------------\n")
  print("Predictions for validation dataset are: \n", val_preds)

  train_score, val_score = evaluate_model_metrics(rf_model)
  print("Train dataset score: ", train_score)
  print("Validation dataset score: ", val_score)

  # predict on test data
  test_preds = rf_model.predict(test_inputs)
  print("-----------------------------------------------------------\n")
  print("Predictions for test dataset are: \n", test_preds)

In [121]:
random_forest(10, 50)

The RMSE for training dataset is:  3.653980451528581
The RMSE for validation dataset is:  3.646048926699294
-----------------------------------------------------------

Predictions for training dataset are: 
 [10.17363158 13.74625213  5.10990187 ...  9.0421932  14.16669307
 13.62496878]
-----------------------------------------------------------

Predictions for validation dataset are: 
 [ 4.8776264  11.10075099 30.3818388  ...  5.10990187 18.23924426
 15.68748861]
Train dataset score:  0.8573712904805945
Validation dataset score:  0.8544858845554287
-----------------------------------------------------------

Predictions for test dataset are: 
 [10.72852293 10.78620383  5.06933965 ... 55.27381741 21.88656283
  6.97351737]


KeyboardInterrupt: 

## Results of Random Forest Regression:

As we can see above that the
1. RMSE for `train_dataset = 3.653980451528581`
2. RMSE for `val_dataset = 3.646048926699294`

## Accuracy:
1. Train dataset score:  `0.8573712904805945`
2. Validation dataset score:  `0.8544858845554287`


The above accuracy and RMSE is much better than the baseline model as well as the Ridge Regression Model that we created.


