# New York City Taxi Fare Prediction

We'll train a machine learning model to predict the fare for a taxi ride in New York city given information like pickup date & time,\
pickup location,drop location and no. of passengers. 

Dataset Link: [new york city taxi fare prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction)

## Dataset Description
**File descriptions**
* **train.csv** - Input features and target fare_amount values for the training set (about 55M rows).
* **test.csv**- Input features for the test set (about 10K rows). Your goal is to predict fare_amount for each row.
* **sample_submission.csv** - a sample submission file in the correct format (columns key and fare_amount). This file 'predicts' fare_amount to be $11.35 for all rows, which is the mean fare_amount from the training set.
### Data fields
#### ID
* **key** - Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer, but this doesn't matter, it should just be used as a unique ID field.
Required in your submission CSV. Not necessarily needed in the training set, but could be useful to simulate a 'submission file' while doing cross-validation within the training set.
#### Features
* **pickup_datetime** - timestamp value indicating when the taxi ride started.
* **pickup_longitude** - float for longitude coordinate of where the taxi ride started.
* **pickup_latitude** - float for latitude coordinate of where the taxi ride started.
* **dropoff_longitude** - float for longitude coordinate of where the taxi ride ended.
* **dropoff_latitude** - float for latitude coordinate of where the taxi ride ended.
* **passenger_count** - integer indicating the number of passengers in the taxi ride.
#### Target
* **fare_amount** - float dollar amount of the cost of the taxi ride. This value is only in the training set; this is what you are predicting in the test set and it is required in your submission CSV.

## Planning
* Approching problem with classic Machine learning algorithms
  * Random Forest
  * Xgboost
  * LGBM
  Find best performing algorithm and hypertune it
* Approching problem using Deep learning
  * ANN (hypertune)

Imports required libs/modules

In [153]:
import opendatasets as od
import pandas as pd
import random

## 1. Download the Dataset

- Install required libraries
- Download data from Kaggle
- View dataset files
- Load training set with Pandas
- Load test set with Pandas


Dataset link:  [new york city taxi fare prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview)

### Download Data from Kaggle

We'll use the [opendatasets]( https://github.com/JovianML/opendatasets) library

In [154]:
dataset_url = 'https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data'

In [155]:
od.download(dataset_url)

Skipping, found downloaded files in "./new-york-city-taxi-fare-prediction" (use force=True to force download)


In [156]:
data_dir = './new-york-city-taxi-fare-prediction'

Using shell commands because python modules are very slow with large data so instead of using OS modelu,\
we are using shell commands

In [157]:
# List of files with size
!ls -lh {data_dir}

total 5,4G
-rw-r--r-- 1 root root  486 29. Dez 13:42 GCP-Coupons-Instructions.rtf
-rw-r--r-- 1 root root 336K 29. Dez 13:42 sample_submission.csv
-rw-r--r-- 1 root root 960K 29. Dez 13:42 test.csv
-rw-r--r-- 1 root root 5,4G 29. Dez 13:42 train.csv


In [158]:
# Training set
!head {data_dir}/train.csv

key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
2012-12-03 13:10:00.000000125,9,2012-12-03 13:10:00 UTC,-74.006462,40.726713,-73.99

In [159]:
# Test set
!head {data_dir}/test.csv

key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.973320007324219,40.7638053894043,-73.981430053710938,40.74383544921875,1
2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862182617188,40.719383239746094,-73.998886108398438,40.739200592041016,1
2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.75126,-73.979654,40.746139,1
2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.98116,40.767807,-73.990448,40.751635,1
2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966046,40.789775,-73.988565,40.744427,1
2012-12-01 21:12:12.0000005,2012-12-01 21:12:12 UTC,-73.960983,40.765547,-73.979177,40.740053,1
2011-10-06 12:10:20.0000001,2011-10-06 12:10:20 UTC,-73.949013,40.773204,-73.959622,40.770893,1
2011-10-06 12:10:20.0000003,2011-10-06 12:10:20 UTC,-73.777282,40.646636,-73.985083,40.759368,1
2011-10-06 12:10:20.0000002,2011-10-06 12:10:20 UTC,-74.01409

In [160]:
# Sample submission file
!head {data_dir}/sample_submission.csv

key,fare_amount
2015-01-27 13:08:24.0000002,11.35
2015-01-27 13:08:24.0000003,11.35
2011-10-08 11:53:44.0000002,11.35
2012-12-01 21:12:12.0000002,11.35
2012-12-01 21:12:12.0000003,11.35
2012-12-01 21:12:12.0000005,11.35
2011-10-06 12:10:20.0000001,11.35
2011-10-06 12:10:20.0000003,11.35
2011-10-06 12:10:20.0000002,11.35


In [161]:
# No. of lines in training set
!wc -l {data_dir}/train.csv

55423856 ./new-york-city-taxi-fare-prediction/train.csv


In [162]:
# No. of lines in test set
!wc -l {data_dir}/test.csv

9914 ./new-york-city-taxi-fare-prediction/test.csv


In [163]:
# No. of lines in submission file
!wc -l {data_dir}/sample_submission.csv

9915 ./new-york-city-taxi-fare-prediction/sample_submission.csv


Observations:
* It's supervise learning problem
* total size of data 5.5GB
* number of rows in trainig set 55423856
* number of rows in test set 9914
* Target/prediction is fare_amount

### Loading Training Set

Loading the entire dataset into Pandas is going to be slow, so we can use the following optimizations:

- Ignore the `key` column
- Parse pickup datetime while loading data 
- Specify data types for other columns
   - `float32` for geo coordinates
   - `float32` for fare amount
   - `uint8` for passenger count
- Work with a 1% sample of the data (~500k rows)

We can apply these optimizations while using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [164]:
sample_frac = 0.01 # to sample only 1% of the data from whole dataset (~500_000)

In [165]:
selected_cols = 'fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')
# increase pandas speed we are specifing the datatypes for all features/columns
dtypes = {
    'fare_amount': 'float32',
    'pickup_datetime':'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}

In [166]:
def skip_rows(row_idx):
    if row_idx == 0:
        return False

    # probability of getting true is 1%, hance we can get only 1% of data
    return random.random() > sample_frac

In [167]:
random.seed(42) # to produce same results

taxi_fare_train_org_df = pd.read_csv(
    data_dir + '/train.csv',
    usecols=selected_cols,
    dtype=dtypes,
    parse_dates=['pickup_datetime'],
    skiprows=skip_rows
)

### Load Test Set

For the test set, we'll simply provide the data types.

In [168]:
taxi_fare_test_df = pd.read_csv(
    data_dir + '/test.csv',
    dtype=dtypes,
    parse_dates=['pickup_datetime'],
)

## 2. Exploratory Data Analysis (EDA)

- Basic info about training set
- Basic info about test set
- Exploratory data analysis & visualization
- Ask & answer questions

In [169]:
taxi_fare_train_org_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552450 entries, 0 to 552449
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   fare_amount        552450 non-null  float32            
 1   pickup_datetime    552450 non-null  datetime64[ns, UTC]
 2   pickup_longitude   552450 non-null  float32            
 3   pickup_latitude    552450 non-null  float32            
 4   dropoff_longitude  552450 non-null  float32            
 5   dropoff_latitude   552450 non-null  float32            
 6   passenger_count    552450 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 15.3 MB


In [170]:
taxi_fare_train_org_df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,552450.0,552450.0,552450.0,552450.0,552450.0,552450.0
mean,11.354059,-72.497063,39.9105,-72.504326,39.934265,1.684983
std,9.811924,11.618246,8.061114,12.074346,9.255057,1.337664
min,-52.0,-1183.362793,-3084.490234,-3356.729736,-2073.150635,0.0
25%,6.0,-73.99202,40.734875,-73.991425,40.73399,1.0
50%,8.5,-73.981819,40.752621,-73.980179,40.753101,1.0
75%,12.5,-73.967155,40.767036,-73.963737,40.768059,2.0
max,499.0,2420.209473,404.983337,2467.752686,3351.403076,208.0


Observations
* fare_amount is negative, it is not realistic, so we need to drop negative rows
* mean of fare_amount is $11.35 and std is $9.81, so our model must be better than 9.9
* max passenger count is 208???? it seems like an outlier. we can explore it during EDA

In [171]:
# drop all the rows where fare_amount is negative
print(f'before dropping fare_amount rows with negative value, Total Rows: {len(taxi_fare_train_org_df)}')
taxi_fare_train_df = taxi_fare_train_org_df[taxi_fare_train_org_df.fare_amount>=0]
print(f'after dropping fare_amount rows with negative value, Total Rows: {len(taxi_fare_train_df)}')

before dropping fare_amount rows with negative value, Total Rows: 552450
after dropping fare_amount rows with negative value, Total Rows: 552426


In [172]:
taxi_fare_train_df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,552426.0,552426.0,552426.0,552426.0,552426.0,552426.0
mean,11.3549,-72.497543,39.910751,-72.504684,39.934444,1.684979
std,9.81104,11.616858,8.060573,12.073421,9.254789,1.337657
min,0.0,-1183.362793,-3084.490234,-3356.729736,-2073.150635,0.0
25%,6.0,-73.99202,40.734875,-73.991425,40.73399,1.0
50%,8.5,-73.981819,40.752621,-73.980179,40.753101,1.0
75%,12.5,-73.967163,40.767036,-73.963737,40.768059,2.0
max,499.0,2420.209473,404.983337,2467.752686,3351.403076,208.0


In [173]:
taxi_fare_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9914 entries, 0 to 9913
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   key                9914 non-null   object             
 1   pickup_datetime    9914 non-null   datetime64[ns, UTC]
 2   pickup_longitude   9914 non-null   float32            
 3   pickup_latitude    9914 non-null   float32            
 4   dropoff_longitude  9914 non-null   float32            
 5   dropoff_latitude   9914 non-null   float32            
 6   passenger_count    9914 non-null   uint8              
dtypes: datetime64[ns, UTC](1), float32(4), object(1), uint8(1)
memory usage: 319.6+ KB


In [174]:
taxi_fare_test_df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.75174,1.671273
std,0.042774,0.033541,0.039072,0.035435,1.278747
min,-74.25219,40.573143,-74.263245,40.568974,1.0
25%,-73.9925,40.736125,-73.991249,40.735253,1.0
50%,-73.982327,40.753052,-73.980015,40.754065,1.0
75%,-73.968012,40.767113,-73.964062,40.768757,2.0
max,-72.986534,41.709557,-72.990967,41.696682,6.0


In [175]:
taxi_fare_train_df.isnull().sum()

fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64

No null or NaN values

In [176]:
import matplotlib.pyplot as plt

In [183]:
def remove_outliers_passenger_count(df:pd.DataFrame) -> None:
    return df[(df['passenger_count']<= 8) \
              & (df['passenger_count']>=1)
        ]

In [205]:
def add_part_dates_time(df:pd.DataFrame, col_name:str) -> None: 
    """
    This function seperate datetime objects to seperate columns with year, month, day, weekday, hour
    and drop orignal datetime column
    """
    df[f'{col_name}_year'] = df[col_name].dt.year
    df[f'{col_name}_month'] = df[col_name].dt.month
    df[f'{col_name}_day'] = df[col_name].dt.day
    df[f'{col_name}_weekday'] = df[col_name].dt.weekday
    df[f'{col_name}_hour'] = df[col_name].dt.hour
    
    # we do not need original datetime column, so we can drop it
    df.drop(col_name, axis=1, inplace=True)


To find average rige distance, we have to calculate/convert pick and dropoff longitute and latitute to distance

Add Distance Between Pickup and Drop
We can use the haversine distance:

* https://en.wikipedia.org/wiki/Haversine_formula
* https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

In [218]:
def haversine_np(lon1:float, lat1:float, lon2:float, lat2:float)->float:
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [219]:
def add_distance(df:pd.DataFrame)-> None:
    df['distance_km'] = haversine_np(
        df['pickup_longitude'], 
        df['pickup_latitude'], 
        df['dropoff_longitude'], 
        df['dropoff_latitude']
        )


we can remove these outliers by keeping only NYC longitute and latitute info 

In [227]:
def remove_outliers_location(df:pd.DataFrame) -> pd.DataFrame:
    return df[
                (df['pickup_longitude'] >= -75) & 
                (df['pickup_longitude'] <= -72) & 
                (df['dropoff_longitude'] >= -75) & 
                (df['dropoff_longitude'] <= -72) & 
                (df['pickup_latitude'] >= 40) & 
                (df['pickup_latitude'] <= 42) & 
                (df['dropoff_latitude'] >=40) & 
                (df['dropoff_latitude'] <= 42)
            ]

## 3. Prepare Dataset for Training

- Split Training & Validation Set
- Data Cleaning
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

### Split Training & Validation Set

We'll set aside 20% of the training data as the validation set, to evaluate the models we train on previously unseen data. 

Since the test set and training set have the same date ranges, we can pick a random 20% fraction.

In [234]:
from sklearn.model_selection import train_test_split

Lets take frash train set, so that 
* we can automize whole preprocessing using already created functions
* We can easily skip EDA section to save time

Before we create dataset containing date and time. we have to check whether we have predict future data or past data(or random data)
we can easily check it by checking dates from testing dataset

In [235]:
taxi_fare_train_org_df.pickup_datetime.sample(10)

367044   2014-10-19 19:25:36+00:00
417626   2015-06-09 23:31:15+00:00
511655   2012-12-16 10:20:00+00:00
308295   2009-07-26 21:26:44+00:00
96786    2011-01-14 15:31:22+00:00
28530    2014-07-07 23:44:40+00:00
372281   2011-12-27 21:00:53+00:00
199765   2011-07-08 15:14:48+00:00
48585    2013-05-28 20:11:00+00:00
8467     2009-07-07 20:40:34+00:00
Name: pickup_datetime, dtype: datetime64[ns, UTC]

In [236]:
taxi_fare_test_df.pickup_datetime.sample(10)

9499   2014-07-21 18:19:00+00:00
9188   2011-06-01 07:37:00+00:00
2437   2010-09-20 16:48:00+00:00
3397   2012-11-20 21:54:00+00:00
2708   2010-02-06 19:59:16+00:00
1774   2014-12-24 03:00:00+00:00
427    2011-06-24 12:03:00+00:00
7399   2011-12-13 22:00:00+00:00
3481   2009-03-16 10:30:12+00:00
9642   2012-11-20 21:54:00+00:00
Name: pickup_datetime, dtype: datetime64[ns, UTC]

### Fill/Remove Missing Values

There are no missing values in our sample, but if there were, we could simply drop the rows with missing values instead of trying to fill them (since we have a lot of training data)>

In [237]:
taxi_fare_train_df = taxi_fare_train_df.dropna()

train and test dataset contains data with same year, hence we do not need to predict future data, 
if we have to predict that then we have to keeping mind during spliting training dataset into train and test/validation, we have to split dataset on the basis of date. validation dataset can only contain future data than data from training data. for example train dataset contain data from year 2009-2014 and validation dataset contain only data with date 2015.

In [238]:
# Separate input feature and target
X = taxi_fare_train_df.drop('fare_amount', axis=1)
y = taxi_fare_train_df.fare_amount

In [239]:
X.columns

Index(['pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

In [240]:
X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y, 
        test_size=0.2,
        shuffle=True, 
        random_state=42
    )

## Preprocessing

### Remove Outliers

we already remove data with 0 and minus fare amount

In [241]:
def remove_outliers(df):
    """
    fare_amount outliers are already removed
    """
    df = remove_outliers_passenger_count(df=df)
    df = drop_longitude_lattitude_outliers(    
        df=df, 
        lat_col_name='pickup_latitude', 
        long_col_name='pickup_longitude'
    )
    df = drop_longitude_lattitude_outliers(
        df=df, 
        lat_col_name='dropoff_latitude', 
        long_col_name='dropoff_longitude'
    )
    df = remove_outliers_location(df=df)
    return df

In [242]:
X_train = remove_outliers(X_train)
X_test = remove_outliers(X_test)

In [243]:
X_train.head(5)

Unnamed: 0,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
106816,2009-04-01 14:44:04+00:00,-74.002106,40.740608,-73.992126,40.753681,1
299923,2013-07-18 08:34:10+00:00,-73.99173,40.744503,-73.973915,40.751446,1
108730,2009-09-24 06:11:00+00:00,-73.994629,40.748676,-73.864937,40.770443,1
472304,2013-07-22 08:07:00+00:00,-73.999039,40.734142,-73.984924,40.751049,1
27344,2011-02-04 06:51:00+00:00,-73.97567,40.782063,-73.978851,40.757328,1


## Feature Engineering

#### Separate date and time

In [244]:
add_part_dates_time(df=X_train, col_name='pickup_datetime')
add_part_dates_time(df=X_test, col_name='pickup_datetime')

In [245]:
X_train.head(5)

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
106816,-74.002106,40.740608,-73.992126,40.753681,1,2009,4,1,2,14
299923,-73.99173,40.744503,-73.973915,40.751446,1,2013,7,18,3,8
108730,-73.994629,40.748676,-73.864937,40.770443,1,2009,9,24,3,6
472304,-73.999039,40.734142,-73.984924,40.751049,1,2013,7,22,0,8
27344,-73.97567,40.782063,-73.978851,40.757328,1,2011,2,4,4,6


#### Calculate trip distance

In [246]:
add_distance(X_train)
add_distance(X_test)

In [247]:
X_train.head(5)

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,distance_km
106816,-74.002106,40.740608,-73.992126,40.753681,1,2009,4,1,2,14,1.67816
299923,-73.99173,40.744503,-73.973915,40.751446,1,2013,7,18,3,8,1.686465
108730,-73.994629,40.748676,-73.864937,40.770443,1,2009,9,24,3,6,11.181247
472304,-73.999039,40.734142,-73.984924,40.751049,1,2013,7,22,0,8,2.223336
27344,-73.97567,40.782063,-73.978851,40.757328,1,2011,2,4,4,6,2.76176


In [248]:
X_train.distance_km.describe()

count    431117.000000
mean          3.326101
std           3.731683
min           0.000000
25%           1.253458
50%           2.154325
75%           3.910396
max         113.474625
Name: distance_km, dtype: float64

#### Add Distance From Popular Landmarks

- JFK Airport
- LGA Airport
- EWR Airport
- Times Square
- Met Meuseum
- World Trade Center

We'll add the distance from drop location. 

In [249]:
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
met_lonlat = -73.9632, 40.7794
wtc_lonlat = -74.0099, 40.7126

In [250]:
def add_landmark_dropoff_distance(df, landmark_name, landmark_lonlat):
    lon, lat = landmark_lonlat
    df[landmark_name + '_drop_distance'] = haversine_np(lon, lat, df['dropoff_longitude'], df['dropoff_latitude'])

In [251]:
for a_df in [X_train, X_test, taxi_fare_test_df]:
    for name, lonlat in [('jfk', jfk_lonlat), ('lga', lga_lonlat), ('ewr', ewr_lonlat), ('met', met_lonlat), ('wtc', wtc_lonlat)]:
        add_landmark_dropoff_distance(a_df, name, lonlat)

In [253]:
X_train.head()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,distance_km,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
106816,-74.002106,40.740608,-73.992126,40.753681,1,2009,4,1,2,14,1.67816,21.934467,10.271628,16.934586,3.754353,4.804076
299923,-73.99173,40.744503,-73.973915,40.751446,1,2013,7,18,3,8,1.686465,20.542608,8.872175,18.242617,3.234372,5.274363
108730,-73.994629,40.748676,-73.864937,40.770443,1,2009,9,24,3,6,11.181247,16.108179,1.047018,27.576614,8.329019,13.794492
472304,-73.999039,40.734142,-73.984924,40.751049,1,2013,7,22,0,8,2.223336,21.269541,9.767874,17.370049,3.642241,4.76246
27344,-73.97567,40.782063,-73.978851,40.757328,1,2011,2,4,4,6,2.76176,21.267338,9.088562,18.119427,2.783923,5.616056


#### Scaling and One-Hot Encoding

We won't do this because we'll be training tree-based models which are generally able to do a good job even without the above.


##  Train & Evaluate Different Models

We'll train each of the following & submit predictions to Kaggle:

- Random Forests
- 
Exercise: Train Ridge, SVM, KNN, Decision Tree models

## 4. Train Hardcoded & Baseline Models

- Hardcoded model: always predict average fare
- Baseline model: Linear regression 

For evaluation the dataset uses RMSE error: 
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

In [252]:
taxi_fare_train_without_outlier_df = remove_outliers(taxi_fare_train_df)

## References
* https://www.kaggle.com/code/madhurisivalenka/cleansing-eda-modelling-lgbm-xgboost-starters
* https://www.kaggle.com/code/breemen/nyc-taxi-fare-data-exploration
* https://www.openstreetmap.org/export#map=10/41.1528/-73.6496
* https://www.kaggle.com/code/yekahaaagayeham/new-york-city-taxi-fare-prediction-eda-baseline#2.-Explore-the-Dataset