# New York City Taxi Fare Prediction
<img src= "https://miro.medium.com/max/1200/1*-Oa3eUBRoF4uzvJkp9OV_Q.jpeg" alt ="Titanic" style='width:8600px;'>

Image Credit : <a href="https://medium.com/analytics-vidhya/new-york-city-taxi-fare-prediction-1ba96223ba7e">Medium article</a>


Let's train a machine learning model to predict the fare for a taxi ride in New York city given information like pickup date & time, pickup location, drop location and no. of passengers. 

Dataset Link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction


## Step 1. Loading the Dataset

- Install required libraries
- Download data from Kaggle or Use Kaggle Notebook for using data without downloading
- View dataset files
- Load training and test set with Pandas

### Load dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data_dir = '/kaggle/input/new-york-city-taxi-fare-prediction'

### View Dataset Files

In [None]:
# List of files with size
!ls -lh {data_dir}

As we can see training file is too large. so before loading it into dataframe let's take a look at training data using shell commands

In [None]:
# Training dataset
!head {data_dir}/train.csv

In [None]:
# Test dataset
!head {data_dir}/test.csv

In [None]:
# sample_submission file
!head {data_dir}/sample_submission.csv

In [None]:
# count number of lines in training dataset
!wc -l {data_dir}/train.csv

In [None]:
# count number of lines in test dataset
!wc -l {data_dir}/test.csv

In [None]:
# No. of lines in sample_submission file
!wc -l {data_dir}/sample_submission.csv

Observations:

- This is a supervised learning regression problem
- Training data is 5.5 GB in size and consist of 55.4 M rows
- Test set is much smaller (only ≈ 10 K rows)
- 8 fetures present:
    - `key` (unique ID field, used in submission)
    - `pickup_datetime`
    - `pickup_longitude`
    - `pickup_latitude`
    - `dropoff_longitude`
    - `dropoff_latitude`
    - `passenger_count`
    - `fare_amount` (target column)

- The test set has all columns except the target column.

### Loading Training and test Dataset into pandas

Loading the entire dataset into pandas dataframe will be slow, let's take following measures

- Because 'key' column can not be used for prediction. Ignore it.
- Parse pickup_datetime 
- Specify data types for other columns
   - `uint8` for passenger count
   - `float32` for geo coordinates
   - `float32` for fare amount
   
- Only use 5% sample of the data for model training for now (≈2.77M rows)

In [None]:
import pandas as pd
import random
from datetime import datetime

In [None]:
selected_cols = 'fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')
dtypes = {
    'fare_amount': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude' : 'float32',
    'passenger_count': 'uint8'
}

In [None]:
%%time

frac = 0.02
def skip_row(row_idx):
    if row_idx == 0:
        return False
    return random.random() > frac

random.seed(10)
# dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

taxi_df = pd.read_csv(data_dir+"/train.csv", 
                 usecols=selected_cols, 
                 dtype=dtypes, 
                 parse_dates=['pickup_datetime'],
                 skiprows=skip_row)

In [None]:
taxi_df

In [None]:
type(taxi_df.pickup_datetime[0])

In [None]:
# Load Test Set
test_df = pd.read_csv(data_dir+'/test.csv', dtype=dtypes, parse_dates=['pickup_datetime'])

In [None]:
test_df

## Step 2. Data Cleaning, Data Visualization and Feature Engineering

- Basic info about training set
- Basic info about test set
- Remove noise and outliers
- Exploratory data analysis & visualization
- Ask & answer questions
- Add features to dataset

### Training Dataset

In [None]:
taxi_df.info()

In [None]:
taxi_df.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

In [None]:
taxi_df.isnull().sum()

In [None]:
taxi_df.fare_amount.lt(0).sum

In [None]:
taxi_df.pickup_datetime.min(), taxi_df.pickup_datetime.max()

In [None]:
sum(taxi_df['fare_amount']>100)

Observations about training data:

- missing data present
- `fare_amount` is negative in some cols. That is not no realistic. So i will drop these rows from dataset. 
- There seem to be some errors in the latitude & longitude values
- Dates range from 1st Jan 2009 to 30th June 2015
- The dataset takes up ~15 MB of space in the RAM

### Test Set

In [None]:
test_df.info()

In [None]:
test_df.describe()

In [None]:
test_df.pickup_datetime.min(), test_df.pickup_datetime.max()

Some observations about the test set:

- 9914 rows of data
- No missing values
- No obvious data entry errors
- 1 to 6 passengers (we can limit training data to this range)
- Latitudes lie between 40 and 42
- Longitudes lie between -75 and -72
- Pickup dates range from Jan 1st 2009 to Jun  30th 2015 (same as training set)

We can use the ranges of the test set to drop outliers/invalid data from the training set.

In [None]:
taxi_df = taxi_df.dropna()

In [None]:
# import libraries for data visulization
import matplotlib
import matplotlib.pyplot as plt
import plotly.express as px
import plotly
import seaborn as sns


In [None]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (12, 8)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [None]:
sns.displot(taxi_df['fare_amount'], kde=False, bins=100)

Fare_amount is negative in some rows and absurdly huge like $500 in some rows. I am removing these outliers.

In [None]:
taxi_df = taxi_df[(taxi_df['fare_amount']>0) & (taxi_df['fare_amount'] <= 200)]

In [None]:
sns.displot(taxi_df[taxi_df['fare_amount']<=100]['fare_amount'], kde=False, bins=100)

In [None]:
sns.displot(taxi_df['passenger_count'], kde=False)

In [None]:
sum(taxi_df['passenger_count']>10)

In [None]:
sns.displot(taxi_df[taxi_df['passenger_count']<=10].passenger_count, kde=False, bins=10)

In [None]:
sum(taxi_df['passenger_count']==0)

In [None]:
sns.boxplot(data=taxi_df[taxi_df['passenger_count']<=10], x="passenger_count", y="fare_amount")

In [None]:
taxi_df[taxi_df['passenger_count']>8]

There are some rows in which passnger count is 0. So I am going to drop those rows. Although it is unlikely for taxi to have more than 5 passanger. Even if i consider some extereme scenario taxis can't carry more than 8 passangers. So Let's drop these data points.

In [None]:
taxi_df = taxi_df[(taxi_df['passenger_count']<=8) & (taxi_df['passenger_count']>0)]

Now because we have longitude and latitude. So We have plot these coordinate on a map to get a better view and see if there is some data to be cleaned.
First let's define the Bounding Box from data of test dataset. Bounding Box is the area defined by two longitudes and two latitudes that will include all spatial points.

In [None]:
bbox = (min(test_df.pickup_longitude.min(), test_df.dropoff_longitude.min()),
        max(test_df.pickup_longitude.max(), test_df.dropoff_longitude.max()),
        min(test_df.pickup_latitude.min(), test_df.dropoff_latitude.min()),
        max(test_df.pickup_latitude.max(), test_df.dropoff_latitude.max())
)
       
bbox

We can go to https://www.openstreetmap.org/export#map=5/51.500/-0.100 to get the desired map.
I have followed this medium article for getting the map and plotting pickup and dropoff location on that map. https://towardsdatascience.com/easy-steps-to-plot-geographic-data-on-a-map-python-11217859a2db

In [None]:
import PIL
import urllib
import io

url = 'https://i.imgur.com/xx2b9dC.png'
nyc_map = PIL.Image.open(urllib.request.urlopen(url))
nyc_map

In [None]:
nyc_map = np.array(nyc_map)

Because I do not have to predict fare for trips that are outside that bounding box and we already have enough large dataset. So let's drop data points outside of bounding box

In [None]:
taxi_df = taxi_df[(taxi_df.pickup_longitude >= bbox[0]) & (taxi_df.pickup_longitude <= bbox[1]) &
            (taxi_df.pickup_latitude >= bbox[2]) & (taxi_df.pickup_latitude <= bbox[3]) & 
            (taxi_df.dropoff_longitude >= bbox[0]) & (taxi_df.dropoff_longitude <= bbox[1]) & 
            (taxi_df.dropoff_latitude >= bbox[2]) & (taxi_df.dropoff_latitude <= bbox[3])]

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (15,9))
ax[0].scatter(taxi_df['pickup_longitude'], taxi_df['pickup_latitude'], zorder=1, alpha= 0.2, c='b', s=1)
ax[0].set_title('Pickup Locations')
ax[0].set_xlim(bbox[0],bbox[1])
ax[0].set_ylim(bbox[2],bbox[3])
ax[0].imshow(nyc_map, zorder=0, extent = bbox, aspect= 'equal')

ax[1].scatter(taxi_df['dropoff_longitude'], taxi_df['dropoff_latitude'], zorder=1, alpha= 0.2, c='b', s=1)
ax[1].set_title('Dropoff Locations')
ax[1].set_xlim(bbox[0],bbox[1])
ax[1].set_ylim(bbox[2],bbox[3])
ax[1].imshow(nyc_map, zorder=0, extent = bbox, aspect= 'equal')

plt.imshow(nyc_map)

In [None]:
longitude = list(taxi_df.pickup_longitude) + list(taxi_df.dropoff_longitude)
latitude = list(taxi_df.pickup_latitude) + list(taxi_df.dropoff_latitude)
plt.figure(figsize = (10,10))
plt.plot(longitude,latitude,'.', alpha = 0.4, markersize = 0.05)
plt.xlim(-74.05, -73.75)
plt.ylim(40.6, 40.9)
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (15,9))
xdf = taxi_df[taxi_df['fare_amount']<80]

sp = ax.scatter(xdf.dropoff_longitude, xdf.dropoff_latitude, c=xdf.fare_amount,alpha= 0.4, s=5, cmap='Spectral')
fig.colorbar(sp)
ax.set_xlim(-74.05, -73.75)
ax.set_ylim(40.6, 40.9)
plt.show()

Places that are far away has higher taxi fare. Makes sense

It can be seen from previous plots that some location points are in the water. Let's try to remove them using a mask image of map in which land is shown in black and water in white. I've taken help from this notebook for that. https://www.kaggle.com/breemen/nyc-taxi-fare-data-exploration

In [None]:
url = 'https://i.imgur.com/ZGg3Bry.png'
nyc_mask = np.array(PIL.Image.open(urllib.request.urlopen(url)))[:,:,0]>(255*0.7)

In [None]:
nyc_mask.shape

In [None]:
plt.imshow(nyc_map, zorder=0)
plt.imshow(nyc_mask, alpha=0.7, cmap='gray')

In [None]:
def location_to_coor(longitude, latitude, dx, dy, bbox):
    return (dx*(longitude - bbox[0])/(bbox[1]-bbox[0])).astype('int'), (dy - dy*(latitude - bbox[2])/(bbox[3]-bbox[2])).astype('int')

In [None]:
pickup_x, pickup_y = location_to_coor(taxi_df.pickup_longitude, taxi_df.pickup_latitude, 
                                  nyc_mask.shape[1], nyc_mask.shape[0], bbox)
dropoff_x, dropoff_y = location_to_coor(taxi_df.dropoff_longitude, taxi_df.dropoff_latitude, 
                                  nyc_mask.shape[1], nyc_mask.shape[0], bbox)

In [None]:
idx = (nyc_mask[pickup_y, pickup_x] & nyc_mask[dropoff_y, dropoff_x])
print("Number of trips in water: {}".format(np.sum(idx)))

In [None]:
np.count_nonzero(idx==0)

In [None]:
taxi_df[idx]

These all coordinates lies on water. I've checked two of then using <a href="https://www.google.com/maps/place/41%C2%B002'57.1%22N+73%C2%B016'51.6%22W/@40.9199231,-73.5469829,8.24z/data=!4m5!3m4!1s0x0:0x4004ca2e07ed014b!8m2!3d41.049183!4d-73.281006">Google Maps</a>.
Let's drop these data points

In [None]:
taxi_df = taxi_df[~idx]

### Extract Parts of Date

- Year
- Month
- Day
- Weekday
- Hour

In [None]:
def add_dateparts(df, col):
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_weekday'] = df[col].dt.weekday
    df[col + '_hour'] = df[col].dt.hour

In [None]:
add_dateparts(taxi_df, 'pickup_datetime')

In [None]:
add_dateparts(test_df, 'pickup_datetime')

In [None]:
test_df

### Add Distance Between Pickup and Drop

I have calculated the haversine distance: 
- https://en.wikipedia.org/wiki/Haversine_formula
- https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

In [None]:
def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [None]:
def add_trip_distance(df):
    df['trip_distance'] = haversine_np(df['pickup_longitude'], df['pickup_latitude'], df['dropoff_longitude'], df['dropoff_latitude'])

In [None]:
%%time
add_trip_distance(taxi_df)
add_trip_distance(test_df)

In [None]:
taxi_df.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

In [None]:
taxi_df

In [None]:
sns.scatterplot(data = taxi_df, x="pickup_datetime_day", y="fare_amount")

Throught days of a month fare seems to be uniformly distributed

In [None]:
sns.scatterplot(data = taxi_df, x="pickup_datetime_year", y="fare_amount")

In [None]:
taxi_df.groupby('pickup_datetime_year').mean()['fare_amount']

Fare amount seems to steadily increasing by year as expected

In [None]:
sns.scatterplot(data = taxi_df, x="pickup_datetime_weekday", y="fare_amount")

In [None]:
sns.distplot(x=taxi_df['pickup_datetime_hour'], bins=24, kde=False)

No. of trips are lowest from midnight to 5 am and highest in evening when there are people returning from their workplaces. Nothing unexpected. Time of the day also plays an important role.

In [None]:
sns.barplot(x='pickup_datetime_hour',y='fare_amount', data=taxi_df)

In [None]:
sns.barplot(x='pickup_datetime_hour',y='trip_distance', data=taxi_df)

Fare is higher b/w 3-6 am and 2-4 pm . It maybe possible that people living far away from their workplaces prefer to leave early to avoid rush hour

In [None]:
sns.distplot(x=taxi_df['pickup_datetime_weekday'], bins=7, kde=False)

trips are uniformly devided throught all days of week.

In [None]:
sns.barplot(x='pickup_datetime_weekday',y='fare_amount', data=taxi_df)

There seems to slight increase avg fare amount on sunday. Maybe people are going for weekend travel. Or maybe I am overthinking it??

In [None]:
sns.distplot(taxi_df['trip_distance'])

In [None]:
taxi_df[taxi_df['trip_distance']>100]

### Add Distance From Popular Landmarks

- JFK Airport
- LGA Airport
- EWR Airport
- Times Square
- Met Meuseum
- World Trade Center

We'll add the distance from drop location. 

In [None]:
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
met_lonlat = -73.9632, 40.7794
wtc_lonlat = -74.0099, 40.7126

In [None]:
def add_landmark_dropoff_distance(df, landmark_name, landmark_lonlat):
    lon, lat = landmark_lonlat
    df[landmark_name + '_drop_distance'] = haversine_np(lon, lat, df['dropoff_longitude'], df['dropoff_latitude'])

In [None]:
%%time
for a_df in [taxi_df, test_df]:
    for name, lonlat in [('jfk', jfk_lonlat), ('lga', lga_lonlat), ('ewr', ewr_lonlat), ('met', met_lonlat), ('wtc', wtc_lonlat)]:
        add_landmark_dropoff_distance(a_df, name, lonlat)

In [None]:
test_df

## 3. Prepare Dataset for Training

- Split Training & Validation Set
- Fill/Remove Missing Values
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

### Split Training & Validation Set

Time range for test set is also 2009-2015. So I'll set aside 20% of the training data as the validation set, to evaluate the models we train on previously unseen data. 

Since the test set and training set have the same date ranges, pick a random 20% fraction.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_df, val_df = train_test_split(taxi_df, test_size=0.2, random_state=10)

In [None]:
len(train_df), len(val_df)

### Extract Inputs and Outputs

In [None]:
taxi_df.columns

In [None]:
input_cols = ['pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_weekday', 'pickup_datetime_hour', 'trip_distance',
       'jfk_drop_distance', 'lga_drop_distance', 'ewr_drop_distance',
       'met_drop_distance', 'wtc_drop_distance']

In [None]:
target_col = 'fare_amount'

#### Training

In [None]:
train_inputs = train_df[input_cols]

In [None]:
train_targets = train_df[target_col]

In [None]:
train_inputs

In [None]:
train_targets

#### Validation

In [None]:
val_inputs = val_df[input_cols]

In [None]:
val_targets = val_df[target_col]

In [None]:
val_inputs

In [None]:
val_targets

#### Test

In [None]:
test_inputs = test_df[input_cols]

In [None]:
test_inputs

### Scaling and One-Hot Encoding

I am not going to do this because I'll be training tree-based models which are generally able to do a good job even without the above.

### Save Intermediate DataFrames

Let's save the processed datasets in the Apache Parquet format, so that I will be able to download them easily and continue model training on my lacal machine.

They can also be used to create seperate notebook for training and evaluating models after EDA, feature engineering.

In [None]:
train_df.to_parquet('train.parquet')

In [None]:
val_df.to_parquet('val.parquet')

In [None]:
test_df.to_parquet('test.parquet')

## 7. Train & Evaluate Different Models

I will train each of the following & submit predictions to Kaggle:

- Gradient Boosting
- LightGBM
- ANN

Can also train Linear Regression, Random Forests for prediction

In [None]:
train_df = pd.read_parquet('test.parquet', engine='pyarrow')
val_df = pd.read_parquet('test.parquet', engine='pyarrow')
test_df = pd.read_parquet('test.parquet', engine='pyarrow')

Let's define a helper function to evaluate models and generate test predictions

In [None]:
from sklearn.metrics import mean_squared_error
def evaluate(model):
    train_preds = model.predict(train_inputs)
    train_rmse = mean_squared_error(train_targets, train_preds, squared=False)
    val_preds = model.predict(val_inputs)
    val_rmse = mean_squared_error(val_targets, val_preds, squared=False)
    return train_rmse, val_rmse, train_preds, val_preds

In [None]:
def predict_and_submit(model, fname):
    test_preds = model.predict(test_inputs)
    sub_df = pd.read_csv(data_dir+'/sample_submission.csv')
    sub_df['fare_amount'] = test_preds
    sub_df.to_csv(fname, index=None)
    return sub_df

## Gradient Boosting

### 1. XGBoost

https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

In [None]:
from xgboost import XGBRegressor

In [None]:
xgb_model_final = XGBRegressor(objective='reg:squarederror', n_jobs=-1, random_state=42,
                               n_estimators=500, max_depth=5, learning_rate=0.1, 
                               subsample=0.8, colsample_bytree=0.8, tree_method= 'gpu_hist')

In [None]:
%%time
xgb_model_final.fit(train_inputs, train_targets)

In [None]:
evaluate(xgb_model_final)

This model is giving fairly better predictions than the base xgb model. Let's also plot the graph to see the importance of features in pridictions.

In [None]:
importance_df = pd.DataFrame({
    'feature': train_inputs.columns,
    'importance': xgb_model_final.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
def plot_importance(importance_df):
    plt.figure(figsize=(10,6))
    plt.title('Feature Importance')
    sns.barplot(data=importance_df.head(10), x='importance', y='feature')

In [None]:
plot_importance(importance_df)

In [None]:
predict_and_submit(xgb_model_final, 'xgb_tuned_submission.csv')

### 2. Light GBM

https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

In [None]:
import lightgbm as lgb

In [None]:
dtrain = lgb.Dataset(train_inputs, label = train_targets, silent=True, free_raw_data=False)
dval  = lgb.Dataset(val_inputs, label = val_targets, silent=True, free_raw_data=False)

In [None]:
params = {
        'boosting_type':'gbdt',
        'objective': 'regression',
        'nthread': -1,
        'verbose': -1,
        'metric': 'rmse',
    }

In [None]:
lgbm_base_model = lgb.train(params, train_set = dtrain, valid_sets = [dval])

In [None]:
evaluate(lgbm_base_model)

Tuning Hyperparameters for LGBM

https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html

https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5

I'm using optuna for hyperparameter tuning. This is the first time i am using this in an kaggle competition.

In [None]:
import optuna

In [None]:
from optuna.integration import LightGBMPruningCallback


def objective(trial):
    
    param_grid = {
        "objective": trial.suggest_categorical("objective", ["regression"]),
        'metric': trial.suggest_categorical("metric", ['rmse']),
        "boosting_type": trial.suggest_categorical("boosting_type", ['gbdt']),
        "verbose" :trial.suggest_categorical("verbose", [-1]),
        "device_type": trial.suggest_categorical("device_type", ['gpu']),
        "num_boost_round": trial.suggest_categorical("num_boost_round", [1000]),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 200, 10000, step=100),
        "lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5),
        "lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step=5),
        "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
        "bagging_fraction": trial.suggest_float(
            "bagging_fraction", 0.2, 0.9, step=0.1
        ),
        "bagging_freq": trial.suggest_categorical("bagging_freq", [1]),
        "feature_fraction": trial.suggest_float(
            "feature_fraction", 0.2, 0.9, step=0.1
        ),
    }

    pruning_callback = optuna.integration.LightGBMPruningCallback(trial, 'rmse')


    model = lgb.train(param_grid , dtrain,
        valid_sets = [dval],
        early_stopping_rounds=100,
        callbacks=[pruning_callback]
    )
    
    val_preds = model.predict(val_inputs)
    val_rmse = mean_squared_error(val_targets, val_preds, squared=False)


    return val_rmse


In [None]:
study = optuna.create_study(direction="minimize", study_name="LGBM Regressor")
func = lambda trial: objective(trial)
study.optimize(func, n_trials=25)

In [None]:
print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")

for key, value in study.best_params.items():
    print(f"\t\t{key}: {value}")

In [None]:
lgbm_final_model = lgb.train(study.best_params, train_set = dtrain, valid_sets = [dval], early_stopping_rounds=100)

In [None]:
evaluate(lgbm_final_model)

In [None]:
importance_df = pd.DataFrame({
    'feature': train_inputs.columns,
    'importance': lgbm_final_model.feature_importance()
}).sort_values('importance', ascending=False)

In [None]:
plot_importance(importance_df)

In [None]:
predict_and_submit(lgbm_final_model, 'lgbm_tuned_submission.csv')

## ANN

Finally, I'm training a neural network for this regression task. I will be using using a neural network of 4 hidden layers.

https://www.tensorflow.org/tutorials/keras/regression

https://keras.io/guides/sequential_model/

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
train_inputs.shape[1]

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

train_scalin = scaler.fit_transform(train_inputs)
val_scalin = scaler.transform(val_inputs)
test_scalin = scaler.transform(test_inputs)

In [None]:
len(test_scalin)

In [None]:
from keras import backend as K

def root_mean_squared_error(y_true, y_pred):
        return K.sqrt(K.mean(K.square(y_pred - y_true))) 

In [None]:
# define a deep neural network model
def build_and_compile_model(dim):
    model = keras.Sequential([

      layers.Dense(128, activation='relu', input_dim=dim),
      layers.BatchNormalization(),

      layers.Dense(64, activation='relu'),
      layers.BatchNormalization(),

      layers.Dense(32, activation='relu'),
      layers.BatchNormalization(),

      layers.Dense(8, activation='relu'),
      layers.BatchNormalization(),

      layers.Dense(1)
    ])

    model.compile(loss=root_mean_squared_error,
                optimizer=tf.keras.optimizers.Adam(0.001), metrics=['mae'])
    return model


In [None]:
dnn_model = build_and_compile_model(dim=train_inputs.shape[1])
dnn_model.summary()

In [None]:
ep_no = 50
Batch = 128

In [None]:
%%time
history = dnn_model.fit(
    train_scalin,
    train_targets,
    validation_data=(val_scalin, val_targets),
    validation_steps=len(val_scalin) // Batch,
    batch_size=    Batch,
    epochs=ep_no, verbose=1)

In [None]:
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylim([0, 10])
plt.xlabel('Epoch')
plt.ylabel('Error')
plt.legend()
plt.grid(True)

In [None]:
preds = dnn_model.predict(test_scalin, batch_size=Batch, verbose=1)

In [None]:
sub_df = pd.read_csv(data_dir+'/sample_submission.csv')
sub_df['fare_amount'] = preds
sub_df.to_csv('DNN_Submission.csv', index=None)

 # Future Work
 
 - Claculate density for dropoff location and see how it can affect fare amount
 - Train on GPU with entire dataset using `dask`, `cudf` and `cuml`