Weverton Domingos de Medeiros - weverton.medeiros@ee.ufcg.edu.br
![](https://www.peoplesworld.org/wp-content/uploads/2017/08/taxisinnewyork960.jpg)

In this project is proposed the use all my Machine Learning techniques and data hadling skills. For this, it is necessary to apply pre-processing techniques and feature engineering in the dataset to have a good model performance and then submit for the competition.

# Dataset description

### Name: 
* New York City Taxi Fare Prediction

### File descriptions:
* train.csv - Input features and target fare_amount values for the training set (about 55M rows).
* test.csv - Input features for the test set (about 10K rows). Your goal is to predict fare_amount for each row.
* sample_submission.csv - a sample submission file in the correct format (columns key and fare_amount). This file 'predicts' fare_amount to be $11.35 for all rows, which is the mean fare_amount from the training set.

### Data Field
#### ID
* key - Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer, but this doesn't matter, it should just be used as a unique ID field. Required in your submission CSV. Not necessarily needed in the training set, but could be useful to simulate a 'submission file' while doing cross-validation within the training set.

#### Features
* pickup_datetime - timestamp value indicating when the taxi ride started.
* pickup_longitude - float for longitude coordinate of where the taxi ride started.
* pickup_latitude - float for latitude coordinate of where the taxi ride started.
* dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
* dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
* passenger_count - integer indicating the number of passengers in the taxi ride.

#### Target
* fare_amount - float dollar amount of the cost of the taxi ride. This value is only in the training set; this is what you are predicting in the test set and it is required in your submission CSV.

#### Support - This analysis was made with support on the next notebook: https://www.kaggle.com/breemen/nyc-taxi-fare-data-exploration

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Libraries applied to the development of the EDA, preprocessing and model evaluation.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# from matplotlib.gridspec import GridSpec
from sklearn.model_selection import train_test_split
# from sklearn.compose import ColumnTransformer
# from sklearn.pipeline import Pipeline
# from sklearn.impute import SimpleImputer
# from sklearn.preprocessing import OneHotEncoder
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.model_selection import cross_val_score
# import matplotlib.pyplot as plt
import lightgbm as lgbm
from sklearn import metrics
# from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
import category_encoders as ce
# import plotly.express as px
import itertools
# import time
import math
import datetime

Functions developed for the project in order to facilitate the feature engineering and pre-processing process.

* **remove_geog_outliers(row), remove_passenger_outliers(row), remove_fare_outliers(row):** functions that return the updated values without outliers.

* **haversine_dist(long_pickup, long_dropoff, lat_pickup, lat_dropoff):** a function that calculate the haversine distance between two geolocation points.

* **datetime_features(row):** function that helps to get the individual values from datetime feature, as *year*, *month*, *hour*, etc.

* **rush_hour(row):** a function to define the rush hour. It was created in order to help the model to improve the predictions.

* **is_weekend(row):** a function to define if the day is on weekend on dyring the week. It was created in order to help the model to improve the predictions.

* **define_airport(row):** a function to define if the location is from an airport or not. It was created in order to help the model to improve the predictions.

Data collect from dataset.

In [None]:
X = pd.read_csv('../input/new-york-city-taxi-fare-prediction/train.csv', nrows = 5_000_000)
X_test = pd.read_csv('../input/new-york-city-taxi-fare-prediction/test.csv')

X.head()

# Import and initial processing of data
Dataset overview about the characteristics of its variables.

In [None]:
X.info()

In [None]:
X.describe()

Analyzing the output of the *describe()* method, it can be seen that some values present unrealistic values, like the minimum value of *fare_amount* feature, which is negative, as the maximum value of this same variable is really high. Using this approach, it is easy to find this outliers values in other features. Therefore, the outliers will be removed immediately at processing stage.

In [None]:
X_correlation = X.corr()
sns.heatmap(X_correlation, annot = True, cmap = 'viridis')
plt.title('Correlation among features')
plt.show()

From this correlation map there is no conclusion it can be made about the relation among *fare_amount* and other *features*, since these are not strongly correlated. But what can be seen is some variables, besides *fare_amount*, are correlated, such as the variables *pickup_latitude* and *dropoff_latitude* and *pickup_longitude* and *dropoff_longitude*, what demonstrate that, after a good feature engineering the model can be improved in order to predict the *fare_amount* values with less error.

# Pre-processing and Feature Engineering

In [None]:
print('Size before removal of negative values: ', len(X))

X = X[X.fare_amount >= 0]
print('Size after removal of negative values: ', len(X))

In [None]:
s = (X.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

In [None]:
# minimum and maximum longitude test
min(X.pickup_longitude.min(), X.dropoff_longitude.min()), \
max(X.pickup_longitude.max(), X.dropoff_longitude.max())

In [None]:
# minimum and maximum latitude test
min(X.pickup_latitude.min(), X.dropoff_latitude.min()), \
max(X.pickup_latitude.max(), X.dropoff_latitude.max())

Just as described by the method *describe()* used above, some variables present outlier values, and some of those were just desmontrated on the cells above.

In [None]:
fig, (ax1,ax2) = plt.subplots(figsize = [12,12], nrows = 1, ncols = 2)
plt.subplots_adjust(left = 0, bottom = None, right = 1, top = 0.5, 
                    wspace = 0.2, hspace = None)

plt.figure

ax1.boxplot(X['fare_amount'])
ax1.set_title('Fare amount')
ax1.set_ylabel('Fare amount',fontsize = 10)

ax2.boxplot(X['passenger_count'])
ax2.set_title('Passenger count')
ax2.set_ylabel('Passenger count',fontsize = 10)

fig, (ax3,ax4) = plt.subplots(figsize = [12,12], nrows = 1, ncols = 2)
plt.subplots_adjust(left = 0, bottom = None, right = 1, top = 0.5, 
                    wspace = 0.2, hspace = None)

plt.figure

ax3.boxplot(X['pickup_longitude'])
ax3.set_title('Pickup longitude')
ax3.set_ylabel('Pickup longitude',fontsize = 10)

ax4.boxplot(X['pickup_latitude'])
ax4.set_title('Pickup latitude')
ax4.set_ylabel('Pickup latitude',fontsize = 10)

From the views above, one can see the importance of, in addition to treating missing data, also removing disproportionate and unrealistic data. Thus, knowing this, below will be done the removal of these values that can hinder the performance of the model.

It is known that the city New York is at latitude 40.730610 and longitude -73.935242.

In [None]:
def remove_geog_outliers(df):
    return df[(df.pickup_latitude < 49.7) & (df.pickup_latitude > 31.77) &
             (df.dropoff_latitude < 49.7) & (df.dropoff_latitude > 31.77) &
             (df.pickup_longitude > -77) & (df.pickup_longitude < -71) &
             (df.dropoff_longitude > -77) & (df.dropoff_longitude < -71)]

In [None]:
init_amount, _ = X.shape
print('Removing outliers from geographic data features')
X = remove_geog_outliers(X)
final_amount, _ = X.shape
print('Amount of data removed in (%): {:.03f}%'.format((1 - final_amount/init_amount)*100))

In [None]:
def remove_fare_outliers(df):
    return df[(df.fare_amount >= 2.5) & (df.fare_amount <= 250)]

In [None]:
init_amount, _ = X.shape
print('Removing outliers from fare amount feature')
X = remove_fare_outliers(X)
final_amount, _ = X.shape
print('Amount of data removed in (%): {:.03f}%'.format((1 - final_amount/init_amount)*100))

In [None]:
def remove_passenger_outliers(df):
    return df[(df.passenger_count > 0) & (df.passenger_count <= 7)]

In [None]:
init_amount, _ = X.shape
print('Removing outliers from passenger count feature')
X = remove_passenger_outliers(X)
final_amount, _ = X.shape
print('Amount of data removed in (%): {:.03f}%'.format((1 - final_amount/init_amount)*100))

The distance traveled and the time taken by the Taxi to take the passenger to the destination is also one of the reasons for the increase in the fare amount, but this information is not described in the data set, thus, through a feature engineering treatment, by adding new variables, the relationship between the route and its final value can become clearer.

In [None]:
def haversine_dist(long_pickup, long_dropoff, lat_pickup, lat_dropoff):
    
    distance = []
    
    for i in range(len(long_pickup)):
        long1, long2, lat1, lat2 = map(math.radians, 
                                       (long_pickup[i], long_dropoff[i], 
                                        lat_pickup[i], lat_dropoff[i]))
        dlat = (lat2 - lat1)
        dlong = (long2 - long1)    
        a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * (math.sin(dlong/2)**2)

        distance.append(2 * math.asin(math.sqrt(a)) * 6371)

    return distance

In [None]:
X['dist_km'] = haversine_dist(X['pickup_longitude'].to_numpy(),X['dropoff_longitude'].to_numpy(),X['pickup_latitude'].to_numpy(),X['dropoff_latitude'].to_numpy())
X_test['dist_km'] = haversine_dist(X_test['pickup_longitude'].to_numpy(),X_test['dropoff_longitude'].to_numpy(),X_test['pickup_latitude'].to_numpy(),X_test['dropoff_latitude'].to_numpy())

X = X[X.dist_km <=130]
X_test = X_test[X_test.dist_km <= 130]

X.head(5)

To understand how the fare amount change according with the timeof the day and the day of the week the function *datetime_features* was implemented below.

In [None]:
def datetime_features(df):
    df['date'] = df['pickup_datetime'].str.replace('UTC', '')
    df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d %H:%M:%S')
    df['hour_of_day'] = df.date.dt.hour
    df['week'] = df.date.dt.week
    df['day_of_week'] = df.date.dt.weekday
    df['month'] = df.date.dt.month
    df['year'] = df.date.dt.year
    
    df.drop('date', inplace = True, axis = 1)
    
    return df

In [None]:
X = datetime_features(X)
X_test = datetime_features(X_test)

X.drop(columns = ['key', 'pickup_datetime'], axis = 1, 
       inplace = True, errors = 'ignore')
X_test.drop(columns = ['key', 'pickup_datetime'], axis = 1, 
            inplace = True, errors = 'ignore')

X.head()

In [None]:
rush_time = X.groupby(['hour_of_day']).size()

print('Fare by hour of the day:')
print(rush_time)

It is noticed, therefore, that there are two hour intervals during the day that stand out due to the continuous increase in the number of runs: 06-09am and 04-08pm. That is the reason for the creation of the feature *rush_hour*. For the same reason, try to find new relations with the *fare_amount* feature, the features *is_weekend* and *airport* were created.

In [None]:
def rush_hour(row):
    if ((row.hour_of_day >= 6 and row.hour_of_day <= 9) or 
    (row.hour_of_day >= 16 and row.hour_of_day <= 20)):
        return 1
    else:
        return 0

In [None]:
X['rush_hour'] = X.apply(lambda row: rush_hour(row), axis = 1)
X_test['rush_hour'] = X_test.apply(lambda row: rush_hour(row), axis = 1)
X.head()

In [None]:
def is_weekend(row):
    if (row.day_of_week == 5 or row.day_of_week == 6):
        return 1
    else:
        return 0

In [None]:
X['is_weekend'] = X.apply(lambda row: is_weekend(row), axis = 1)
X_test['is_weekend'] = X_test.apply(lambda row: is_weekend(row), axis = 1)
X.head()

In [None]:
def define_airport(row):
    if ((row.pickup_latitude >= 40.63 and row.pickup_latitude <= 40.68)
       and (row.pickup_longitude >= -73.79 and row.pickup_longitude <= -73.75) 
       or (row.dropoff_latitude >= 40.63 and row.dropoff_latitude <= 40.68)
       and (row.dropoff_longitude >= -73.79 and row.dropoff_longitude <= -73.75)):
        return 1 #JFK airport
    
    elif ((row.pickup_latitude >= 40.76 and row.pickup_latitude <= 40.79)
       and (row.pickup_longitude >= -73.89 and row.pickup_longitude <= -73.85)
         or (row.dropoff_latitude >= 40.76 and row.dropoff_latitude <= 40.79)
       and (row.dropoff_longitude >= -73.89 and row.dropoff_longitude <= -73.85)):
        return 2 #LGA airport
    else:
        return 0 #None

In [None]:
X['airport'] = X.apply(lambda row: define_airport(row), axis = 1)
X_test['airport'] = X_test.apply(lambda row: define_airport(row), axis = 1)

# Exploratory Data Analysis (EDA)
Dataset overview about the characteristics of its variables.

In [None]:
plt.figure(figsize=(16,8))
sns.set_style("whitegrid")
plt.title('Distance distribution (in km)')
plt.xlabel('Distance (km)')
plt.xticks(range(0,400,5))

sns.kdeplot(X[X.dist_km < 200].dist_km, shade=True)

As can be seen above, most part of the trips are between aproximately 5 km to 25 km. All the taxi trips that have their total distance (in km) too long may be outliers values and were removed to make the feature *dist_km* more correlated to the target variable. The distance of the trips are quite reasonable since from the centre New York city extends in a radius around 130 km.

In [None]:
X.pivot_table('fare_amount', index='hour_of_day', columns='year').plot(figsize = (14,6))
plt.ylabel('Fare amount $USD');

From the view above it can be see that over the years the taxi fare mean grew and it can be justified by the increase in the number of vehicles circulating on the streets, which leads to an increase in the time the taxi remains on the street during the trip.

In [None]:
hour_of_day = X.groupby('hour_of_day').fare_amount.agg(['mean'])
day_of_week = X.groupby('day_of_week').fare_amount.agg(['mean'])
month = X.groupby('month').fare_amount.agg(['mean'])


fig, (ax1,ax2, ax3) = plt.subplots(figsize = [14,12], nrows = 3, ncols = 1)

ax1.plot(hour_of_day, 'b')
ax2.plot(day_of_week,'g')
ax3.plot(month, 'r')

ax1.set_title('Fare by hour', fontsize = 18)
ax2.set_title('Fare by day', fontsize = 18)
ax3.set_title('Fare by month', fontsize = 18)

# plt.style.use('seaborn')
sns.set_style("white")
# plt.grid()
plt.show()


The hour of the day with highest fare amount is in the beggining of the day. Maybe because of traffic and people moving to their jobs/classes/chores. During the week the values of fare amount remain close but from Saturday to Sunday there is a notable grow, which can be explained because usually people get out with alone or with their family/friends to enjoy touristic places and others, so there is a high demand and maybe and increase on the traffics and the chosen of specific places.

In [None]:
plt.figure(figsize = (12,6))
plt.title('Fare taxis during the weekend (or not)')
sns.set_style("white")
sns.countplot(x = 'day_of_week', hue = 'is_weekend', data = X)

In [None]:
fig, (ax1, ax2) = plt.subplots(figsize = [14,10], nrows = 2, ncols = 1)


sns.set_style("white")
sns.countplot(x = 'hour_of_day', data = X[X.day_of_week <= 4], ax = ax1, palette = 'viridis')
ax1.set_title('Number of taxi trips by the hour during the week')

sns.set_style("white")
sns.countplot(x = 'hour_of_day', data = X[X.day_of_week >= 5], ax = ax2, palette = 'viridis')
ax2.set_title('Number of taxi trips by the hour during the weekend')

There is a slight difference on the number of taxi trips when compering the days during the week and the weekend. People used to get out to parties and events during the weekend, what explain the high number of taxi trips during dawn, what is the opposite in what happens during the week days, that this number is low during dawn and higher at convetional hours.

In [None]:
X[X.fare_amount < 150].fare_amount.hist(bins = 100, figsize=(14,5))
plt.xlabel('Fare amount $USD')
plt.title('Histogram')

In [None]:
plt.figure(figsize = (14,10))

# zoom in on part of data
idx = (X.dist_km <= 30) & (X.fare_amount <= 150)
plt.scatter(X[idx].dist_km, X[idx].fare_amount, alpha = 0.2)
plt.xlabel('distance km')
plt.ylabel('fare amount $USD')
plt.title('Zoom in on distance <= 30 km, fare amount <= $150');

As there are a few fixed values and greater than the values that are in the distribution between $ USD 0 ~ 30, it can be said that these are fixed values for trips to selected places. The horizontal lines in the right plot might indicate again the fixed fare trips to/from an airport or other place that can indicate a fixed value for the taxi fare.

In [None]:
X_correlation2 = X.corr()
plt.figure(figsize = (18,18))
sns.heatmap(X_correlation2, annot = True, cmap = 'viridis')
plt.title('Correlation among features after feature engineering')
plt.show()

Due to the pre-processing and feature engineering work, now there are lots of features that are correlated with *fare_amount* feature, which leads the model to a better performance and lower error. Still, it can be seen that other some features, even if are not well correlated with *fare_amount* started to be correlated with features that are well correlated with the target.

# Model

### Light GBM Classifier

LightGBM is a gradient boosting framework that uses tree based learning algorithm. In this model is used leaf-wise tree growth technique. For a small number of nodes, leaf-wise will probably out-perform level-wise, and another great advantage is its high speed.

Initially the param was defined and all the parameters were defined as suggested. It was made a initial study to see a good n_estimators that returned the best evaluation metrics.

In [None]:
y = X['fare_amount']
X.drop(columns = ['fare_amount'], axis = 1, 
       inplace = True, errors = 'ignore')

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size = 0.8, 
                                                           test_size = 0.2, random_state = 0)

In [None]:
params = {
        'boosting_type':'gbdt',
        'objective': 'regression',
        'num_leaves': 38,
        'learning_rate': 0.08,
        'max_depth': -1,
        'subsample': 0.8,
        'bagging_fraction' : 1,
        'max_bin' : 5000 ,
        'bagging_freq': 20,
        'colsample_bytree': 0.6,
        'metric': 'rmse',
        'min_split_gain': 0.5,
        'min_child_weight': 1,
        'min_child_samples': 10,
        'scale_pos_weight': 1,
        'zero_as_missing': True,
        'seed': 0,
        'num_rounds': 500
    }


model_train = lgbm.Dataset(X_train, y_train)
model_test = lgbm.Dataset(X_valid, y_valid)

model = lgbm.train(params = params, train_set = model_train, num_boost_round = 1000, 
                   early_stopping_rounds = 250, verbose_eval = 100, 
                   valid_sets = model_test)

In [None]:
prediction = model.predict(X_valid)
rmse = math.sqrt(mean_squared_error(y_valid, prediction))

print(rmse)

In [None]:
lgbm.plot_importance(model)
plt.show()

In [None]:
y_prediction = model.predict(X_test)

submission = pd.read_csv('../input/new-york-city-taxi-fare-prediction/sample_submission.csv')
submission['fare_amount'] = y_prediction

submission.to_csv('submission_taxifare_lgbm.csv', index = False)
submission.head(5)

# Conclusions

**As a result of previous activities and analyzes, the LGBM model was chosen due to the better performance in the evaluation metrics for previous tabular data and is faster than other Boosting techniques, as XGBoost model. The LGBM model design and submission of the test dataset achieved a RMSE performance of 3.61, being reaching a XXX place on leaderboard with score 3.12518. The competition has been closed, but from the leaderboard now I'd reach the position 372.**

As it was seen on EDA, some features were, even no having a strong correlation with the target, were used to create new features in order to improve the model performance. In this sense, some variable even after processing, didn't add any relevant information.

Maybe a better pre-processing technique could make the result better, such as instead of using the harvesine distance, calculate the real distance through the geolocalization points obeying the streets, to know how long the taxi trip lasted, if this information was given on the dataset, or even can to say for sure the time of the trip and the distance from the traffic in a specific hour. *dist_km* stands as the main feature.

After the feature engineering and processing stages, some variables had their correlation parameter improved, and so on the model performance.