# <center> New York Taxi Fares Prediction </center>
## <center> Supervised Learning with XGBoost </center>

![NY taxis](https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iaFilM3php7g/v1/-1x-1.jpg)

## Introduction

Hi Kagglers,

Here I come with another notebook. This time trying to predict the New York taxi rides fares, given the pickup, dropoff locations, and some other features. We will stick with the XGBoost algorithm, which gives very good predictions for this specific data after we perform some interesting feature engineering and exploration of the dataset. Check it out!

Taxicabs are, and always will be, an iconic part of New York. The history of their characteristic yellow color goes as follows: owners of cab companies painted their fleets a distinct signature color, resulting in cabs ranging from brown, white, red, and even checker ones. And some were yellow. After a few years, two big cab companies decided that yellow was the way to go, with both ultimately contributing to the tradition of yellow cabs in New York City. These companies were the Yellow Cab Company, started by John Hertz in Chicago in 1910, and the Yellow Taxicab Company, which was incorporated in New York by Albert Rockwell in 1912. More information on this can be found [here](https://untappedcities.com/2017/07/12/nyc-fun-facts-why-are-most-nyc-taxi-cabs-yellow/).

My notebook has been inspired by others on the same topic, which I enlist hereunder. My recognition to them and recommendation to take them a look:

- Ravi tanwar's Data Cleaning + Eda + Modelling: https://www.kaggle.com/ravijoe/data-cleaning-eda-modelling

- Jes√∫s Ros' XGBoost'ing Taxi Fares: https://www.kaggle.com/gunbl4d3/xgboost-ing-taxi-fares

- Vinod R's EDA + XGBoost For Predicting Fare Amount: https://www.kaggle.com/vinodsunny1/eda-xgboost-for-predicting-fare-amount

- AlexS2020's Rapdis and XG Boost running on GPU: https://www.kaggle.com/alexs2020/rapdis-and-xg-boost-running-on-gpu

- Nicapotato's Taxi Rides Time Analysis and OOF LGBM: https://www.kaggle.com/nicapotato/taxi-rides-time-analysis-and-oof-lgbm

Remember to upvote if you really enjoyed it. And feel free to comment, suggest or even complain in the comment section. Check my other notebooks, which are also great.

Cheers!

## Index

[The data](#section0)

1. [Load the libraries](#section1)
2. [Load the dataset](#section2)
3. [Basic exploration](#section3)
4. [Dataset Cleaning](#section4)
5. [Feature engineering](#section5)
6. [Further EDA](#section6)
7. [Model training](#section7)
8. [Predictions](#section8)

## <a id=secion0>The data</a>

The data is from a Kaggle's Playground Prediction Competition, it can be found [here](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data).

The main signature of this dataset is its massive size, counting with 55 million observations in its train set. Therefore, it is vital to handle this in some way, in order to build a decent model and don't convert your computer into a toaster at the same time. Particularly we will subset the set with 7 million rows, an adequate and more than enough amount of data to get a good performance.

Apart from this, the data is extremely cleaned and simple in their dimensions, having only 6 features, an id column, and the dependent variable of the fare amount. These are the attributes:

    pickup_datetime - timestamp value indicating when the taxi ride started. 
    pickup_longitude - float for longitude coordinate of where the taxi ride started.
    pickup_latitude - float for latitude coordinate of where the taxi ride started.
    dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
    dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
    passenger_count - integer indicating the number of passengers in the taxi ride. 

### <a id="section1">1. Load the libraries</a>

In [None]:
# For processing the data
import numpy as np
import pandas as pd
import datetime as dt

# Visualization tools
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("white") # set style for seaborn plots

# Machine learning
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Ignore warnings
import warnings 
warnings.filterwarnings('ignore')

# Time-related functions
import time

### <a id="section2">2. Load the dataset</a>

We will load our train dataset from the downloaded csv file. Since, as stated, this dataset has 55 million rows, we will set the `nrows` parameter to 7M to prevent memory issues and speed up everything. Considering there is no order among the observations there is no need to randomize this selection of rows.

Feel free to adapt the `nrows` in accordance with your computer capabilities or the performance of the model you want to attain, if you only want to read and learn something from the notebook you can low it down to 1M for example.

In [None]:
data = pd.read_csv("../input/new-york-city-taxi-fare-prediction/train.csv", nrows = 7000000)

### <a id="section3">3. Basic exploration</a>

Now we will explore the loaded data and see whether we find problems that require some sort of fixing, such as missing values, imbalanced data, inconsistent observations, outliers, etc.

In [None]:
print("Dimensions of our training set: ", data.shape)
data.dtypes

Here we have the first 5 rows of our data. With them, we can become familiar with all the features,  their values, and have a solid representation of how our set looks like.

> 

In [None]:
data.head()

Here we have the distribution of our dependent variable:

In [None]:
f, ax = plt.subplots(1, 1, figsize=(8,5))
sns.distplot(data["fare_amount"], kde=True, color="#fdb813")
plt.xlim(0, 700)
plt.ylim(0, 0.08)
plt.title("Distribution of the target: fare_amount")
plt.xlabel("Frequency")
plt.show()

We won't remove anything for now, but it will be positive to know if whether or not we have outliers in our target.

In [None]:
q1  = data['fare_amount'].quantile(0.25)
q3  = data['fare_amount'].quantile(0.75)
iqr =  q3 - q1
print("Fare Amount lower bound : ", q1 - (1.5 * iqr), 
      "Fare Amount upper bound : ", q3 + (1.5 * iqr))

In [None]:
print("Total null values:\n", data.isnull().sum())
print("Percentage of null values:\n",
      data[["dropoff_longitude", "dropoff_latitude"]].isnull().sum() / data.shape[0])

We have very few rows with null values, only 47 of the 7M, less than a 0.0007%. With this in mind, it is safe to just remove them. We can be sure that we won't lose any valuable information from them.

In [None]:
data.dropna(how='any', axis='rows', inplace=True)

Having solved the missing values issue, let us look at how the coordinates are distributed in a scatter plot. Given that the taxi rides are placed in New York, they should be clustered around specific longitude and latitude values. We will only plot the first rows which should be sufficient to show what we want.

In [None]:
f, ax = plt.subplots(1, 2, figsize=(16, 5))
sns.scatterplot(x="pickup_longitude", y="pickup_latitude", data=data.iloc[:10000], 
                color="#fdb813", ax=ax[0])
sns.scatterplot(x="dropoff_longitude", y="dropoff_latitude", data=data.iloc[:10000], 
                color="#fdb813", ax=ax[1])
ax[0].set_title("Pickup Coordinates")
ax[1].set_title("Dropoff Coordinates")
plt.show()

Latitude and longitude coordinates of New York City are around the values 40.730610, and -73.935242 respectively. But the values of our set are much different from the actual NY latitude and longitude. It can be true that some taxis have gone outside the city to dropoff some commuters, but it is deceptive that they could go that far and that many times. What should we believe? That those observations with 0 latitude values went to pickup or dropoff the passenger to the Earth's Equator?

We will remove points, not near these coordinates in the future. 

In [None]:
data.describe()

With the descriptive statistics of the data, we find some interesting yet problematic insights: 

- Some `fare_amount` values are negative. Or the taxi driver is a modern Robin Hood, or they are errors that should be removed. 

- On the other side, the `passenger_count` has also unrealistic quantities, ranging from 0 to 200. Maybe I miss something but the maximum number of passengers here in the European Union, which I know better, are 7 for the biggest cars: 2 rows of 3 back seats, and the two pilot and co-driver front seats.

So we will deal with these problems in the next section.

### <a id="section4">4. Dataset Cleaning</a>

In [None]:
def get_cleaned(df):
    return df[(df.fare_amount > 0) &
              (df.pickup_latitude > 35) & (df.pickup_latitude < 45) &
              (df.pickup_longitude > -80) & (df.pickup_longitude < -68) &
              (df.dropoff_latitude > 35) & (df.dropoff_latitude < 45) &
              (df.pickup_longitude > -80) & (df.dropoff_longitude < -68) &
              (df.passenger_count > 0) & (df.passenger_count < 8)]

data = get_cleaned(data)
print(len(data))
print("Data lost after the cleaning process: ", 7000000 - len(data))

We have lost 169.718 observations due to incongruous data, a fairly high number but with minimum impact considering our 6.830.282 rows dataset.

### <a id="section5">5. Feature engineering</a>

Now that we have cleaned our data, we will follow by adding some interesting features. These new variables ideas have been drawn from all the other notebooks cited above.

#### 5.1 Distance measurement from pickup to dropoff

The haversine formula determines the great-circle distance between two points on a sphere given their longitudes and latitudes. Let us calculate it for each observation, in other words, let us calculate the distance along great radius between pickup and dropoff coordinates for each individual ride.

In [None]:
def sphere_dist(pick_lat, pick_lon, drop_lat, drop_lon):
    R_earth = 6371 # Earth radius (in km)
    # Convert degrees to radians
    pick_lat, pick_lon, drop_lat, drop_lon = map(np.radians, [pick_lat, pick_lon,
                                                              drop_lat, drop_lon])
    # Compute distances along lat, lon dimensions
    dlat = drop_lat - pick_lat
    dlon = drop_lon - pick_lon
    
    # Compute haversine distance
    a = np.sin(dlat/2.0)**2 + np.cos(pick_lat) * np.cos(drop_lat) * np.sin(dlon/2.0)**2
    return 2 * R_earth * np.arcsin(np.sqrt(a))

#### 5.2 Distance to the city airports

Trips from or to the airports of New York have a fixed price, so it would be nice if we state this fact.

In [None]:
def airport_dist(df):
    """
    JFK: John F. Kennedy International Airport
    EWR: Newark Liberty International Airport
    LGA: LaGuardia Airport
    """
    jfk_coord = (40.639722, -73.778889)
    ewr_coord = (40.6925, -74.168611)
    lga_coord = (40.77725, -73.872611)
    
    pick_lat = df['pickup_latitude']
    pick_lon = df['pickup_longitude']
    drop_lat = df['dropoff_latitude']
    drop_lon = df['dropoff_longitude']
    
    pickup_jfk = sphere_dist(pick_lat, pick_lon, jfk_coord[0], jfk_coord[1])
    dropoff_jfk = sphere_dist(jfk_coord[0], jfk_coord[1], drop_lat, drop_lon) 
    pickup_ewr = sphere_dist(pick_lat, pick_lon, ewr_coord[0], ewr_coord[1])
    dropoff_ewr = sphere_dist(ewr_coord[0], ewr_coord[1], drop_lat, drop_lon) 
    pickup_lga = sphere_dist(pick_lat, pick_lon, lga_coord[0], lga_coord[1]) 
    dropoff_lga = sphere_dist(lga_coord[0], lga_coord[1], drop_lat, drop_lon)
    
    df['jfk_dist'] = pd.concat([pickup_jfk, dropoff_jfk], axis=1).min(axis=1)
    df['ewr_dist'] = pd.concat([pickup_ewr, dropoff_ewr], axis=1).min(axis=1)
    df['lga_dist'] = pd.concat([pickup_lga, dropoff_lga], axis=1).min(axis=1)
    
    return df

#### 5.3 Information from datetime (day of the week, month, hour, day). 

Taxi fares change day/night or on weekdays/holidays in most of the cities.

In [None]:
def datetime_info(df):
    #Convert to datetime format
    df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'],format="%Y-%m-%d %H:%M:%S UTC")
    
    df['hour'] = df.pickup_datetime.dt.hour
    df['day'] = df.pickup_datetime.dt.day
    df['month'] = df.pickup_datetime.dt.month
    df['weekday'] = df.pickup_datetime.dt.weekday
    df['year'] = df.pickup_datetime.dt.year
    
    return df


data = datetime_info(data)
data = airport_dist(data)
data['distance'] = sphere_dist(data['pickup_latitude'], data['pickup_longitude'], 
                               data['dropoff_latitude'], data['dropoff_longitude'])

data.head()

### <a id="section6">6. Further EDA</a>

Now that we have our brand new features let us continue with further exploration of the set.

In [None]:
plt.figure(figsize=(10,5))
sns.lineplot(x="year", y="fare_amount", data=data, color="#fdb813")
plt.title("Fare among years")
plt.show()

Fares have steadily increased over the years. This is important information that our model should take into account.

In [None]:
f, ax = plt.subplots(1, 2, figsize=(12,5))
ax[0].hist(data["passenger_count"], bins=7, color=("#fdb813"))
ax[0].set_title("Number of passengers frequency")
ax[0].set_xlabel('No. of Passengers')
ax[0].set_ylabel('Frequency')

ax[1].scatter(x=data["passenger_count"], y=data["fare_amount"], s=1.5, 
              color=("#3D2C05"))
ax[1].set_title("Fare amount by number of passengers")
ax[1].set_xlabel('No. of Passengers')
ax[1].set_ylabel('Fare');

From the above graphs, we can see that single passengers are by far the most frequent, and the highest fare also seems to come from cabs which carry just one commuter.

In [None]:
f, ax = plt.subplots(1, 3, figsize=(16,5))
ax[0].hist(data["hour"], bins=24, color="#fdb813")
ax[0].set_title("Frequency of rides by Hour of the day")
ax[0].set_xlabel('Hour of the day')
ax[0].set_ylabel('Frequency')

ax[1].scatter(x=data["hour"], y=data["fare_amount"], s=1.5, c="#3D2C05")
ax[1].set_title("Fares by Hour of the day")
ax[1].set_xlabel('Hour of the day')
ax[1].set_ylabel('Fare')

sns.barplot(x="hour", y="fare_amount", data=data, ax=ax[2], color="#fdb813")
ax[2].set_title("Mean Fares by Hour of the day")
ax[2].set_xlabel('Hour of the day')
ax[2].set_ylabel('Mean fare')
plt.show()

If we focus our attention on the time of the day, it seems that the fares are higher between 22 and 5h., and 14 to 16h.

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x='weekday', y="fare_amount", data=data, palette=("#fdb813", "#3D2C05"))
plt.ylim(0, 14)
plt.title("Mean Fares among Days of the week")
plt.xlabel('Day of Week')
plt.ylabel('Mean fare')
plt.show()

The highest fares seem to be on a Sunday. But the differences are minimal, and I would say that they are even non-significative.

Does the distance affect the fare? This is a no-brainer. I am confident enough to bet good money that the distance would affect the fare by a great deal. Let us visualize this phenomenon:

In [None]:
f, ax = plt.subplots(1, 2, figsize=(16, 5))
sns.regplot(x="distance", y="fare_amount", data=data, color="#fdb813", ax=ax[0])
sns.regplot(x="distance", y="fare_amount", data=data, color="#fdb813", ax=ax[1])
ax[1].set_xlim(0, 1000)
ax[1].set_ylim(0, 300)
plt.title("Positive relation between distance and fare")
plt.show()

In [None]:
f = plt.figure(figsize=(14, 8))
sns.heatmap(data.corr(), annot=True, linewidths=0.2, cmap="viridis")
plt.title("Correlation Heatmap")
plt.show()

As suspected, the distance feature is highly correlated with `fare_amount`. Its importance in the prediction will be capital.

In [None]:
dropoff_longitude = data['dropoff_longitude'].to_numpy()
dropoff_latitude = data['dropoff_latitude'].to_numpy()

plt.figure(figsize=(12,8))
plt.scatter(dropoff_longitude, dropoff_latitude,
                color="#fdb813", 
                s=.02, alpha=.2)
plt.title("Dropoffs through the city")
# Borders of the city
plt.xlim(-74.03, -73.75)
plt.ylim(40.63, 40.85)
plt.show()

Now we need to drop the columns that we will not use to train our model.
- `key`: Independent variable with no information at all for the fare. Its function was merely for identification purposes.
- `pickup_datetime`: We divided this variable into multiple ones. Once done this, it is detrimental to keep it in the set, we would be counting date-related information twice.

In [None]:
data.drop(columns=["key", "pickup_datetime"], inplace=True)
data.head()

### <a id="section7">7. Model training</a>

Now that we have the dataframe that we wanted we can start to train the XGBoost model. First, we will split the dataset into train (95%) and test (5%). With this amount of data 10% should be enough to test performance.

In [None]:
y = data["fare_amount"]
train = data.drop(columns=["fare_amount"])

x_train, x_test, y_train, y_test = train_test_split(train, y, random_state=2666, test_size=0.05)

Through tunning with CV we know the optimal parameters that we record in the next dictionary for training the model in the future:

In [None]:
params = {
    "max_depth": 7,
    "subsample": 0.9,
    "eta": 0.03,
    "colsample_bytree": 0.9,
    "random_state": 2666,
    "objective": "reg:linear",
    "eval_metric": "rmse",
    "silent": 1
}

In [None]:
def XGBmodel(x_train, x_test, y_train, y_test, params):
    matrix_train = xgb.DMatrix(x_train, label=y_train)
    matrix_test = xgb.DMatrix(x_test, label=y_test)
    model = xgb.train(params=params,
                      dtrain=matrix_train,num_boost_round=5000, 
                      early_stopping_rounds=10,evals=[(matrix_test,'test')])
    return model

start_time = time.time()
model = XGBmodel(x_train, x_test, y_train, y_test, params)

In [None]:
time_taken = time.time() - start_time
time_taken

This took ages... but at least we have a pretty good model. Let us finally make the predictions of the test set and prepare the submission file.

### <a id="section8">8. Predictions</a>

In [None]:
test =  pd.read_csv('../input/new-york-city-taxi-fare-prediction/test.csv')
test = datetime_info(test)
test = airport_dist(test)
test['distance'] = sphere_dist(test['pickup_latitude'], test['pickup_longitude'], 
                               test['dropoff_latitude'] , test['dropoff_longitude'])
test_key = test['key']
x_pred = test.drop(columns=['key', 'pickup_datetime']) 

#Predict from test set
prediction = model.predict(xgb.DMatrix(x_pred), ntree_limit=model.best_ntree_limit)

In [None]:
#Create submission file
submission = pd.DataFrame({
        "key": test_key,
        "fare_amount": prediction.round(2)
})

submission.to_csv('taxi_fare_submission.csv',index=False)
submission.head()