#Predicting Taxi Fares

1. **In supervised learning**, the model is trained using labels - for example, we give it images of animals, and we tell it what animal it is. Then, we give it unlabelled images and the model try to guess what it is based on what it learned. Supervised learning is commonly used for classification and regression problems. Example of popular algorithms : Linear regression, Support Vector Machines (SVM), Neural networks, Decision Trees (such as random forest), Naive Bayes, Nearest Neighbor. It is widely used in predictive modelling.

2. **In unsupervised learning**, the model isn't given any label. Clustering (grouping of similar points into clusters) and anomaly detection are examples of unsupervised learning. Popular algorithms : k-means clustering and Association rules

3. **Semi-supervised learning** is a combination of supervised and unsupervised learning ; some data are labelled, some are not.

4. **In reinforcement learning**, the model learns by trial and error. It learns from past experience and tries to use this knowledge (feedback) to make better decisions. It is used for self-driving cars and was the technique used by AlphaGO. It is also used in all robotics related applications. Popular algorithms are Q-learning and deep adversarial networks.

As you probably guessed, predicting taxi fares is a supervised learning task.


In [None]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/sample_submission.csv
/kaggle/input/test.csv
/kaggle/input/GCP-Coupons-Instructions.rtf
/kaggle/input/train.csv


Here, we are interested in the test and train files. The idea is to use train data to build a model, and then to use the model to predict taxi fare on unseen data (tets data).

So, let's read the files we need to create and test a model :

In [None]:
trn_data = pd.read_csv("../input/train.csv", nrows = 2_000_000, parse_dates=["pickup_datetime"])
tst_data = pd.read_csv("../input/test.csv")

The original train file has 55'423'856 rows. It takes time to read it, and almost uses all the RAM available (16 GB). If we go over this 16 GB limit, the kernel dies. Therefore, we will use a subset of 2'000'000 rows.
Now that we have the files, we will look at what they contain (the 5 first rows), starting with the train data set :

In [None]:
trn_data.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00,-73.968095,40.768008,-73.956655,40.783762,1


In [None]:
print(trn_data.shape)

(2000000, 8)


We see here that the file contains 2'000'000 rows and 8 variables. The *key* variable is the ID, and *fare_amount* is the target variable that we want to predict, using the information of the remaining columns. In other words, *fare_amount* is actually the label. Finally, let's see if there are any missing values (NAs) in the data :

In [None]:
print(trn_data.isnull().sum())

key                   0
fare_amount           0
pickup_datetime       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude    14
dropoff_latitude     14
passenger_count       0
dtype: int64


The variables *dropoff_longitude* and *dropoff_latitude* both contain 14 missing values. It is important to deal with NAs before doing any modelling. NAs handling is a big and complex subject, but here we will simply drop these 14 rows for the sake of simplicty. Also, 14 NAs is a very few number so it shouldn't really affect our model.

In [None]:
trn_data = trn_data.dropna(how = 'any', axis = 'rows')
trn_data.isnull().values.any()

False

The code above dropped the 14 rows containing NAs and checked that the dataset doesn't contain NA's anymore. Note that now, the new train dataset contains 1'999'986 rows.

What does the test set contain ?

In [None]:
tst_data.head()

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.97332,40.763805,-73.98143,40.743835,1
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862,40.719383,-73.998886,40.739201,1
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.75126,-73.979654,40.746139,1
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.98116,40.767807,-73.990448,40.751635,1
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966046,40.789775,-73.988565,40.744427,1


In [None]:
tst_data.shape

(9914, 7)

We see that the test set contains the same variables as the train set, except *fare_amount*. This is normal, because it is the column we want to predict.
We also see that this file has 9'914 rows, whereas the original train set had 55'423'856 rows. This is because when we split the original dataset into train and test sets, we want to have more data to create the model, and less to test it. A general rule of thumb is to use ~80% for training and ~20% for testing. Here, ~98% of the original data (55'423'856 rows) is used for training, and 2% (9914 rows) for testing. In reality, it is a little bit more complicated : Kaggle internally splits the test set into 2 subsets : one of them is public, and the other one is private. When submitting results in the competition, the leaderboard shows performances on the public test data. The private test data are used at the end of the competition to determine the final standings.

Usually, we woudn't want to subset the train data the way we did because we lose information and might decrease our model's performances. Also, we altered the rule of thumb (80% train, 20% test). However, to keep things simple, we will assume it is acceptable here.

What is the sample_submission.csv file that was listed earlier ? Let's import it and look at the first 5 rows of its content :

In [None]:
sample_sub = pd.read_csv("../input/sample_submission.csv")
sample_sub.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.35
1,2015-01-27 13:08:24.0000003,11.35
2,2011-10-08 11:53:44.0000002,11.35
3,2012-12-01 21:12:12.0000002,11.35
4,2012-12-01 21:12:12.0000003,11.35


This file shows the expected format for the submission of results. Indeed, once we're satisfied with our model (when predictions are good enough according to some metrics), the predictions done on the test set are saved in a CSV file that will be submitted to Kaggle.
Kaggle knows the actual fare amount values, and compares them with our predictions. It then calculates a metric from this comparison and show the results on the leaderboard.

Each competition specifies a metric that will be used to rank the participants. For example, common metrics for classification problems are Area Under the ROC (AUC) and F1 Score, whereas Mean Squared/Absolute Error (MSE/MAE) are common for regression problems. Our goal is then to build a model that optimizes the given metric.

Now that we know the essentials about the data and the submitting form, before creating a model, we must identify the problem type ; for example, is it a classification problem ? Or is it a regression problem ?

## 2. Identifying the problem and creating a model

We saw earlier that in this case, we want to predict the fare amount of taxis rides. Because it is a continuous variable, this problem will be addressed with regression models.

The simplest regression method is, of course, the linear regression. More complex regression methods include artificial neural networks (after all, they are "just" regression models). However, it is important to use appropriate tools and start with simple models. This is why we will first build a simple linear regression.

### 2.1 Linear regression model

Let's start by building a simple linear regression model using Scikit-Learn. Scikit-learn is a popular Python library for machine learning.
In the first line of the following chunk, we import the linear regression model from the Scikit-learn library. We then create a linear regression object (called lr) and fit it to the train data :

In [None]:
from sklearn.linear_model import LinearRegression

# Create a linear regression object
lr = LinearRegression()

# Fit the model on the train data
lr.fit(X = trn_data[['pickup_longitude', 'pickup_latitude',
                     'dropoff_longitude', 'dropoff_latitude', 'passenger_count']],
       y = trn_data['fare_amount'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now that we have our model, we want to predict taxi fares on the test set. This is done by applying the predict method on the linear object we created. In the code below, a new column column called *fare_amount* is created in the test set (previously, this column was only present in the train set). This column contains the predictions done by our model with the variables specified in the variable *features*.

In [None]:
# Select variables with which the model has been trained
features = ['pickup_longitude', 'pickup_latitude',
            'dropoff_longitude', 'dropoff_latitude', 'passenger_count']

tst_data['fare_amount'] = lr.predict(tst_data[features])

To submit these results, we have to create a file with the format specified in sample_submission.csv. This latter required 2 variables : *key* and *fare_amount*.
In the following code, we format our submission so that it meets the requirements (select the 2 variables) and create the submission CSV file :

In [None]:
my_submission = tst_data[['key', 'fare_amount']]

Check that the output is what we want :

In [None]:
my_submission.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.269973
1,2015-01-27 13:08:24.0000003,11.27003
2,2011-10-08 11:53:44.0000002,11.270001
3,2012-12-01 21:12:12.0000002,11.269899
4,2012-12-01 21:12:12.0000003,11.269872


Pefect, this is the correct format ! The last thing to do would be to save this file on disk in CSV using Panda's to_csv method :

### 2.2 A better linear regression model

In order to improve our model, we must first understand how it really works. Let's go through some theory !

As we saw earlier, regression methods are part of supervised learning, which requires that both inputs (variables) and desired outputs (labels or target value) be provided to the algorithm. This latter uses data to infer a function from which it does predictions. These predictions are then compared to the target values and the accuracy of prediction can be quantified using different metrics (such as the MSE). Once training is complete, the algorithm applies what it learned to new data (the test set).

The important words in the previous paragraph are that the **model uses data to infer a function from which it does predictions**. In fact, model training consist of finding a function that explains the data points observed in the training set. Then, the model obtained (=the function) is used to predict new data points.

In a linear regression with 2 variables, the function used by the model to fit the data is of the form *y = ax + b*, where *y* is the target, *x* is a single variable, and *a* and *b* are the parameters of the model that we want to learn. More precisely, *a* is the coefficient and *b* is the observation noise.

The question becomes : how do we choose *a* and *b* ? --> We minimize a loss function called Ordinary least squares (OLS). In Scikit-learn, this is done under the hood. The idea is to find *a* and *b* that minimize the chosen metric.

In higher dimensions, the formula above generalizes to *y = a1x1 + a2x2 + a3x3 + anxn + b*. Note that here, the coefficient for each variable *a* and *b* must be specified. Concretely, we pass to Scikit-learn 2 arrays : one containing the variables, and one with the target.

Another very important concept is the notion of overfitting.

**Overfitting**

In the previous model, data were split into 2 subsets ; one for training and one for testing. The model used the training data to estimate a function describing the link between inputs and outputs.
Two undesirable cases may occur when learning from data :

1. **Overfitting** : this phenomenon occurs when the model learns the training data too accurately. The estimated function can thus be very complex and result in a model that has very poor generalization performance (generalization is the ability to make predictions based on new, unused data points)

1. **Underfitting**  (or oversmoothing) : in this case, the model is close to or even linear and is unable to do good predictions.

The figure below illustrates these cases :


![](https://i.imgur.com/1DUUmAd.png)

Imagine that the points represent the price of a taxi ride as a function of distance (*x* = distance, *y* = price). Blue dots are used to create the model, and red ones are used to test it. The *true function* is the unknown function that we want our model to find. In this scenario, we have the following cases :

* Underfitting : we see that a simple linear regression is too simple to capture the relationship between *x* and *y*. Both training and testing errors (blue and red vertical lines) are important. This could be the reason why our model had important testing errors.

* Overfitting : in this case, the model is very complex. Training errors are close to 0 but testing errors are important. Obviously, we can't have this situation with a linear model, but it can happen with the more complex models derived with neural networks. Because data consists of information and noise, it is not desirable that the model learns the data by heart.

* Optimal : We almost found the *true function*, training and testing errors are low.

Finding a good model is a chellenging task. However, there is something we can do: instead of using 2 subsets (train and test), we can use 3 ! In this approach, the train set is used for parameters estimation (for example *a* and *b* from *y = ax + b*), the validation set is used to select a model (the one with the best generalization properties), and the test set is used to assess the final performance of the model. In other words, we build several models and and use the validation set to select the best one. Then, we use this model on the test set.
The problem is that creating 3 subsets reduces the size of the training set, which can drecrease performances. To address this issue, a very common method consists of doing a so called cross-validation : instead of 3 subsets, we split the data in 2. But this time, the train set becomes the validation-train, while the test set remains unchanged. Then, a K-fold cross-validation is performed on the validation-train subset for parameter estimation.


**Cross-validation**

The results on the test set depend on how the data were split. It is possible that the train/test split was not representative of the data, and that the model was trained with it. Cross-validation consists in splitting the validation-train set into K subsets : the training is done on K-1 subsets, and the last one is used for validation. This process is repeated K times with K different splits between training and validation data. We then have K models to chose from. To resume, CV avoids the problem of the chosen metric being dependant on the train/test split.

How many folds should be used ?

To answer this question, let's first talk about the bias, the variance, and the bias-variance tradeoff, since CV and this latter are related.

The **bias** represents the difference between the average prediction of the model and the true target value. Models with high bias are oversimplified (underfitting), with high errors both on training and test data.

The **variance** is the variability of model prediction for a given data point. Models with high variance perform well on training data, but have high errors on test data. In other words, they overfit and do not generalize well.

The **bias-variance tradeoff** : a too simple model with very few parameters may have a high bias and a low variance. A more complex model with many parameters may have a high variance and a low bias. Therefore, a good model has to be somewhere inbetween ; we want to minimize the bias and the variance. So, we want to have enough data to train, but also enough data to validate. If K is too large, there are not enough data points in the validation set to confidently evaluate the model. Usually, a large K decreases the bias but increase the variance and the computation time (because we fit and predict more time). A small K means less trained models to evaluate.

In short :

Low K = faster, less variance, more bias

High K = slower, more variance, less bias

Usually, 5 or 10-folds CV is a good starting point. Increasing K to the point where K = n is called *leave-one-out cross-validation* (LOOCV)

**Regularization**

We saw that performing a CV can help obtaining a better model. What else can we do ? We can **regularize** the regression. We also saw that fitting a linear regression mimizes a loss function called Ordinary least squares (OLS) to chose a coefficient (or parameter) for each feature variable. The problem is that if those parameters are large, we can get overfitting (in high dimensional space, so if we have many variables). This is why we alter the loss function, so that it penalizes large coefficients (=regularization).

There are different types of regularized regressions :

* **Lasso** (Least absolute shrinkage and selection operator) regression, in which *loss function = OLS + alpha x sum(abs(ai))*. Lasso can be used for feature selection, that is to select important variables of the dataset, because it shrinks the coefficients of less important features to 0. Lasso is also called L1 regularization because the regularization term is the L1 norm of the coefficients.

* **Ridge** regression, in which *loss function = OLS + alpha x sum(ai^2)*, where *alpha* is a parameter that controls model complexity and that we have to chose. We also see that when *alpha* = 0, we get back OLS. When alpha is very large, the penalization is strong and the model is simple, which can lead to underfitting. Ridge regression is also called L2 regularization and is the one that interests us here.

* **Elastic net** regularization, that combines L1 and L2 regularization.

In both regressions, we have to chose an *alpha*, and parameters *a* and *b*. Such parameters are called hyperparameters. How do we chose these parameters ?

**Hyperparameter tuning**

This approach consists of trying different hyperparameter values, and see how they perform. In the end, we keep those which give the best model. It is important to use CV when tuning.
In practice, we chose a grid of values that we want to try for the hyperparameter and we perform K-fold CV for each value in the grid. This method is called **gridsearch**.

Before we do this, there is one last thing we can do to improve our model : preprocessing the data. It is often said that this step takes 80% of the time when preparing a model.

**Preprocessing the data**

This process is generally done during the so called EDA (exploration data analysis). In this latter, we usually use visualizations to see the distribution of the data, to spot outliers, trends, NAs, and so on. We also look at summary statistics, which can give a lot of information.
In ML, categorical variables are transformed into numerical variables and data are often normalized.

In this kernel, we already removed the NAs, so we don't have to worry about it. We still have to scale our data, because many models use some form of distance to inform them. If features have different scale, they can influence the model. This is called data normalization. There are different ways to perform it, such as standardization (subtract the mean and divide by variance -> all features are centered around 0 and have variance 1). We can also subtract the minimum and divide by the range (minimum 0 and maximum 1) or normalize so that the data ranges from -1 to 1.

Now, let's look at some statistics of the data to find potential outliers.

In [None]:
trn_data.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,1999986.0,1999986.0,1999986.0,1999986.0,1999986.0,1999986.0
mean,11.34768,-72.52324,39.92965,-72.52395,39.92808,1.684125
std,9.852609,12.86798,7.98333,12.77497,10.32382,1.314979
min,-62.0,-3377.681,-3458.665,-3383.297,-3461.541,0.0
25%,6.0,-73.99208,40.73491,-73.99141,40.734,1.0
50%,8.5,-73.98181,40.75263,-73.98016,40.75312,1.0
75%,12.5,-73.96713,40.7671,-73.96369,40.76809,2.0
max,1273.31,2856.442,2621.628,3414.307,3345.917,208.0


Looking at these statistics give useful information : we see that the minimum *fare_amount* is negative and that the maximum *passenger_count* is 208. Also, there are weird values for all the min/max longitude/latitude. Let's filter all these outliers :

In [None]:
trn_data = trn_data.drop(trn_data[(trn_data.fare_amount <= 2.5)].index, axis = 0)
trn_data = trn_data.drop(trn_data[(trn_data.passenger_count > 8)].index, axis = 0)
trn_data = trn_data.drop(trn_data[(trn_data.pickup_latitude < 40) |
                          (trn_data.pickup_latitude > 42)].index, axis = 0)
trn_data = trn_data.drop(trn_data[(trn_data.pickup_longitude < -75) |
                          (trn_data.pickup_longitude > -73)].index, axis = 0)
trn_data = trn_data.drop(trn_data[(trn_data.dropoff_latitude < 40) |
                          (trn_data.dropoff_latitude > 42)].index, axis = 0)
trn_data = trn_data.drop(trn_data[(trn_data.dropoff_longitude < -75) |
                          (trn_data.dropoff_longitude > -73)].index, axis = 0)
trn_data.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,1950929.0,1950929.0,1950929.0,1950929.0,1950929.0,1950929.0
mean,11.36409,-73.97525,40.75104,-73.97437,40.75139,1.685153
std,9.731438,0.0387881,0.02990528,0.0379305,0.03304212,1.307223
min,2.51,-74.96814,40.05272,-74.96426,40.01931,0.0
25%,6.0,-73.99229,40.73655,-73.9916,40.73554,1.0
50%,8.5,-73.98211,40.75335,-73.98063,40.75384,1.0
75%,12.5,-73.96838,40.76753,-73.96542,40.7684,2.0
max,500.0,-73.01175,41.92279,-73.01178,41.95112,6.0


As you can see, *fare_amount* was filtered to remove everything below 2.50$, since it is the minimum fare. *passenger_count* was filtered to remove everything above 8, and latitudes and longitudes were filtered to keep only point within NYC area.

If you think about predicting the price of a taxi ride, you might think that one of the most important variable to do it would probably be some sort of distance. It is not present in the data, but we can create one thanks to existing variables. The process of extracting features and trasforming them is called **feature engineering**. This is done in order to prepare the inputs so that they are compatible with the ML algorithm. In the following chunk, we will define a function that creates a distance that we will use in our model.

In [None]:
def distance_between_points(df):
    df['diff_lat'] = abs(df['dropoff_latitude'] - df['pickup_latitude'])
    df['diff_long'] = abs(df['dropoff_longitude'] - df['pickup_longitude'])
    df['manhattan_dist'] = df['diff_lat'] + df['diff_long']

distance_between_points(trn_data)
distance_between_points(tst_data)

Note that Manhattan distance is also called *taxi-distance* ; it is the distance between 2 points in a gridlike street geography. This new variable should probably help our model ! Let's look at the 3 new variables :

In [None]:
trn_data.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,diff_lat,diff_long,manhattan_dist
count,1950929.0,1950929.0,1950929.0,1950929.0,1950929.0,1950929.0,1950929.0,1950929.0,1950929.0
mean,11.36409,-73.97525,40.75104,-73.97437,40.75139,1.685153,0.02139504,0.02287349,0.04426853
std,9.731438,0.0387881,0.02990528,0.0379305,0.03304212,1.307223,0.02425309,0.03532813,0.05313563
min,2.51,-74.96814,40.05272,-74.96426,40.01931,0.0,0.0,0.0,0.0
25%,6.0,-73.99229,40.73655,-73.9916,40.73554,1.0,0.00699234,0.006186,0.016456
50%,8.5,-73.98211,40.75335,-73.98063,40.75384,1.0,0.014214,0.012742,0.028193
75%,12.5,-73.96838,40.76753,-73.96542,40.7684,2.0,0.027257,0.023973,0.050924
max,500.0,-73.01175,41.92279,-73.01178,41.95112,6.0,0.8809624,1.190012,1.710989


To resume, the folowing methods can be used to improve the simple linear model we created before (some of the following were already done, and some do not need to be done here) :

* Deal with NAs (remove them, impute a value like the mean, ...)
* Deal with outliers
* Transform categorical variables to numerical variables
* Create new variables
* Normalize the data
* Use K-fold cross-validation to create the model
* Tune the hyperparameter (we will use GridSearch)
* Use a regularized regression (we will use L2 (ridge)) to prevent overfitting

We changed quite a lot of things compared to our first model. Hopefully, this will pay off ! So, let's build this model and check !

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

ridge = Ridge()

parameters = {'alpha':np.linspace(0,1,20)}

ridge_regressor = GridSearchCV(ridge, parameters, scoring ='neg_mean_squared_error', cv = 5)

ridge_regressor.fit(X = trn_data[['pickup_longitude', 'pickup_latitude',
                     'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
                     'manhattan_dist']],
       y = trn_data['fare_amount'])

print(ridge_regressor.best_params_)
print(ridge_regressor.best_score_)

{'alpha': 1.0}
-32.59745553905192


We see here that the best model on the validation-train set scored an RMSE of ~5.7 (sqrt(abs(-32.28))). This gives us an estimation of how it will perform on the test set. Now, let's use the test set to assess the final performance of the model, submit the file and see if it beats the previous one...

### 2.3 A simple neural network

We hear popular terms such as **convolutional neural networks**, **recurrent neural networks** and **deep learning** all the time. In this section, we will look at one of the most basic neural network architecture, called **feedforward neural networks**. Once again, it is important to start with simple models.

Artificial neural networks aim to reproduce the way their biological analogous works. The basic idea is that a neuron receives information and process them to produce an output. A neural network consists of 3 parts : an input layer, a hidden layer, and an output layer. The following image illustrates the basic concept of NN :

![](https://i.imgur.com/nBSgAhw.png)

This setup contains one hidden layer consisting of 4 neurons. However, the size of each layer is not arbitrary :

- The size (number of neurons) of the input layer depends on the number of features in the data used for the prediction ; if 5 features are used, then the input layer contains 5 neurons.

- The size of the hidden layer can be found by trial and error. We can also use a popular formula : *Nh = Ns / (a * (Ni + N0))*, where *Nh* is the number of neurons in the hidden layer, *Ns* is the number of sample in the training set, *a* is a scaling factor, *Ni* is the number of input neurons and *n0* is the number of output neurons. The number of hidden layers depends on the task. Some networks use several hidden layers with up to billions of neurons. The term deep learning is derived from how deep the model is (how many hidden layers)

- The size of the output layer depends on the desired outcome. For example, 1 neuron is used for regression, while multicall classification uses more than 1.

Let's come back to the model. Each input (*x*) has its own weight (*w*), and each neuron has a weight (*b*) called bias.

A neuron receives multiple inputs which pass through a combination function (sum) that combines them into a single value. Although different operators can be used (max/min of weighted inputs, logical AND or OR of the values), a weighted summation (each input is multiplied by its weight and these products are added together) is the most common one.

Then, a transfer function (*f*) calculates the output value from the result of the combination function. Both these functions constitute the so called **activation function**. The most common transfer functions are based on the biological model in which the output remains very low until the combined inputs reach a threshold value. When this value is reached, the neuron is activated and the output is high. Thus, a neuron has the property that small changes in the inputs can have relatively large effects on the output when combined values lie within a middle range. On the other hand, important changes in the inputs can have a very limited effect on the output for combined values far from the middle range. This is a typical nonlinear behavior, and partly constitutes the power of neural networks. Note that the case in which neurons use a linear transfer function is equivalent to perform a linear regression !
The most common transfer functions are S-shaped, among which the logistic function (sigmoid) produces output values in the range 0, 1, and the hyperbolic tangent in -1, 1.
Activation functions are important because they introduce nonlinear properties to the networks, which are then able to model linear, nearlinear and nonlinear problems.

The following image resumes these concepts by showing a detailed view of a neuron:

![](https://i.imgur.com/TYyvcOi.png)

The different transfer functions are : green = sigmoid, orange = linear, blue = hyperbolic tangent. Note that any infinetely differentiable nonlinear function would work, and a network can use different functions in its neurons.

The hyperparameters (that can be optimized) in NNs are : the weights and biases, the activation function, the number of hidden layers and neurons constituting it.

A limitation of NNs are their opaqueness ; even though the weights and biases are known, it is not clear why the network produces the observed results. NNs act like a blackbox, making the results difficult to interpret. The question is to know whether the NN actually understood the concept it had to learn, or if it just memorized the answer. A technique called sensitivity analysis allows to get an idea of what is going on in the network. In a nutshell, it uses the test set to determine how sensitive the output is regarding to each input. To do so, it first finds the average value of each input. Then, it measures the output of the neural network with the averaged inputs. Finally, each input is modified, one at a time, to be at its minimum and maximum values, and the output of the network is calculated. Each input is used with three values; minimum, average and maximum. This method allows assessing the sensitivity of the output regarding the inputs individually.

**backpropagation**

Among various types of NNs, backpropagation is a classical algorithm used in feedforward neural networks whose purpose is to minimize the error between the value predicted by the network and the desired output by adjusting the weights and biases. Concretely, the algorithm starts by initializing the weights (there are useful techniques such as simulated annealing that help finding good starting points). Then, the inputs and desired outputs (=target) are presented to the network. The network calculates the output based on the weights, biases and activation function. Finally, the algorithm calculates the error and updates the weights recursively. In other words, the error is propagated backwards to optimize the weights.

Let's look at an example to be sure to understand what is happening :

![](https://i.imgur.com/WzGdcDx.png)

Here, the weights were randomly initialized in a 0, 1 range for the first iteration (note that the final weights, after optimization, might not be in that interval). First, the product of the inputs with their corresponding weights is calculated and the biases are added :

* 1 x 0.7 + 0.5 x 0.5 = 0.95 + 0.4 = 1.35

* 1 x 0.2 + 0.5 x 0.9 = 0.65 + 0.3 = 0.95

* 1 x 0.3 + 0.5 x 0.8 = 0.7 + 0.8 = 1.5

Then, the transfer function is applied to these values in each neuron. For example, if a sigmoid (defined as y = 1 / (1 + e^-x)) is used, we get the following values :

* S(1.35) = 0.794

* S(0.95) = 0.721

* S(1.5) = 0.818

This forms the hidden layer results, whose product is summed with the weights linking the hidden layer to the output as follows :

* 0.794 x 0.4 + 0.721 x 0.5 + 0.818 x 0.6 = 1.69

The output value of the network is obtained by applying a transfer function to this result. In our case, we do not need to apply a function to the output layer because it is a regression problem. The network predicted a value of 1.69, which is compared with the actual value (target), 1.10. As we can see, the output of the first forward propagation depends on the weights and biases that were randomly generated. Therefore, the calculated output can be far from the target value. The backpropagation algorithm allows optimizing these weights and biases to reduce the error between the output and the target values. It minimizes a cost function. For example, if the prediction is *y*, the target is *t* and the metric is the squared error, then the cost function is *J(W) = (y-t)^2*.

**Gradient descent**

Gradient descent is an iterative method that optimizes the weights and biases. The idea is to get an idea of the value of the cost function for weights that are close to the current weights by calculating the gradient. Then, the algorithm moves (tries weights) in the direction that minimizes the cost function. This step is repeated several times, until a minima is found.
One must be aware that gradient descent might be slow and present the risk of falling in a local minima.

The steps of gradient descent are the following :

1. Weights *W* are randomly initialized
2. The network calculates a prediction
3. The gradients *G* are calculated with respect to the parameters, using partial differentiation. So the value of the gradient depends on the inputs, the weights, the biases and the cost function
4. The weights are updated by an amount proportional to the gradients ; *W = W - nG*, where *n* is the **learning rate**. This latter determines the size of the steps we take to reach a minimum. It is generally chosen manually, starting with small values such as 0.1, 0.01 or 0.001. We then can adapt it depending on how fast/slow the cost function is reducing. There also exists methods to automatically chose a learning rate, such as **Adam optimizer**, **AdaGrad** and **RMSProp**. In our model, we will use Adam optimizer.
5. Repeat until the cost function stops reducing or until a predefined criterion is met.

There are multiple variants of gradient descent defined by the amount of data that is used to calculate the gradient ;

- **Batch gradient descent** : computes the gradient of the cost function for the entire training data. This method can be very slow if the training set contains many observations.

- **Stochastic gradient descent (SGD)** : computes the gradient for each update using a single, randomly chosen data point. This is assumed to be a stochastic approximation of the gradient calculated using batch gradient descent, and is much faster than this latter.

- **Mini-batch gradient descent** : the training set is divided into small batches and the gradient is calculated for each of them. Typically, 30-500 batches are used and one update is made for each of them.

Enough theory, we now have the basics to build a simple neural network model and use it to predict taxi fares ! To do so, we will use the popular Keras API. Keras wraps the efficient library TensorFlow.

In [None]:
from keras.models import Sequential
from keras.layers import Dense

# define base model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(10, input_dim = 6, kernel_initializer = 'normal', activation = 'sigmoid'))
    model.add(Dense(1, kernel_initializer = 'normal'))
    # Compile model
    model.compile(loss = 'mean_squared_error', optimizer = 'adam')
    return model

Using TensorFlow backend.


In the code above, we defined a function that creates a baseline model. First, we created a sequential model, which is a linear stack of layers. Then, we specified the following layers ; the hidden layer, consisting of 10 neurons with the *sigmoid* transfer function. Inside the declaration of this layer, we specified that the input layer has 6 neurons. Finally, we specified the output layer, consisting of one neuron (the predicted value). Note that we do not apply any transfer function to the output layer. the *kernel_initializer* parameter defines the way to set the initial random weights. Here, they are generated from a normal distribution. We also use the MSE as the chosen metric, as well as adam optimizer to find the appropriate learning rate.

In the next chunk of code, we train this model and use it to predict NYC taxi fares :

In [None]:
X = trn_data[['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'manhattan_dist']]
y = trn_data['fare_amount']

model = baseline_model()
model.fit(X, y, epochs = 5, batch_size = 100, verbose = 1)

features = ['pickup_longitude', 'pickup_latitude',
            'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
           'manhattan_dist']

tst_data['fare_amount'] = model.predict(tst_data[features])

my_submission = tst_data[['key', 'fare_amount']]
my_submission.to_csv('final_submission_NN', index = False)
print(os.listdir('.'))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
['__notebook__.ipynb', '__output__.json', 'final_submission_NN', 'final_submission_lr', 'final_submission_L2r']


Here, the number of epochs basically means that the model is trained over 5 forward and backward passes, with the expectation that the loss decreases with each epoch. The output confirms this, and wee see that it converges to a loss of ~27. The batch size is the number of samples per gradient update, and verbose can be : 0 = silent, 1 = progress bar, 2 = one line per eopch.

Let's see how this model performed :

## Conclusion

In this tutorial, we first saw how to compete in a Kaggle competition using a simple supervised learning algorithm : a linear regression. The only preprocessing we made before training the model was removing the NAs. The resulting model scored ~9.4, a pretty poor performance.

Then, we saw how to improve this simple model. In particular, we saw the notions of overfitting, cross-validation, regularization (Lasso (L1) and Ridge (L2)), hyperparameter tuning and preprocessing of the data. In this latter, weird values for longitude, latitude, fare amount and number of passengers were filtered. Then, a new variable was created, the manhattan distance.  I didn't show all the submissions in the notebook, but it is the creation of this distance variable that really improved the model. This improved linear regression scored ~5.6.

Finally, we saw a basic architecture of neural networks : feedforward neural networks. We also saw what backpropagation and gradient descent are. With a pretty simple setup, and without trying to optimize parameters such as the number of neurons and the transfer function, the model scored ~4.5.

In this case, the neural network was the best model. However, it is important to think about the problem before trying an algorithm, and not going straight for the most complicated models. While linear regressions can be very effective (both in terms of computation efficiency and accuracy) for linear relationships, more complex models such as NNs are better to capture nonlinear relationships. The downside of NNs are their opaqueness and the computation time.

In ML, we generally don't know in advance which algorithm will give the best results, and have to try several ones.

**How could we achieve a better score ?**

There are different ways of increasing this score :

- Use more data : the original training set had ~55'000'000 rows. We only used ~2'000'000

- Feature engineering : do a more in-depth, accurate data preprocessing. For example, create new variables, or select/extract features (with feature selection algorithms such as Lasso for selection or PCA for extraction)

- Optimize this model : try other transfer functions, number of neurons and parameters

- Use another model : one could try decision trees such as random forest, or other type of neural networks. We could also use transfer learning (reusing a model)

- Use ensemble learning methods : combine decisions from multiple models


