# Classroom Project

To build this notebook first was created individuals ones for each model, then used [Grid Search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) from [scikit-learn](http://scikit-learn.org/), to make exaustive search from models to find top1. So, this notebook compiles, just the top1 models.

## Used Models
   * Linear Regression
   * Decision Tree 
   * Neural Networks

## Task

Based on individual trip attributes, should be predict the duration of each trip in the test set.

## Dataset

Dataset is from [New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration), a playground competition from Kaggle.

## Data fields
   * **id** - a unique identifier for each trip
   * **vendor_id** - a code indicating the provider associated with the trip record
   * **pickup_datetime** - date and time when the meter was engaged
   *  **dropoff_datetime** - date and time when the meter was disengaged
   * **passenger_count** - the number of passengers in the vehicle (driver entered value)
   * **pickup_longitude** - the longitude where the meter was engaged
   * **pickup_latitude** - the latitude where the meter was engaged
   * **dropoff_longitude** - the longitude where the meter was disengaged
   * **dropoff_latitude** - the latitude where the meter was disengaged
   * **store_and_fwd_flag** - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
   * **trip_duration** - duration of the trip in seconds

# Import base Packages

It is common in Python community usage of alias, like:
  * pandas as pd
  * numpy as np

In [1]:
# used to work with dataframes(excels like tables)
import pandas as pd

# used to work properly with numerical operations
import numpy as np

# used to avoid deprecated messages
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Show files in Kernel Virtual Machine

## File descriptions
* **train.csv** - the training set (contains 1458644 trip records)
* **test.csv** - the testing set (contains 625134 trip records)
* **sample_submission.csv** - a sample submission file in the correct format


In [2]:
!ls ../input

# Load train dataset

Other common alias, is *dataframe* as *df*

In [3]:
df_train = pd.read_csv('../input/train.csv')
df_train.head()

# Create shortest variables to "large" string labels

In [4]:
plg, plt = 'pickup_longitude', 'pickup_latitude'
dlg, dlt = 'dropoff_longitude', 'dropoff_latitude'
pdt, ddt = 'pickup_datetime', 'dropoff_datetime'

# Clean Train dataset

Basically in this section all unecessary or "unmeaningful" data from dataset was removed.

## Remove missing values

In [5]:
df_train.dropna(inplace=True)

## Remove outliers respect to *trip_duration* column

[Here](https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html) is why data that has more than two, absolute, standard deviation from mean are considered outliers.

In [6]:
mean, std_deviation = np.mean(df_train['trip_duration']), np.std(df_train['trip_duration'])
df_train = df_train[df_train['trip_duration'] <= mean + 2 * std_deviation]
df_train = df_train[df_train['trip_duration'] >= mean - 2 * std_deviation]

## Function to calculate distance from *pickup* to *dropoff*

To make this calculation, has to be used [Haversine Formula](https://en.wikipedia.org/wiki/Haversine_formula), instead [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance), because latitude and longitue are points in sphere and not in plain.

The implementation below to Haversine Formula it is from [Aaron D](https://stackoverflow.com/users/399704/aaron-d), on StackOverflow and is available [here](https://stackoverflow.com/questions/15736995/how-can-i-quickly-estimate-the-distance-between-two-latitude-longitude-points). To extend precision of this calculation was used double precision, from [numpy](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html).

In [7]:
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371* c
    return km

def haversine_distance(x):
    x1, y1 = np.float64(x[plg]), np.float64(x[plt])
    x2, y2 = np.float64(x[dlg]), np.float64(x[dlt])    
    return haversine(x1, y1, x2, y2)

## Create column with calculated distance from *pickup* to *dropoff*

In [8]:
%%time
df_train['distance'] = df_train[[plg, plt, dlg, dlt]].apply(haversine_distance, axis=1)
df_train.head()

## Convert string to datetime

Columns *pickup_datetime* and *dropoff_datetime* are strings that are in format of *datetime*, and contains information about date and time. But to get date and time from strings in this columns, first it has to be converted to *datetime*. Code below to that.

In [9]:
from datetime import datetime

df_train[pdt] = df_train[pdt].apply(lambda x : datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
df_train[ddt] = df_train[ddt].apply(lambda x : datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))

## Create colums from *pickuptime*

Date in format YEAR-MONTH-DAY HOUR:MINUTES:SECONDS, aka %Y-%m-%d %H:%M:%S, has no meaningful information to Machine Learning models. Because of that, some information has to be extracted from *pickup_time*, that are:
* **month** - integer from 1 to 12, where january is 1, february is 2, and so on;
* **weekDay** - integer from 0 to 5, where monday is 0, tuesday is 1, and so on;
* **dayMonth** - integer from 0 to 30, where 0 is the first day of month.
* **pickupTimeMinutes** - integer that contains time in minutes of *pickupTime*. Seconds are removed because will insert much specificty in dataset, that can be problematic to generalize.

In [10]:
df_train['month'] = df_train[pdt].apply(lambda x : x.month)
df_train['weekDay'] = df_train[pdt].apply(lambda x : x.weekday())
df_train['dayMonth'] = df_train[pdt].apply(lambda x : x.day)
df_train['pickupTimeMinutes'] = df_train[pdt].apply(lambda x : x.hour * 60.0 + x.minute)
df_train.head()

## Remove unecessary columns

Remove columns that not will be used to train models

In [11]:
df_train.drop(['id', pdt, ddt, dlg, dlt, 'store_and_fwd_flag'], inplace=True, axis=1)
df_train.head()

## Rearrange columns

Just organize columns to put output as last column, and all attributes that has similiar information together, for example, latitude besides longitude.

In [12]:
df_train = df_train[
    [
        plg, 
        plt, 
        'distance', 
        'month', 
        'dayMonth', 
        'weekDay', 
        'pickupTimeMinutes', 
        'passenger_count', 
        'vendor_id', 
        'trip_duration'
    ]
]
df_train.head()

# Prepare data to train models

## Get train and test data

To separate dataset will be use holdout, that consists of split dataset in two parts, usually that division is 70% to train and 30% to test or 66% to train and 34% to test. But in this project was used 70/30 division. But this test set was named *val*(validation), because it is used to validate model, and test set used was Kaggle PrivateTest, that are contents of *test.csv* file. Just Kaggle knows this PrivateTest output, on file are just inputs os tests.

At this point, if all cells are execute properly, dataset will clean. So, last colum contains output, named *y*, and all others columns(attributes) contains input, named *X*. Yes, upper X! For some reason, Python community that works with Machine Learning use this pattern, upper case x to input, and lower case y to output.

In [13]:
from sklearn.model_selection import train_test_split

X, y = df_train.iloc[:, :-1], df_train.iloc[:, -1]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=4305)
X_train.shape, y_train.shape, X_val.shape, y_val.shape

## Standardization input

Brief explanation what is Normalization and Standardization:
* **Normalization**: rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost.
* **Standardization**: rescales data to have a mean of 0 and standard deviation of 1 (unit variance).

More detailed description [here](http://sebastianraschka.com/Articles/2014_about_feature_scaling.html).

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)

scaler = StandardScaler().fit(X_val)
X_val = scaler.transform(X_val)

## Use KFold to Cross-Validation validation

[Link](https://www.analyticsvidhya.com/blog/2018/05/improve-model-performance-cross-validation-in-python-r/) to explanation about why cross-validation method is import to evaluate performance of Machine Learn models, and how its work.

This dataset was separeted using k = 3, because validation of this models are just for study purpose, and not of extremelly guaranteed precision. Scientific reports and experiments recommend k = 10, to good model evaluation.

For guarantee that model will be good at generalization, dataset was shuffled to remove any data dependence.

In [15]:
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits=3, shuffle=True, random_state=4305)

# Create Models

In [16]:
models = {}

## Create Neural Network Model

In [17]:
from sklearn.neural_network import MLPRegressor

mlp = MLPRegressor(
    activation='relu',
    alpha=0.0001, 
    batch_size='auto',
    beta_1=0.9,
    beta_2=0.999, 
    early_stopping=False, 
    epsilon=1e-08,
    hidden_layer_sizes=(3, 3),
    learning_rate='adaptive',
    learning_rate_init=0.001, 
    max_iter=1000, 
    momentum=0.5,
    nesterovs_momentum=True,
    power_t=0.5,
    random_state=None,
    shuffle=True, 
    solver='adam',
    tol=0.0001, 
    validation_fraction=0.1,
    verbose=False, 
    warm_start=True
)

models['mlp'] = mlp

## Create Decision Tree Model

In [18]:
from sklearn.tree import DecisionTreeRegressor

dtree = DecisionTreeRegressor(
    criterion='mse', 
    max_depth=17, 
    max_features=None,       
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,  
    min_impurity_split=None, 
    min_samples_leaf=1,      
    min_samples_split=2,
    min_weight_fraction_leaf=0.0,  
    presort=False, 
    random_state=None,
    splitter='best'
)

models['dtree'] = dtree

## Create Linear Regression Model

In [19]:
from sklearn.linear_model import LinearRegression

lreg = LinearRegression(
    copy_X=True, 
    fit_intercept=True, 
    n_jobs=1, 
    normalize=False
)

models['lreg'] = lreg

# Show created models

In [20]:
models

# Create function to calculate error

This competitions evalution use RMSLE (Root Mean Square Logarithmic Error) function. That works like explained [here](https://www.kaggle.com/c/nyc-taxi-trip-duration#evaluation), and scikit-learn does not implement that like a scorer function. Because of that, this function has to be implemented. Thanks to [jpopham91](https://www.kaggle.com/jpopham91) at Kaggle,  it is already implemented in vectorized version, and is available [here](https://www.kaggle.com/jpopham91/rmlse-vectorized).

\begin{equation*}
    \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }
\end{equation*}

Where:

- $\epsilon$ - is the RMSLE value (score);
- $n$ - is the total number of observations in the (public/private) data set;
- $p_{i}$ - is model prediction of trip duration;
- $a_{i}$ - is the actual trip duration for $i$;
- $log(x)$ - is the natural logarithm of $x$.


In [21]:
from sklearn.metrics import make_scorer

def rmsle(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    return np.sqrt(np.mean(np.square(np.subtract(np.log1p(y_true), np.log1p(y_pred)))))

rmsle_scorer = make_scorer(rmsle, greater_is_better=False)

# Train models

In [22]:
%%time
for model in models.values():    
    model.fit(X_train, y_train)

# Validate Models

In [23]:
for model_name, model in models.items():    
    score = cross_val_score(model, X_val, y_val, cv=kf)
    print(f'model name: {model_name} \t score: {score} \t mean_score: {np.mean(score)}')

# Submit to Kaggle

Will be generated files, one for each model trained, after that this notebook must be commited to be possible output files submission.

## Load test dataset

To test models Kaggle maintains a PrivateTest set, where isgiven to users just inputs, for them fill output and send to Kaggle evaluate model performance.

In [24]:
df_test = pd.read_csv('../input/test.csv')
df_test.head()

## Clean Test dataset

In [25]:
df_test['distance'] = df_test[[plg, plt, dlg, dlt]].apply(haversine_distance, axis=1)
df_test[pdt] = df_test[pdt].apply(lambda x : datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
df_test['month'] = df_test[pdt].apply(lambda x : x.month)
df_test['weekDay'] = df_test[pdt].apply(lambda x : x.weekday())
df_test['dayMonth'] = df_test[pdt].apply(lambda x : x.day)
df_test['pickupTimeMinutes'] = df_test[pdt].apply(lambda x : x.hour * 60.0 + x.minute)
df_test.drop(['pickup_datetime', dlg, dlt, 'store_and_fwd_flag'], inplace=True, axis=1)
df_test = df_test[['id', plg, plt, 'distance', 'month', 'dayMonth', 'weekDay', 'pickupTimeMinutes', 'passenger_count', 'vendor_id']]
df_test.head()

## Get Test data

In [26]:
X_id, X_test = df_test.iloc[:, 0], df_test.iloc[:, 1:]
X_id.shape, X_test.shape

## Standardization input

In [27]:
scaler = StandardScaler().fit(X_test)
X_test = scaler.transform(X_test)

## Predict outputs

Machine Learn models are programming/statistics/artificial intelligence tools to make prediction, because of that, process of a model generate an output is called prediction.

In [28]:
models_output = {}
for model_name, model in models.items():
    models_output[model_name] = model.predict(X_test)

## Generate output file

In [29]:
for model_name, model_output in models_output.items():
    df_output = pd.DataFrame({'id' : X_id, 'trip_duration': model_output})
    df_output.to_csv(model_name + '.csv', index=False)

# Results

## Proposed Models

| Model name            | Local score| Kaggle Public | Kaggle Private |
|-----------------------|------------|---------------|----------------|
| Decision Tree         | 0.61409    | 0.50989       | 0.50703        |
| Linear Regression     | 0.49877    | 0.62139       | 0.62284        |
| Multilayer Perceptron | 0.56493    | 0.53855       | 0.53848        |

## Kaggle

| Team Name      | Public  | Private |
|----------------|---------|---------|
| L2F            | 0.28882 | 0.28976 |
| Swimming       | 0.30566 | 0.30664 |
| Tomohiko Itano | 0.30831 | 0.30942 |