# Training <i>RandomForest</i> - Lap records in Formula 1

## Overview

Time series analysis and forecasting are used to predict future trends, behaviors, and behaviours based on historical data.

A time series is a sequence of data points collected, recorded, or measured at successive, evenly-spaced time intervals. In this way, e
Each data point represents observations or measurements taken over time, such as stock prices, temperature readings, or sales figurese.

In this exercise, [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) is adopted as the time-series forecasting algorithm. Certainly, other predication algorithms can be used as well, like AutoRegression, Linear Regression family, and even to build deep neural networks. 

## Taxonomy

When it comes to time-series forecasting, usually there are four categories of predication algorithms to use, as below:

(1) Statistical models 
* AutoRegression
    *  ARIMA: AutoRegressive Integrated Moving Average
    *  SARIMA: Seasonal Auto Regressive Integrated Moving Average
    *  VAR: Vector Auto-Regression
* Exponential Smoothing <br>

(2) Linear Regression <br>
(3) Machine Learning Regressor <br>
(4) Deep Learning: RNN/LSTM, CNN + LSTM, etc. <br>

In addition, Meta/Facebook provides an interesting tool for producing high quality forecasts for time series data: [Prophet](https://facebook.github.io/prophet/) 

The following diagram shows the taxonomy of time-series forecasting algorithms, but does not represent all forecasting algorithms.
<img src="../pictures/forecasting-taxonomy.png" width="800">

In [63]:
# import libraries

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pickle
import matplotlib.pyplot as plt
from ipywidgets import widgets, interact

## (1) Loading processed data 

Based on the previous step of data pre-processing, the processed data will be loaded here. 

By following the __classic__ convention to split training and testing data, 2/3 of the processed data is used for training, 1/3 of the data is used for testing.

In [64]:
pd_full_lap_info = pd.read_csv('../data/03-processed/processed-data.csv')
print(pd_full_lap_info.shape)
pd_full_lap_info.columns

(551742, 7)


Index(['driverId', 'lap', 'milliseconds_each_lap', 'event_name',
       'lap_time_wth_pit', 'avg_race_lap_time',
       'event_fastestLapTime_seconds'],
      dtype='object')

## (2) Splitting training/testing datasets

To easily split the training/testing datasets, the method [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from [scikit-learn](https://scikit-learn.org/stable/index.html) is used here.

In [65]:
training_data = pd_full_lap_info[['lap', 'lap_time_wth_pit', 'avg_race_lap_time', 'event_fastestLapTime_seconds']].values
ground_truth = pd_full_lap_info['milliseconds_each_lap'].values
X_train, X_test, y_train, y_test = train_test_split(training_data, ground_truth, test_size=0.33)

In [66]:
# write the splitted datasets to files

df_test_features = pd.DataFrame(X_test)
df_test_labels = pd.DataFrame(y_test)
df_test_features.to_csv('../data/04-train-test-data/test_data.csv', index=False)
df_test_labels.to_csv('../data/04-train-test-data/test_labels.csv', index=False)

In [67]:
# write the identiifer data to files, in order to associate with predication

lap_identifier = pd_full_lap_info[['driverId', 'event_name']].iloc[:X_test.shape[0]]
print(lap_identifier.shape)
lap_identifier.columns

(182075, 2)


Index(['driverId', 'event_name'], dtype='object')

In [68]:
lap_identifier.to_csv('../data/04-train-test-data/pred_ident.csv', index=False)

## (3) Looking into data before training

In [69]:
print(X_train.shape)
print(y_train.shape)

(369667, 4)
(369667,)


In [70]:
print(X_train[:1])
print(y_train[:1])

[[4.30000000e+01 9.54290000e+04 1.55397077e+05 5.54040000e+01]]
[95429.]


## (4) Training RandomForestRegressor 

In [71]:
# rf_regressor = RandomForestRegressor(n_estimators=50, max_depth=3, max_features="sqrt", criterion="absolute_error", warm_start=True)
rf_regressor = RandomForestRegressor(n_estimators=50, warm_start=True)
rf_regressor.fit(X_train, y_train)

## (5) Save the trained model

In [72]:
with open('../data/05-trained-models/rf-trained-model.pkl', 'wb') as f:
    pickle.dump(rf_regressor, f)