# Complete tutorial covering ML pipeline

An typical machine learning pipeline consists of:

1. data collection
2. exploratory data analysis (EDA)
3. data preprocessing
4. data modeling

  
- Since `data collection` is done by Kaggle for you, we should focus on step 2 to step 4. 

- In this tutorial, we will practice popular packages that are used in those steps. 

- Specifically, we will pratice using following packages for each step.

Step 2: `matplotlib`, `seaborn`  
Step 3: `scikit-learn`  
Step 4: `scikit-learn` 

## Setup

We will first read the data using `pandas`. 

In [None]:
import pandas as pd
import numpy as np

In [None]:
path = '../input/tabular-playground-series-feb-2021/'

train_path = path + 'train.csv'
test_path = path + 'test.csv'
sub_path = path + 'sample_submission.csv'

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
sub = pd.read_csv(sub_path)

## Exploratory Data Analysis

- Let's use `pd.DataFrame.head()` method which will show top 5 rows of each dataframe to verify that the data have been read well.

In [None]:
train.head()

In [None]:
test.head()

In [None]:
sub.head()

In [None]:
train.shape, test.shape, sub.shape

- We have 300k samples for training and 200k samples for testing. 

- 300k samples is actually a huge amount. It will take quite some time for the model to train on it all. 

- For this tutorial, we will only use 10k samples that have been sampled randomly for training. 

In [None]:
# tutorial on seaborn and matplotlib on progress...
import seaborn as sns
import matplotlib.pyplot as plt

## Data Preprocessing 

- *on progress...*

## Data Modeling

- In this section, we will use a `for` loop to train several machine learning models at once. 

- In particular, we will train `DecisionTree`, `Random Forest`, and `LightGBM`. 

- Like we've stated at the `EDA` section, we will only use 10k samples from 300k total samples of training data due to time constraints. 

- However, it is recommended to train on all 300k samples for higher accuracy.

- Let's first import the models.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb

- Then, we will import `KFold` and `cross_val_score` functions in order to validate on our training data

- Those two functions are necessary in order to perform `Shuffled K-fold Cross Validation`. 

![](http://ethen8181.github.io/machine-learning/model_selection/img/kfolds.png)
- Image Reference: http://ethen8181.github.io/machine-learning/model_selection/model_selection.html

- As shown on the image above, `Cross Validation` allows you to validate your model on your training data by holding out certain part of the training data for validation.

- If you divide the whole training data into 5 parts and use each part for validating your model, it becomes `5-Fold Cross Validation`.

- When dividing the training data, it is important to shuffle them in order to prevent overfitting. This can be performed with the `KFold` function. 

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score 

- `KFold` function has following parameters.

    - `n_splits`: number of folds (int)
    - `shuffle`: whether to shuffle the data (boolean)
    - `random_state`: seed number for reproducibility (int)
    
- By setting each parameters 5, True, 0 respectively, we are performing `5-Fold Cross-Validation`.

- We will save the `KFold` object to `k_fold` variable, which will be passed to `cross_val_score` function.

In [None]:
k_fold = KFold(n_splits = 5, shuffle = True, random_state=0)

- In the cell below, we will define 3 models which will be used for modeling. 

In [None]:
model_dict = {'DT':DecisionTreeRegressor(),
              'RF':RandomForestRegressor(n_jobs=-1, random_state=0), 
              'LGB':lgb.LGBMRegressor()}

- Next, we will define `compare_models` function which will  
  
    1. iterate through the models in `model_dict` and perform `5-Fold Cross Validation`
    2. save the result to `score` variable

In [None]:
def compare_models(X_train, y_train, model_dict):
    
    score = {}

    for model_name in model_dict.keys():

        model = model_dict[model_name]

        score[model_name] = np.mean(cross_val_score(model, X_train.sample(frac=1, random_state=0).head(10000), y_train.sample(frac=1, random_state=0).head(10000), scoring = 'neg_mean_squared_error', cv = k_fold, n_jobs = -1))

        print(f'{model_name} validation completed')
        
    return score

- By using `pd.DataFrame.sample()` we can randomly shuffle the dataframe and then use `head()` to extract 10k samples. 

- In the cell below, we have divided `train`, `test` data into `X_train`, `y_train`, and `X_test` to train the models. 

In [None]:
X_train = train.iloc[:,11:-1]
y_train = train['target']
X_test = test[X_train.columns]

By executing the cell below, it will validate 3 models in model_dict and save the validated score to the `score` variable.

In [None]:
score = compare_models(X_train, y_train, model_dict)

In [None]:
score

- The result comes out in `negative MSE`. So the higher the score is, the better is the performance of the model. 

- Since `Random Forest` has highest performance of `-0.784`, we will use `Random Forest` to make inference on the test data. 

In [None]:
model = model_dict['RF']

model.fit(X_train.sample(frac=1, random_state=0).head(10000), y_train.sample(frac=1, random_state=0).head(10000))

- After training the model, we will make inference on `X_test` and save the result into the `sub` file.

In [None]:
sub['target'] = model.predict(X_test)

In [None]:
sub.to_csv('submission.csv', index=False)

- The result should be around `0.87` which is quite low.

- However, do mind that we have only used 10k sample out of 300k training samples for educational purpose. 

- Since you are now familiar with the whole machine learning pipeline process, try fiddling the number of training samples and the types of models to increase your score.