---
---
# 1) General Model Investigation

Purpose:
* Explore models with stratified, 10-fold cross validation using pycaret
* Select 3 best models for further optimization

---
# 2) Installs & Imports
The pycaret module is not native and must be fully installed.

In [None]:
!pip install pycaret

In [None]:
import os
import numpy as np 
import pandas as pd 
from pycaret.classification import *
import matplotlib.pyplot as plt  
%matplotlib inline

---
# 3) Load & Format Data
Pycaret performs stratified k-fold cross validation naturally, so there is no need to split into training and validation groups. Pixel values are normalized before model creation.

In [None]:
# Read in the training data
train = pd.read_csv('../input/overheadmnist/version2/train.csv')
train.dropna(axis = 0, inplace = True)
train.iloc[:, 1:] /= 255.

# Check for missing values
print(train.head().iloc[:, :5])
print(f'\nThere are {train.isna().sum().sum()} missing examples.')

---
# 4) Model Creation
* This process can take several minutes
* Reduce folds to avoid notebook timeout

In [None]:
model_setup = setup(data = train, target = 'label', n_jobs = -1, 
                     session_id = 42, log_data = True, verbose = True, 
                     fold = 3, use_gpu = True, silent = True)

---
# 5) Compare Classification Models
* This takes several hours for large data sets
* ***GPU REQUIRED***

In [None]:
# Return parameters for top 3 models
model_comp = compare_models(n_select = 3, verbose = True)   

---
# 6) Results & Discussion
* CatBoost classifier has very large training time
* Gradient Boosting family has best performance
* Generality is not lost by reducing number of cross-validation folds

The increased number of examples in Version 2 changes the performance from the initial evaluation in the first. Here we see SVM perform poorly, while the gradient boosting models excel. Available models and hyper-parameters used are displayed below.

In [None]:
 # View parameters in the final models
model_comp     

---
# 7) Conclusion
> 1. CatBoost ---> .8285
> 2. Light Gradient Boosting Machine ---> .7998
> 3. Extreme Gradient Boosting ---> .7921

Due to large training time, it may not be worth using CatBoost to get a 3% increase in accuracy. The most promising is xgboost, which performed just below lightgbm but in one-third the time.

## Next Steps
* Optimize top three models
* Test results of image formatting
* Explore feature engineering
---
---