# Supervised Learning

In this notebook, we are going to attempt different tree-based supervised learning models. They are:
- Decision Tree
- Random Forest
- CatBoost
- XGBoost
- LGMB

**To run this notebook, we need:**
- the clean preprocessed dataset, with all the unnecessary columns removed [df]
- edit the column name for the risk labels [currently 'label']


# Load in the Dataset + Packages

In [None]:
# pip install graphviz
import sklearn
import graphviz 
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from catboost import CatBoostClassifier
from lightgbm.sklearn import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)

In [None]:
#dataset
# df = ...

In [None]:
# X = df.drop(columns = ['label'])
# y = df['label']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## 1. Decision Tree

### Basic Approach

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
clf = tree.DecisionTreeClassifier(random_state=37)

In [None]:
clf = clf.fit(X_train,y_train)

In [None]:
# Visualise the tree
tree.plot_tree(clf)

In [None]:
# Alternative visualisation
visualise_tree = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(visualise_tree)
graph
# See here for more [https://scikit-learn.org/stable/modules/tree.html]

In [None]:
y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

### Various hyperparameters to attempt

In [None]:
# max depth
clf_maxdepth = DecisionTreeClassifier(random_state=37, max_depth=5)
clf_maxdepth = clf_maxdepth.fit(X_train,y_train)

In [None]:
# max_leaf_nodes
clf_maxleafnode = DecisionTreeClassifier(random_state=37, max_leaf_nodes=5)
clf_maxleafnode = clf_maxleafnode.fit(X_train,y_train)

## 2. Random Forest

### Basic Approach

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
rf_clf = RandomForestClassifier(n_estimators=100, random_state=37) # n_estimators = no. of trees in the forest
rf_clf.fit(X_train, y_train)

In [None]:
y_pred_rf_clf = rf_clf(X_test)

In [None]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_rf_clf))

In [None]:
# feature importance
feature_imp = pd.Series(rf_clf.feature_importances_, 
                        index = X_train.feature_names).sort_values(ascending = False)
feature_imp

### Various hyperparameters to attempt

In [None]:
random_grid = {'bootstrap': [True, False],
               'max_depth': [2, 4, 6, 8, 10, 12, None],
               'max_features': ['auto', 'sqrt'],
               'min_samples_leaf': [1, 2, 4],
               'min_samples_split': [2, 5, 10],
               'n_estimators': [130, 180, 230]}

In [None]:
scoring_metric = ["accuracy", "f1", "recall"]
rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, 
                               n_iter = 100, 
                               cv = 3, 
                               verbose=2, 
                               random_state=37, 
                               n_jobs = -1,
                               scoring = scoring_metric,
                               return_train_score=True)

In [None]:
rf_random.fit(X_train, y_train)

In [None]:
cv_results = rf_random.cv_results_
for mean_score, params in zip(cv_results["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
# find best params
best_params = rf_random.best_params_
best_params

In [None]:
# best score
rf_random.best_score_

In [None]:
df_rf_random = pd.DataFrame(rf_random.cv_results_)
df_rf_random

## 3. All the boosts

### Reference:
https://pages.github.ubc.ca/mds-2021-22/DSCI_573_feat-model-select_students/lectures/05_ensembles.html

In [None]:
classifiers = {
    "CatBoost": CatBoostClassifier(verbose=0, random_state=37),
    "XGBoost": XGBClassifier(random_state=37, eval_metric='logloss', verbosity=0),
    "LightGBM": LGBMClassifier(random_state=37),
    "decision tree": DecisionTreeClassifier(random_state=37),
    "random forest": RandomForestClassifier(n_estimators=100, random_state=37)
}

In [None]:
results = {}
scoring_metric = ["accuracy", "f1", "recall"]

In [None]:
dummy = DummyClassifier(strategy="stratified")
results["Dummy"] = mean_std_cross_val_scores(
    dummy, X_train, y_train, return_train_score=True, scoring=scoring_metric
)

In [None]:
for (name, model) in classifiers.items():
    results[name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True, scoring=scoring_metric
    )

In [None]:
pd.DataFrame(results).T