# Model Tuning

This notebook includes a hyperparameter tuning and feature selection exercise for the top performing classifier. The objective is to narrow down which TML features and hyperparameters will be used in the next phase of scaling to jurisdictional scale maps.

In [1]:
import matplotlib.pyplot as plt
import sys
sys.path.append('../src/prototype')
import prepare_data as pp
import run_preds as rp
import score_classifier as score
import pandas as pd
import pickle
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, f1_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay

%load_ext autoreload
%autoreload 2

## Feature Selection
Evaluate feature importance for the top performing MVP model (Catboost). Training with only the most important features will help to reduce overfitting and training time, while improving accuracy by removing misleading data and noise. Narrow down the 65 TML features to 10-15.  
  
Index 0: slope  
Index 1-3: s1  
Index 3-13: s2  
Index 13-78: TML features (78 is TML tree probability)

In [40]:
df = pd.read_csv('../models/mvp_scores.csv')
len(df)

46

In [41]:
df[df['model'].str.contains('^cat_model_v10.*') == True]

Unnamed: 0,model,cv,train_score,test_score,roc_auc,precision,recall,f1
16,cat_model_v10_np,0.8908,0.9969,0.8518,0.9302,0.921,0.7865,0.8485
21,cat_model_v10_nf,0.8957,0.985,0.8262,0.9101,0.9053,0.7489,0.8197
29,cat_model_v10,0.8884,0.9972,0.8523,0.9291,0.9239,0.7847,0.8487
44,cat_model_v10,0.9073,0.9982,0.8269,0.9126,0.8695,0.7876,0.8265
45,cat_model_v10_nf,0.895,0.9854,0.8293,0.9098,0.8685,0.7942,0.8297


In [42]:
original_model = df[44:45]
original_model

Unnamed: 0,model,cv,train_score,test_score,roc_auc,precision,recall,f1
44,cat_model_v10,0.9073,0.9982,0.8269,0.9126,0.8695,0.7876,0.8265


In [43]:
# load original model
filename = f'../models/cat_model_v10.pkl'
with open(filename, 'rb') as file:
    model = pickle.load(file)

# calculate the feature importance for the TML features 
most_important = model.get_feature_importance()
tml_feats = most_important[13:]

# get indices of 15 most important TML features (sorted by importance)
tml_feats = np.argsort(tml_feats)[::-1][:15]

# get original indices by adding 13
top15 = [index + 13 for index in tml_feats]
v10 = top15

[62, 71, 74, 32, 64, 54, 33, 38, 46, 16, 77, 36, 60, 19, 13]

In [45]:
# quick test to see if we can get better performance using only top 15 feats
X, y = pp.create_xy((14, 14), ['v03', 'v04', 'v10'], drop_prob=False, drop_feats=False, verbose=False)
X_train_ss, X_test_ss, y_train, y_test = pp.reshape_and_scale_manual(X, y)

Plot id 04003 has no cloud free imagery and will be removed.
Plot id 04005 has no cloud free imagery and will be removed.
Plot id 04007 has no cloud free imagery and will be removed.
Plot id 04009 has no cloud free imagery and will be removed.
Plot id 10043 has no cloud free imagery and will be removed.
Plot id 10067 has no cloud free imagery and will be removed.
Baseline: 0.483
The data has been scaled to -1.0000000000000002, 1.0000000000000002


In [46]:
# filter X (all 13 regular features + 15 TML top features)

feats = X_train_ss[:, :13]
tml_feats = X_train_ss[:, top15]

empty = np.empty((X_train_ss.shape[0], 28))
empty[:,:13] = feats
empty[:,13:] = tml_feats

# create new x_train
X_train_selected = empty
X_train_selected.shape

(33908, 28)

In [47]:
# same for test
feats = X_test_ss[:, :13]
tml_feats = X_test_ss[:, top15]

empty = np.empty((X_test_ss.shape[0], 28))
empty[:,:13] = feats
empty[:,13:] = tml_feats

# create new x_test
X_test_selected = empty
X_test_selected.shape

(16856, 28)

In [48]:
# check shapes
X_train_selected.shape, X_test_selected.shape, y_train.shape, y_test.shape

((33908, 28), (16856, 28), (33908,), (16856,))

In [49]:
# fit and score new model 
cat_15feats = CatBoostClassifier(verbose=False, random_state=22)
cat_15feats.fit(X_train_selected, y_train)
score.print_scores(cat_15feats, X_train_selected, X_test_selected, y_train, y_test)

cv: 0.9072
train: 0.9966
test: 0.834
roc_auc: 0.9138
precision: 0.8762
recall: 0.7954
f1: 0.8338


In [39]:
original_model

Unnamed: 0,model,cv,train_score,test_score,roc_auc,precision,recall,f1
44,cat_model_v10,0.9073,0.9982,0.8269,0.9126,0.8695,0.7876,0.8265


### Check consistency across regions

In [36]:
# are the same features important for catboost in west africa?
model_name, v_train_data = 'cat', 'v11'
filename = f'../models/{model_name}_model_{v_train_data}.pkl'

with open(filename, 'rb') as file:
    model = pickle.load(file)

# get initial read of feature importance
feats = model.get_feature_importance()
feats_ordered = np.argsort(feats)[::-1]

# get original indices by adding 13
top15 = [index + 13 for index in feats_ordered][:15]
v11 = top15

In [37]:
# how about south america? (v9)
model_name, v_train_data = 'cat', 'v09'
filename = f'../models/{model_name}_model_{v_train_data}.pkl'

with open(filename, 'rb') as file:
    model = pickle.load(file)

# get initial read of feature importance
feats = model.get_feature_importance()
feats_ordered = np.argsort(feats)[::-1]

# get original indices by adding 13
top15 = [index + 13 for index in feats_ordered][:15]
v09 = top15

In [40]:
# what are the common best features across three regions
v09

[18, 17, 29, 24, 46, 13, 16, 25, 23, 64, 75, 26, 85, 37, 19]

In [41]:
v10

[62, 71, 74, 32, 64, 54, 33, 38, 46, 16, 77, 36, 60, 19, 13]

In [42]:
v11

[16, 84, 20, 29, 25, 24, 67, 77, 89, 18, 75, 88, 85, 87, 26]

In [46]:
# check latin america, then africa
first_set = set(v09).intersection(set(v10))    
result_set = first_set.intersection(set(v11))
first_set, result_set

({13, 16, 19, 46, 64}, {16})

## Hyperparameter Tuning
Hyperparameter tuning was informed by [Catboost Documentation](https://catboost.ai/en/docs/concepts/parameter-tuning#iterations). Generally speaking, Catboost's default parameters will provide a strong result, but the tuning adjustments can bring minor improvements.

In [51]:
# use central america training data
X, y = pp.create_xy((14, 14), ['v03', 'v04', 'v10'], drop_prob=False, drop_feats=False, verbose=False)
X_train_ss, X_test_ss, y_train, y_test = pp.reshape_and_scale_manual(X, y)

Plot id 04003 has no cloud free imagery and will be removed.
Plot id 04005 has no cloud free imagery and will be removed.
Plot id 04007 has no cloud free imagery and will be removed.
Plot id 04009 has no cloud free imagery and will be removed.
Plot id 10043 has no cloud free imagery and will be removed.
Plot id 10067 has no cloud free imagery and will be removed.
Baseline: 0.483
The data has been scaled to -1.0000000000000002, 1.0000000000000002


In [52]:
iterations = [int(x) for x in np.linspace(200, 1100, 10)]            # equiv to n_estimators
depth = [int(x) for x in np.linspace(4, 10, 4)]                      # equiv to max_depth (must be <= 16)
l2_reg = [int(x) for x in np.linspace(2, 30, 4)]
learning_rate = [.01, .02, .03]                                      # decrease learning rate if overfitting 

param_dist = {'iterations': iterations,
              'depth': depth,
              'l2_leaf_reg': l2_reg,
              'learning_rate': learning_rate}

In [57]:
# perform the random search
cat = CatBoostClassifier(random_seed=22, verbose=False)

rds = RandomizedSearchCV(estimator=cat,
                        param_distributions=param_dist, 
                        n_iter=30,
                        cv=3)

In [58]:
# Achieves 0.907
rds.fit(X_train_ss, y_train)
rds_best = rds.best_params_
print(f"The best parameters are {rds.best_params_} with a score of {rds.best_score_}")

The best parameters are {'learning_rate': 0.03, 'l2_leaf_reg': 20, 'iterations': 1100, 'depth': 4} with a score of 0.9076614979982308


In [60]:
# now fit classifier with best params and get all scores
cat_best_params = CatBoostClassifier(random_seed=22,
                                     learning_rate=0.03,
                                     l2_leaf_reg=20,
                                     iterations=1100,
                                     depth=4,
                                     verbose=False)

cat_best_params.fit(X_train_ss, y_train) 

# save trained model
filename = f'../models/cat_model_10tuned.pkl'
with open(filename, 'wb') as file:
    pickle.dump(model, file)

score.print_scores(cat_best_params, X_train_ss, X_test_ss, y_train, y_test)

cv: 0.9077
train: 0.9788
test: 0.8293
roc_auc: 0.9138
precision: 0.8726
recall: 0.7893
f1: 0.8288


In [3]:
original_model

Unnamed: 0,model,cv,train_score,test_score,roc_auc,precision,recall,f1
29,cat_model_v10,0.8884,0.9972,0.8523,0.9291,0.9239,0.7847,0.8487


**Conclusions** 
#### Round 2
- Fitting with the top TML features show minor improvements in accuracy and reduced overfititng.
- There are commonalities across regions. In Latin America (Cental + South) there were 5 important features in common {13, 16, 19, 46, 64}. Across all three regions feature {16} made the top 15. All are TML features.
- Hyperparameter tuning indicated the best parameters are {'learning_rate': 0.03, 'l2_leaf_reg': 20, 'iterations': 1100, 'depth': 4}

#### Round 1
- Fitting the CatboostClassifier with the top 15 TML features rather than all 65 only brought minor improvements in accuracy.
- The feature selection exercise revealed that different features are important for different regions.
- Index 7 (s2) and 67 (TML feat) ranked highly across all three regions.
- Index 77 (TML tree probability) had surprisingly low importance.
- Fitting CatboostClassifier with the best features and best parameters resulted in improvements between .01 - .1% across accuracy metrics.