# Optimizing Random Forest (Model Selection - 3)

Since, we know from the start that the class distribution for this dataset is not balanced 
We are going to add different sample weight policies and try to run through the grid search, and see if it improves accuracy

- next model selection (say model selection 4) could be with oversampling and undersampling to see if that works well or not

In [1]:
import pandas as pd
import numpy as np
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
import joblib

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report

from sklearn.metrics import plot_confusion_matrix

In [2]:
from datetime import datetime
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
import joblib
import os

**custom weight calculation function based on the classes**

In [3]:
def class_weight(labels_dict,mu=0.15):
    total = np.sum(list(labels_dict.values()))
    keys = labels_dict.keys()
    weight = dict()
    for i in keys:
            score = float(labels_dict[i])/total
            weight[i] = score if score < 1 else 1
    return weight

In [4]:
# function to save classification report
def classification_report_csv(report,classifier_name,ascore):
    report_data = []
    counter=0
    lines = report.split('\n')
    
    for line in lines[2:-5]:
        row = {}
        row_data = line.split('      ')
        
        
        row['classifier'] = classifier_name
        row['accuracy_score'] = ascore
        
        if counter==0:
            row['class'] = row_data[2].strip()
            row['precision'] = float(row_data[3].strip())
            row['recall'] = float(row_data[4].strip())
            row['f1_score'] = float(row_data[5].strip())
            row['accuracy_score'] = ascore
        elif counter==1:
            row['class'] = row_data[0].strip()
            row['precision'] = float(row_data[1].strip())
            row['recall'] = float(row_data[2].strip())
            row['f1_score'] = float(row_data[3].strip())
            row['accuracy_score'] = ascore
        elif counter==2:
            row['class'] = row_data[1].strip()
            row['precision'] = float(row_data[2].strip())
            row['recall'] = float(row_data[3].strip())
            row['f1_score'] = float(row_data[4].strip())
        
        report_data.append(row)
        
        counter+=1
        
    dataframe = pd.DataFrame.from_dict(report_data)
        
    if os.path.exists('classification_reports/classification_report.csv'):
        df_cr = pd.read_csv('classification_reports/classification_report.csv')
                
        t = df_cr[df_cr['classifier']==classifier_name].index
        if len(t)>0:
            df_cr.drop(t, inplace=True)
        
        df_cr = pd.concat([df_cr,dataframe])
        df_cr.to_csv('classification_reports/classification_report.csv', index = False)
    else:
        dataframe.to_csv('classification_reports/classification_report.csv', index = False)
    

**reading all the feature set files**

In [5]:
# base feature set with advanced and mean encoded features
df_base_adv_mean = pd.read_csv('input/feature_sets/base_adv_mean.csv')

# 1. Model with Base + Advanced + Mean Features

In [6]:
X = df_base_adv_mean.drop(['status_group','id','functional needs repair','non functional'], axis=1)
y = df_base_adv_mean['status_group'].values

# i have changed below to test size .21 based on results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.21, random_state=42)

In [7]:
# random labels_dict
labels_dict = df_base_adv_mean['status_group'].value_counts().to_dict()
weights = class_weight(labels_dict)
print(weights)

{'functional': 0.543080808080808, 'non functional': 0.3842424242424242, 'functional needs repair': 0.07267676767676767}


### Halving Grid Search CV - 1 (re-test)

In [8]:
# parameter grid
pgrid = {    
    'max_depth' : [101],    
    'min_samples_split' : [3,4],
    'min_samples_leaf' : [2],
    'class_weight': ['balanced','balanced_subsample']
}

# specifying the cv
cv_skf = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

# specifying the model 
rfgs = BalancedRandomForestClassifier(n_jobs=-1)

# keep track of the date and time
dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")


# specify the grid search cv
cv = HalvingGridSearchCV(estimator=rfgs,param_grid=pgrid,cv=cv_skf,n_jobs=-1,verbose=10, scoring='accuracy',random_state=0)

# pring the date and time 
print("date and time =", dt_string)

date and time = 05/09/2021 23:41:48


**execution of halving grid search cv**

In [9]:
%%time
joblib.dump(cv.fit(X_train,y_train),'models/HalvingGridSearchCV_ms3.pkl')

fd is '5'
fd is '5'
fd is '5'
n_iterations: 2
n_required_iterations: 2
n_possible_iterations: 2
min_resources_: 15642
max_resources_: 46926
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 4
n_resources: 15642
Fitting 5 folds for each of 4 candidates, totalling 20 fits
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is 

['models/HalvingGridSearchCV_ms3.pkl']

**displaying the best param**

In [10]:
loaded_cv = joblib.load('models/HalvingGridSearchCV_ms3.pkl')
loaded_cv.best_params_

{'class_weight': 'balanced_subsample',
 'max_depth': 101,
 'min_samples_leaf': 2,
 'min_samples_split': 3}

**get the best estimator and classification report**

In [11]:
rf_best = loaded_cv.best_estimator_
rf_best.fit(X_train,y_train)

# get the prediction
rfpred = rf_best.predict(X_test)

# print classification report
cr = classification_report(y_test, rfpred)

print(cr)

classification_report_csv(cr,'HalvingGridSearchCV_ms3.pkl',accuracy_score(y_test, rfpred))

                         precision    recall  f1-score   support

             functional       0.85      0.70      0.77      6790
functional needs repair       0.26      0.78      0.38       885
         non functional       0.84      0.73      0.78      4799

               accuracy                           0.72     12474
              macro avg       0.65      0.74      0.65     12474
           weighted avg       0.81      0.72      0.75     12474



### Halving Grid Search CV 2

In [12]:
# parameter grid
pgrid = {        
    'max_depth' : [121,151],
    'min_samples_split' : [2,3],
    'min_samples_leaf' : [1], 
    'class_weight': ['balanced','balanced_subsample']
}

# specifying the cv
cv_skf = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

# specifying the model 
rfgs = BalancedRandomForestClassifier(n_jobs=-1, verbose=1)

# keep track of the date and time
dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")


# specify the grid search cv
cv = HalvingGridSearchCV(
    estimator=rfgs,param_grid=pgrid,cv=cv_skf,n_jobs=-1,verbose=10,scoring='balanced_accuracy',random_state=0,
    resource='n_estimators',max_resources=1100)

# pring the date and time 
print("date and time =", dt_string)

date and time = 05/09/2021 23:42:19


**execution**

In [13]:
%%time
joblib.dump(cv.fit(X_train,y_train),'models/HalvingGridSearchCV_2_ms3.pkl') 

fd is '5'
fd is '5'
fd is '5'
fd is '5'
n_iterations: 2
n_required_iterations: 2
n_possible_iterations: 2
min_resources_: 366
max_resources_: 1100
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 8
n_resources: 366
Fitting 5 folds for each of 8 candidates, totalling 40 fits
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
f

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   11.8s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   21.9s
[Parallel(n_jobs=-1)]: Done 1098 out of 1098 | elapsed:   31.3s finished


Wall time: 6min 9s


['models/HalvingGridSearchCV_2_ms3.pkl']

**displaying the best params**

In [14]:
loaded_cv = joblib.load('models/HalvingGridSearchCV_2_ms3.pkl')
loaded_cv.best_params_

{'class_weight': 'balanced_subsample',
 'max_depth': 151,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1098}

**get the best estimator and classification report**

In [15]:
rf_best = loaded_cv.best_estimator_
rf_best.fit(X_train,y_train)

# get the prediction
rfpred = rf_best.predict(X_test)

# print classification report
cr = classification_report(y_test, rfpred)

print(cr)

classification_report_csv(cr,'HalvingGridSearchCV_2_ms3.pkl',accuracy_score(y_test, rfpred))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   15.5s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   28.2s
[Parallel(n_jobs=-1)]: Done 1098 out of 1098 | elapsed:   40.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.9s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    1.6s
[Parallel(n_jobs=8)]: Done 1098 out of 1098 | elapsed:    2.3s finished


                         precision    recall  f1-score   support

             functional       0.86      0.71      0.78      6790
functional needs repair       0.26      0.78      0.39       885
         non functional       0.85      0.75      0.80      4799

               accuracy                           0.73     12474
              macro avg       0.66      0.74      0.65     12474
           weighted avg       0.81      0.73      0.76     12474



### Halving Grid Search CV - 3

In [16]:
# parameter grid
pgrid = {    
    'n_estimators':[1100],
    'min_samples_split' : [2,3],
    'min_samples_leaf' : [2],  
    'class_weight': ['balanced','balanced_subsample']
}

# specifying the cv
cv_skf = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

# specifying the model 
rfgs = BalancedRandomForestClassifier(n_jobs=-1, verbose=1)

# keep track of the date and time
dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")


# specify the grid search cv
cv = HalvingGridSearchCV(
    estimator=rfgs,param_grid=pgrid,cv=cv_skf,n_jobs=-1,verbose=10,scoring='accuracy',random_state=0,
    resource='max_depth',max_resources=500)

# pring the date and time 
print("date and time =", dt_string)

date and time = 05/09/2021 23:49:22


**execution**

In [17]:
%%time
joblib.dump(cv.fit(X_train,y_train),'models/HalvingGridSearchCV_3_ms3.pkl') # based on halving_grid_search_cv_2_ms2.pkl

fd is '5'
fd is '5'
fd is '5'
fd is '5'
n_iterations: 2
n_required_iterations: 2
n_possible_iterations: 2
min_resources_: 166
max_resources_: 500
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 4
n_resources: 166
Fitting 5 folds for each of 4 candidates, totalling 20 fits
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   27.7s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:   39.6s finished


Wall time: 6min 20s


['models/HalvingGridSearchCV_3_ms3.pkl']

**best params**

In [18]:
loaded_cv = joblib.load('models/HalvingGridSearchCV_3_ms3.pkl')
loaded_cv.best_params_

{'class_weight': 'balanced_subsample',
 'min_samples_leaf': 2,
 'min_samples_split': 3,
 'n_estimators': 1100,
 'max_depth': 498}

**classification report**

In [19]:
rf_best = loaded_cv.best_estimator_
rf_best.fit(X_train,y_train)

# get the prediction
rfpred = rf_best.predict(X_test)

# print classification report
cr = classification_report(y_test, rfpred)

print(cr)

classification_report_csv(cr,'HalvingGridSearchCV_3_ms3.pkl',accuracy_score(y_test, rfpred))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   14.6s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   27.2s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:   38.9s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.8s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 1100 out of 1100 | elapsed:    2.0s finished


                         precision    recall  f1-score   support

             functional       0.85      0.70      0.77      6790
functional needs repair       0.26      0.79      0.39       885
         non functional       0.84      0.73      0.78      4799

               accuracy                           0.72     12474
              macro avg       0.65      0.74      0.65     12474
           weighted avg       0.81      0.72      0.75     12474



### Randomized Grid Search - 1

In [20]:
# parameter grid
pgrid = {    
    'max_depth' : [200,300],    
    'max_features' : ['sqrt'],
    'min_samples_split' : [3,4],
    'min_samples_leaf' : [2,3],
    'criterion' : ['entropy'],
    'class_weight': ['balanced','balanced_subsample']
}

# specifying the cv
cv_skf = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

# specifying the model 
rfgs = BalancedRandomForestClassifier(n_jobs=-1, verbose=1)

# keep track of the date and time
dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")

# specify the grid search cv
cv = RandomizedSearchCV(estimator=rfgs,param_distributions=pgrid,cv=cv_skf,n_jobs=-1, 
                        verbose=10, scoring='accuracy',random_state=0)

# pring the date and time 
print("date and time =", dt_string)

date and time = 05/09/2021 23:56:36


**execution**

In [21]:
%%time
joblib.dump(cv.fit(X_train,y_train),'models/RandomizedSearchCV_ms3.pkl')

fd is '5'
fd is '5'
fd is '5'
fd is '5'
Fitting 5 folds for each of 10 candidates, totalling 50 fits
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.3s finished


Wall time: 47.5 s


['models/RandomizedSearchCV_ms3.pkl']

**best params**

In [22]:
loaded_cv = joblib.load('models/RandomizedSearchCV_ms3.pkl')
loaded_cv.best_params_

{'min_samples_split': 4,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 300,
 'criterion': 'entropy',
 'class_weight': 'balanced_subsample'}

**classification report**

In [23]:
rf_best = loaded_cv.best_estimator_
rf_best.fit(X_train,y_train)

# get the prediction
rfpred = rf_best.predict(X_test)

# print classification report
cr = classification_report(y_test, rfpred)

print(cr)

classification_report_csv(cr,'RandomizedSearchCV_ms3.pkl',accuracy_score(y_test, rfpred))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    2.5s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished


                         precision    recall  f1-score   support

             functional       0.85      0.70      0.77      6790
functional needs repair       0.26      0.80      0.39       885
         non functional       0.84      0.73      0.78      4799

               accuracy                           0.72     12474
              macro avg       0.65      0.74      0.65     12474
           weighted avg       0.81      0.72      0.75     12474



### Randomized Grid Search - 2

In [24]:
# parameter grid
pgrid = {    
    'n_estimators': [1000,1100],
    'max_depth' : [300,500],    
    'min_samples_split' : [3],
    'min_samples_leaf' : [1,2],
    'criterion' : ['entropy'],
    'class_weight': ['balanced','balanced_subsample']
}

# specifying the cv
cv_skf = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

# specifying the model 
rfgs = BalancedRandomForestClassifier(n_jobs=-1, verbose=1)

# keep track of the date and time
dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")

# specify the grid search cv
cv = RandomizedSearchCV(estimator=rfgs,param_distributions=pgrid,cv=cv_skf,n_jobs=-1,verbose=10,
                        scoring='balanced_accuracy',random_state=0)

# pring the date and time 
print("date and time =", dt_string)

date and time = 05/09/2021 23:57:29


**execution**

In [25]:
%%time
joblib.dump(cv.fit(X_train,y_train),'models/RandomizedSearchCV_2_ms3.pkl')

fd is '5'
fd is '5'
fd is '5'
fd is '5'
Fitting 5 folds for each of 10 candidates, totalling 50 fits
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   20.6s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:   29.3s finished


Wall time: 8min 20s


['models/RandomizedSearchCV_2_ms3.pkl']

**best params**

In [26]:
loaded_cv = joblib.load('models/RandomizedSearchCV_2_ms3.pkl')
loaded_cv.best_params_

{'n_estimators': 1100,
 'min_samples_split': 3,
 'min_samples_leaf': 1,
 'max_depth': 500,
 'criterion': 'entropy',
 'class_weight': 'balanced_subsample'}

**classification report**

In [27]:
rf_best = loaded_cv.best_estimator_
rf_best.fit(X_train,y_train)

# get the prediction
rfpred = rf_best.predict(X_test)

# print classification report
cr = classification_report(y_test, rfpred)

print(cr)

classification_report_csv(cr,'RandomizedSearchCV_2_ms3.pkl',accuracy_score(y_test, rfpred))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   27.8s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:   39.4s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.8s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    1.5s
[Parallel(n_jobs=8)]: Done 1100 out of 1100 | elapsed:    2.1s finished


                         precision    recall  f1-score   support

             functional       0.86      0.70      0.77      6790
functional needs repair       0.26      0.79      0.39       885
         non functional       0.85      0.74      0.79      4799

               accuracy                           0.73     12474
              macro avg       0.66      0.75      0.65     12474
           weighted avg       0.81      0.73      0.75     12474



### Grid Search - 1

In [28]:
# parameter grid
pgrid = {
    'n_estimators'      : [1001,1100],
    'bootstrap'         : [True],
    'criterion'         : ['gini','entropy'],
    'max_depth'         : [300],        
    'min_samples_split' : [2,3],
    'min_samples_leaf'  : [1,2],
    'class_weight': ['balanced','balanced_subsample']
}

# specifying the cv
cv_ss = StratifiedShuffleSplit(n_splits=3, train_size=0.75, test_size=.25,random_state=0)

# specifying the model 
rfgs = BalancedRandomForestClassifier(n_jobs=-1, verbose=1)

# keep track of the date and time
dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")

# specify the grid search cv
cv = GridSearchCV(estimator=rfgs, param_grid=pgrid, cv=cv_ss, n_jobs=-1, verbose=10, scoring='accuracy')

# pring the date and time 
print("date and time =", dt_string)

date and time = 06/09/2021 00:06:42


**execution**

In [29]:
%%time
joblib.dump(cv.fit(X_train,y_train),'models/GridSearchCV_ms3.pkl')

fd is '5'
fd is '5'
fd is '5'
fd is '5'
Fitting 3 folds for each of 32 candidates, totalling 96 fits
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   15.6s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   28.6s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:   40.6s finished


Wall time: 14min 21s


['models/GridSearchCV_ms3.pkl']

**best params**

In [30]:
loaded_cv = joblib.load('models/GridSearchCV_ms3.pkl')
loaded_cv.best_params_

{'bootstrap': True,
 'class_weight': 'balanced_subsample',
 'criterion': 'gini',
 'max_depth': 300,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1100}

**classification report**

In [31]:
rf_best = loaded_cv.best_estimator_
rf_best.fit(X_train,y_train)

# get the prediction
rfpred = rf_best.predict(X_test)

# print classification report
cr = classification_report(y_test, rfpred)

print(cr)

classification_report_csv(cr,'GridSearchCV_ms3.pkl',accuracy_score(y_test, rfpred))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   14.6s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   27.6s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:   39.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.9s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    1.6s
[Parallel(n_jobs=8)]: Done 1100 out of 1100 | elapsed:    2.3s finished


                         precision    recall  f1-score   support

             functional       0.86      0.71      0.78      6790
functional needs repair       0.26      0.78      0.39       885
         non functional       0.85      0.75      0.79      4799

               accuracy                           0.73     12474
              macro avg       0.66      0.74      0.65     12474
           weighted avg       0.81      0.73      0.76     12474



### Grid Search - 2

In [32]:
# parameter grid 
pgrid = {
    'n_estimators' : [1000,1100],
    'max_depth' : [None,151],        
    'min_samples_split' : [2,3,6],
    'min_samples_leaf' : [1],
    'bootstrap': [False],
    'class_weight': ['balanced','balanced_subsample']
}

# specifying the cv
cv_skf = StratifiedKFold(n_splits=3, random_state=None, shuffle=False) # need to change to 3 splits based on results

# specifying the model 
rfgs = BalancedRandomForestClassifier(n_jobs=-1, verbose=1)

# keep track of the date and time
dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")

# specify the grid search cv
cv = GridSearchCV(estimator=rfgs, param_grid=pgrid, cv=cv_skf, n_jobs=-1, verbose=10, scoring='balanced_accuracy')

# pring the date and time 
print("date and time =", dt_string)

date and time = 06/09/2021 00:22:00


**execution**

In [33]:
%%time
joblib.dump(cv.fit(X_train,y_train),'models/GridSearchCV_2_ms3.pkl')

fd is '5'
fd is '5'
fd is '5'
fd is '5'
Fitting 3 folds for each of 24 candidates, totalling 72 fits
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'
fd is '5'

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    5.3s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   22.6s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:   32.1s finished


Wall time: 11min 22s


['models/GridSearchCV_2_ms3.pkl']

**best params**

In [34]:
loaded_cv = joblib.load('models/GridSearchCV_2_ms3.pkl')
loaded_cv.best_params_

{'bootstrap': False,
 'class_weight': 'balanced',
 'max_depth': 151,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1100}

**classification report**

In [35]:
rf_best = loaded_cv.best_estimator_
rf_best.fit(X_train,y_train)

# get the prediction
rfpred = rf_best.predict(X_test)

# print classification report
cr = classification_report(y_test, rfpred)

print(cr)

classification_report_csv(cr,'GridSearchCV_2_ms3.pkl',accuracy_score(y_test, rfpred))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   20.6s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:   29.3s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.7s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    1.3s
[Parallel(n_jobs=8)]: Done 1100 out of 1100 | elapsed:    1.9s finished


                         precision    recall  f1-score   support

             functional       0.86      0.71      0.78      6790
functional needs repair       0.26      0.75      0.39       885
         non functional       0.85      0.77      0.80      4799

               accuracy                           0.74     12474
              macro avg       0.66      0.74      0.66     12474
           weighted avg       0.81      0.74      0.76     12474



# Showing the Results -  Accuracy for the smallest class by frequency

In [37]:
df_cr = pd.read_csv('classification_reports/classification_report.csv')
df_cr[df_cr['class']=='functional needs repair']\
[['classifier','recall']].drop_duplicates().sort_values(by='recall', ascending=False).reset_index(drop=True)

Unnamed: 0,classifier,recall
0,RandomizedSearchCV_ms3.pkl,0.8
1,RandomizedSearchCV_2_ms3.pkl,0.79
2,HalvingGridSearchCV_3_ms3.pkl,0.79
3,GridSearchCV_ms3.pkl,0.78
4,HalvingGridSearchCV_2_ms3.pkl,0.78
5,HalvingGridSearchCV_ms3.pkl,0.78
6,GridSearchCV_2_ms3.pkl,0.75
7,GridSearchCV_2_ms2.pkl,0.37
8,RandomizedSearchCV_2.pkl,0.37
9,GridSearchCV_2.pkl,0.37


clear evidence that using balanced accuracy, we are getting balanced penalized over the weaker class as compare to others.

gonna use that during submissions