# EXPLORING BAGGING TECHNIQUES

#### We explore bagging techniques such as Bagging Classifier and Random Forest Classier in this notebook.

**BAGGING CLASSIFIER:** A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as [Pasting](https://link.springer.com/article/10.1023/A:1007563306331). If samples are drawn with replacement, then the method is known as [Bagging](https://link.springer.com/article/10.1007/BF00058655). When random subsets of the dataset are drawn as random subsets of the features, then the method is known as [Random Subspaces](https://ieeexplore.ieee.org/document/709601). Finally, when base estimators are built on subsets of both samples and features, then the method is known as [Random Patches](https://link.springer.com/chapter/10.1007/978-3-642-33460-3_28).

**RANDOM FOREST:** The random forest algorithm is an extension of the bagging method as it utilizes both bagging and feature randomness to create an uncorrelated forest of decision trees. Feature randomness, also known as feature bagging or [the random subspace method](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf), generates a random subset of features, which ensures low correlation among decision trees. This is a key difference between decision trees and random forests. While decision trees consider all the possible feature splits, random forests only select a subset of those features.

### IMPORTING PREPROCESSED DATA

In [1]:
%run Selected_Questions_Combined.ipynb
dataframes = preprocessed_data()
salary_data = dataframes["all_questions_dataframe"]
salary_data_as_num = dataframes["selected_numeric_questions"]
salary_data_selected_questions = dataframes["selected_questions_dataframe"]

### FURTHER DATA PREPROCESSING

#### Features and Target

In [2]:
null_indices = salary_data[salary_data['q24'].isnull()].index

In [3]:
y = salary_data['q24'].dropna()
X = salary_data_as_num.drop(index=null_indices)

In [4]:
# Train-test split

from sklearn.model_selection import train_test_split

X_dev, X_test, y_dev, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify = y)


In [5]:
# label enc y

from sklearn.preprocessing import LabelEncoder

l_enc = LabelEncoder()
l_enc.fit_transform(y_dev)
l_enc.transform(y_test)

array([ 0, 22,  1, ...,  5, 23,  2])

In [6]:
from sklearn.metrics import roc_auc_score

## BAGGING CLASSIFIER

In [7]:
from sklearn.ensemble import BaggingClassifier

### BASELINE MODEL

#### Training

In [8]:
bgc = BaggingClassifier(random_state = 84)
bgc.fit(X_dev, y_dev)

#### Evaluation

In [9]:
print(f"The ROC-AUC score for this model is: {roc_auc_score(y_test, bgc.predict_proba(X_test), average='weighted', multi_class='ovr'):.4f}")


The ROC-AUC score for this model is: 0.6430


### HYPERPARAMETER OPTIMIZATION USING GRID SEARCH


In [10]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

#### Preparing the hyperparameter space and performing GridSearch CV

In [11]:
'''
Input:

n_estimators = [i for i in range(50)]
sfold = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(estimator = bgc, param_grid = {'n_estimators': n_estimators}, scoring='roc_auc_ovr', cv=sfold, n_jobs=-1)
grid_search.fit(X_dev, y_dev)
grid_search.best_params_
'''

'''
Output:

{'n_estimators': 49}

'''

"\nOutput:\n\n{'n_estimators': 49}\n\n"

#### Re-training Bagging Classifier using the best parameters obtained

In [12]:
bgc_best = BaggingClassifier(random_state = 84, n_estimators = 49)
bgc_best.fit(X_dev, y_dev)

#### Evaluation

In [13]:
bagging_roc = roc_auc_score(y_test, bgc_best.predict_proba(X_test), average='weighted', multi_class='ovr')
print(f"The ROC-AUC score for this model is: {bagging_roc}")

The ROC-AUC score for this model is: 0.7073646465306128


## RANDOM FOREST

In [14]:
from sklearn.ensemble import RandomForestClassifier

### BASELINE MODEL

#### Training

In [15]:
rf = RandomForestClassifier(random_state = 84)
rf.fit(X_dev, y_dev)

#### Evaluation

In [16]:
print(f"The ROC-AUC score for this model is: {roc_auc_score(y_test, rf.predict_proba(X_test), average='weighted', multi_class='ovr'):.4f}")


The ROC-AUC score for this model is: 0.7029


### RANDOMIZED HYPERPARAMETER OPTIMIZATION

In [17]:
from sklearn.model_selection import RandomizedSearchCV

#### Preparing the hyperparameter space

In [18]:
'''
n_estimators = [np.random.randint(75,200) for i in range(50)]
criterion = ["gini", "entropy", "log_loss"]
max_depth = [np.random.randint(0,200) for i in range(50)]

param_distributions = {"n_estimators": n_estimators, "criterion": criterion, "max_depth": max_depth}
'''

'\nn_estimators = [np.random.randint(75,200) for i in range(50)]\ncriterion = ["gini", "entropy", "log_loss"]\nmax_depth = [np.random.randint(0,200) for i in range(50)]\n\nparam_distributions = {"n_estimators": n_estimators, "criterion": criterion, "max_depth": max_depth}\n'

#### Performing RandomSearch CV

In [19]:
'''
Input:

sfold = StratifiedKFold(n_splits=10)

rm_search = RandomizedSearchCV(estimator = rf, param_distributions = param_distributions, scoring='roc_auc_ovr', cv=sfold, n_jobs=-1, n_iter=20)
rm_search.fit(X_dev, y_dev)
print(rm_search.best_params_)

'''

'''
Output : 

{'n_estimators': 115,
 'max_depth': 13,
 'criterion': 'log_loss'}
 
'''


"\nOutput : \n\n{'n_estimators': 115,\n 'max_depth': 13,\n 'criterion': 'log_loss'}\n \n"

200 fits

#### Re-training Random Forest using the best parameters obtained

In [20]:
rf_new = RandomForestClassifier(random_state = 84, n_estimators = 115, criterion = 'log_loss', max_depth = 13)
rf_new.fit(X_dev, y_dev)

#### Evaluation

In [21]:
print(f"The ROC-AUC score for this model is: {roc_auc_score(y_test, rf_new.predict_proba(X_test), average='weighted', multi_class='ovr'):.4f}")


The ROC-AUC score for this model is: 0.7185


### HYPERPARAMETER OPTIMIZATION USING GRID SEARCH
##### (In and around the values obtained using Randomized Search)

#### Preparing the hyperparameter space

In [22]:
'''
n_estimators = [i for i in range(109,119)]
criterion = ["gini", "entropy", "log_loss"]
max_depth = [i for i in range(8,18)]

param_grid = {"n_estimators": n_estimators, "criterion": criterion, "max_depth": max_depth}
'''

'\nn_estimators = [i for i in range(109,119)]\ncriterion = ["gini", "entropy", "log_loss"]\nmax_depth = [i for i in range(8,18)]\n\nparam_grid = {"n_estimators": n_estimators, "criterion": criterion, "max_depth": max_depth}\n'

#### Performing GridSearch CV

In [23]:
'''
Input:

sfold = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(estimator = rf_new, param_grid = param_grid, scoring='roc_auc_ovr', cv = sfold, n_jobs=-1)
grid_search.fit(X_dev, y_dev)
print(grid_search.best_params_)

'''

'''
Output:

{'criterion': 'entropy',
 'max_depth': 8,
 'n_estimators': 118}
 
'''

"\nOutput:\n\n{'criterion': 'entropy',\n 'max_depth': 8,\n 'n_estimators': 118}\n \n"

39000 fits

#### Re-training Random Forest using the best parameters obtained

In [24]:
rf_best = RandomForestClassifier(random_state = 84, n_estimators = 118, criterion = 'entropy', max_depth = 8)
rf_best.fit(X_dev, y_dev)

#### Evaluation

In [25]:
rf_roc = roc_auc_score(y_test, rf_best.predict_proba(X_test), average='weighted', multi_class='ovr')
print(f"The ROC-AUC score for this model is: {rf_roc}")

The ROC-AUC score for this model is: 0.7277023339438273


In [26]:
def BaggingOT_ROC_scores():
    return({"Bagging Classifier": bagging_roc,
           "Random Forest Classifier": rf_roc})

### COMMENTS:

The dataset, with 10,000 non-null observations, presents challenges due to its high dimensionality. Notably, Bagging techniques, including the Bagging Classifier and Random Forest, underperform compared to the simpler Logistic Regression, possibly due to the sensitivity of random forest methods to data noise. The superior performance of Logistic Regression suggests the presence of inherent linear relationships between features and the target variable. Another significant factor contributing to the performance gap may perhaps be the potential non-uniform importance of features, where Logistic Regression excels in discerning and leveraging varying degrees of influence, particularly in the context of linear relationships.