Assignment Specs

You should compare AdaBoost to at least one of the following: a bagging model, a stacking model.
Based on the visualizations seen at the links above you're probably also thinking that this classification task should not be that difficult. So, a secondary goal of this assignment is to test the effects of the AdaBoost function arguments on the algorithm's performance. 
You should explore at least 3 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.

In [1]:
# load in data
import pandas as pd
penguins = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\penguins_size.csv")

## Bagging Model

In [24]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
penguins = penguins.dropna()
for label in penguins.columns:
    penguins[label] = LabelEncoder().fit_transform(penguins[label])

X = penguins.drop(['species'], axis=1)
Y = penguins['species']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

base_tree = DecisionTreeClassifier(criterion='entropy', max_depth=3)
bagging = BaggingClassifier(estimator=base_tree, n_estimators=100, random_state=42)

bagging.fit(X_train, y_train)

y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Bagging Test Accuracy:", round(accuracy * 100, 2), "%")


Bagging Test Accuracy: 98.51 %


## AdaBoost

In [20]:
# first adaboost
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
for label in penguins.columns:
    penguins[label] = LabelEncoder().fit(penguins[label]).transform(penguins[label])


X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
    
X = penguins.drop(['species'],axis=1)
Y = penguins['species']

model = DecisionTreeClassifier(criterion='entropy',max_depth=1)


AdaBoost = AdaBoostClassifier(n_estimators=400,learning_rate=1,algorithm='SAMME')

AdaBoost.fit(X_train, y_train)
prediction = AdaBoost.score(X_test, y_test)

print('The accuracy is: ',prediction*100,'%')



The accuracy is:  98.50746268656717 %


In [26]:
# second adaboost - higher estimator and lower learning rate
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=6)
X = penguins.drop(['species'],axis=1)
Y = penguins['species']
model = DecisionTreeClassifier(criterion='entropy', max_depth=2)
AdaBoost = AdaBoostClassifier(estimator=model, n_estimators=500, learning_rate=0.4, algorithm='SAMME')
AdaBoost.fit(X_train, y_train)
prediction = AdaBoost.score(X_test, y_test)

print('The accuracy is: ',prediction*100,'%')



The accuracy is:  97.60479041916167 %


In [21]:
# third adaboost - using cross validation
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
X = penguins.drop(['species'],axis=1)
Y = penguins['species']
model = DecisionTreeClassifier(criterion='entropy', max_depth=2)
AdaBoost = AdaBoostClassifier(estimator=model, n_estimators=500, learning_rate=0.4, algorithm='SAMME')
scores = cross_val_score(AdaBoost, X, Y, cv=5)
print("Cross-validated accuracy:", round(scores.mean() * 100, 2), '%')



Cross-validated accuracy: 98.8 %


For each different model that was run, all of the predicted accuracies were very close. For the first model, I ran a straight boosting classifier and got a 98.51% accuracy. For the first boosting model, I ran a straight boosting model with an estimator of 400 and a learning rate of 1 and got an accuracy of 98.057%. For the second adaboost model, I changed did a test size of 0.5 instead of 0.2 and changed the seed as well as changing the estimator to 500 and the learning rate to 0.4 and got an accuracy of 97.60%. For the third adaboost model, I added in cross validation and got an accuracy of 98.8%. 