You should compare XGBoost or Gradient Boosting to the results of your previous AdaBoost activity.
Based on the visualizations seen at the links above you're probably also thinking that this classification task should not be that difficult. So, a secondary goal of this assignment is to test the effects of the XGBoost (or Gradient Boosting) function arguments on the algorithm's performance. 
You should explore at least 3 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.

In [3]:
# load in data
import pandas as pd
penguins = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\penguins_size.csv")

In [4]:
pip install xgboost

Collecting xgboostNote: you may need to restart the kernel to use updated packages.

  Downloading xgboost-3.0.0-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-3.0.0-py3-none-win_amd64.whl (150.0 MB)
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   -- ------------------------------------- 7.9/150.0 MB 40.7 MB/s eta 0:00:04
   ---- ----------------------------------- 15.7/150.0 MB 38.1 MB/s eta 0:00:04
   ------ --------------------------------- 23.6/150.0 MB 37.3 MB/s eta 0:00:04
   -------- ------------------------------- 30.4/150.0 MB 35.1 MB/s eta 0:00:04
   ---------- ----------------------------- 38.0/150.0 MB 34.0 MB/s eta 0:00:04
   ------------ --------------------------- 46.1/150.0 MB 34.2 MB/s eta 0:00:04
   -------------- ------------------------- 53.2/150.0 MB 33.9 MB/s eta 0:00:03
   ---------------- ----------------------- 60.0/150.0 MB 33.6 MB/s eta 0:00:03
   ------------------ --------------------- 67.6/150.0 MB 34.0 MB/s eta 0:00


[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: C:\Users\achur\AppData\Local\Programs\Python\Python312\python.exe -m pip install --upgrade pip


In [17]:
# first xgboost
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# Drop missing values
penguins = penguins.dropna()

# Encode all categorical variables
for label in penguins.columns:
    penguins[label] = LabelEncoder().fit_transform(penguins[label])

# Split features and target
X = penguins.drop(['species'], axis=1)
Y = penguins['species']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state = 4)

# XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=7)

# Fit model
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred = xgb_model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print("XGBoost Test Accuracy:", round(accuracy * 100, 2), "%")


XGBoost Test Accuracy: 97.01 %


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [7]:
# second xgboost
xgb2 = XGBClassifier(max_depth=2, learning_rate=0.1, n_estimators=200, subsample=0.8,
                     colsample_bytree=0.8, use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb2.fit(X_train, y_train)
acc2 = accuracy_score(y_test, xgb2.predict(X_test))
print("Set 2 Accuracy:", round(acc2 * 100, 2), "%")

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Set 2 Accuracy: 98.51 %


In [38]:
xgb3 = XGBClassifier(
    max_depth=1,
    learning_rate=0.5,
    n_estimators=1000,
    subsample=0.6,
    colsample_bytree=0.6,
    use_label_encoder=False,
    eval_metric='mlogloss',
    random_state=15
)

xgb3.fit(X_train, y_train)
acc3 = accuracy_score(y_test, xgb3.predict(X_test))
print("Set 3 Accuracy:", round(acc3 * 100, 2), "%")



Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Set 3 Accuracy: 100.0 %


For each different model that was run, all of the predicted accuracies were very close. For the first model, I ran a straight xgboosting classifier and got a 97.01% accuracy. For the second model, I made the max_depth and the learning rate lower and got an accuracy rate of 98.51%. For the third model, I made the max_depth even lower and increased the estimators and got an accuracy rate of 100%. In comparison to the adaboost models, all 3 of the adaboost models produced accuracy rates of 98.057%, 97.60%, and 98.8%. On average, xgboost predicts at a higher accuracy rate. 