# Case Study 3: XGBoost Classifier (Wine Dataset)

This notebook demonstrates **XGBoost** (`xgboost.XGBClassifier`) on the Wine dataset. We'll include installation note, training, evaluation, feature importance, and basic hyperparameter tuning.

In [None]:
# Install xgboost if not available
# Uncomment the following line if running in an environment without xgboost
# !pip install xgboost

In [None]:
# Imports
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
# Load dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names
pd.DataFrame(X, columns=feature_names).head()

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
print(X_train.shape, X_test.shape)

In [None]:
# Fit XGBoost
xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb.fit(X_train, y_train)

# Evaluate
y_pred = xgb.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification report:\n', classification_report(y_test, y_pred))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - XGBoost')
plt.show()

In [None]:
# Feature importances (built-in)
importances = xgb.feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=True)
plt.figure(figsize=(8,10))
feat_imp.plot(kind='barh')
plt.title('Feature Importances - XGBoost')
plt.show()

In [None]:
# Basic hyperparameter tuning (small grid)
param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 4]
}
gs = GridSearchCV(XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42), param_grid, cv=3, scoring='accuracy')
gs.fit(X_train, y_train)
print('Best params:', gs.best_params_)
print('Best CV score:', gs.best_score_)

best = gs.best_estimator_
y_pred_best = best.predict(X_test)
print('Test accuracy (best):', accuracy_score(y_test, y_pred_best))

## Takeaways

- XGBoost is an optimized and regularized implementation of gradient boosting with excellent performance on tabular data.
- It offers many tuning knobs (tree method, subsample, colsample_bytree) and supports early stopping and GPU acceleration.
- Use `eval_metric` and monitor validation to avoid overfitting.