<a href="https://colab.research.google.com/github/isa-ulisboa/greends-pml/blob/main/notebooks/wine_region_pipeline_XGB_CV_gridsearch_featselection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [49]:
from sklearn import datasets
from sklearn.metrics import make_scorer, f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV,  StratifiedKFold, cross_val_score, train_test_split


Define pipeline that includes preprocessing and classification. This prevents data leakage

In [74]:
# Create pipeline with preprocessing and classifier
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Handle missing values
    ('scaler', StandardScaler()),                # Standardize features
    ('classifier', xgb.XGBClassifier())
])


Optional: add feature selection to the pipeline. In this case feature selection uses feature importance from a `RandomForestClassifier`

In [75]:
# Create pipeline with preprocessing, automatic feature sselection and classifier
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Handle missing values
    ('scaler', StandardScaler()),                # Standardize features
    ('feature_selection', RFECV(                 # Recursive Feature Elimination (automatic version with CV)
        estimator=RandomForestClassifier(),
        step=1,
        cv=StratifiedKFold(3),
        scoring='accuracy'
    )),
    ('classifier', xgb.XGBClassifier())
])


Define cross validation splitting stategy

In [76]:
# Initialize StratifiedKFold for cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


Load data: X, y

In [77]:
X,y=datasets.load_wine(return_X_y=True) # Wine region dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Optional: if you want to perform hyperparameter tuning (in this case with `RandomizedSearchCV`). The code also updates the pipeline after best parameter  search

In [79]:

param_grid = {
    'classifier__max_depth': [ 2,3,4]
}

search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_grid,
    cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42),
    n_jobs=-1,
    random_state=42
)
search.fit(X_train, y_train)

# print best parameters
print(search.best_params_)

# Update the pipeline with the best parameters
pipeline = search.best_estimator_




{'classifier__max_depth': 3}


Apply cross validation to obtain scores: the arguments are the pipeline, the data, the cross-validation scheme, and the scoring metric. Note that the default "accuracy" is not a good scoring metric if data is imbalanced.

In [78]:
# Example usage with cross-validation
scores = cross_val_score(
    estimator=pipeline,
    X=X_train,  # Your feature matrix
    y=y_train,  # Your target vector
    cv=skf,
    scoring='accuracy'
)

print(scores)

[1.         0.82758621 0.96428571 0.92857143 1.        ]
