# Gradient Boosting Model with Stratified k-Fold Cross-Validation

We will train a Gradient Boosting model using stratified k-fold cross-validation for the 4 different training datasets to determine which datasets is the most suitable for training the final model. We will evaluate the performance of the model using accuracy and F1-score.



## Import the required libraries

In [21]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import StratifiedKFold, GridSearchCV, RandomizedSearchCV, cross_validate
from sklearn.metrics import accuracy_score, f1_score, make_scorer
from xgboost import XGBClassifier

## Load Datasets

The datasets are stored in the `numerical_dataset` folder. We will load them into a list of DataFrames.

1. train_data_mod_fasttext_300d_numerical.csv
2. train_data_mod_glove_50d_0v_numerical.csv
3. train_data_mod_glove_50d_custom_numerical.csv
4. train_data_mod_word2vec_50d_numerical.csv

In [22]:
# List of dataset filenames
datasets = [
    "numerical_datasets/train_data_mod_fasttext_300d_numerical.csv",
    "numerical_datasets/train_data_mod_glove_50d_0v_numerical.csv",
    "numerical_datasets/train_data_mod_glove_50d_custom_numerical.csv",
    "numerical_datasets/train_data_mod_word2vec_50d_numerical.csv",
]

## Perform Stratified k-Fold Cross-Validation

We'll perform stratified k-fold cross-validation for each dataset. This ensures that each validation set has the same distribution of target values as the entire dataset. We'll use 5 folds in this example.

In [23]:
# Perform stratified k-fold cross-validation for each dataset
for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load dataset
    df = pd.read_csv(dataset)

    # Define target variable and feature columns
    target = "target"
    features = df.columns.drop(["id", "target"])

    # Set up stratified k-fold cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    accuracies = []
    f1_scores = []

    # Perform cross-validation
    for train_index, val_index in skf.split(df, df[target]):
        X_train, X_val = df.loc[train_index, features], df.loc[val_index, features]
        y_train, y_val = df.loc[train_index, target], df.loc[val_index, target]

        # Train the Gradient Boosting model
        model = XGBClassifier(random_state=42)
        model.fit(X_train, y_train)

        # Predict on validation set
        y_pred = model.predict(X_val)

        # Evaluate the model
        accuracy = accuracy_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred)

        # Store evaluation metrics
        accuracies.append(accuracy)
        f1_scores.append(f1)

    # Print average evaluation metrics for the dataset
    print(f"Average accuracy: {np.mean(accuracies):.4f}")
    print(f"Average F1-score: {np.mean(f1_scores):.4f}")
    print("\n")

Processing dataset: numerical_datasets/train_data_mod_fasttext_300d_numerical.csv
Average accuracy: 0.7612
Average F1-score: 0.7086


Processing dataset: numerical_datasets/train_data_mod_glove_50d_0v_numerical.csv
Average accuracy: 0.7580
Average F1-score: 0.7067


Processing dataset: numerical_datasets/train_data_mod_glove_50d_custom_numerical.csv
Average accuracy: 0.7579
Average F1-score: 0.7048


Processing dataset: numerical_datasets/train_data_mod_word2vec_50d_numerical.csv
Average accuracy: 0.7571
Average F1-score: 0.7048




## Hyperparameter Tuning

We will perform hyperparameter tuning using Random Search with 5-fold cross-validation on each dataset. This will help us find the best hyperparameters for the XGBoost Gradient Boosting model.

In [26]:
# Define the parameter distributions for Random Search
param_distributions = {
    'learning_rate': stats.uniform(0.01, 0.2),
    'max_depth': stats.randint(3, 10),
    'n_estimators': stats.randint(50, 201),
    'subsample': stats.uniform(0.5, 0.5),
    'colsample_bytree': stats.uniform(0.5, 0.5),
}

# Initialize the XGBoost classifier
xgb_clf = XGBClassifier(random_state=42)

# Create a StratifiedKFold object for cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Use RandomizedSearchCV for hyperparameter tuning
random_search = RandomizedSearchCV(
    estimator=xgb_clf,
    param_distributions=param_distributions,
    scoring={'accuracy': make_scorer(accuracy_score), 'f1': make_scorer(f1_score)},
    refit='accuracy',
    cv=cv,
    verbose=2,
    n_jobs=-1,
    n_iter=50,
    random_state=42,
)

# Define a function to perform hyperparameter tuning on each dataset
def evaluate_datasets(datasets):
    for dataset in datasets:
        print(f"Processing {dataset}")
        data = pd.read_csv(dataset)
        
        X = data.drop(columns=['id', 'target'])
        y = data['target']
        
        random_search.fit(X, y)
        
        print(f"Best score: {random_search.best_score_}")
        print(f"Best params: {random_search.best_params_}")
        print("\n")
        
        # print the accuracy and f1 score for this dataset
        clf = XGBClassifier(**random_search.best_params_, random_state=42)
        scores = cross_validate(
            estimator=clf,
            X=X,
            y=y,
            scoring={'accuracy': make_scorer(accuracy_score), 'f1': make_scorer(f1_score)},
            cv=cv,
            n_jobs=-1,
        )
        print(f"Average accuracy: {np.mean(scores['test_accuracy']):.4f}")
        print(f"Average F1 Score: {np.mean(scores['test_f1']):.4f}")
        print("\n")

evaluate_datasets(datasets)

Processing numerical_datasets/train_data_mod_fasttext_300d_numerical.csv
Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END colsample_bytree=0.5780093202212182, learning_rate=0.041198904067240534, max_depth=5, n_estimators=137, subsample=0.6668543055695109; total time=  19.8s
[CV] END colsample_bytree=0.5780093202212182, learning_rate=0.041198904067240534, max_depth=5, n_estimators=137, subsample=0.6668543055695109; total time=  20.2s
[CV] END colsample_bytree=0.5780093202212182, learning_rate=0.041198904067240534, max_depth=5, n_estimators=137, subsample=0.6668543055695109; total time=  20.6s
[CV] END colsample_bytree=0.6872700594236812, learning_rate=0.20014286128198325, max_depth=5, n_estimators=121, subsample=0.7993292420985183; total time=  22.3s
[CV] END colsample_bytree=0.6872700594236812, learning_rate=0.20014286128198325, max_depth=5, n_estimators=121, subsample=0.7993292420985183; total time=  22.5s
[CV] END colsample_bytree=0.6872700594236812, learning_ra

KeyboardInterrupt: 