# Gradient Boosting Model with Stratified k-Fold Cross-Validation

We will train a Gradient Boosting model using stratified k-fold cross-validation for the 4 different training datasets to determine which datasets is the most suitable for training the final model. We will evaluate the performance of the model using accuracy and F1-score.

## Import the required libraries

In [8]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, make_scorer
from sklearn.model_selection import cross_validate, StratifiedKFold
from skopt import BayesSearchCV
from skopt.space import Real, Integer

## Load Datasets

The datasets are stored in the `numerical_dataset` folder. We will load them into a list of DataFrames.

1. train_data_mod_fasttext_300d_numerical.csv
2. train_data_mod_glove_50d_0v_numerical.csv
3. train_data_mod_glove_50d_custom_numerical.csv
4. train_data_mod_word2vec_50d_numerical.csv

In [9]:
# List of train dataset filenames
datasets = [
    "../numerical_datasets/train_data_mod_fasttext_300d_numerical.csv",
    "../numerical_datasets/train_data_mod_glove_50d_0v_numerical.csv",
    "../numerical_datasets/train_data_mod_glove_50d_custom_numerical.csv",
    "../numerical_datasets/train_data_mod_word2vec_50d_numerical.csv",
]

## Perform Stratified k-Fold Cross-Validation

We'll perform stratified k-fold cross-validation for each dataset. This ensures that each validation set has the same distribution of target values as the entire dataset. We'll use 5 folds in this example.

In [3]:
# Perform stratified k-fold cross-validation for each dataset
for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load dataset
    df = pd.read_csv(dataset)

    # Define target variable and feature columns
    target = "target"
    features = df.columns.drop(["id", "target"])

    # Set up stratified k-fold cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    accuracies = []
    f1_scores = []

    # Perform cross-validation
    for train_index, val_index in skf.split(df, df[target]):
        X_train, X_val = df.loc[train_index, features], df.loc[val_index, features]
        y_train, y_val = df.loc[train_index, target], df.loc[val_index, target]

        # Train the Gradient Boosting model
        model = XGBClassifier(random_state=42)
        model.fit(X_train, y_train)

        # Predict on validation set
        y_pred = model.predict(X_val)

        # Evaluate the model
        accuracy = accuracy_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred)

        # Store evaluation metrics
        accuracies.append(accuracy)
        f1_scores.append(f1)

    # Print average evaluation metrics for the dataset
    print(f"Average accuracy: {np.mean(accuracies):.4f}")
    print(f"Average F1-score: {np.mean(f1_scores):.4f}")
    print("\n")

Processing dataset: numerical_datasets/train_data_mod_fasttext_300d_numerical.csv
Average accuracy: 0.7612
Average F1-score: 0.7086


Processing dataset: numerical_datasets/train_data_mod_glove_50d_0v_numerical.csv
Average accuracy: 0.7580
Average F1-score: 0.7067


Processing dataset: numerical_datasets/train_data_mod_glove_50d_custom_numerical.csv
Average accuracy: 0.7579
Average F1-score: 0.7048


Processing dataset: numerical_datasets/train_data_mod_word2vec_50d_numerical.csv
Average accuracy: 0.7571
Average F1-score: 0.7048




## Hyperparameter Tuning

We will perform hyperparameter tuning using Random Search with 5-fold cross-validation on each dataset. This will help us find the best hyperparameters for the XGBoost Gradient Boosting model.

In [10]:
def combined_scorer(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    return 0.5 * accuracy + 0.5 * f1

# Define the search space for Bayesian Optimization
param_space = {
    'learning_rate': Real(0.01, 0.2),
    'max_depth': Integer(3, 10),
    'n_estimators': Integer(50, 201),
    'subsample': Real(0.5, 1.0),
    'colsample_bytree': Real(0.5, 1.0),
}

# Initialize the XGBoost classifier
xgb_clf = XGBClassifier(random_state=42)

# Create a StratifiedKFold object for cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Use BayesSearchCV for hyperparameter tuning
bayes_search = BayesSearchCV(
    estimator=xgb_clf,
    search_spaces=param_space,
    scoring=make_scorer(combined_scorer),
    cv=cv,
    n_jobs=-1,
    n_iter=50,
    random_state=42,
)

def evaluate_datasets(datasets):
    best_models = []
    for dataset in datasets:
        print(f"Processing {dataset}")
        data = pd.read_csv(dataset)
        
        X = data.drop(columns=['id', 'target'])
        y = data['target']
        
        bayes_search.fit(X, y)
        
        print(f"Best score: {bayes_search.best_score_}")
        print(f"Best params: {bayes_search.best_params_}")
        print("\n")
        
        clf = XGBClassifier(**bayes_search.best_params_, random_state=42)
        scores = cross_validate(
            estimator=clf,
            X=X,
            y=y,
            scoring={'accuracy': make_scorer(accuracy_score), 'f1': make_scorer(f1_score)},
            cv=cv,
            n_jobs=-1,
        )
        print(f"Average accuracy: {np.mean(scores['test_accuracy']):.4f}")
        print(f"Average F1 Score: {np.mean(scores['test_f1']):.4f}")
        print("\n")
        
        best_models.append(clf.fit(X, y))
    
    return best_models

best_models = evaluate_datasets(datasets)

Processing numerical_datasets/train_data_mod_fasttext_300d_numerical.csv
Best score: 0.747682378581122
Best params: OrderedDict([('colsample_bytree', 0.8212749643012855), ('learning_rate', 0.010428072688169624), ('max_depth', 10), ('n_estimators', 198), ('subsample', 0.5815134565446674)])


Average accuracy: 0.7741
Average F1 Score: 0.7213


Processing numerical_datasets/train_data_mod_glove_50d_0v_numerical.csv
Best score: 0.7448216322459841
Best params: OrderedDict([('colsample_bytree', 1.0), ('learning_rate', 0.01), ('max_depth', 10), ('n_estimators', 92), ('subsample', 0.6265872392721268)])


Average accuracy: 0.7703
Average F1 Score: 0.7194


Processing numerical_datasets/train_data_mod_glove_50d_custom_numerical.csv
Best score: 0.7461805091520255
Best params: OrderedDict([('colsample_bytree', 0.864064272685718), ('learning_rate', 0.11111325363758288), ('max_depth', 6), ('n_estimators', 53), ('subsample', 0.7246541244464286)])


Average accuracy: 0.7718
Average F1 Score: 0.7205




## Make predictions on Test Data

In [11]:
# List of test dataset filenames
test_datasets = [
    "../numerical_datasets/test_data_mod_fasttext_300d_numerical.csv",
    "../numerical_datasets/test_data_mod_glove_50d_0v_numerical.csv",
    "../numerical_datasets/test_data_mod_glove_50d_custom_numerical.csv",
    "../numerical_datasets/test_data_mod_word2vec_50d_numerical.csv",
]

# Save the ouputs inthe predictions folder
output_filenames = [
    "gb_predictions/test_data_mod_fasttext_300d_predictions.csv",
    "gb_predictions/test_data_mod_glove_50d_0v_predictions.csv",
    "gb_predictions/test_data_mod_glove_50d_custom_predictions.csv",
    "gb_predictions/test_data_mod_word2vec_50d_predictions.csv",
]

In [12]:
def make_predictions_and_save(models, test_datasets, output_filenames):
    for idx, (model, test_dataset, output_filename) in enumerate(zip(models, test_datasets, output_filenames)):
        print(f"Processing {test_dataset}")
        
        data = pd.read_csv(test_dataset)
        X_test = data.drop(columns=['id'])
        ids = data['id']
        
        y_pred = model.predict(X_test)
        
        output = pd.DataFrame({'id': ids, 'target': y_pred})
        output.to_csv(output_filename, index=False)
        print(f"Saved predictions to {output_filename}")
        print("\n")

make_predictions_and_save(best_models, test_datasets, output_filenames)

Processing numerical_datasets/test_data_mod_fasttext_300d_numerical.csv
Saved predictions to predictions/test_data_mod_fasttext_300d_predictions.csv


Processing numerical_datasets/test_data_mod_glove_50d_0v_numerical.csv
Saved predictions to predictions/test_data_mod_glove_50d_0v_predictions.csv


Processing numerical_datasets/test_data_mod_glove_50d_custom_numerical.csv
Saved predictions to predictions/test_data_mod_glove_50d_custom_predictions.csv


Processing numerical_datasets/test_data_mod_word2vec_50d_numerical.csv
Saved predictions to predictions/test_data_mod_word2vec_50d_predictions.csv




## Results

Here are the results of the predictions:
1. test_data_mod_fasttext_300d_predictions.csv = 0.75574
2. test_data_mod_glove_50d_0v_predictions.csv = 0.74348
3. test_data_mod_glove_50d_custom_predictions.csv = 0.7441
4. test_data_mod_word2vec_50d_predictions.csv = 0.75206

## Conclusion

Based on the results, we can conclude that the dataset with Fasttext 300D embeddings performed the best with an accuracy of 0.75574, followed by the dataset with Word2Vec 50D embeddings (0.75206). The datasets with GloVe 50D embeddings had similar performances, with the custom dataset performing slightly better than the zero-vector dataset (0.7441 vs. 0.74348).

In summary, the Fasttext 300D dataset seems to be the most suitable for training the final model.