<a href="https://colab.research.google.com/github/xesmaze/cpsc541-fall2024/blob/main/lectures/Stratified_Kfold_Abalone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This assignment was intended to address the class-conditional variation in the abalone data set based on the gender(sex), by adjusting the cross-validation splits**

To handle the class imbalance specifically for the Sex representation in the folds, we need to balance the folds so that each fold has an equal or near-equal representation of each Sex category (male, female, and infant) instead of using the standard `StratifiedKFold`.

For this, we can create a custom approach to ensure that each fold contains a balanced number of instances of each sex category.

**Key Steps**

- **Manual Balancing:** Instead of using StratifiedKFold, we will split the dataset by the Sex column, and then perform K-fold cross-validation on each of these subsets.
- **Concatenation of Folds:** After splitting and performing K-fold on each Sex category, we will combine the corresponding training and test folds to maintain balanced representation in each fold.


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
column_names = ['Sex', 'Length', 'Diameter', 'Height', 'WholeWeight', 'ShuckedWeight', 'VisceraWeight', 'ShellWeight', 'Rings']
df = pd.read_csv(url, header=None, names=column_names)

# Convert Sex to numeric categories
df['Sex'] = df['Sex'].map({'M': 0, 'F': 1, 'I': 2})

# Prepare features and target
X = df[['Length']]  # Using Length as the feature
y = df['Rings']  # Using Rings as the target

# Number of splits for K-Fold
n_splits = 5

# Separate data by Sex categories
df_male = df[df['Sex'] == 0]
df_female = df[df['Sex'] == 1]
df_infant = df[df['Sex'] == 2]

# Initialize KFold
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Store MSE scores
mse_scores = []

# Loop through the K-Fold splits for each Sex group
kf_male = kf.split(df_male)
kf_female = kf.split(df_female)
kf_infant = kf.split(df_infant)

for (train_index_male, test_index_male), \
    (train_index_female, test_index_female), \
    (train_index_infant, test_index_infant) in zip(kf_male, kf_female, kf_infant):

    # Split each Sex group into training and test sets
    train_male, test_male = df_male.iloc[train_index_male], df_male.iloc[test_index_male]
    train_female, test_female = df_female.iloc[train_index_female], df_female.iloc[test_index_female]
    train_infant, test_infant = df_infant.iloc[train_index_infant], df_infant.iloc[test_index_infant]

    # Combine the training sets and test sets from all Sex groups
    train_set = pd.concat([train_male, train_female, train_infant])
    test_set = pd.concat([test_male, test_female, test_infant])

    # Prepare training and test sets for model fitting
    X_train = train_set[['Length']]
    y_train = train_set['Rings']
    X_test = test_set[['Length']]
    y_test = test_set['Rings']

    # Linear regression
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Prediction and evaluation
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

# Output mean MSE across folds
mean_mse = np.mean(mse_scores)
print(f"Mean MSE across folds: {mean_mse}")


Mean MSE across folds: 7.177193646238633


To compare the balanced vs unbalanced K-fold results...

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
column_names = ['Sex', 'Length', 'Diameter', 'Height', 'WholeWeight', 'ShuckedWeight', 'VisceraWeight', 'ShellWeight', 'Rings']
df = pd.read_csv(url, header=None, names=column_names)

# Convert Sex to numeric categories
df['Sex'] = df['Sex'].map({'M': 0, 'F': 1, 'I': 2})

# Prepare features and target
X = df[['Length']]  # Using Length as the feature
y = df['Rings']  # Using Rings as the target

# Number of splits for K-Fold
n_splits = 5

# Separate data by Sex categories
df_male = df[df['Sex'] == 0]
df_female = df[df['Sex'] == 1]
df_infant = df[df['Sex'] == 2]

# Initialize KFold
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Store MSE scores
balanced_mse_scores = []
unbalanced_mse_scores = []

# ---- Balanced K-Fold CV ----
kf_male = kf.split(df_male)
kf_female = kf.split(df_female)
kf_infant = kf.split(df_infant)

for (train_index_male, test_index_male), \
    (train_index_female, test_index_female), \
    (train_index_infant, test_index_infant) in zip(kf_male, kf_female, kf_infant):

    # Split each Sex group into training and test sets
    train_male, test_male = df_male.iloc[train_index_male], df_male.iloc[test_index_male]
    train_female, test_female = df_female.iloc[train_index_female], df_female.iloc[test_index_female]
    train_infant, test_infant = df_infant.iloc[train_index_infant], df_infant.iloc[test_index_infant]

    # Combine the training sets and test sets from all Sex groups
    train_set = pd.concat([train_male, train_female, train_infant])
    test_set = pd.concat([test_male, test_female, test_infant])

    # Prepare training and test sets for model fitting
    X_train = train_set[['Length']]
    y_train = train_set['Rings']
    X_test = test_set[['Length']]
    y_test = test_set['Rings']

    # Linear regression
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Prediction and evaluation
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    balanced_mse_scores.append(mse)

# ---- Unbalanced K-Fold CV (Standard K-Fold) ----
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Linear regression
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Prediction and evaluation
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    unbalanced_mse_scores.append(mse)

# Output results
mean_balanced_mse = np.mean(balanced_mse_scores)
mean_unbalanced_mse = np.mean(unbalanced_mse_scores)

print(f"Mean MSE (Balanced K-Fold CV): {mean_balanced_mse}")
print(f"Mean MSE (Unbalanced K-Fold CV): {mean_unbalanced_mse}")


Mean MSE (Balanced K-Fold CV): 7.177193646238633
Mean MSE (Unbalanced K-Fold CV): 7.19052977744533


Alternately, we can do K-Fold cross-validation independently for each Sex category...

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
column_names = ['Sex', 'Length', 'Diameter', 'Height', 'WholeWeight', 'ShuckedWeight', 'VisceraWeight', 'ShellWeight', 'Rings']
df = pd.read_csv(url, header=None, names=column_names)

# Convert Sex to numeric categories
df['Sex'] = df['Sex'].map({'M': 0, 'F': 1, 'I': 2})

# Prepare features and target
X = df[['Length']]  # Using Length as the feature
y = df['Rings']  # Using Rings as the target

# Number of splits for K-Fold
n_splits = 5

# Separate data by Sex categories
df_male = df[df['Sex'] == 0]
df_female = df[df['Sex'] == 1]
df_infant = df[df['Sex'] == 2]

# Initialize KFold
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Store MSE scores
balanced_mse_scores = []
unbalanced_mse_scores = []
male_mse_scores = []
female_mse_scores = []
infant_mse_scores = []

# ---- Balanced K-Fold CV ----
kf_male = kf.split(df_male)
kf_female = kf.split(df_female)
kf_infant = kf.split(df_infant)

for (train_index_male, test_index_male), \
    (train_index_female, test_index_female), \
    (train_index_infant, test_index_infant) in zip(kf_male, kf_female, kf_infant):

    # Split each Sex group into training and test sets
    train_male, test_male = df_male.iloc[train_index_male], df_male.iloc[test_index_male]
    train_female, test_female = df_female.iloc[train_index_female], df_female.iloc[test_index_female]
    train_infant, test_infant = df_infant.iloc[train_index_infant], df_infant.iloc[test_index_infant]

    # Combine the training sets and test sets from all Sex groups
    train_set = pd.concat([train_male, train_female, train_infant])
    test_set = pd.concat([test_male, test_female, test_infant])

    # Prepare training and test sets for model fitting
    X_train = train_set[['Length']]
    y_train = train_set['Rings']
    X_test = test_set[['Length']]
    y_test = test_set['Rings']

    # Linear regression
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Prediction and evaluation
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    balanced_mse_scores.append(mse)

# ---- Unbalanced K-Fold CV (Standard K-Fold) ----
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Linear regression
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Prediction and evaluation
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    unbalanced_mse_scores.append(mse)

# ---- Separate K-Fold CV for Each Sex Category ----

# Male K-Fold CV
for train_index, test_index in kf.split(df_male[['Length']]):
    X_train, X_test = df_male[['Length']].iloc[train_index], df_male[['Length']].iloc[test_index]
    y_train, y_test = df_male['Rings'].iloc[train_index], df_male['Rings'].iloc[test_index]

    # Linear regression
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Prediction and evaluation
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    male_mse_scores.append(mse)

# Female K-Fold CV
for train_index, test_index in kf.split(df_female[['Length']]):
    X_train, X_test = df_female[['Length']].iloc[train_index], df_female[['Length']].iloc[test_index]
    y_train, y_test = df_female['Rings'].iloc[train_index], df_female['Rings'].iloc[test_index]

    # Linear regression
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Prediction and evaluation
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    female_mse_scores.append(mse)

# Infant K-Fold CV
for train_index, test_index in kf.split(df_infant[['Length']]):
    X_train, X_test = df_infant[['Length']].iloc[train_index], df_infant[['Length']].iloc[test_index]
    y_train, y_test = df_infant['Rings'].iloc[train_index], df_infant['Rings'].iloc[test_index]

    # Linear regression
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Prediction and evaluation
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    infant_mse_scores.append(mse)

# ---- Output results ----
mean_balanced_mse = np.mean(balanced_mse_scores)
mean_unbalanced_mse = np.mean(unbalanced_mse_scores)
mean_male_mse = np.mean(male_mse_scores)
mean_female_mse = np.mean(female_mse_scores)
mean_infant_mse = np.mean(infant_mse_scores)

print(f"Mean MSE (Balanced K-Fold CV): {mean_balanced_mse}")
print(f"Mean MSE (Unbalanced K-Fold CV): {mean_unbalanced_mse}")
print(f"Mean MSE (Male K-Fold CV): {mean_male_mse}")
print(f"Mean MSE (Female K-Fold CV): {mean_female_mse}")
print(f"Mean MSE (Infant K-Fold CV): {mean_infant_mse}")


Mean MSE (Balanced K-Fold CV): 7.177193646238633
Mean MSE (Unbalanced K-Fold CV): 7.19052977744533
Mean MSE (Male K-Fold CV): 7.961104882985151
Mean MSE (Female K-Fold CV): 9.141149627805742
Mean MSE (Infant K-Fold CV): 3.3461574891102805


### Why Stratified K-Fold from Scikit-learn Won’t Work this Case

The reason **Stratified K-Fold** from **Scikit-learn** won’t work directly in our scenario is that **StratifiedKFold** is primarily designed for **classification problems** where the goal is to ensure that each fold of the cross-validation contains roughly the same proportion of each class label (e.g., 0s and 1s in binary classification) across all folds.

However, in our case:

1. **Regression Problem:**
   - You are dealing with a **regression problem**, where the target variable (`Rings`) is continuous (numeric) rather than categorical (class labels). Stratified K-Fold is meant for classification tasks where the target variable is discrete, making it inappropriate for continuous targets in a straightforward manner.

2. **Multiple Stratification Targets:**
   - You want to balance the data based on **Sex** (categorical) while performing regression on the **Rings** (continuous). Stratified K-Fold would not natively support stratifying by a separate categorical feature (e.g., **Sex**) while performing regression on a continuous target.

### Why StratifiedKFold Fails for Regression:

- **StratifiedKFold** stratifies based on the distribution of the **target variable** (e.g., class labels in classification problems). When your target is a continuous variable (like **Rings**), there’s no clear notion of "classes" to maintain a balanced distribution across folds.
- Even if you were to manually bin the continuous target into discrete intervals, that would solve balancing for **Rings**, but it wouldn't handle balancing based on the **Sex** category, which we’re also concerned about.

### Why a Custom Approach is Needed:

- **Balancing by Sex:** You need a method that ensures balanced representation of the **Sex** categories (Male, Female, Infant) in each fold, which requires a custom solution. Stratified K-Fold, in its typical form, cannot be applied to stratify by a separate feature (e.g., **Sex**) while ignoring the continuous nature of the target.
  
- **Stratifying by Rings & Sex Simultaneously:** Even if you used binned values of **Rings**, you'd still need to balance by **Sex**, which StratifiedKFold doesn't allow since it can only stratify based on the target variable.

Thus, to achieve the desired balancing based on **Sex** categories while performing regression on **Rings**, you need to use a custom solution, as shown in the previous examples, where you manually balance each **Sex** group in the K-Fold process.
