## Step 1 - Importing Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import SelectFromModel, RFE

## Step 2 - Importing Necessary files


In [2]:
path_to_fluprint = r"C:\Users\Dana\OneDrive\Documents\Applied Data science\FluPRINT_database\FluPRINT_filtered_data\Fluprint.csv"
fluprint_data = pd.read_csv(path_to_fluprint)

## Step 3 - Data Preprocessing Steps for Random Forest Model

Before training our Random Forest model, we need to carry out a couple of steps to ensure the data is in the optimal format for machine learning. 

1. Convert categorical variables (specifically "name_formatted") into a format suitable for machine learning algorithms.

2. Separate data into training and testing sets to evaluate model performance on unseen data. 
**Note**: `stratify=y` ensures that the class distribution is maintained in both sets, which is important for imbalanced datasets.

3. Feature Scaling using `StandardScaler`. Ensures that all features contribute equally to the model and improves convergence speed for many algorithms. 
**Note**: While Random Forests are not sensitive to feature scales, we will try it to see if it improves performance (compared to the other random forest algorithm).

4. Handling Class Imbalance with SMOTE - Helps prevent the model from being biased towards the majority class, which we have seen it has affected all other algorithms so far. 
**Note**: SMOTE is applied only to the training data to prevent data leakage.

In [5]:
X = pd.get_dummies(fluprint_data[["name_formatted", "data"]], columns=["name_formatted"], drop_first=True)
y = fluprint_data["vaccine_response"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_imputed), columns=X_train_imputed.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_imputed), columns=X_test_imputed.columns)

# Apply SMOTE to handle class imbalance
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

print("Final X_train_resampled shape:", X_train_resampled.shape)
print("Final y_train_resampled shape:", y_train_resampled.shape)

Final X_train_resampled shape: (125104, 3283)
Final y_train_resampled shape: (125104,)


## Step 4 - Gradient Boosting Classifier (GBC) Implementation and Results
Gradient Boosting builds trees sequentially, where each tree corrects errors from the previous one. Given that our previous Random Forest algorithms showed signs of overfitting, this may help mitigate this issue. To validate model robusteness we will focus on precision, recall, and F1-scores (like in the previous models) and then ROC AUC to help us analyse the model's ability to distinguish between classes (given the potential class imbalance in our vaccine response data).

In [None]:
# Initialise and train the model
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = gbc.predict(X_test_scaled)
y_pred_proba = gbc.predict_proba(X_test_scaled)[:, 1]

# Print the classification report
print(classification_report(y_test, y_pred))

# Print ROC AUC score
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")

# Perform cross-validation
cv_scores = cross_val_score(gbc, X_train_resampled, y_train_resampled, cv=5, scoring='roc_auc')
print(f"Cross-validation ROC AUC Scores: {cv_scores}")
print(f"Mean CV ROC AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Print feature importances
feature_importance = pd.DataFrame({'feature': X_train_resampled.columns, 'importance': gbc.feature_importances_})
print("\nTop 10 Most Important Features:")
print(feature_importance.sort_values('importance', ascending=False).head(10))

              precision    recall  f1-score   support

         0.0       0.73      0.58      0.64     15639
         1.0       0.34      0.50      0.41      6768

    accuracy                           0.55     22407
   macro avg       0.53      0.54      0.52     22407
weighted avg       0.61      0.55      0.57     22407

ROC AUC Score: 0.552
Cross-validation ROC AUC Scores: [0.56234051 0.56066809 0.63604964 0.65347685 0.64947041]
Mean CV ROC AUC: 0.612 (+/- 0.084)

Top 10 Most Important Features:
                                 feature  importance
0                                   data    0.904841
32    name_formatted_CD3_neg_lymphocytes    0.006186
2873                 name_formatted_TNFA    0.005222
31     name_formatted_CD33_pos_monocytes    0.004006
2092            name_formatted_L50_CD40L    0.002979
541                  name_formatted_IL31    0.002791
84                   name_formatted_IFNA    0.002689
509                  name_formatted_IL1B    0.002444
2123           na

### Key insights:

- **Class 0 (Low Responders)**: The model performs better on the majority class, with higher precision (0.73) and F1-score (0.64). However, recall is relatively low (0.58), meaning many true low responders are misclassified.
- **Class 1 (High Responders)**: Performance on the minority class remains poor, with low precision (0.34) and F1-score (0.41). Recall is slightly better at 0.50, indicating that half of the true high responders are correctly identified.

The model struggles to achieve strong performance on the minority class (high responders), as evidenced by low precision and recall for Class 1.
Overall accuracy and F1-scores remain modest at best.

#### 2. **ROC AUC Score**:
- **Test Set ROC AUC Score**: 0.552  
- **Cross-Validation ROC AUC Scores**: [0.562, 0.561, 0.636, 0.653, 0.649]  
- **Mean CV ROC AUC**: **0.612 (+/- 0.084)**  
    - The mean cross-validation ROC AUC score of 0.612 indicates moderate class separation during training, but the test set ROC AUC score of 0.552 suggests weaker performance on unseen data.

#### Observations:
- The high importance of the data feature (90.4%) in our previous model suggests that the model was primarily using these percentage values for predictions.
- Immune cell types (`name_formatted_*`) have much smaller contributions individually, suggesting that their impact on vaccine response may be subtle or overshadowed by other factors.
- This is more than likely the reason for the overfitting. If the model relies too heavily on this single feature, it might not generalise well to new data.

#### Next steps:
- Create new features that combine the cell type name with its percentage

---

In [13]:
fluprint_data['combined_feature'] = fluprint_data['name_formatted'] + '_' + fluprint_data['units'] + '_' + fluprint_data['data'].astype(str)
X = pd.get_dummies(fluprint_data[['combined_feature']], columns=['combined_feature'], drop_first=True)
y = fluprint_data["vaccine_response"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Initialize and train the model
gbc = GradientBoostingClassifier(random_state=42)
random_search = RandomizedSearchCV(estimator=gbc, param_distributions=param_grid, 
                                   n_iter=20, cv=5, scoring='roc_auc', n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)

# Make predictions
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# Print results
print("Best Parameters:", random_search.best_params_)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print(f"\nROC AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")

# Perform cross-validation
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"\nCross-validation ROC AUC Scores: {cv_scores}")
print(f"Mean CV ROC AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Print feature importances
feature_importance = pd.DataFrame({'feature': X.columns, 'importance': best_model.feature_importances_})
print("\nTop 10 Most Important Features:")
print(feature_importance.sort_values('importance', ascending=False).head(10))

MemoryError: Unable to allocate 5.83 GiB for an array with shape (87291, 71700) and data type bool

#### Challenge:
By combining the name_formatted, units and data there are now thousands of rows which means that we quickly run out of memory. To overcome this issue we can try to use `LabelEncoder`. Label Encoding is faster and more memory-efficient, which is why it would work better. However, it can potentially impact model performance, especially for certain types of models. Since we are using a gradient boosting machine, this might not be too much (since these models can handle label-encoded categorical variables as well), but it would definately be an issue for the other algorithm we will be using (K-clustering). So for that we would need to do something differently.

In [None]:
# Combine features
fluprint_data['combined_feature'] = fluprint_data['name_formatted'] + '_' + fluprint_data['units'] + '_' + fluprint_data['data'].astype(str)

# Use LabelEncoder
le = LabelEncoder()
X = pd.DataFrame(le.fit_transform(fluprint_data['combined_feature']), columns=['combined_feature_encoded'])
y = fluprint_data["vaccine_response"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Initialise and train the model
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = gbc.predict(X_test)
y_pred_proba = gbc.predict_proba(X_test)[:, 1]

# Print the classification report
print(classification_report(y_test, y_pred))

# Print ROC AUC score
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")

# Perform cross-validation
cv_scores = cross_val_score(gbc, X_train_resampled, y_train_resampled, cv=5, scoring='roc_auc')
print(f"Cross-validation ROC AUC Scores: {cv_scores}")
print(f"Mean CV ROC AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Print feature importance
feature_importance = pd.DataFrame({'feature': X.columns, 'importance': gbc.feature_importances_})
print("\nFeature Importance:")
print(feature_importance)

              precision    recall  f1-score   support

         0.0       0.72      0.53      0.61     15639
         1.0       0.32      0.52      0.40      6768

    accuracy                           0.53     22407
   macro avg       0.52      0.52      0.50     22407
weighted avg       0.60      0.53      0.55     22407

ROC AUC Score: 0.540
Cross-validation ROC AUC Scores: [0.56078002 0.55875436 0.56186111 0.55237823 0.56169757]
Mean CV ROC AUC: 0.559 (+/- 0.007)

Feature Importance:
                    feature  importance
0  combined_feature_encoded         1.0


### Why Did the First Approach Perform Better?

1. **Granularity of Features:**
   - In the first approach, each immune cell type (e.g., `name_formatted_CD3_neg_lymphocytes`) and its associated data (e.g., percentage) are treated as separate features.
   - This allows the model to identify which specific immune cell types or cytokines contribute most to vaccine response.

2. **Preservation of Relationships:**
   - By keeping features like `data` separate, the model can directly use quantitative measurements instead of relying on a single encoded value that combines all information.

3. **Interpretability:**
   - The first approach provides interpretable feature importance scores, helping you understand which immune components are most predictive of vaccine response.

---

### Conclusion:

The first approach outperformed the second one because it provided more detailed and meaningful features for the model to learn from. While Label Encoding is fast and memory-efficient, it sacrifices granularity and interpretability in this case.

To achieve both good performance and manageable memory usage:
- Stick with separate features (`name_formatted`, `units`, `data`) instead of compressing them into a single label-encoded column.
- Use techniques like feature selection or dimensionality reduction if memory becomes an issue with one-hot encoding.