# 3D Protein Influenza Classifier

This notebook trains and evaluates three different classification models (Logistic Regression, Random Forest, and SVM) to find the best one for our 3D protein data.

**Process:**
1.  **Setup:** Import libraries.
2.  **Data Load & Split:** Load the single data file and split it into a **train set** (for model tuning) and a **test set** (for final, unbiased evaluation). *This is the most important step for getting a reliable result.*
3.  **Preprocessing Pipelines:** Create pipelines to handle feature scaling and transformation. This prevents data leakage.
4.  **Model Training & Tuning:** Use `GridSearchCV` to find the best hyperparameters for each model using only the **train set**.
5.  **Final Evaluation:** Use the **test set** (which the models have never seen) to get a final, realistic performance score for each tuned model.
6.  **Conclusion:** Compare the models and select the best one.

In [None]:
# 1. Setup and Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Metrics
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score

# Ignore convergence warnings for cleaner output
warnings.filterwarnings('ignore')

## 2. Data Load & Split

### **Instructions for Google Colab:**
1.  Click the 'Files' (folder) icon on the left sidebar.
2.  Click the 'Upload' (page with arrow) icon and select your `training_data.csv` file.
3.  The `file_path` variable below should be correct. If you uploaded it with a different name, change the variable.

In [None]:
# Define the path to your data file
file_path = '/content/training_data.csv'

try:
    df = pd.read_csv(file_path)
except FileNotFoundError:
    print("!!! ERROR: File not found. !!!")
    print("Please make sure you have uploaded 'training_data.csv' to Colab.")

if 'df' in locals():
    print("Dataset loaded successfully.")
    print(f"Original dataset shape: {df.shape}")

    # Separate features (X) and target (y)
    X = df.drop('class_label', axis=1)
    y = df['class_label']

    # CRITICAL STEP: Split data into training and a hold-out test set
    # We use test_size=0.2 (17 samples) which is small, but necessary.
    # stratify=y ensures both train and test sets have the same proportion of classes as the original data.
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=0.2, 
        random_state=42, 
        stratify=y
    )

    print(f"Training set shape: {X_train.shape} samples")
    print(f"Test set shape: {X_test.shape} samples")

## 3. Preprocessing Pipelines

We will create a preprocessor for the models that require scaled data (Logistic Regression & SVM).

1.  **Skewed Features:** Features like `num_points`, `surface_area`, and `bbox_volume` are highly skewed. We will apply a `PowerTransformer` (similar to a log-transform) to make them more "normal", which helps model performance.
2.  **All Features:** After transforming, we will apply `StandardScaler` to all features, which is essential for SVM and Logistic Regression.

**Random Forest does not need this**, so we will create a separate, simpler pipeline for it.

In [None]:
# Based on the EDA, these features are highly skewed
skewed_features = ['num_points', 'surface_area', 'bbox_volume', 'max_dist', 'mean_dist', 'std_dist', 'min_dist']

# Get a list of the remaining features
numeric_features = [col for col in X_train.columns if col not in skewed_features]

# Create a pipeline for skewed features: Transform, then Scale
skewed_transformer = Pipeline(steps=[
    ('power', PowerTransformer(method='yeo-johnson')), 
    ('scaler', StandardScaler())
])

# Create a pipeline for other numeric features: Just Scale
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Use ColumnTransformer to apply different transformers to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('skewed', skewed_transformer, skewed_features),
        ('numeric', numeric_transformer, numeric_features)
    ],
    remainder='passthrough'
)

# Define the cross-validation strategy for tuning
# We use 5 splits, repeated 3 times. This is more stable than a single 5-fold CV.
cv_strategy = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)

## 4. Model Training & Hyperparameter Tuning

We will now find the best settings for each model using `GridSearchCV`. This will **only use the `X_train` and `y_train` data**.

In [None]:
# --- Model 1: Logistic Regression --- 
print("Tuning Logistic Regression...")
# Create the full pipeline: Preprocess -> Classify
pipe_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000, solver='liblinear'))
])

# Parameters to test: C (regularization strength) and penalty type (L1 or L2)
param_grid_lr = {
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2']
}

# Create and run the grid search
grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=cv_strategy, scoring='accuracy', n_jobs=-1)
grid_lr.fit(X_train, y_train)

print(f"Best LR parameters: {grid_lr.best_params_}")
print(f"Best LR cross-validation accuracy: {grid_lr.best_score_:.4f}")
print("-"*30)

In [None]:
# --- Model 2: Random Forest --- 
print("Tuning Random Forest...")

# Random Forest does not need preprocessing, so the pipeline is simpler
pipe_rf = Pipeline(steps=[
    ('classifier', RandomForestClassifier(random_state=42))
])

# Parameters to test: Focus on params that prevent overfitting on small data
param_grid_rf = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 7],
    'classifier__min_samples_leaf': [3, 5, 10]
}

# Create and run the grid search
grid_rf = GridSearchCV(pipe_rf, param_grid_rf, cv=cv_strategy, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train, y_train)

print(f"Best RF parameters: {grid_rf.best_params_}")
print(f"Best RF cross-validation accuracy: {grid_rf.best_score_:.4f}")
print("-"*30)

In [None]:
# --- Model 3: Support Vector Machine (SVC) --- 
print("Tuning SVM...")

# SVM needs the preprocessed data
pipe_svc = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(random_state=42, probability=True))
])

# Parameters to test: C (regularization), kernel type, and gamma
param_grid_svc = {
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__gamma': ['scale', 0.01, 0.1],
    'classifier__kernel': ['rbf', 'linear']
}

# Create and run the grid search
grid_svc = GridSearchCV(pipe_svc, param_grid_svc, cv=cv_strategy, scoring='accuracy', n_jobs=-1)
grid_svc.fit(X_train, y_train)

print(f"Best SVM parameters: {grid_svc.best_params_}")
print(f"Best SVM cross-validation accuracy: {grid_svc.best_score_:.4f}")
print("-"*30)

## 5. Final Evaluation on Test Set

Now we take the *best* version of each model (found by `GridSearchCV`) and evaluate it on the **hold-out test set** (`X_test`, `y_test`).

This is the *true* measure of how our models will perform on new, unseen data.

In [None]:
# Store the best models from the grid search
best_lr = grid_lr.best_estimator_
best_rf = grid_rf.best_estimator_
best_svc = grid_svc.best_estimator_

models = {
    'Logistic Regression': best_lr,
    'Random Forest': best_rf,
    'SVM': best_svc
}

print("--- FINAL MODEL EVALUATION ON TEST SET ---\n")

for name, model in models.items():
    # Get predictions on the test set
    y_pred = model.predict(X_test)
    
    print(f"--- {name} Final Report ---")
    print(f"Test Set Accuracy: {accuracy_score(y_test, y_pred):.4f}\n")
    print(classification_report(y_test, y_pred))
    print("="*40)
    print("\n")

## 6. Confusion Matrix Visualization

Let's visualize the performance. We want to see high numbers on the diagonal (top-left to bottom-right), which represent correct predictions.

In [None]:
# Set up the figure for 3 subplots
fig, axes = plt.subplots(1, 3, figsize=(22, 6))
fig.suptitle('Confusion Matrices on Test Set', fontsize=16)

for i, (name, model) in enumerate(models.items()):
    # Get predictions and create confusion matrix
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    
    # Display the confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(ax=axes[i], cmap='Blues', colorbar=False)
    axes[i].set_title(name)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

## 7. Conclusion & Next Steps

Look at the **"Final Model Evaluation on Test Set"** and the **Confusion Matrices** above.

**Which model is best?**

* **Accuracy:** Look at the `Test Set Accuracy` for each model. Higher is better.
* **F1-Score:** Look at the `f1-score` (especially the `macro avg` or `weighted avg`). This is a good, balanced metric, especially if you have a slight class imbalance. Higher is better.
* **Confusion Matrix:** Which matrix has the lowest numbers in the off-diagonal (top-right and bottom-left) cells? Those are the incorrect predictions.

Based on these results, you can confidently choose your final model. The model that performed best on the **test set** is your champion.

**Note:** Because your test set is very small (17 samples), a difference of just one prediction can change the accuracy by ~6%. This is normal. Trust the process: the `GridSearchCV` on the training set did the hard work of finding good parameters, and this final test is the most honest evaluation you can get.

In [None]:
# 8. Exporting the Champion Model

import joblib

# Determine the champion model based on test set accuracy
# (You should manually verify this by checking the printed reports!)

# Example: Assuming Random Forest was the best based on your final evaluation
# If Logistic Regression was best, change 'best_rf' to 'best_lr', etc.

champion_model = best_rf # Change this to best_lr or best_svc if they performed better
model_filename = 'influenza_protein_classifier.pkl'

# Save the entire pipeline (including preprocessor and model)
joblib.dump(champion_model, model_filename)

print(f"Successfully exported the model to: {model_filename}")
print("You can now download this file from the Colab file browser.")

# --- Example of how to load the model later ---
# loaded_model = joblib.load(model_filename)
# new_prediction = loaded_model.predict(X_new_data)