# 03 - Model Training and Evaluation

This is the final notebook in our Titanic Survival Prediction project. Here, we will use the preprocessed data to:

1.  **Split the data** into training and testing sets.
2.  **Select and train** two different classification models: a `LogisticRegression` model and a `RandomForestClassifier`.
3.  **Evaluate** the performance of both models using a variety of metrics and visualizations, including Confusion Matrices, Accuracy, Precision, Recall, F1-Score, and ROC curves.

## 1. Load Preprocessed Data

We'll load the `processed_data.csv` file, which contains our cleaned, engineered, and scaled features from the previous notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

# Set a random state for reproducibility
RANDOM_STATE = 42

# Load the preprocessed dataset
try:
    processed_df = pd.read_csv('../data/processed_data.csv')
    print("Preprocessed dataset loaded successfully.")
except FileNotFoundError:
    print("Error: processed_data.csv not found. Please run '02_Feature_Engineering_and_Preprocessing.ipynb' first.")
    processed_df = pd.DataFrame() # Create an empty DataFrame to avoid errors later


## 2. Data Splitting

It's crucial to split our data into training and testing sets. The training set will be used to train our models, and the unseen testing set will be used to evaluate how well our models generalize to new data.

We'll use a **Train-Test split (80/20)**, and importantly, we'll use `stratify=y` to ensure that the proportion of `Survived` (our target variable) is approximately the same in both training and testing sets. This is vital for imbalanced datasets.

In [None]:
if not processed_df.empty:
    # Define features (X) and target (y)
    X = processed_df.drop('Survived', axis=1)
    y = processed_df['Survived']

    # Perform Train-Test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y)

    print("Data split into training and testing sets.")
    print(f"X_train shape: {X_train.shape}")
    print(f"X_test shape: {X_test.shape}")
    print(f"y_train shape: {y_train.shape}")
    print(f"y_test shape: {y_test.shape}")

    print("\nSurvival distribution in training set:")
    print(y_train.value_counts(normalize=True))
    print("\nSurvival distribution in test set:")
    print(y_test.value_counts(normalize=True))


## 3. Model Selection & Training

We'll train two different classification models to compare their performance:

1.  **Logistic Regression**: A simple yet powerful linear model often used as a baseline.
2.  **Random Forest Classifier**: A robust ensemble tree-based model that can capture complex non-linear relationships.

### 3.1. Logistic Regression

In [None]:
if 'X_train' in locals() and not X_train.empty:
    lr_model = LogisticRegression(random_state=RANDOM_STATE, solver='liblinear')
    lr_model.fit(X_train, y_train)
    print("Logistic Regression model trained successfully.")


### 3.2. Random Forest Classifier

In [None]:
if 'X_train' in locals() and not X_train.empty:
    rf_model = RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=100)
    rf_model.fit(X_train, y_train)
    print("Random Forest Classifier model trained successfully.")


## 4. Model Evaluation

Now, let's evaluate how well our trained models perform on the unseen test data. We'll use several key metrics and visualizations to get a comprehensive understanding.

### 4.1. Evaluation Function

To avoid code repetition, let's define a function to evaluate and print metrics for a given model.

In [None]:
def evaluate_model(model, X_test, y_test, model_name):
    """
    Evaluates a trained classification model and prints key metrics.
    """
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    print(f"\n--- {model_name} Evaluation ---")

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    print("\nConfusion Matrix:")
    print(cm)

    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
                xticklabels=['Did Not Survive', 'Survived'],
                yticklabels=['Did Not Survive', 'Survived'])
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title(f'Confusion Matrix for {model_name}')
    plt.show()

    # Metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"AUC: {roc_auc:.4f}")
    
    return fpr, tpr, roc_auc # Return for ROC curve plotting


### 4.2. Evaluate Logistic Regression

In [None]:
if 'lr_model' in locals():
    fpr_lr, tpr_lr, auc_lr = evaluate_model(lr_model, X_test, y_test, 'Logistic Regression')


**Interpretation for Logistic Regression:**
-   **Confusion Matrix:**
    -   True Positives (TP): Correctly predicted survivors.
    -   True Negatives (TN): Correctly predicted non-survivors.
    -   False Positives (FP): Predicted survivors, but they did not survive (Type I error).
    -   False Negatives (FN): Predicted non-survivors, but they did survive (Type II error).
-   **Accuracy:** Overall proportion of correct predictions. Useful when classes are balanced.
-   **Precision:** Out of all predicted positives, what fraction were actually positive. Important when the cost of False Positives is high (e.g., predicting someone survived when they didn't).
-   **Recall (Sensitivity):** Out of all actual positives, what fraction were correctly predicted positive. Important when the cost of False Negatives is high (e.g., failing to predict someone survived when they did).
-   **F1-Score:** The harmonic mean of Precision and Recall. A good balance between the two, especially useful for imbalanced datasets.
-   **AUC (Area Under the ROC Curve):** Measures the model's ability to distinguish between positive and negative classes. Higher AUC indicates better performance across all classification thresholds.

### 4.3. Evaluate Random Forest Classifier

In [None]:
if 'rf_model' in locals():
    fpr_rf, tpr_rf, auc_rf = evaluate_model(rf_model, X_test, y_test, 'Random Forest Classifier')


**Interpretation for Random Forest Classifier:**
Similar to Logistic Regression, evaluate the same metrics. Often, a Random Forest will perform better than a Logistic Regression due to its ability to model complex interactions.

### 4.4. ROC Curve and AUC Comparison

The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The Area Under the Curve (AUC) is a summary measure of the ROC curve. A higher AUC indicates a better model.

In [None]:
if 'fpr_lr' in locals() and 'fpr_rf' in locals():
    plt.figure(figsize=(8, 6))
    plt.plot(fpr_lr, tpr_lr, color='darkorange', lw=2, label=f'Logistic Regression (AUC = {auc_lr:.2f})')
    plt.plot(fpr_rf, tpr_rf, color='green', lw=2, label=f'Random Forest (AUC = {auc_rf:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate (1 - Specificity)')
    plt.ylabel('True Positive Rate (Recall)')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.show()
else:
    print("Models not trained or evaluated yet. Please run previous cells.")


## 5. Which Model Performed Better?

Compare the metrics obtained for both Logistic Regression and Random Forest Classifier.

-   **Accuracy:** Often a good starting point, but can be misleading with imbalanced classes.
-   **Precision and Recall:** Depending on the problem, one might be more critical than the other. For Titanic survival, perhaps correctly identifying survivors (high Recall) is important, but not at the cost of too many false alarms (low Precision).
-   **F1-Score:** Provides a balanced view of Precision and Recall.
-   **AUC:** A robust metric for comparing classifier performance across different thresholds, especially for imbalanced datasets. The model with a higher AUC is generally considered better.

Based on the results, you can discuss which model appears to be more suitable for this problem and why. Typically, a more complex model like Random Forest might capture nuances that a linear model like Logistic Regression misses, leading to better performance.