## Instructions




Perform classification using the Titanic dataset. Compare the performance of different classifier models, including Logistic Regression, Random Forest Classifier, and XGBoost Classifier.





Incorporate the following aspects:

Part 1 - Data Processing

+ **Load the data**.
+ **Explore the data**, including distributions, correlation, etc. Make plots.
+ Check for null values. **Handle null values** by dropping or imputing. How will you handle the "age" variable?
+ **Choose target and feature(s)**.
+ **Encode features** as necessary (ordinal vs one-hot).
+ **Scale / normalize features** as necessary.
+ **Split into train and test sets** (specifically 80/20 split). Remember to use a random seed to ensure your results are reproducible.

Part 2 - Benchmark Models

A) Logistic Regression:

+ Use a [`LogisticRegression` model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) from `sklearn` (using default params for now).
+ **Train the model** using the training data.
+ Inspect artifacts from the training process:
    + **Print the model's coefficients**. Inspect the coefficients by wrapping them in a pandas Series (or DataFrame for multiclass) and labeling them with their corresponding feature names, then sort them in descending order.
    + **Interpret the coefficients** - which features contribute most to our model's predictive ability?
+ Make **predictions** for the test set.
+ Evaluate the results using sklearn **classification metrics**, specifically the classification_report function. Optionally also report on the ROC AUC score as an additional metric. Interpret the results - how well did the model do?
+ Make a **confusion matrix**, using the `confusion_matrix` function from `sklearn.metrics`. Display your confusion matrix as a heatmap, for example using plotly or matplotlib.

B) Random Forest:

+ Repeat Step 2, but this time using a [`RandomForestClassifier` model](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from `sklearn` (using default params for now).

C) XGBoost:
  + Repeat Step 2, but this time using an [`XGBClassifier` model](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier) from `xgboost` (using default params for now).

Part 3 - Best Model

  + Compare the results from Steps 2, 3, and 4. Which model gives the best benchmark performance?
  + Try researching and using different classification models to improve performance, if possible.
  + Try engineering different features to improve the model's performance, if possible.
  + Try tuning the model(s) hyperparameters to improve the model's performance, if possible.
  + Optionally use a [`GridSearchCV`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html) from `sklearn` to explore the hyperparameter space to find the hyperparameters that yield the best performance for each model. Specifically use ROC AUC score ("roc_auc" for binary classification or "roc_auc_ovr" for multi-class classification) as the metric / scoring function to optimize on.

  + What is the best model you found? Report on it's  performance. What are the hyperparameters of the best model?



## Part 1 - Data Processing


### Data Loading

We are loading a combined version of the Titanic dataset that has been hosted on GitHub.

For more information about this dataset, consult: https://www.kaggle.com/competitions/titanic/data

In [None]:
from pandas import read_csv

df = read_csv("https://raw.githubusercontent.com/prof-rossetti/intro-to-python/main/data/titanic-original-full.csv")
df["embarked"] = df["embarked"].str.upper()
df.head()

Unnamed: 0,passenger_id,survived,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,False,3,"Braund, Mr. Owen Harris",MALE,22.0,1,0,A/5 21171,7.25,,SOUTHAMPTON
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,CHERBOURG
2,3,True,3,"Heikkinen, Miss. Laina",FEMALE,26.0,0,0,STON/O2. 3101282,7.925,,SOUTHAMPTON
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",FEMALE,35.0,1,0,113803,53.1,C123,SOUTHAMPTON
4,5,False,3,"Allen, Mr. William Henry",MALE,35.0,0,0,373450,8.05,,SOUTHAMPTON


In [None]:
print(len(df))

1309


### Data Exploration

In [None]:
print(df.shape)
df.info()
df.isna().sum()

(1309, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  1309 non-null   int64  
 1   survived      1309 non-null   bool   
 2   pclass        1309 non-null   int64  
 3   name          1309 non-null   object 
 4   gender        1309 non-null   object 
 5   age           1046 non-null   float64
 6   sibsp         1309 non-null   int64  
 7   parch         1309 non-null   int64  
 8   ticket        1309 non-null   object 
 9   fare          1308 non-null   float64
 10  cabin         295 non-null    object 
 11  embarked      1307 non-null   object 
dtypes: bool(1), float64(2), int64(4), object(5)
memory usage: 113.9+ KB


Unnamed: 0,0
passenger_id,0
survived,0
pclass,0
name,0
gender,0
age,263
sibsp,0
parch,0
ticket,0
fare,1


### Handling Missing Values
1. cabin has 1014 missing - I'll drop this column.
2. age has 263 missing values - I'll impute this using median age
3. embarked has 2 missing values - I'll impute this using mode value
4. fare has 1 missing value - I'll impute this using median fare






In [None]:
df["age"] = df["age"].fillna(df["age"].median())
df["fare"] = df["fare"].fillna(df["fare"].median())
df["embarked"] = df["embarked"].fillna(df["embarked"].mode()[0])
df.drop(columns=["cabin"], inplace=True)
df.isna().sum()

Unnamed: 0,0
passenger_id,0
survived,0
pclass,0
name,0
gender,0
age,0
sibsp,0
parch,0
ticket,0
fare,0


In [None]:
target = "survived"
features = ["pclass", "gender", "age", "sibsp", "parch", "fare", "embarked"]

df = df[features + [target]]
df.head()

Unnamed: 0,pclass,gender,age,sibsp,parch,fare,embarked,survived
0,3,MALE,22.0,1,0,7.25,SOUTHAMPTON,False
1,1,FEMALE,38.0,1,0,71.2833,CHERBOURG,True
2,3,FEMALE,26.0,0,0,7.925,SOUTHAMPTON,True
3,1,FEMALE,35.0,1,0,53.1,SOUTHAMPTON,True
4,3,MALE,35.0,0,0,8.05,SOUTHAMPTON,False


### One-hot Encoding

In [None]:
from pandas import get_dummies

df = get_dummies(df, columns=["gender", "embarked"], drop_first=True)
df.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,survived,gender_MALE,embarked_QUEENSTOWN,embarked_SOUTHAMPTON
0,3,22.0,1,0,7.25,False,True,False,True
1,1,38.0,1,0,71.2833,True,False,False,False
2,3,26.0,0,0,7.925,True,False,False,True
3,1,35.0,1,0,53.1,True,False,False,True
4,3,35.0,0,0,8.05,False,True,False,True


### Feature Correlation

In [None]:
import plotly.express as px

def plot_correlation_matrix(df, method="pearson", height=450):
    """Params: method (str): "spearman" or "pearson". """
    cor_mat = df.corr(method=method, numeric_only=True)
    title= f"{method.title()} Correlation"
    fig = px.imshow(cor_mat,
                    text_auto= ".2f", # round to two decimal places
                    color_continuous_scale="Blues",
                    color_continuous_midpoint=0)
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.show()

plot_correlation_matrix(df, method="spearman", height=450)

In [None]:
corr_target = df.corr(numeric_only=True)[target].sort_values(ascending=False)
corr_target

Unnamed: 0,survived
survived,1.0
fare,0.18344
parch,0.054357
embarked_QUEENSTOWN,0.024336
age,-0.034592
sibsp,-0.044795
embarked_SOUTHAMPTON,-0.144647
pclass,-0.240505
gender_MALE,-0.370021


### Interpretation of Correlation Matrix

1. **fare (+0.18):**
Passengers who paid higher fares were more likely to survive (makes sense, because higher classes).
2. **parch (+0.05):**
Very weak positive link - having parents/children aboard may have helped slightly.
3. **embarked_QUEENSTOWN (+0.02):**
Negligible correlation.
4. **age (–0.03):**
Very weak - younger people may have survived slightly more, but not strongly.
5. **sibsp (–0.04):**
Weak - number of siblings/spouses doesn’t strongly affect survival.
6. **embarked_SOUTHAMPTON (–0.14):**
Those who embarked from Southampton were somewhat less likely to survive.
7. **pclass (–0.24):**
Stronger negative correlation - lower classes had worse survival chances.
8. **gender_MALE (–0.37):**
Strongest correlation - males were much less likely to survive (consistent with “women and children first”).


### The most predictive features (based on correlation) are:
1. gender_MALE
2. pclass
3. fare


### Data Preperation


Split X/Y

In [None]:
X = df.drop(columns=[target])
y = df[target]

In [None]:
X = (X - X.mean(axis=0)) / X.std(axis=0)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=99
)

print("TRAIN:", X_train.shape, y_train.shape)
print("TEST:", X_test.shape, y_test.shape)

TRAIN: (1047, 8) (1047,)
TEST: (262, 8) (262,)


## Part 2 - Benchmark Models

### A) Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from pandas import Series

model_lr = LogisticRegression(random_state=99, max_iter=1000)
model_lr.fit(X_train, y_train)

coef = Series(model_lr.coef_[0], index=X_train.columns).sort_values(ascending=False)
print(coef)

fare                    0.112380
parch                   0.066032
embarked_QUEENSTOWN     0.010121
embarked_SOUTHAMPTON   -0.263135
sibsp                  -0.272619
age                    -0.330719
pclass                 -0.528037
gender_MALE            -0.771062
dtype: float64




*   gender_MALE: negative -> males less likely to survive
*   fare: positive -> higher fare means higher survival

### Evaluation

In [None]:
from sklearn.metrics import classification_report, roc_auc_score

y_pred_lr = model_lr.predict(X_test)

print(classification_report(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_lr).round(3))

              precision    recall  f1-score   support

       False       0.68      0.68      0.68       114
        True       0.75      0.75      0.75       148

    accuracy                           0.72       262
   macro avg       0.71      0.71      0.71       262
weighted avg       0.72      0.72      0.72       262

ROC-AUC: 0.713


### Interpretation
*   Accuracy: 0.72 -> The model correctly predicts survival about 72% of the time.
*   F1-scores: True (survived) = 0.75, False (did not survive) = 0.68 -> Slightly better at predicting survivors than non-survivors.
*   ROC-AUC: 0.713 -> Moderate discriminatory power; better than random (0.5) but room for improvement.

In [None]:
from sklearn.metrics import confusion_matrix
import plotly.express as px

cm = confusion_matrix(y_test, y_pred_lr)

px.imshow(cm, text_auto=True, color_continuous_scale="Blues",
          labels={"x": "Predicted", "y": "Actual"},
          title="Confusion Matrix - Logistic Regression")

### Interpretation

*   True Negatives (TN): 77 -> Correctly predicted as not survived.
*   False Positives (FP): 37 -> Predicted survived, actually did not.
*   False Negatives (FN): 37 -> Predicted did not survive, actually survived.
*   True Positives (TP): 111 -> Correctly predicted as survived.



### B) Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(random_state=99)
model_rf.fit(X_train, y_train)

y_pred_rf = model_rf.predict(X_test)

print(classification_report(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_rf).round(3))

              precision    recall  f1-score   support

       False       0.64      0.56      0.60       114
        True       0.69      0.76      0.72       148

    accuracy                           0.67       262
   macro avg       0.67      0.66      0.66       262
weighted avg       0.67      0.67      0.67       262

ROC-AUC: 0.659


In [None]:
importances = Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print(importances)

age                     0.363672
fare                    0.329853
gender_MALE             0.123283
sibsp                   0.054764
pclass                  0.048591
parch                   0.040722
embarked_SOUTHAMPTON    0.027239
embarked_QUEENSTOWN     0.011877
dtype: float64


### Interpretation

Performance
* Accuracy: 0.67 → lower than Logistic Regression (0.72).
* ROC-AUC: 0.659 → also lower than LR (0.713).
* Precision & Recall:
* For True (survived): Recall is 0.76, so RF is slightly better at capturing survivors than LR’s 0.75.
* For False (did not survive): Recall is 0.56, lower than LR’s 0.68 → RF misses more non-survivors.


### C) XGBoost

In [None]:
%%capture
!pip install xgboost

In [None]:
from xgboost import XGBClassifier

model_xgb = XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=99)
model_xgb.fit(X_train, y_train)

y_pred_xgb = model_xgb.predict(X_test)

print(classification_report(y_test, y_pred_xgb))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_xgb).round(3))


Parameters: { "use_label_encoder" } are not used.




              precision    recall  f1-score   support

       False       0.67      0.57      0.62       114
        True       0.70      0.78      0.74       148

    accuracy                           0.69       262
   macro avg       0.69      0.68      0.68       262
weighted avg       0.69      0.69      0.69       262

ROC-AUC: 0.677


In [None]:
importances_xgb = Series(model_xgb.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print(importances_xgb)

gender_MALE             0.516322
pclass                  0.168292
embarked_SOUTHAMPTON    0.070778
sibsp                   0.062570
fare                    0.056759
age                     0.052344
embarked_QUEENSTOWN     0.037817
parch                   0.035119
dtype: float32


### Observations

* XGBoost has higher recall for survivors (True) than LR (0.78 vs 0.75), which can be useful if missing survivors is costly.
* However, recall for non-survivors (False) is lower (0.57), so it misclassifies more of them.
* Overall accuracy and ROC-AUC are slightly worse than Logistic Regression, so LR remains the best overall benchmark.

## Part 3 - Best Model

### Comparing the Benchmark Models:

**1. Logistic Regression:**
* Accuracy: 0.72
* ROC-AUC: 0.713
* F1-score for survivors (True): 0.75

##### Logistic Regression performs best overall in terms of accuracy and ROC-AUC. The model is interpretable, and the most important features are fare and gender.

**2. Random Forest:**
* Accuracy: 0.67
* ROC-AUC: 0.659
* F1-score for survivors (True): 0.72

##### Random Forest is less accurate than Logistic Regression. Age and fare are the most important features. Good for capturing non-linear relationships, but benchmark performance is lower.

**3. XGBoost:**
* Accuracy: 0.69
* ROC-AUC: 0.677
* F1-score for survivors (True): 0.74

##### XGBoost improves recall for survivors (True) but misclassifies non-survivors more often. Gender is the most important feature in this model.


#### Observation:

Logistic Regression provides the best balance of accuracy, ROC-AUC, and interpretability among the benchmark models. XGBoost can slightly improve recall for survivors, while Random Forest performs worst in the benchmark.


### Hyperparameter Tuning - Logistic Regression

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

print("=" * 60)
print("HYPERPARAMETER TUNING")
print("=" * 60)

# Define parameter grid for Logistic Regression
lr_param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [1000]
}

# Grid search with cross-validation
print("\nTuning Logistic Regression...")
lr_grid = GridSearchCV(
    LogisticRegression(random_state=99),
    lr_param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

lr_grid.fit(X_train, y_train)

print(f"\nBest LR Parameters: {lr_grid.best_params_}")
print(f"Best LR CV ROC-AUC Score: {lr_grid.best_score_:.3f}")

HYPERPARAMETER TUNING

Tuning Logistic Regression...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Best LR Parameters: {'C': 0.1, 'max_iter': 1000, 'penalty': 'l1', 'solver': 'liblinear'}
Best LR CV ROC-AUC Score: 0.745


### Hyperparameter Tuning - Random Forest

In [None]:
# Parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

print("\nTuning Random Forest...")
rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=99),
    rf_param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

rf_grid.fit(X_train, y_train)

print(f"\nBest RF Parameters: {rf_grid.best_params_}")
print(f"Best RF CV ROC-AUC Score: {rf_grid.best_score_:.3f}")


Tuning Random Forest...
Fitting 5 folds for each of 81 candidates, totalling 405 fits

Best RF Parameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Best RF CV ROC-AUC Score: 0.750


### Hyperparameter Tuning - XGBoost

In [None]:
# Parameter grid for XGBoost
xgb_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.8, 1.0]
}

print("\nTuning XGBoost...")
xgb_grid = GridSearchCV(
    XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=99),
    xgb_param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

xgb_grid.fit(X_train, y_train)

print(f"\nBest XGB Parameters: {xgb_grid.best_params_}")
print(f"Best XGB CV ROC-AUC Score: {xgb_grid.best_score_:.3f}")


Tuning XGBoost...
Fitting 5 folds for each of 54 candidates, totalling 270 fits

Best XGB Parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 300, 'subsample': 0.8}
Best XGB CV ROC-AUC Score: 0.752



Parameters: { "use_label_encoder" } are not used.




### Comparing all the models

In [None]:
print("\n" + "=" * 60)
print("FINAL MODEL COMPARISON")
print("=" * 60)

# Create dictionary of all models
models = {
    'Logistic Regression (Default)': model_lr,
    'Random Forest (Default)': model_rf,
    'XGBoost (Default)': model_xgb,
    'Logistic Regression (Tuned)': lr_grid.best_estimator_,
    'Random Forest (Tuned)': rf_grid.best_estimator_,
    'XGBoost (Tuned)': xgb_grid.best_estimator_
}

# Evaluate each model
results = []
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)

    results.append({
        'Model': name,
        'Accuracy': f'{acc:.3f}',
        'Precision': f'{precision:.3f}',
        'Recall': f'{recall:.3f}',
        'F1-Score': f'{f1:.3f}',
        'ROC-AUC': f'{roc_auc:.3f}'
    })

import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score

# Display results
results_df = pd.DataFrame(results)
print("\n")
print(results_df.to_string(index=False))


FINAL MODEL COMPARISON


                        Model Accuracy Precision Recall F1-Score ROC-AUC
Logistic Regression (Default)    0.718     0.750  0.750    0.750   0.774
      Random Forest (Default)    0.672     0.691  0.757    0.723   0.748
            XGBoost (Default)    0.691     0.703  0.784    0.741   0.789
  Logistic Regression (Tuned)    0.729     0.773  0.736    0.754   0.777
        Random Forest (Tuned)    0.710     0.737  0.757    0.747   0.791
              XGBoost (Tuned)    0.714     0.748  0.743    0.746   0.791


### Best Model Analysis

In [None]:
# Find the best model based on ROC-AUC
results_df['ROC-AUC'] = results_df['ROC-AUC'].astype(float)
best_idx = results_df['ROC-AUC'].idxmax()
best_model_name = results_df.loc[best_idx, 'Model']
best_model = models[best_model_name]

print("\n" + "=" * 60)
print("BEST MODEL REPORT")
print("=" * 60)

print(f"\nBest Model: {best_model_name}")
print(f"\nTest Set Performance:")
print(results_df.iloc[best_idx].to_string())

# Get best hyperparameters
if 'Tuned' in best_model_name:
    print(f"\nBest Hyperparameters:")
    if 'Logistic' in best_model_name:
        print(f"  - C: {lr_grid.best_params_['C']}")
        print(f"  - penalty: {lr_grid.best_params_['penalty']}")
        print(f"  - solver: {lr_grid.best_params_['solver']}")
    elif 'Forest' in best_model_name:
        for param, value in rf_grid.best_params_.items():
            print(f"  - {param}: {value}")
    elif 'XGB' in best_model_name:
        for param, value in xgb_grid.best_params_.items():
            print(f"  - {param}: {value}")

# Compare with baseline
baseline_idx = 0  # Logistic Regression (Default)
improvement = (float(results_df.loc[best_idx, 'ROC-AUC']) -
               float(results_df.loc[baseline_idx, 'ROC-AUC'])) * 100

print(f"\nImprovement from Baseline Logistic Regression:")
print(f"  - ROC-AUC improved by: {improvement:.2f}%")


BEST MODEL REPORT

Best Model: Random Forest (Tuned)

Test Set Performance:
Model        Random Forest (Tuned)
Accuracy                     0.710
Precision                    0.737
Recall                       0.757
F1-Score                     0.747
ROC-AUC                      0.791

Best Hyperparameters:
  - max_depth: 10
  - min_samples_leaf: 1
  - min_samples_split: 5
  - n_estimators: 200

Improvement from Baseline Logistic Regression:
  - ROC-AUC improved by: 1.70%


### Confusion Matrix for Best Model

In [None]:
# Create confusion matrix for best model
y_pred_best = best_model.predict(X_test)
cm_best = confusion_matrix(y_test, y_pred_best)

fig = px.imshow(cm_best,
                text_auto=True,
                color_continuous_scale="Blues",
                labels={"x": "Predicted", "y": "Actual"},
                title=f"Confusion Matrix - {best_model_name}")
fig.show()

---
References : Used Claude for answering 2B) Random Forest, 2C) XGBoost and 3) Best Model

---