<a href="https://colab.research.google.com/github/walson6/Evaluation-and-Comparison-of-Classification-Models/blob/main/Jim_GitHub_ML_Metrics_And_Model_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Project Title: Evaluation and Comparison of Classification Models

**Objective:**  
To evaluate and compare different classification models using standard model evaluation metrics with help from the scikit-learn documentation

**Implementation:**

1. **Data Preparation:**
   - Loaded the breast cancer dataset from scikit-learn.
   - Split the data into training (50%), validation (25%), and test (25%) sets using `train_test_split` with stratification.

2. **Model Training and Evaluation:**
   - **Logistic Regression:**
     - Trained a Logistic Regression model with `solver='lbfgs'` and `max_iter=3000`.
     - Evaluated the model using accuracy, recall, precision, AUC, and F1 score.
     - Results:
       - Accuracy: 0.944
       - Recall: 0.966
       - Precision: 0.945
       - AUC: 0.988
       - F1 Score: 0.956

   - **Support Vector Machine (SVM):**
     - Trained an SVM model with `probability=True`.
     - Evaluated the model using the same metrics.
     - Results:
       - Accuracy: 0.894
       - Recall: 0.978
       - Precision: 0.870
       - AUC: 0.967
       - F1 Score: 0.967 (with optimized C value of 100000)

3. **Hyperparameter Tuning:**
   - Optimized the SVM model's hyperparameter `C` to achieve the best F1 score.
   - Tested various values of `C` and found the best value to be 100000.

4. **Model Selection:**
   - Compared the performance of Logistic Regression and SVM.
   - Chose Logistic Regression for its balance of performance, speed, and ease of understanding, despite SVM's higher F1 score.

**Skills Demonstrated:**
- Proficiency in Python and scikit-learn.
- Understanding of classification models and evaluation metrics.
- Experience with data splitting and stratification.
- Knowledge of hyperparameter tuning and model optimization.
- Ability to interpret and compare model performance.

**Tools and Libraries Used:**
- Python
- scikit-learn
- NumPy

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score
from sklearn.svm import SVC
import numpy as np

In [2]:
from sklearn.datasets import load_breast_cancer # Load cancer dataset from sklearn
data_cancer = load_breast_cancer()

In [3]:
# X=feature/input y=target/output
X = data_cancer.data
y = data_cancer.target
print(len(X))

# Since data size is 569, I would consider it as plenty of data for a 50:25:25 split

# Split the data into training 50% and temp (validation, test) 50%
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.5, random_state=1024, stratify=y)

# Split the temp into validation 25% and test 25%
# Temp data is 50%, split into half for each, making it 25% each
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1024, stratify=y_temp)


569


In [4]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# Fit a Logistic Classifier to the data
log_reg = LogisticRegression(random_state=1024, solver='lbfgs', max_iter=3000)
log_reg.fit(X_train, y_train)


In [5]:
# Prediction for target
y_val_pred = log_reg.predict(X_val)

# Predicted probabilities for the positive class '1'
y_val_prob = log_reg.predict_proba(X_val)[:, 1]

# Based on the formula
accuracy = accuracy_score(y_val, y_val_pred)
recall = recall_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred)
auc = roc_auc_score(y_val, y_val_prob)
print('Logistic Regression:')
print(f'Accuracy: {accuracy: .3f}')
print(f'Recall: {recall: .3f}')
print(f'Precision: {precision: .3f}')
print(f'AUC: {auc: .3f}')


Logistic Regression:
Accuracy:  0.944
Recall:  0.966
Precision:  0.945
AUC:  0.988


In [6]:
# https://scikit-learn.org/stable/api/sklearn.svm.html#module-sklearn.svm
# Fit a SVM model to the same data
svm_model = SVC(random_state=1024, probability=True)
svm_model.fit(X_train, y_train)

In [7]:
# Compare the two models using the same metrics as before

# Prediction for target
y_val_pred = svm_model.predict(X_val)
# Predicted probabilities for the positive class '1'
y_val_prob = svm_model.predict_proba(X_val)[:, 1]
# Based on the formula
svm_accuracy = accuracy_score(y_val, y_val_pred)
svm_recall = recall_score(y_val, y_val_pred)
svm_precision = precision_score(y_val, y_val_pred)
svm_auc = roc_auc_score(y_val, y_val_prob)
print('SVM:')
print(f'Accuracy: {svm_accuracy: .3f}')
print(f'Recall: {svm_recall: .3f}')
print(f'Precision: {svm_precision: .3f}')
print(f'AUC: {svm_auc: .3f}')

SVM:
Accuracy:  0.894
Recall:  0.978
Precision:  0.870
AUC:  0.967


In [8]:
# Optimize the SVM model hyperparameter, C, to get the best F1 performance

# List of regularization hyperparameter C values to test
# Small C ignores more training points = wider margin
# Big C has narrower margin = higher dependency on training data
C_values = [0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 10000000]

results = {}

for C_value in C_values:
    # Train the SVM model with current C value
    svm_model = SVC(C=C_value, random_state=1024, probability=True)
    svm_model.fit(X_train, y_train)

    # Predict class labels for validation dataset X_val and predicted probabilities for the positive class '1'
    y_val_pred = svm_model.predict(X_val)
    y_val_prob = svm_model.predict_proba(X_val)[:, 1]

    # Based on the formula
    accuracy = accuracy_score(y_val, y_val_pred)
    recall = recall_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)
    auc = roc_auc_score(y_val, y_val_prob)

    results[C_value] = {"F1": f1}
    print (f"C = {C_value}, F1 = {f1:.3f}")

# Find the best C based on F1 scores
best_C = max(results, key=lambda k: results[k]["F1"])
best_f1 = results[best_C]["F1"]
print(f"Best C value for SVM: {best_C:.0f} | Best F1 score: {best_f1:.3f}")

y_val_pred = log_reg.predict(X_val)
f1 = f1_score(y_val, y_val_pred)
print(f'F1 score for Logistic Regression: {f1:.3f}')

C = 0.01, F1 = 0.771
C = 0.1, F1 = 0.917
C = 1, F1 = 0.921
C = 10, F1 = 0.924
C = 100, F1 = 0.934
C = 1000, F1 = 0.950
C = 10000, F1 = 0.961
C = 100000, F1 = 0.967
C = 10000000, F1 = 0.950
Best C value for SVM: 100000 | Best F1 score: 0.967
F1 score for Logistic Regression: 0.956


### Which model I would choose and why

The SVM model is good to prioritize a higher F1 score, but it can be susceptible to overfitting as the C value gets larger. Logistic Regression is a good balanced model, it is fast, and less fine-tuning is required. For most use-cases, I would use Logistic Regression because it is easy to understand, faster to train, and is less prone to overfitting.