### Classification

#### Confusion Matrix

True Positive: The model correctlt predicted a positive outcome (the actual outcome was positive).

True Negative (TN): The model correctly predicted a negative outcome (the actual outcome was negative).

False Positive (FP): The model incorrectly predicted a positive outcome (the actual outcome was negative). Also known as Type I error.

False Negative (FN): The model incorrectly predicted a negative outcome (the actual outcome was psoitive). Also know as a Type II error.

#### Metrics

Accuracy = $\frac{TP+TN}{TP+FP+TN+FN}$. This is the percentage of correct predictions overall.

Precision = $\frac{TP}{TP+FP}$. This is the percentage of all positive predictions that are actually positive.

Recall = $\frac{TP}{TP+FN}$. This is the percentage of correctly predicted actual positives.

For predicting diseases, recall is more important while for spam emails, both precision and recall are important.

Accuracy does not work well when the dataset is imbalanced.

F1 Score = Harmonic mean of Precision and Recall = $2\cdot\frac{precision \cdot recall}{precision + recall}$

The harmonic mean of $x_1, x_2, ..., x_n$ is $(\frac{\frac{1}{x_1} + \frac{1}{x_2} + ... + \frac{1}{x_n}}{n})^{-1}$

If $n=2$, $(\frac{\frac{1}{x_1} + \frac{1}{x_2}}{2})^{-1} = 2\cdot\frac{x_1x_2}{x_1 + x_2}$

Model 1: precision = 0.5, recall = 1

Model 2: precision = 0.6, recall = 0.9

We want a good balance of both precision and recall. This is where the harmonic means come into play since it measures how close precision is with recall.

In [58]:
import numpy as np
import pandas as pd

$P(y=1|X) = \frac{1}{1+e^{-w^{T}x}}$

In [97]:
np.random.seed(42)
n_samples = 200

#Generate income and credit scores
income = np.random.normal(50, 10, n_samples) #Mean = 50k, sd = 10k
credit_score = np.random.normal(600, 50, n_samples)

X = np.c_[income, credit_score]

#Generate loan approval (approved or rejected)
prob_approval = 1 / (1 + np.exp(-0.05 * (income - 50) - 0.05 * (credit_score - 600)))
y = (np.random.rand(n_samples) < prob_approval).astype(int)

approved_ratio = np.sum(y) / n_samples

print(approved_ratio)

# Turn the binary classes into approved and rejected

y = np.where(y == 1, "APPROVED", "REJECTED")
y

0.53


array(['APPROVED', 'APPROVED', 'APPROVED', 'APPROVED', 'REJECTED',
       'REJECTED', 'REJECTED', 'APPROVED', 'REJECTED', 'APPROVED',
       'REJECTED', 'APPROVED', 'APPROVED', 'APPROVED', 'REJECTED',
       'APPROVED', 'REJECTED', 'REJECTED', 'REJECTED', 'APPROVED',
       'APPROVED', 'REJECTED', 'APPROVED', 'REJECTED', 'REJECTED',
       'APPROVED', 'REJECTED', 'REJECTED', 'REJECTED', 'APPROVED',
       'REJECTED', 'APPROVED', 'REJECTED', 'REJECTED', 'APPROVED',
       'APPROVED', 'REJECTED', 'APPROVED', 'REJECTED', 'APPROVED',
       'REJECTED', 'APPROVED', 'REJECTED', 'APPROVED', 'REJECTED',
       'REJECTED', 'REJECTED', 'REJECTED', 'APPROVED', 'REJECTED',
       'REJECTED', 'REJECTED', 'APPROVED', 'APPROVED', 'REJECTED',
       'REJECTED', 'APPROVED', 'REJECTED', 'APPROVED', 'APPROVED',
       'REJECTED', 'APPROVED', 'REJECTED', 'REJECTED', 'APPROVED',
       'REJECTED', 'APPROVED', 'REJECTED', 'REJECTED', 'APPROVED',
       'APPROVED', 'REJECTED', 'APPROVED', 'APPROVED', 'REJECT

In [99]:
two_class_df = pd.DataFrame(np.c_[income.round(2), credit_score.astype(int), y], columns = ["INCOME", "CREDIT_SCORE", "STATUS"])

In [101]:
two_class_df.index = [f"ID{str(i).zfill(3)}" for i in range(1, n_samples + 1)]

two_class_df

Unnamed: 0,INCOME,CREDIT_SCORE,STATUS
ID001,54.97,617,APPROVED
ID002,48.62,628,APPROVED
ID003,56.48,654,APPROVED
ID004,65.23,652,APPROVED
ID005,47.66,531,REJECTED
...,...,...,...
ID196,53.85,576,APPROVED
ID197,41.16,514,REJECTED
ID198,51.54,667,APPROVED
ID199,50.58,594,REJECTED


In [103]:
two_class_df.to_csv("two_classes.csv")

In [105]:
n_samples = 300

#Generate income and credit scores
income = np.random.normal(50, 10, n_samples) #Mean = 50k, sd = 10k
credit_score = np.random.normal(600, 50, n_samples)

X = np.c_[income, credit_score]

#Generate loan approval (approved or rejected)
prob_approval = 1 / (1 + np.exp(-0.05 * (income - 50) - 0.05 * (credit_score - 600)))
y = np.digitize(prob_approval, [0.33, 0.67])

y_labels = np.where(y == 0, "REJECTED", np.where(y == 1, "PENDING", "APPROVED"))

In [107]:
three_class_df = pd.DataFrame(np.c_[income.round(2), credit_score.astype(int), y_labels], columns = ["INCOME", "CREDIT_SCORE", "STATUS"])
three_class_df.tail(10)

Unnamed: 0,INCOME,CREDIT_SCORE,STATUS
290,51.66,591,PENDING
291,54.92,600,PENDING
292,52.89,545,REJECTED
293,74.55,527,REJECTED
294,43.62,679,APPROVED
295,44.69,557,REJECTED
296,43.77,550,REJECTED
297,44.45,492,REJECTED
298,43.63,568,REJECTED
299,61.89,533,REJECTED


In [109]:
three_class_df.index = [f"ID{str(i).zfill(3)}" for i in range(1, n_samples + 1)]
three_class_df.to_csv("three_classes.csv")

### Logistic Regression in `sklearn`

Exercise: Two-class dataset
- Implement Logistic Regression (train.test.split, test_size = 0.3)
- Create a confusion matrix
- Calculate metrics (accuracy, precision, recall, F1 score)

In [119]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

In [137]:
two_classes = pd.read_csv("two_classes.csv")

X = two_classes[["INCOME", "CREDIT_SCORE"]]
y = two_classes[["STATUS"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Confusion Matrix:
[[27  5]
 [ 3 25]]
Classification Report:
              precision    recall  f1-score   support

    APPROVED       0.90      0.84      0.87        32
    REJECTED       0.83      0.89      0.86        28

    accuracy                           0.87        60
   macro avg       0.87      0.87      0.87        60
weighted avg       0.87      0.87      0.87        60



  y = column_or_1d(y, warn=True)
