## Lab 11. Hyperparameter Tuning

In this lab we will use three different datasets:

1. Insurance
2. Loan Eligibility
3. Titanic

For __each__ of these datasets please perform the following steps:

#### Part 1
1. Import and inspect raw data as a pandas data frame
2. Perform feature transformation as appropriate and create feature and label arrays X and y
#### Part 2
- Read about ROC (receiver operating characteristic) and AUc (area under curve)
3. Split your data into a training and test set
4. Fit a min-max scaler and transform the training and test feature sets
5. Configure grid search for a regularized logistic regression model. Include the following hyperparameters in your grids: 
    - Regularization constant C (C is inversely proportional to alpha)
    - Penalty type: 'l1' or 'l2' 
    - Solver' : 'newton-cg', 'lbfgs', 'liblinear', 'sag' or 'saga' (don't worry about understanding how the solvers work)
    - Number of iterations 'max_iter'
    - Review https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html for more details
6. Fit your grid search on X_train, y_train with n-fold cross-validation (n =5 is a reasonable choice for n folds)
7. Find the best parameters and substitute them into an instace of a logistic regression model LogisticRegression(). Train and test your model on tre training and test sets, respectively.
8. Train a baseline LogisticRegression() model with its default hyperparameters. Do you see improvement in the model scores after hyperparameter tuning?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import warnings
warnings.filterwarnings("ignore")

## Lab 11 - Partial Solutions

#### Part 1

Steps 1 and 2 for each of the datasets

#### 1. Insurance

In [2]:
df = pd.read_csv('../data/insurance2.csv')
df['region'] = df['region'].astype('str')
y = df['insuranceclaim']
df=df.drop(columns = ['insuranceclaim'])
df = pd.get_dummies(data = df, drop_first = True)
X = df.to_numpy()

#### 2. Loan Eligibility

In [3]:
df = pd.read_csv('../data/loan_eligibility.csv')
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].median(), inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
df['Dependents'][df['Dependents'] == '3+'] = '3'
df['Dependents'] = df['Dependents'].astype("int")
Label = df["Loan_Status"].to_numpy()
le = LabelEncoder()
y = le.fit_transform(Label)
X_df = df.drop(["Loan_Status",'Loan_ID'], axis=1)
X_df = pd.get_dummies(X_df, drop_first=True)
X = X_df.to_numpy()
features = X_df.columns

#### 3. Titanic

In [4]:
df = pd.read_csv("../data/titanic.csv")
df = df.drop(['PassengerId','Name','Ticket','Cabin'], axis=1)
a = df['Age'].isnull()
for i in range(df.shape[0]):
    if a[i]:
        if df['Pclass'][i] == 1:
            df['Age'][i] = 37
        elif df['Pclass'][i] == 2:
            df['Age'][i] = 29
        else:
            df['Age'][i] = 24

df = pd.get_dummies(df, drop_first=True)
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values
features = list(df.iloc[:, 1:].columns)


### ROC (Receiver Operating Characteristic) Curve

The ROC curve is a graphical representation of a binary classifier's performance across different threshold values. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. Here's how these rates are defined:

- **True Positive Rate (TPR)**, also known as Sensitivity or Recall:
  $$TPR = \frac{\text{True Positives (TP)}}{\text{True Positives (TP) + False Negatives (FN)}}$$
  - TPR measures the proportion of actual positives that are correctly identified.

- **False Positive Rate (FPR)**:
  $$FPR = \frac{\text{False Positives (FP)}}{\text{False Positives (FP) + True Negatives (TN)}}$$
  - FPR measures the proportion of actual negatives that are incorrectly identified as positives.

A point on the ROC curve represents a specific threshold. The curve starts from the bottom-left (0,0), where all predictions are negative, to the top-right (1,1), where all predictions are positive.

### AUC (Area Under the Curve)

The AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). It provides a single scalar value to evaluate the model's performance. The AUC value ranges from 0 to 1, where:

- **AUC = 1**: Perfect model. The model correctly classifies all positive and negative instances.
- **AUC = 0.5**: Random guessing. The model has no discrimination capacity to distinguish between positive and negative instances.
- **AUC < 0.5**: Worse than random guessing. The model's performance is inversely correlated with the actual classification.

### Interpretation

- **High AUC**: Indicates that the model has good measure of separability, meaning it can distinguish between positive and negative classes well.
- **Low AUC**: Indicates poor separability, meaning the model struggles to distinguish between positive and negative classes.

### Example

Consider a binary classifier predicting whether an email is spam or not. By varying the threshold, we can get different TPR and FPR values, which can be plotted to form the ROC curve. The AUC of this curve provides an aggregated measure of the model's performance across all classification thresholds.

### Practical Use

- **Model Comparison**: ROC and AUC are particularly useful for comparing different classification models.
- **Threshold Selection**: The ROC curve helps in selecting an optimal threshold by analyzing the trade-off between TPR and FPR.


<img src="../images/roc_curve.png" alt="Bias and Variance" width="400" height="400">

### Part 2

- For Step 7, don't forget to fill in the best set of hyperparameters

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
sc = MinMaxScaler()
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

In [6]:
lr = LogisticRegression()

parameters = {'penalty':['l1', 'l2'], 
              'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'C':[0.01, 0.10, 1, 10, 100],
              'max_iter': [10, 100, 1000]}


hyper_tune = GridSearchCV(lr, parameters, cv = 5, scoring = 'roc_auc', return_train_score = True, verbose = 3)

In [7]:
hyper_tune.fit(X_train, y_train)

Fitting 5 folds for each of 150 candidates, totalling 750 fits
[CV 1/5] END C=0.01, max_iter=10, penalty=l1, solver=newton-cg;, score=(train=nan, test=nan) total time=   0.0s
[CV 2/5] END C=0.01, max_iter=10, penalty=l1, solver=newton-cg;, score=(train=nan, test=nan) total time=   0.0s
[CV 3/5] END C=0.01, max_iter=10, penalty=l1, solver=newton-cg;, score=(train=nan, test=nan) total time=   0.0s
[CV 4/5] END C=0.01, max_iter=10, penalty=l1, solver=newton-cg;, score=(train=nan, test=nan) total time=   0.0s
[CV 5/5] END C=0.01, max_iter=10, penalty=l1, solver=newton-cg;, score=(train=nan, test=nan) total time=   0.0s
[CV 1/5] END C=0.01, max_iter=10, penalty=l1, solver=lbfgs;, score=(train=nan, test=nan) total time=   0.0s
[CV 2/5] END C=0.01, max_iter=10, penalty=l1, solver=lbfgs;, score=(train=nan, test=nan) total time=   0.0s
[CV 3/5] END C=0.01, max_iter=10, penalty=l1, solver=lbfgs;, score=(train=nan, test=nan) total time=   0.0s
[CV 4/5] END C=0.01, max_iter=10, penalty=l1, solver=

[CV 5/5] END C=0.01, max_iter=100, penalty=l2, solver=sag;, score=(train=0.827, test=0.891) total time=   0.0s
[CV 1/5] END C=0.01, max_iter=100, penalty=l2, solver=saga;, score=(train=0.849, test=0.768) total time=   0.0s
[CV 2/5] END C=0.01, max_iter=100, penalty=l2, solver=saga;, score=(train=0.818, test=0.882) total time=   0.0s
[CV 3/5] END C=0.01, max_iter=100, penalty=l2, solver=saga;, score=(train=0.846, test=0.769) total time=   0.0s
[CV 4/5] END C=0.01, max_iter=100, penalty=l2, solver=saga;, score=(train=0.840, test=0.831) total time=   0.0s
[CV 5/5] END C=0.01, max_iter=100, penalty=l2, solver=saga;, score=(train=0.827, test=0.891) total time=   0.0s
[CV 1/5] END C=0.01, max_iter=1000, penalty=l1, solver=newton-cg;, score=(train=nan, test=nan) total time=   0.0s
[CV 2/5] END C=0.01, max_iter=1000, penalty=l1, solver=newton-cg;, score=(train=nan, test=nan) total time=   0.0s
[CV 3/5] END C=0.01, max_iter=1000, penalty=l1, solver=newton-cg;, score=(train=nan, test=nan) total 

[CV 1/5] END C=0.1, max_iter=100, penalty=l1, solver=saga;, score=(train=0.843, test=0.769) total time=   0.0s
[CV 2/5] END C=0.1, max_iter=100, penalty=l1, solver=saga;, score=(train=0.811, test=0.883) total time=   0.0s
[CV 3/5] END C=0.1, max_iter=100, penalty=l1, solver=saga;, score=(train=0.840, test=0.778) total time=   0.0s
[CV 4/5] END C=0.1, max_iter=100, penalty=l1, solver=saga;, score=(train=0.833, test=0.809) total time=   0.0s
[CV 5/5] END C=0.1, max_iter=100, penalty=l1, solver=saga;, score=(train=0.813, test=0.887) total time=   0.0s
[CV 1/5] END C=0.1, max_iter=100, penalty=l2, solver=newton-cg;, score=(train=0.856, test=0.776) total time=   0.0s
[CV 2/5] END C=0.1, max_iter=100, penalty=l2, solver=newton-cg;, score=(train=0.825, test=0.888) total time=   0.0s
[CV 3/5] END C=0.1, max_iter=100, penalty=l2, solver=newton-cg;, score=(train=0.850, test=0.783) total time=   0.0s
[CV 4/5] END C=0.1, max_iter=100, penalty=l2, solver=newton-cg;, score=(train=0.844, test=0.828) 

[CV 1/5] END C=1, max_iter=10, penalty=l1, solver=saga;, score=(train=0.858, test=0.776) total time=   0.0s
[CV 2/5] END C=1, max_iter=10, penalty=l1, solver=saga;, score=(train=0.828, test=0.891) total time=   0.0s
[CV 3/5] END C=1, max_iter=10, penalty=l1, solver=saga;, score=(train=0.855, test=0.785) total time=   0.0s
[CV 4/5] END C=1, max_iter=10, penalty=l1, solver=saga;, score=(train=0.847, test=0.825) total time=   0.0s
[CV 5/5] END C=1, max_iter=10, penalty=l1, solver=saga;, score=(train=0.832, test=0.890) total time=   0.0s
[CV 1/5] END C=1, max_iter=10, penalty=l2, solver=newton-cg;, score=(train=0.858, test=0.778) total time=   0.0s
[CV 2/5] END C=1, max_iter=10, penalty=l2, solver=newton-cg;, score=(train=0.828, test=0.890) total time=   0.0s
[CV 3/5] END C=1, max_iter=10, penalty=l2, solver=newton-cg;, score=(train=0.854, test=0.792) total time=   0.0s
[CV 4/5] END C=1, max_iter=10, penalty=l2, solver=newton-cg;, score=(train=0.847, test=0.824) total time=   0.0s
[CV 5/5]

[CV 4/5] END C=1, max_iter=1000, penalty=l1, solver=sag;, score=(train=nan, test=nan) total time=   0.0s
[CV 5/5] END C=1, max_iter=1000, penalty=l1, solver=sag;, score=(train=nan, test=nan) total time=   0.0s
[CV 1/5] END C=1, max_iter=1000, penalty=l1, solver=saga;, score=(train=0.858, test=0.777) total time=   0.0s
[CV 2/5] END C=1, max_iter=1000, penalty=l1, solver=saga;, score=(train=0.828, test=0.891) total time=   0.0s
[CV 3/5] END C=1, max_iter=1000, penalty=l1, solver=saga;, score=(train=0.855, test=0.786) total time=   0.0s
[CV 4/5] END C=1, max_iter=1000, penalty=l1, solver=saga;, score=(train=0.847, test=0.824) total time=   0.0s
[CV 5/5] END C=1, max_iter=1000, penalty=l1, solver=saga;, score=(train=0.833, test=0.890) total time=   0.0s
[CV 1/5] END C=1, max_iter=1000, penalty=l2, solver=newton-cg;, score=(train=0.858, test=0.778) total time=   0.0s
[CV 2/5] END C=1, max_iter=1000, penalty=l2, solver=newton-cg;, score=(train=0.828, test=0.890) total time=   0.0s
[CV 3/5] E

[CV 2/5] END C=10, max_iter=100, penalty=l1, solver=liblinear;, score=(train=0.829, test=0.895) total time=   0.0s
[CV 3/5] END C=10, max_iter=100, penalty=l1, solver=liblinear;, score=(train=0.857, test=0.792) total time=   0.0s
[CV 4/5] END C=10, max_iter=100, penalty=l1, solver=liblinear;, score=(train=0.848, test=0.829) total time=   0.0s
[CV 5/5] END C=10, max_iter=100, penalty=l1, solver=liblinear;, score=(train=0.836, test=0.885) total time=   0.0s
[CV 1/5] END C=10, max_iter=100, penalty=l1, solver=sag;, score=(train=nan, test=nan) total time=   0.0s
[CV 2/5] END C=10, max_iter=100, penalty=l1, solver=sag;, score=(train=nan, test=nan) total time=   0.0s
[CV 3/5] END C=10, max_iter=100, penalty=l1, solver=sag;, score=(train=nan, test=nan) total time=   0.0s
[CV 4/5] END C=10, max_iter=100, penalty=l1, solver=sag;, score=(train=nan, test=nan) total time=   0.0s
[CV 5/5] END C=10, max_iter=100, penalty=l1, solver=sag;, score=(train=nan, test=nan) total time=   0.0s
[CV 1/5] END C=

[CV 1/5] END C=100, max_iter=100, penalty=l2, solver=lbfgs;, score=(train=0.860, test=0.782) total time=   0.0s
[CV 2/5] END C=100, max_iter=100, penalty=l2, solver=lbfgs;, score=(train=0.830, test=0.895) total time=   0.0s
[CV 3/5] END C=100, max_iter=100, penalty=l2, solver=lbfgs;, score=(train=0.857, test=0.793) total time=   0.0s
[CV 4/5] END C=100, max_iter=100, penalty=l2, solver=lbfgs;, score=(train=0.848, test=0.828) total time=   0.0s
[CV 5/5] END C=100, max_iter=100, penalty=l2, solver=lbfgs;, score=(train=0.836, test=0.883) total time=   0.0s
[CV 1/5] END C=100, max_iter=100, penalty=l2, solver=liblinear;, score=(train=0.860, test=0.782) total time=   0.0s
[CV 2/5] END C=100, max_iter=100, penalty=l2, solver=liblinear;, score=(train=0.830, test=0.895) total time=   0.0s
[CV 3/5] END C=100, max_iter=100, penalty=l2, solver=liblinear;, score=(train=0.857, test=0.793) total time=   0.0s
[CV 4/5] END C=100, max_iter=100, penalty=l2, solver=liblinear;, score=(train=0.848, test=0.

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [0.01, 0.1, 1, 10, 100],
                         'max_iter': [10, 100, 1000], 'penalty': ['l1', 'l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']},
             return_train_score=True, scoring='roc_auc', verbose=3)

In [8]:
hyper_tune.best_params_

{'C': 10, 'max_iter': 100, 'penalty': 'l2', 'solver': 'sag'}

In [9]:
lr0 = LogisticRegression()
lr0.fit(X_train, y_train)
print(classification_report(y_test, lr0.predict(X_test))[:220])
confusion_matrix(y_test, lr0.predict(X_test))

              precision    recall  f1-score   support

           0       0.81      0.86      0.84       157
           1       0.78      0.72      0.75       111

    accuracy                           0.80       268
  


array([[135,  22],
       [ 31,  80]], dtype=int64)

In [10]:
lr1 = LogisticRegression(C = 10, max_iter = 10, penalty = 'l2', solver = 'sag')
lr1.fit(X_train, y_train)
print(classification_report(y_test, lr1.predict(X_test))[:220])
confusion_matrix(y_test, lr1.predict(X_test))

              precision    recall  f1-score   support

           0       0.83      0.87      0.85       157
           1       0.80      0.74      0.77       111

    accuracy                           0.82       268
  


array([[137,  20],
       [ 29,  82]], dtype=int64)

In [11]:
##ncompute and compare areas under curve
fpr, tpr, thresholds = roc_curve(y_test, lr0.predict(X_test))
auc(fpr, tpr)

0.7902966660928444

In [12]:
fpr, tpr, thresholds = roc_curve(y_test, lr1.predict(X_test))
auc(fpr, tpr)

0.8056751018534458