# Logistic Regression Model Tuning

In [2]:
import numpy as np
import pandas as pd

Using the [Heart Attack Analysis & Prediction Dataset](https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset)

In [3]:
df = pd.read_csv("../data/heart.csv")

df.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


Let's see the split of predictions

In [45]:
df.shape[0] 

303

In [5]:
df['output'].value_counts()

1    165
0    138
Name: output, dtype: int64

In [6]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

train_x, train_y = train[df.columns.difference(["output"])], train["output"]
test_x, test_y = test[df.columns.difference(["output"])], test["output"]

## Basic Logistic Regression, with no added parameters

In [27]:
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()
logistic_regression.fit(train_x, train_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

So, Logistic Regression with the default max number of iterations (100) does not converge. Let's see if increasing it has an effect.

In [28]:
logistic_regression = LogisticRegression(max_iter=1000)
logistic_regression.fit(train_x, train_y)

LogisticRegression(max_iter=1000)

So, it converges with a new max number of iterations. Let's quantify it's performance

In [36]:
from sklearn import metrics

def print_metrics(model, test_x, test_y):
    test_pred = model.predict(test_x) 

    print(f"Accuracy: {metrics.accuracy_score(test_y, test_pred)}")
    print(f"F1: {metrics.f1_score(test_y, test_pred)}")
    print(f"Precision: {metrics.precision_score(test_y, test_pred)}")
    print(f"Recall: {metrics.recall_score(test_y, test_pred)}")
    print(f"AUC: {metrics.roc_auc_score(test_y, test_pred)}")

In [37]:
print_metrics(logistic_regression, test_x, test_y)

Accuracy: 0.8360655737704918
F1: 0.8571428571428571
Precision: 0.8108108108108109
Recall: 0.9090909090909091
AUC: 0.8295454545454546


So the model already performs ok. Let's see if it can be improved

## Tuning Logistic Regression

LogisticRegression by default uses an *l2* penalty. How do other penalty types affect it?

In [38]:
l1_regression = LogisticRegression(max_iter=10000, penalty="l1", solver="saga")
l1_regression.fit(train_x, train_y)

print_metrics(l1_regression, test_x, test_y)

Accuracy: 0.8524590163934426
F1: 0.8732394366197183
Precision: 0.8157894736842105
Recall: 0.9393939393939394
AUC: 0.8446969696969697


So the l1-regularized model performs better than the default l2-regularized one. But how much is it due to the fact that I have also changed the solver from the default `lbfgs` to `saga`?

In [39]:
l2_saga_regression = LogisticRegression(max_iter=10000, penalty="l2", solver="saga")
l2_saga_regression.fit(train_x, train_y)

print_metrics(l2_saga_regression, test_x, test_y)

Accuracy: 0.8524590163934426
F1: 0.8732394366197183
Precision: 0.8157894736842105
Recall: 0.9393939393939394
AUC: 0.8446969696969697


What about elasticnet penalty?

In [40]:
elasticnet_regression = LogisticRegression(max_iter=10000, penalty="elasticnet", solver="saga", l1_ratio=0.5)
elasticnet_regression.fit(train_x, train_y)

print_metrics(elasticnet_regression, test_x, test_y)

Accuracy: 0.8524590163934426
F1: 0.8732394366197183
Precision: 0.8157894736842105
Recall: 0.9393939393939394
AUC: 0.8446969696969697


So it seems that the regularization type has no effect on the result. What about stronger and weaker regularization?

In [41]:
l2_stronger_regression = LogisticRegression(max_iter=10000, penalty="l2", solver="saga", C=0.1)
l2_stronger_regression.fit(train_x, train_y)

print("Stronger Regularization:")
print_metrics(l2_stronger_regression, test_x, test_y)

Stronger Regularization:
Accuracy: 0.8524590163934426
F1: 0.8732394366197183
Precision: 0.8157894736842105
Recall: 0.9393939393939394
AUC: 0.8446969696969697


In [42]:
l2_weaker_regression = LogisticRegression(max_iter=10000, penalty="l2", solver="saga", C=100)
l2_weaker_regression.fit(train_x, train_y)

print("Weaker Regularization:")
print_metrics(l2_weaker_regression, test_x, test_y)

Weaker Regularization:
Accuracy: 0.8524590163934426
F1: 0.8732394366197183
Precision: 0.8157894736842105
Recall: 0.9393939393939394
AUC: 0.8446969696969697


None of these changes (with a fixed max_iter and solver) have had any effect on the scores? Why does that happen?