# Logistic Regression

**Basic Description**

Logisitc regression is one of the simplest and most popular classification algorithms. In logistic regression, a linear output is converted into a probability between 0 and 1 using the sigmoid function. Logistic regression is highly interprettable and easier to explain than other models.

**Bias-Variance Tradeoff** 

Its simplicity makes for high bias and low variance.

**Upsides**

It's quick to implement and can serve as a good baseline for performance.

**Downsides**

It generally does not perform well with non-linear decision boundaries. When interprettability is desireably, it's important that features are not correlated.

**Other Notes**

## Load Packages and Prep Data

In [1]:
# custom utils
import utils
print(utils.__file__)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import RFECV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso, Ridge, LogisticRegression



In [2]:
LogisticRegression?

[0;31mInit signature:[0m
[0mLogisticRegression[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpenalty[0m[0;34m=[0m[0;34m'l2'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdual[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtol[0m[0;34m=[0m[0;36m0.0001[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mC[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfit_intercept[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mintercept_scaling[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclass_weight[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msolver[0m[0;34m=[0m[0;34m'lbfgs'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_iter[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmulti_class[0m[0;34m=[0m[

In [3]:
# load data
X_train, y_train, X_test, y_test = utils.load_data()

X_train (62889, 42)
y_train (62889,)
X_test (15723, 42)
y_test (15723,)


## Model 1
- Default hyperparameters
- Notable
    - penalty='l2'
    - C=1.0
    - max_iter=100

### Fit Model

In [4]:
# fit logistic regression model
log_1 = LogisticRegression(n_jobs=-1)
x = log_1.fit(X_train, y_train)

### Cross validation

In [5]:
# cross validation with f1 scoring
score = utils.f1_cv(log_1, X_train, y_train)

[0.5198 0.5028 0.4952 0.4939 0.4959]
0.5015


## Model X

In [None]:
log_x = LogisticRegression(penalty='l1', C=0.1, solver='liblinear')

## Model 2
- Regularize by selecting important features from recursive feature elimination ranking

### Feature Selection

In [6]:
# recursive feature elimination to determine feature importance
log = LogisticRegression(max_iter = 200)
model_rfe = RFECV(log, cv = 5)
x = model_rfe.fit(X_train, y_train)

In [7]:
# feature ranking
rfe = model_rfe.ranking_
features = X_train.columns
rfe_df = pd.DataFrame({'features': features, 'rfe_rank': rfe})
rfe_df.sort_values(by = 'rfe_rank', ascending = True)

NameError: name 'pd' is not defined

In [None]:
# features ranked 1 are the most important
# select only the more important features as a means of regularization
selected_features = rfe_df[rfe_df['rfe_rank'] == 1]['features'].values
X_train_new = X_train[selected_features]
X_test_new = X_test[selected_features]

In [None]:
# 16 important features
X_train_new.shape

(62889, 16)

### Fit Model

In [None]:
# fit logistic regression model
# using max_iter 400 to avoid error message
# using X_train_new to select only the 16 features that were selected
log_2 = LogisticRegression(max_iter = 400)
x = log_2.fit(X_train_new, y_train)

### Cross validation

In [None]:
# cross validation with f1 scoring
score = utils.f1_cv(log_2, X_train, y_train)

[0.5198 0.5028 0.4952 0.4939 0.4959]
0.5015


## Model 3
- Adjust hyperparameters

### Fit Model

In [None]:
# tried a variety of different combinations of parameters
# same selected features as the original model
log_3 = LogisticRegression(max_iter = 600, penalty="l1", tol=0.001, solver="saga")
x = log_3.fit(X_train_new, y_train)

### Cross-Validation
- Slight improvement

In [None]:
# cross validation with f1 scoring
score = utils.f1_cv(log_3, X_train, y_train)

[0.5588 0.5344 0.5327 0.5215 0.535 ]
0.5365


## Test Performance
Final test performance of model chosen using best cross-validation scores. These scores will be used to select amongst different model types.

In [None]:
# test the performance of the selected model
y_pred = log_3.predict(X_test_new)

utils.pred_metrics(y_test,y_pred)

# confusion matrix
utils.cm_plot(y_test,y_pred)

Accuracy:	0.9452394581186796
Precision:	0.82
Recall:		0.2645161290322581
F1:		0.4
