# Supervised ML : Classification : Logistic Regression

**Linear regression:** continuous response is modeled as a linear combination of the features:

$$y = \beta_0 + \beta_1x$$
 
**Logistic regression:** log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:
 
$$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$
 
This is called the **logit function**.
 
Probability is sometimes written as pi:
 
$$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$
 
The equation can be rearranged into the **logistic function**:
 
$$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$

In other words:
 
 - Logistic regression outputs the **probabilities of a specific class**
 - Those probabilities can be converted into **class predictions**
 
The **logistic function** has some nice properties:

 - Takes on an "s" shape
 - Output is bounded by 0 and 1

## Imports

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

## The Breast Cancer Dataset

In [None]:
# load data
data = load_breast_cancer()
X = data.data
y = data.target
y_labels = np.array(['malignant' if item == 0 else 'benign' for item in y])

### Details

In [None]:
print(data.DESCR)

### Prepare the Dataset for Consumption

In [None]:
df = pd.concat([pd.DataFrame(X, columns=data.feature_names), 
                pd.DataFrame(y_labels.reshape(-1,1), columns=['has cancer'])], axis=1)

In [None]:
df.sample(10)

### Exploratory Analysis

### Question 1

Import the data and do the following:

* Examine the data types--there are many columns, so it might be wise to use value counts
* Determine the distribution of each type of record
* Encode the activity label as an integer

In [None]:
df.info()

### Classification Targets/Output

In [None]:
data.target_names

### Class Distribution

In [None]:
df['has cancer'].value_counts()

## Train-Test Split

### Question 2

* Split the data into train and test data sets. This can be done using any method, but consider using Scikit-learn's `StratifiedShuffleSplit` to maintain the same ratio of predictor classes.
* Regardless of methods used to split the data, compare the ratio of classes in both the train and test splits.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_labels, test_size=0.3, random_state=42)
print('Training dataset shape:', X_train.shape, '\tTest dataset shape:', X_test.shape)

## Modeling

### Question 3

* Fit a logistic regression model without any regularization using all of the features. 
* Using cross validation to determine the hyperparameters, fit models using L1, and L2 regularization. Store each of these models as well. Note the limitations on regularizations. The regularized models, in particular the L1 model, will probably take a while to fit.

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42)

In [None]:
lr.fit(X_train, y_train)

In [None]:
from sklearn.linear_model import LogisticRegressionCV

# L1 regularized logistic regression
lr_l1 = LogisticRegressionCV(Cs=10, cv=4, penalty='l1', solver='liblinear').fit(X_train, y_train)

In [None]:
# L2 regularized logistic regression
lr_l2 = LogisticRegressionCV(Cs=10, cv=4, penalty='l2').fit(X_train, y_train)

### Understand Coefficients

### Question 4

Compare the magnitudes of the coefficients for each of the models. 

In [None]:
# Combine all the coefficients into a dataframe
coefficients = list()

coeff_labels = ['lr', 'l1', 'l2']
coeff_models = [lr, lr_l1, lr_l2]

for lab,mod in zip(coeff_labels, coeff_models):
    coeffs = mod.coef_
    coeff_label = pd.MultiIndex(levels=[[lab], [0]], 
                                 labels=[[0], [0]])
    coefficients.append(pd.DataFrame(coeffs.T, columns=coeff_label))

coefficients = pd.concat(coefficients, axis=1)

coefficients.sample(10)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1)
#axList = axList.flatten()
fig.set_size_inches(10,10)

data = coefficients.xs(0, level=1, axis=1)
data.plot(marker='o', ls='', ms=5.0, ax=ax, legend=False)

ax.legend(loc=4)

ax.set(title='Coefficient Set ')

plt.tight_layout()

### Predict on Test Set

### Question 5

* Predict and store the class for each model.
* Also store the probability for the predicted class for each model. 

In [None]:
test_predictions = lr.predict(X_test)

In [None]:
test_predictions[:10]

## Model Evaluation

### Question 6
For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix

In [None]:
import model_evaluation_utils as meu

### Evaluation Stats

In [None]:
meu.get_metrics(true_labels=y_test, predicted_labels=test_predictions)

### Confusion Matrix

In [None]:
meu.display_confusion_matrix(true_labels=y_test, predicted_labels=test_predictions,
                             classes=data.target_names)

In [None]:
meu.display_classification_report(true_labels=y_test, predicted_labels=test_predictions,
                                  classes=data.target_names)

## Feature Selection

### Question 7
 Identify highly correlated columns and drop those columns before building models

In [None]:
data2.loc[:,~data2.columns.isin(['has cancer'])].shape

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import VarianceThreshold

#threshold with .7

sel = VarianceThreshold(threshold=(.7 * (1 - .7)))

data2 = df.copy()
data_new = pd.DataFrame(sel.fit_transform(data2.loc[:,~data2.columns.isin(['has cancer'])]))

# train-test split
X_new,X_test_new, Y_new,Y_test_new = train_test_split(data_new, data2['has cancer'].tolist(), test_size=0.3, random_state=42)
print('Training dataset shape:', X_new.shape, '\tTest dataset shape:', X_test_new.shape)

### Question 8
+ Predict and store the class for each model.
+ Also store the probability for the predicted class for each model.

### Question 9

For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix