<p style="text-align:center">
PSY 394U <b>Data Analytics with Python</b>, Spring 2018


<img style="width: 400px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Title_pics.png?raw=true" alt="title pics"/>

</p>

<p style="text-align:center; font-size:40px; margin-bottom: 30px;"><b> Logistic regression </b></p>

<p style="text-align:center; font-size:18px; margin-bottom: 32px;"><b>February 13, 2018</b></p>

<hr style="height:5px;border:none" />

# 1. What is logistic regression?
<hr style="height:1px;border:none" />

In a typical linear regression, the dependent variable $Y$ is expressed as a linear combination of independent variables $X_1$, $X_2$, ... $X_p$

$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon$$

Here, $\beta$s are unknown regression coefficients to be estimated, and $\epsilon$ is the error assumed to follow a normal distribution. This model works well if $Y$ is a continuous variable. But say if $Y$ is a binary variable, taking values either 0 or 1. In that case, the linear model above does not work well since Y is not exactly 0 or 1. 

The solution to modeling a binary outcome variable is to use a logistic regression. Rather than modeling $Y$ directly, it models the probability that $Y$ is 1, or $Pr(Y=1)$. A logistic regression model has the following form:

$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon$$

Or

$$p = \frac{1}{1+\exp\left(-\left(\beta_0 + \beta_1 X_1 + \beta_2 X_2 
                             + \cdots + \beta_p X_p\right)\right)}$$

where $p=Pr(Y=1)$. The function $\log\left(\frac{p}{1-p}\right)$ is referred as a **logit** function. It converts a probability (between 0 and 1) into a real number (between $-\infty$ and $\infty$). With this model, the probability $p=Pr(Y=1)$ can be modeled as a function of independent variables $X$s. Here is an example of the probability $p=Pr(Y=1)$ as it relates to an independent variable $X$ in a logistic regression.

<img style="width: 500px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Logistic_LogitFunc.png?raw=true" alt="Logistic function"/>

Once a logistic regression model is learned from the data (i.e., all parameters are estimates), then we can use the model as a classifier. For example, if the predicted probability $p>0.5$ then one can predict *success* (i.e., $Y=1$), whereas one can conclude *failure* (i.e., $Y=0$) when $p<0.5$.

Just a side note, the quantity $\frac{p}{1-p}$ is often referred as *odds*. This is different from *probability*. The probability of rolling 6 on a balanced die 1/6, but the odds of rolling 6 is 1/5. This is because odds is the ratio of the probability of getting 6 to the probability of not getting 6.

# Example: breast cancer data
<hr style="height:1px;border:none" />

You may recall the breast cancer data from the [LDA lecture](https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/LinDisc.ipynb). The data set (see **`WiscBrCa_clean.csv`**) includes the diagnosis information (benign or malignant) on N=683 breast tumors along with some pathology features on the tumors. The goal here is to build a logistic regression classifier to predict whether a tumor is malignant based on the features. 

`<WiscBrCaLogistic.py>`

In [1]:
%matplotlib inline

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report


# loading the data, creating binary target
BrCaData = pd.read_csv('WiscBrCa_clean.csv')
BrCaData['Malignant'] = BrCaData.Class //2 -1  # Binary variable of malignancy
                                               # 0: benign
                                               # 1: malignant

Now let's see if there is any difference between benign and malignant tumors, by calculating feature means for these classes.

In [3]:
# means according to malignancy
print(BrCaData.groupby('Malignant').mean())

                     ID  ClumpThick  UniCellSize  UniCellShape  Adhesion  \
Malignant                                                                  
0          1.115261e+06    2.963964     1.306306      1.414414  1.346847   
1          1.005121e+06    7.188285     6.577406      6.560669  5.585774   

           EpiCellSize   BareNuc  Chromatin  Nucleoli   Mitoses  Class  
Malignant                                                               
0             2.108108  1.346847   2.083333  1.261261  1.065315    2.0  
1             5.326360  7.627615   5.974895  5.857741  2.602510    4.0  


As you can see, the means are substantially different in some of the features.