# Logistic Regression

Very similar but unlike linear regression, logistic regression analysis is used for predicting the outcome of a categorical variable. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the predictor variables, using a logistic function. Rather than modeling the probabilities directly, we model a function of the probabilities that is
not restricted between 0 and 1. In particular, we model the log of the odds, rather than the actual
probabilities. Remember the odds are defined to be the probability that the event occurs, divided by the
probability that it does not occur.




## Simple Logistic Regression

The logistic regression model expresses the population proportion p of individuals with a given attribute (called the probability of success) as a function of a single predictor variable X. The model assumes that p is related to X through 

$$log(\frac{p}{1-p})= \beta_0 +\beta_1X $$

or its inverse

$$p = \frac{exp (\beta_0+\beta_1X)}{1+exp (\beta_0 + \beta_1X)}$$

The odds of success are p/(1 − p). The logistic model assumes that the log-odds of success is linearly related
to X. Graphs of the logistic model relating p to X are given below. 


![image](https://ars.els-cdn.com/content/image/3-s2.0-B9780124071971000041-gr004.jpg?_)


## Example Problem

Survival times are given for 30 patients who died from acute myelogenous leukaemia. Also measured was the patient's white blood cell count at the time of diagnosis. The patients were also factored into 2 groups according to the presence or absence of a morphologic characteristic of white blood cells. Patients termed AG positive were identified by the presence of Auer rods and/or significant granulation of the leukaemic cells in the bone marrow at the time of diagnosis. We are interested in modelling the probability p of surviving at least one year as a function of WBC and AG.


In [None]:
# Needed Libraries
import pandas as pd
import numpy as np
import matplotlib as plt
import math
import seaborn as sns

import os
#Building path to repo, *only need to run once*
! git clone https://github.com/bwsimedlytics/Week1Public
os.chdir("./Week1Public")

#### Import Data - Leukdata.csv found in the Google Drive or Github

In [None]:
#importing data
data = pd.read_csv("leukdata.csv") 

Leukemia Data set

Variables 

nres: number surviving at least one year  
ag: 1 for AG+, 0 for AG-  
wbc: white cell blood count   

#### Summarize and visualize the data (univariate analysis first and then bivariate)

In [None]:
#Write your code here 

#### Which plot best potrays the need for logisitic regression over linear regression? 


#### Assess if variables need to be transformed or normalized. If so, change the variable(s).

Notice from the univariate analysis above the skewness of the WBC distribution. Find a transformation of the WBC variable to achieve a normal distribution. Note that power transforms are a family of parametric, monotonic transformations that aim to transform data from any distribution to as close to a Gaussian/Normal distribution as possible in order to stabilize variance and minimize skewness. https://scikit-learn.org/stable/modules/preprocessing.html

In [None]:
#Write your code here

Now that we have preprocessed the data, we can implement the model.

The model we are considering for this analyis is



$$log(\frac{p}{1-p})= \beta_0 +\beta_1log(WBC) + \beta_2AG$$
 


The model is best understood by separating the AG+ and AG− cases. For AG− individuals, AG=0 so the model reduces to

$$log(\frac{p}{1-p})= \beta_0 +\beta_1log(WBC) + \beta_2*0 =  \beta_0 +\beta_1log(WBC)$$.

##### What is the model for AG+ individuals (AG = 1)?

$$log(\frac{p}{1-p})= \beta_0 +\beta_1log(WBC) + \beta_2*1 =  (\beta_0+\beta_2) +\beta_1log(WBC)$$.

The reduced or simple logisitic model is without AG and where the
log odds of surviving at least one year is related to log(WBC) linearly. When $\beta_2 = 0$ implies that
AG level has no effect on the survival probability once log(WBC) has been accounted for.
Including AG implies the log odds of surviving at least one year and log(WBC) are
a linearly associated, with different slopes for the AG levels. 
$\beta_0$ and $\beta_0 + \beta_2$ are the y-intercepts for
the population logistic regression lines for AG− and AG+, respectively. Both
prediction lines have a common slope, $\beta_1$. The $\beta_2$ estimate for the AG  is the
difference between intercepts for the AG+ and AG− regression lines. The image below shows
 the assumed relationship for a given $\beta_1$ < 0. The population regression
lines are parallel on the logit scale only, and 
preserved on the probability scale.

<img src="https://github.com/bwsimedlytics/Week1Public/blob/master/logscale.png?raw=true">

#### Implementing model using statsmodels (http://www.statsmodels.org/stable/index.html)

In [None]:
#use statsmodels to run the model - YOU WILL EDIT THIS CODE by 

# Specify the column containing the variable you're trying to predict (datay) followed by the columns 
# that the model should use to make the prediction (datax), other columns must be dropped.

import statsmodels.api as sm


datax =
datay =


datax['intercept'] = 1.0 #manually add the intercept

logit_model = sm.Logit(datay, datax)

result = logit_model.fit()

print(result.summary())

Your model should be $$log(\frac{p}{1-p})= 4.7909 - 0.9887log(WBC) + 2.5067AG$$

#### How do you assess what features are significant in the model? What are those features?


#### Replicate the plots in the figure above using the fitted values from results of the model you implemented above.

In [None]:
# Write your code here

# Odds ratio

In statistics, the odds ratio (OR) is one of the primary ways to quantify the correlation between two binary variables.  The odds of an event happening is a positive number and can be calculated from the probability of an event given by the follwing formula: $Odds =\frac{p}{1-p}$.

We can calculate the odds ratio to measure the relative odds of outcome for patients who receive the treatment compared to patients who did not receive the treatment. Luckily it is easy to calculate ORs from the logisitic regression estimates.

#### Calculate the odds ratio and the corresponding 95% confidence interval using the output form model above. Interpret the OR for AG and log(WBC). 
Use the output from the model above. You should obtain the following:

            
ag:    OR=12.263847  95% CI (1.516927, 99.149123)   

lwbc:   OR=0.372042  95% CI (0.144299, 0.959227) 

Regardless of log(WBC), the odds that AG+ patient lives
at least one year is 12.26 times larger than the odds that an AG−
patient lives at least one year.

The odds of surviving at least one year is reduced by a factor of 3 for each unit
increase of log(WBC).

In [None]:
# Write your code here

Although we are not covering methods to assess model fitting you should be familiar with it. You don't have to do it for this simple problem but you will have to in your challenge problem.
https://statisticalhorizons.com/wp-content/uploads/GOFForLogisticRegression-Paper.pdf This is a good source to reference, ignore the SAS programming language code.

Source: Feigl, P. & Zelen, M. (1965) Estimation of exponential survival probabilities with concomitant information. Biometrics 21, 826–838.