# Logistic Regression

## What is it?
The method of finding best fitting curve form an infinite number of curves that can be drwan on a given observations in order to divide the observations into separate classes is called Logistic regression.

The equation for Logistic regression is
$ P = \frac {1}{1+e^−(β_0+β_1X_1+...+β_nX_n)}$

Like linear regression, the independent variables and dependent variable have linear relationship. However the dependent variable is either 0 or 1. The linear term $(β_0+β_1X_1+...+β_nX_n)$ transforms the dependent variable to a value ranging between $-\infty$ to $\infty$ and the Logit function maps the value between $-\infty$ to $\infty$ to a probability value between 0 and 1.

## Log and Log Odds

$ P = \frac {1}{1+e^−(β_0+β_1X_1+...+β_nX_n)}$

$ 1 - P = \frac {e^−(β_0+β_1X_1+...+β_nX_n)}{1+e^−(β_0+β_1X_1+...+β_nX_n)}$

$ \frac {P}{1-P} = e^(β_0+β_1X_1+...+β_nX_n)$

$ ln (\frac {P}{1-P}) = β_0+β_1X_1+...+β_nX_n$

The term $\frac {P}{1-P} $ is called Odds i.e. ratio of probability of occurrance of an event to the probability that event does not occur in other words odds of an event to occur. Its log is called Log Odds. 

With every linear increase in X i.e. independent variable, Log Odds increase linearly and the increase in Odds is multiplicative. Thus independent variable and the Log Odds share linear relationship.

Odds ratio is the ratio of odds between two groups.

## Maximum Likelihood function
The y value for given data points can be either 0 or 1. For Logistic Regression to classify the data i.e. fit a curve through our data points, we want to minimize probability(P) of those values which are closer to 0 and maximize probability for those closer to 1. In other words we want to maximize (1-P) for some points and maximize P for rest of the points. Thus if we have n values for class 0 and m for class 1, we want to maximize:

  $(1-P1)(1-P2)...(1-Pn)(P1)(P2)....(Pm)$

This is the cost function for Logistic Regression.


## Maximum Likelihood estimator

## Confusion Matrix

|Actual/Predicted |   No   |  Yes   |
|-----------------|--------|--------|
|   No            | True Negative | False Positive|
|   Yes           | False Negative | True Positive| 

### Accuracy
Accuracy refers to how well the model predicted yess and noes correctly. The formula is as follows

$ Accuracy = \frac {Correctly Predicted Labels}{Total Number of Labels} $

Its often used to judge the overall model effectiveness as it does not give any information about how accurate the predictions are for each class.  

### Sensitivity
The % of correct predictions by a model i.e. sensitivity of the model.

$True Positive Rate (TPR) = Sensitivity = \frac {Predicted Positives}{Actual Positives} = \frac {True Positives}{True Positives+False Negatives} = \frac {TP}{TP+FN}$

True Positives i.e. positive values predicted as positive by the model
False Negatives i.e. positive values predicted as negative by the model 

### Specificity
The % of correct predictions for the negative class.

$True Negative Rate (TNR) = Specificity = \frac {Predicted Negatives}{Actual Negatives} = \frac {True Negatives}{True Negatives+False Positives} = \frac {TN}{TN+FP}$

In classification, we always give more weightage to one class over the other, therefore Sensitivity and Specificity are important from model evaluation perspecitive. 

### False Positive Rate (FPR)
Its % of False Postives i.e % of negatives incorrectly classified as Positives by the model.

$ FPR = \frac {Negatives predicted as Postive}{Total number of actual Negatives} = \frac {FP}{TN+FP} = 1 - Specificity$

**Note: Its confusiong with Specificity as it deals with negative class and denominator is same, however note that the numerator is different here we want to find out how many negatives we _incorrectly_ classified as positives whereas in case of Specificity we find out how many negatives we _correctly_ classfified as negatives.**

### Precision
Probability that a predicted 'Yes' is actually a 'Yes'.
![image-2.png](attachment:image-2.png)

The formula for precision can be given as:

$ Precision = \frac {TP}{TP+FP}$

Remember that 'Precision' is the same as the **'Positive Predictive Value'**

### Recall
Probability that an actual 'Yes' case is predicted correctly.
![image-3.png](attachment:image-3.png)
The formula for recall can be given as:

$ Recall = \frac{TP}{TP+FN}$

Remember that 'Recall' is exactly the same as **'Sensitivity'.**

### Negative Predictive Value 
The probability that a predicted 'No' is actaually No.

$ Negative Predictive Value = \frac {TN}{TN+FN}$

In other words its the precision for Negative class.

## ROC Curve
ROC is a plot of Signal (True Positive Rate or Sensitivity) to Noise (False Positive Rate or 1 - Specificity). It shows tradeoff between Sensitivity and Specificity. The area under the curve gives model performance. Best possible value is 1 and worst is 0.5 i.e. noise for the model is 50%.

**Interpretation**
1. Spikes on ROC curve - model is not stable 
2. variations at lower end of curve - model is misclassifying values at X=0
3. variations at higher end of curve - model is misclassifying values at X=1

**Further reading:** https://derangedphysiology.com/main/cicm-primary-exam/required-reading/research-methods-and-statistics/Chapter%203.0.5/receiver-operating-characteristic-roc-curve

## Finding optimal cut-off
A plot of Accuracy, Sensitivity and Specificity shows tradeoff between these values. The point where all the values intercept can be a good cut off for our model.
![image.png](attachment:image.png)

Another way to find the cut off is to look at plot of Precision (Green line) vs Recall (Red line). As Precision increases, the Recall decreases.
![image-4.png](attachment:image-4.png)

The plot shows Precision and Recall tradeoff. The curve for precision is quite jumpy towards the end. This is because the denominator of precision, i.e. (TP+FP) is not constant as these are the predicted values of 1s. And because the predicted values can swing wildly, you get a very jumpy curve.

## Nuances of Logistic Regression

### Sample Selection
1. Seasonal or cyclical fluctuations population
2. Representativie of population on which the model will be applied
3. Rare incidence population (unbalanced data)

### Segmentation
1. Build model for each different segment of the data 
2. Combine the models to get overall model
3. The predictor variables must be different for segmented models
4. The overall predictive power generally increases using segmented models

### Variable Transformation - Dummy variable
- If you convert continiuous variable into dummy it helps make model stable 
- Disadvantage is that converting continuous variable into dummy variable, all the data will be compressed into very few categories resulting in data clumping.

### Variable Transformation - Weight of evidence (woe) 
- WOE can be calculated by following formula:\
$ WOE = ln(\frac {good in the bucket}{Total  Good})−ln(\frac {bad in the bucket}{Total  bad}) $
- the WOE should follow an monotonically increasing or decreasing pattern
- if not, then club the bins to get an increasing or decreasing pattern
- WOE reflects groups identity 
- WOE helps treating missing values in continuous or categorical variables 
- The model becomes more stable 

#### Information Value 
- $ V=WOE ∗ (\frac {Good in the bucket}{Total Good} − \frac {Bad in the Bucket}{Total Bad})$
- always positive
- Signifies the predictive power of a variable

### Variable Transformation - Continuous variables 
- use the variable as is 
- impute missing value with bin with nearest WOE 

### Variable Transformation - Interaction variables
- Combine two or more variables to get a single variable
- Use business jedgement to combine variables
- Another way is to build Decision tree to come up with interaction variable

### Variable Transformation - Splines
- Fit a regression model over the WOE values.
- This can be a complex polynomial model.
- Using a spline offers high predictive power, but it may result in unstable models.
- Usually not recommended.

### Variable Transformation - Mathematical transformation
- Use mathematical functions on variables.
- If a variable is not linear, use log transformation.
- Other mathematical functions incldue square, log etc.
- Problem is they are not easy to explain to business.

### Variable Transformation - Principal Component transformation 
- PCA takes all variables in the data and creates components 
- These components are used as variables
- The components are orthogonal i.e. they do not have any correlaton
- Not easy to explain components to the business

### Challanges in Logistic Regression
1. Low event rate - the event is rare so the data is imbalanced e.g. fraud detection
2. Missing values 
  - imputation using woe
  - imputation using median
  - imputation using mean
  - imputation using predictive patterns 
  - Morkov Chain Monte Carlo 
  - Expetation Maximisation 
3. Trcuncated data - for certain data points the label i.e. y value is not available but the model will be applied on such data as well. e.g. people without a credit history 
  - reject inference 

### Model Performance measures
1. Discriminatory power 
  - KS Statistics
  - Gini - area under ROC curve, high value is good. $ Gini=2∗Area Under ROC Curve − 1 $
  - Rank Ordering 
  - Specificity
  - Sensitivity 
2. Accuracy 
  - Specificity
  - Sensitivity 
  - Compare actual vs predicted log odds 
3. Stability 