# 4 Logistic Regression

**Introduction to Logistic Regression**

Logistic Regression is a technique used for **binary classification problems, where the goal is to predict whether something is true or false**. In contrast, linear regression is used to predict a continuous value, such as the price of a house. 

Unlike linear regression, **logistic regression fits an s-shaped (sigmoid) function and is a powerful tool for predicting the likelihood of an event occurring**, such as identifying if a mouse in an experiment are obese or not. In a simple linear regression, if there are weight on x axis and size in y axis, if we kow the weight of a mouse, by using the linear reg line, we can measure the size. In a multiple linear regression by using multiple features, such as weight and blood pressure, we again try to predict size by calculating R2 and p-value. But if they are obese or not, can be understood by binary classification.

![image.png](attachment:7b186b46-98b3-47b9-8c08-68621c23e583.png)

Instead of fitting a line do the data, logistic regression fits an S shaped "logistic function". This curve goes from 0 to 1 and that tells us the probability that a mouse is obese based on its weight.So we calculate the **probability** in log regression. Probability of a mouse to be obese or not! 

It's generally used for **classification**. If there is a probability over 50% then we classify the mouse as "obese", if the probability is <50%, then we classify it as "not obese". 

Just like linear regression we can make simple models, such as predicting obesity by weight; or more complicated models such as predicting obesity by using multiple features like weight + genotype + age + astrological sign etc. In other words, just like linear reg, logistic reg can work with continuous data (such as weight and age) or with discrete data (like genotype and astrological sign). **we can also test to see if each variable is useful for predicting obesity**. However, we can just test to see if a variable's effect on the prediction is significantly different from 0. If not, that means that the variable is not useful for prediction. We use **Wald's Test** to figure this out. In our obesity case, the astrological sign is **totes useless** in its statistical terms. So we can save time by leaving it out.

**A big difference between normal lin reg and log reg is how the line fits the data**. In normal reg, we use LSM (least sqaure methods" to fit the line to data. We use residuals to minimize the squares of errors and to calculate R2. However, log reg does not have a concept of residual, so intead of R2 or least squares, it uses **maximum likelihood**. 



**Logistic Regression Theory- Linear to Logistic**

Logistic Regression is a type of model used to predict whether an event is likely to happen or not. It's mostly used in situations where there are two possible outcomes, such as "yes" or "no", "true" or "false", or "pass" or "fail". It is a binary classifier, meaning it only has two possible outputs.

Furthermore, it uses input variables, also known as features, to calculate the probability of the event happening. The output is a value between 0 and 1 that represents the likelihood of the event occurring. If the probability is higher than a certain threshold, usually 0.5, the event is classified as "likely to happen", otherwise, it's classified as "not likely to happen".

Some examples of problems that can be solved using Logistic Regression are: 
- determining if an email is spam or not, 
- deciding whether a person is sick or not, 
- or predicting if a student will pass or fail an exam.

Actually like a Linear Regression model, a Logistic Regression model calculates a weighted amount of the features plus a bias, however, as opposed to yielding the outcome straightforwardly as the Linear Regression model does, it yields calculated of this outcome as seen below.

![image.png](attachment:09830a82-ca0c-4bf1-842f-b8c36b601759.png)

**Key Terms for Logistic Regression**

**Logit**: The log of the ratio of the probablities is called logit. The function that maps class membership probability to a range from ± ∞ (instead of 0 to 1). Synonym Log odds (log(p / 1-p) = log(1.7) = 0.53)

**Odds**: The ratio of “success” (1) to “not success” (0). (Odds are not probability. The odds are the ratio of "something happening" (i.e. my team winning) to "something not happening" (i.e. my team not winning) (for ex 5 / 3 (5 win / 3 loss)). (probability 5 / 8 (total games). In this case odds are 1.7 and probability is 0.625 .. probability of loss is 0.375 then.). we can calculate the odds from probability. the raprio of probabilities is the odds (0.625/0.375 = 1.7). We can calculate the odds both from counts (5-8) or probabilities. The second formula is (p / (1-p)) :  (0.627 / 1-0.625)

**Log odds**: The response in the transformed model (now linear), which gets mapped back to a probability. The log of the odds makes thing symmetrical, easy to interpret and easier for fancy statistics.

![image.png](attachment:2d0c825b-6ebd-4356-ac94-5b38733a374f.png)

![image.png](attachment:802130fe-a3a1-4083-a2c7-003e2ebde4d8.png)


**Logistic Regression Theory- Maximum Likelihood**

The least-squares method is a common technique used in linear regression to estimate the parameters of a linear model. The performance of the fit is typically measured using metrics such as root mean squared error (RMSE) and R-squared. In contrast, **logistic regression is a type of generalized linear model that is used to model binary outcomes**. Unlike linear regression, there is no closed-form solution for the parameters of a logistic regression model, and the model must be fit using **maximum likelihood estimation (MLE)**. MLE is a method of estimating the parameters of a model that maximizes the likelihood of observing the data given the model.

In **logistic regression, the dependent variable is binary, and the independent variables are typically continuous or categorical**. The logistic regression model converts the dependent variable into a logit variable, which is the logarithm of the odds' ratio. The output of the logistic regression equation is not a binary outcome, but an estimate of the logarithmic probabilities of the outcome being 1. The MLE process seeks a solution in which the predicted log-odds better represent the observed result.

The reason why logistic regression uses odds or log(odds) instead of probability is that odds are more interpretable in this context. When the probability of an event happening is close to 0 or 1, the odds are close to 0 or infinity. However, log-odds are always bounded between negative infinity and positive infinity, and it is easier to compare and interpret them. Additionally, odds are more stable, as they are not affected by the sample size, and probability changes with sample size.

Why do we use odds or log(odds) instead of probability? In reality, this is a common question and the following equations tell us the story:

![image.png](attachment:6edb7462-2e71-430d-b477-31b100f7b2e5.png)

Converting probability to log (odds), we expanded the spectrum from [0, 1] to  [- ∞, +∞ ]. If we fit the model to probability we are faced with a limited range problem and if we apply log transform we can cover the corresponding nonlinearity and still fit into a linear combination of variables.

**important** In logistic regression, we transform the y-axis from probability (0 to 1), to the loggs odd.
However, this transformation oushes the raw data to positive and negative infinity. This makes the residuals also equal to positive and negative infinity. So we can't use the least squares to find the best fittiing line. That's why we use maximum likelihood. 

![image.png](attachment:a3933fec-589b-4342-acfa-a7dcc9ccaf5d.png)



# Evaluating Performance-Classification Error Metrics

In a binary classification model, our prediction can only get two results: **correct or incorrect**. We use **classification error metrics to measure and evaluate the performance of our predictions**.

The major classification metrics are:

- Accuracy: The proportion of correct predictions (both true positives and true negatives) among the total number of cases examined.
- Recall: Also known as sensitivity. It is the fraction of relevant instances that were retrieved.
- Precision: Also called positive predictive value. It is the fraction of relevant instances among the retrieved instances
- F1-Score: The F1 score is the harmonic mean of the precision and recall.

**Confusion Matrix:**

The classification metrics mentioned above (accuracy, precision, etc.) are generated from the confusion matrix. A confusion matrix is a frequently used table to describe the performance of a classification model on a series of test data where real values are known. 


**Receiver operating characteristic (ROC)curve:**

This is a graphical plot that illustrates the performance of a binary classifier as its discriminant threshold is varied. The curve is created by plotting the **true positive rate (TPR) against the false positive rate (FPR) at various threshold values**.  


PS: Why is accuracy not a good measure for classification problems?

Accuracy is not a good measure for classification problems because it gives equal importance to both false positives and false negatives. However, this may not be the case in most business problems. For example, in case of cancer prediction, declaring cancer as benign is more serious than wrongly informing the patient that he is suffering from cancer. Accuracy gives equal importance to both cases and cannot differentiate between them.

 
