# Solution Seekers Group

Lead of the Study Group Discussion: **Youssef Laouina**

Author: **Youssef Laouina**


# Logistic Regression: Introduction

## Defintion:

Logistic regression is a statistical method used for **binary classification** tasks, where the goal is to predict the probability that an instance belongs to a particular class.

In ***binary classification***, we have a target variable that can take on two possible outcomes, often denoted as 0 and 1 (e.g., pass/fail, yes/no, positive/negative).

An example of a binary classification task would be a dataset with two columns: one for the number of hours a student studied for an exam and another column indicating whether they passed (1) or failed (0) the exam.

| Hours Studied | Exam Outcome |
|---:|----:|
|3|0|
|4|0|
|5|1|
|6|1|
|7|1|

# Odds

Let's talk a little bit about the concept of **Odds**.
In the example above we say that the odds in favor of the student passing the exam are 3 to 2.

Visually we have 5 exams total, 3 of which the student will pass and are represented with green, and 2 of which the student fail and are represented with red.

<center><img src="../images/3_pass_2_fail.png" alt="Student passes 3 and fails 2 exams" style="width: 400px;"/></center> 

Alternatively, we can write this as a fraction $\frac{3}{2}$

**Note:** 
Odds Are Not Probabilites!

The **odds** are the ratio of:

$$ \frac{\text{ something happening (i.e passing the exam)}}{\text{
to something not happenng (i.e. failing the exam)}} $$


On the other hand, **Probability** is:
$$ \frac{\text{ something happening (i.e passing the exam)}}{\text{
to everything that could happend (i.e. passing and failing the exam)}} $$

In the case of probability, we can write this as a fraction $\frac{3}{5} = \frac{3}{2 + 3}$

## Log Odds

Now that we know that odds are different from probabilities, let's 
talk about how odds can be 
calculated from probabilities...

We can calculate the porobability of passing the exam $(p)$ as follows:

$$p = \frac{3}{5}$$

then by extension, the probability of failing is:
$$1 - p = \frac{2}{5} $$

if we were to calculat this ratio:

The ratio of the probability of passing to (1 - the probability of passing)

We would get the following:

$$\frac{p}{1 - p} = \frac{\frac{3}{5}}{\frac{2}{5}} = \frac{3}{2} = 1.5 $$

Now that we know what odds are, let's talk about the log of the odds.

We'll try calculating the log of the odds using the expression above:

$$ log_e(\frac{p}{1- p}) = log_e(\frac{3}{2}) = log(1.5) = 0.405 $$


Suppose that the favors for student A are 6 to 1, and student B has the favors against of 1 to 6.

$$ \text{Student A} \rightarrow log_e(\frac{6}{1}) = log_e(6) = 1.79 $$

$$ \text{Student B} \rightarrow log_e(\frac{1}{6}) = log_e(0.167) = -1.79 $$


The **log()** function can help make the outputs symmetrical around zero, and this can help us interpret the different log odds of different students and compare the odds thereof effortlessly.

<center><img src="../images/log_scale_representation_of_ratios.png" alt="Log odds represented on a line" style="width: 600px;"/></center> 

**NOTE:**

The log of the ratio of the probabilities is called the **logit** function and forms the basis of the logistic regresion.

$$ logit = log(\frac{p}{1- p} ) $$

### Applications of Log Odds

To show you what the big deal is all about, if I pick pairs of random numbers that add up to 100 (for example) and use them to calculate the log(odds) and draw a histogram... this is what we'll get.

In [None]:
import random
import seaborn as sns
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd 
import warnings

warnings.simplefilter(action='ignore', category=[FutureWarning])

ratios = list()

for i in range(1000):
    nominator = random.randint(0, 100)
    denominator = 100 - nominator + 1
    ratio = nominator / denominator
    if ratio != 0:
        ratios.append(ratio)

In [None]:
data = pd.DataFrame({'Ratios': np.log(ratios)})

In [None]:
plt.hist(data.Ratios, edgecolor='white', bins=20)
plt.title('Distribution of Ratios')
plt.show()

>  the shape of the histogram is similar to a normal distribution and is approximated with a normal distribution.

This makes the **log(odds)** useful for solving certain statistics problems - specifically ones where we are trying to determine probabilities about win/lose, or yes/no, or true/false types of situations.

# Odds Ratios

Odds Ratios is just the Ratio of Odds.

For instance, the odds for a  patient in favor of not having Cancer are 5 to 8, and the odds in favor of him not having the mutated the gene responsible for having Cancer are 6 to 13.

If we were to calculate the ratio of there two odds we will get:

$$ \text{Ratio of Odds} = \frac{\frac{5}{8}}{\frac{6}{13}} = \frac{5}{8} \times \frac{13}{6} = \frac{65}{48} = 1.354 $$

Now what can we do with Odds Ratios?


Imagine we have a 356 patients, and we have some information about who has cancer and who happened to have the mutated gene.

We then organize the data in a table below:

| | Has not Cancer | Has not Cancer |
|---:|----:| ----: |
|Has Mutated gene | 27 | 117 |
|Has not Mutated gene| 6| 210 |

 We can use an **"odds ratio"** to determine if there is a relationship between the mutated gene and cancer.

 If someone has the mutated gene, are the odds higher that they will get cancer?

Given that a person has has the mutated gene, the odds that they have cancer are $ \frac{27}{117} $

And given that a person does not have the mutated gene, the odds that they have cancer are $ \frac{6}{210} $

Let's calculate our Odds Ratio and see what we'll get:

$$ \text{Odds Ratio} = \frac{\frac{27}{117}}{\frac{6}{210}} = 6.88 $$

And given that a person does not have the mutated gene, the odds that they have cancer are **6.88**, and **$ \text{log}(6.88) = 1.93 $**

What does this mean?

The odds ratio and the log(odds ratio) are like **R-squared**; they indicate a relationship between two things (in this case, a relationship between the mutated gene and cancer)


So, larger values mean that the mutated gene is a **good predictor** of cancer. Smaller values mean that the mutated gene is not a good predictor of cancer.

# Sigmoid Functions

A sigmoid function is any mathematical function whose graph has a characteristic S-shaped or sigmoid curve.
<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Gjl-t%28x%29.svg/1920px-Gjl-t%28x%29.svg.png" alt="Some sigmoid functions compared. In the drawing all functions are normalized in such a way that their slope at the origin is 1.
" style="width:5
    000px;"/></center> 


In Linear Regression we saw that the target variable can in theory take any value, but in Logistic Regression the target value can only take values between 0 and 1 which is the probability membership that an instance is of type calss 1 or not (suppose we have two classes 1 and 0).



## Logistic Function

The logistic function is a sigmoid function, which takes any real input $t$, and outputs a value between zero and one.
The standard logistic function $ f :\mathbb {R} \rightarrow (0,1) $ is defined as follows:

$$ f(t) = {\frac {1}{1+e^{-t}}}={\frac {e^{t}}{1+e^{t}}}$$

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1280px-Logistic-curve.svg.png" alt="The logistic curve" style="width: 400px;"/></center> 

Let us assume that $t$ is a linear function of a single explanatory variable $x$ (the case where $t$ is a linear combination of multiple explanatory variables is treated similarly). We can then express $t$ as follows:

$ t=\beta _{0}+\beta _{1}x $

And the general logistic function 

$ p:\mathbb {R} \rightarrow (0,1) $ can now be written as:

$$ p(x)=f(t)=\frac{1}{1+e^{-(\beta _{0}+\beta _{1}x)}} $$

In the logistic model, $p(x)$ is interpreted as the probability of the dependent variable $Y$ equaling a success/case (i.e. reference class) rather than a failure/non-case.

**BIG NOTE!**


$\beta _{0} $ and $ \beta _{1}$ are estimated using the **Maximum Likelihood Estimation (MLE)** on the contrary of what is used in the Linear Regreesion setup, where we use ***Least Squares Estimation***.

**Attend the online meeting for details**

~Y.L