# Logistical Regression

In this module

Now, in this module, you will learn logistic regression, which is a classification model, i.e. it will help you make predictions in cases where the output is a categorical variable.



Since logistic regression is the most easily interpretable of all classification models, it is very commonly used in various industries such as banking, healthcare, etc.


Please find the Diabetes data [here](https://ml-course2-upgrad.s3.amazonaws.com/Logistic+Regression/Univariate+Logistic+Regression/Diabetes+Example+Data.csv) and the telecom Churn data [here](https://ml-course2-upgrad.s3.amazonaws.com/Logistic+Regression/Univariate+Logistic+Regression/Telecom_Churn.zip).

Prerequisites
You’ll need to brush up on basic maths related to exponentials and logarithms before you begin this session. You can brush up on these topics using the links given below.

1. [Mathplanet - Exponentials](https://www.mathplanet.com/education/algebra-1/exponents-and-exponential-functions/properties-of-exponents)
2. [Mathplanet - Logarithms](https://www.mathplanet.com/education/algebra-2/exponential-and-logarithmic-functions/logarithm-property)

## Univariate Logistic Regression

In this session, you will learn a few basic concepts related to logistic regression. 


Broadly speaking, the topics that will be covered in this session are:

* **Binary classification**
* **Sigmoid function**
* **Likelihood function**
* **Building a logistic regression model in Python**
* **Odds and log odds**

You will learn about all these concepts through a univariate logistic regression example. Also, if these terms sound a little alien to you right now, you don’t need to worry. By the time you are done with this module, you will be well-versed in the terms of odds and log odds!



### Binary classification

classification problem: 

![77.png](attachment:3d3df6b8-7cc1-4846-a25a-732594727124.png)

So, now you know what a binary classification problem is.

Please find the Diabetes dataset [here](https://ml-course2-upgrad.s3.amazonaws.com/Logistic+Regression/Univariate+Logistic+Regression/Diabetes+Example+Data.csv).

![79.png](attachment:4a8ffea9-7a0c-4137-b57f-854537252cd9.png)

Now, recall the graph of the diabetes example. Suppose there is another person, with a blood sugar level of 195, and you do not know whether that person has diabetes or not. What would you do then? Would you classify him/her as a diabetic or as a non-diabetic?

![78.png](attachment:8cc6edad-df15-48a0-aa53-f8e7549a1e83.png)

Now, based on the boundary, you may be tempted to declare this person a diabetic, but can you really do that? This person’s sugar level (195 mg/dL) is very close to the threshold (200 mg/dL), below which people are declared as non-diabetic. It is, therefore, quite possible that this person was just a non-diabetic with a slightly high blood sugar level. After all, the data does have people with slightly high sugar levels (220 mg/dL), who are not diabetics.




### Sigmoid Curve 

In the last section, you saw what a binary classification problem is, and then you saw an example of a **binary classification problem**, where a model is trying to predict whether a person has diabetes or not based on his/her blood sugar level. You saw how using a simple boundary decision method would not work in this case.

Primitive binary classification model you saw earlier can be modified to make it more useful.

![81.png](attachment:8de501a4-3694-40c4-a96d-6c54a7d06382.png)

to, to recap, since the sigmoid curve has all the properties you would want—extremely low values in the start, extremely high values in the end, and intermediate values in the middle — it’s a good choice for modelling the value of the probability of diabetes.

So now we have verified, with actual values, that the sigmoid curve actually has the properties we discussed earlier, i.e. extremely low values in the start, extremely high values in the end, and intermediate values in the middle.



However, you may be wondering — why can’t you just fit a straight line here? This would also have the same properties — low values in the start, high ones towards the end, and intermediate ones in the middle.

![82.png](attachment:6372c14a-e7af-4705-9cbb-0a9df5f77d52.png)

The main problem with a straight line is that it is not steep enough. In the sigmoid curve, as you can see, you have low values for a lot of points, then the values rise all of a sudden, after which you have a lot of high values. In a straight line though, the values rise from low to high very uniformly, and hence, the “boundary” region, the one where the probabilities transition from high to low is not present.

#### Finding the Best Fit Sigmoid Curve

Find the combination of ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_0) and ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_1) which fits the data best.

So, by varying the values of ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_0) and ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_1), you get different sigmoid curves. Now, based on some function that you have to minimise or maximise, you will get the best fit sigmoid curve.



#### **Interactive app**  you can use it and see for yourself how the **curve changes** when the values of ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_0) and ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_1) are changed.

[https://da-upgrad.shinyapps.io/sigmoid/][https://da-upgrad.shinyapps.io/sigmoid/]




So, the best fitting combination of ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_0) and ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_1), will be the one which maximises the product:

![82.png](https://latex.upgrad.com/render?formula=%281-P_1%29%281-P_2%29%281-P_3%29%281-P_4%29%281-P_6%29%28P_5%29%28P_7%29%28P_8%29%28P_9%29%28P_%7B10%7D%29)

### Likelihood Function

This product is called the **likelihood function**. It is the product of:

[![82.png](https://latex.upgrad.com/render?formula=%281-P_i%29%281-P_i%29)------ for all non-diabetics --------] * [![82.png](https://latex.upgrad.com/render?formula=%28P_i%29%28P_i%29) -------- for all diabetics -------]

So, say that for the ten points in our example, the labels are a little different, somewhat like this:
<table align="center" border="1" cellpadding="1" cellspacing="1"><tbody><tr><td style="text-align: center;"><meta charset="utf-8">Point no.</td><td style="text-align: center;">1</td><td style="text-align: center;">2</td><td style="text-align: center;">3</td><td style="text-align: center;">4</td><td style="text-align: center;">5</td><td style="text-align: center;">6</td><td style="text-align: center;">7</td><td style="text-align: center;">8</td><td style="text-align: center;">9</td><td style="text-align: center;">10</td></tr><tr><td style="text-align: center;"><meta charset="utf-8">Diabetes</td><td style="text-align: center;">no</td><td style="text-align: center;">no</td><td style="text-align: center;">no</td><td style="text-align: center;">yes</td><td style="text-align: center;">no</td><td style="text-align: center;">yes</td><td style="text-align: center;">no</td><td style="text-align: center;">yes</td><td style="text-align: center;">yes</td><td style="text-align: center;">yes</td></tr></tbody></table>


In this case, the likelihood would be equal to ![82.png](https://latex.upgrad.com/render?formula=%281-P_1%29%281-P_2%29%281-P_3%29%281-P_5%29%281-P_7%29%28P_4%29%28P_6%29%28P_8%29%28P_9%29%28P_%7B10%7D%29) . The best fitting sigmoid curve would be the one which maximises the value of this product.

If you had to find Equationand Equation for the best fitting sigmoid curve, you would have to try a lot of combinations, unless you arrive at the one which maximises the likelihood. This is similar to linear regression, where you vary  Equation and Equation until you find the combination that minimises the cost function. 

![83.png](attachment:38dc9fe1-6f86-486b-b1e1-7e1c0ac24b03.png)


In the interactive app given below, you can try a few combinations yourself and see how the likelihood varies with betas.

https://da-upgrad.shinyapps.io/likelihood/



So, just by looking at the curve here, you can get a general idea of the curve’s fit. Just look at the yellow bars for each of the 10 points. A curve that has a lot of big yellow bars is a good curve. For example, this curve is not a good fit:

![84.png](attachment:a534d253-4728-4ae2-aa57-8dc989a0352d.png)

This curve, though, is a better fit -

![85.png](attachment:1c821ded-4db9-49e7-b4cb-ab7d3b97cb15.png)

Clearly, this curve is a **better fit**. It has **many big yellow bars**, and even the **small ones are reasonably large**. Just by looking at this curve, you can tell that it will have a high likelihood value.

You saw that by **trying different values of β0 and β1**, you can manipulate the shape of the sigmoid curve. **At some combination** of β0 and β1, the **'likelihood' (length of yellow bars) will be maximised.**

### Logistic Regression in Python

In python, logistic regression can be implemented using libraries such as SKLearn and statsmodels, though looking at the coefficients and the model summary is easier using statsmodels. 


You can find the optimum values of β0 and β1 using the python code given below. This Python code has been run so as to find the optimum values of β0 and β1 so that we can first proceed with the very important concept of **Odds and Log Odds**.


In [1]:
import pandas as pd
import numpy as np

dib = pd.read_csv('Diabetes+Example+Data.csv')
dib.head()

Unnamed: 0,Blood Sugar Level,Diabetes
0,190,No
1,240,Yes
2,300,Yes
3,160,No
4,200,Yes


In [2]:
# Converting Yes to 1 and No to 0
dib['Diabetes'] = dib['Diabetes'].map({'Yes': 1, 'No': 0})

dib.head()

Unnamed: 0,Blood Sugar Level,Diabetes
0,190,0
1,240,1
2,300,1
3,160,0
4,200,1


In [3]:
# Putting feature variable to X
X = dib['Blood Sugar Level']

# Putting response variable to y
y = dib['Diabetes']

In [4]:
import statsmodels.api as sm
logm1 = sm.GLM(y,(sm.add_constant(X)), family = sm.families.Binomial())
logm1.fit().summary()

0,1,2,3
Dep. Variable:,Diabetes,No. Observations:,10.0
Model:,GLM,Df Residuals:,8.0
Model Family:,Binomial,Df Model:,1.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2.5838
Date:,"Sun, 03 Aug 2025",Deviance:,5.1676
Time:,12:26:21,Pearson chi2:,4.32
No. Iterations:,7,Pseudo R-squ. (CS):,0.5809
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-13.5243,9.358,-1.445,0.148,-31.866,4.817
Blood Sugar Level,0.0637,0.044,1.439,0.150,-0.023,0.150


In the summary shown above, 'const' corresponds to β0 and Blood Sugar Level, i.e. 'x1' corresponds to β1. So, β0 = -13.5 and β1 = 0.06.

**Odds and Log Odds:**
So far, you’ve seen this equation for logistic regression:
![eq](https://latex.upgrad.com/render?formula=P%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-(%5Cbeta_%7B0%7D%2B%5Cbeta_%7B1%7Dx)%7D%7D)

Recall that this equation gives the relationship between P, the probability of diabetes and x, the patient’s blood sugar level. **P = Diabetic** , **1-P = Non Diabetic**

While the equation is correct, it is not very intuitive. In other words, the **relationship between P and x** is so complex that it **is difficult to understand** what kind of **trend** exists between the two. If you increase x by regular intervals of, say, 11.5, how will that affect the probability? Will it also increase by some regular interval? If not, what will happen?

 

So, clearly, the relationship between P and x is too complex to see any apparent trends. However, if you convert the equation to a slightly different form, you can achieve a much more intuitive relationship. We convert this sigmod equation ![eq](https://latex.upgrad.com/render?formula=P%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-(%5Cbeta_%7B0%7D%2B%5Cbeta_%7B1%7Dx)%7D%7D)to more linear form :

![87.png](attachment:c0d716db-2ae9-42b0-85c6-5d67bc1386c5.png)

So, now, instead of probability, you have **odds** and **log odds**. Clearly, the **relationship** between them and x is much more **intuitive** and easy to understand.

So, the relationship between x and probability is not intuitive, while that between x and **odds/log odds** is. This has important implications. Suppose you are discussing sugar levels and the probability they correspond to. While talking about 4 patients with sugar levels of 180, 200, 220 and 240, you will not be able to intuitively understand the relationship between their probabilities (10%, 28%, 58%, 83%). However, if you are talking about the log odds of these 4 patients, you know that their log odds are in a **linearly increasing pattern** (-2.18, -0.92, 0.34, 1.60) and that the odds are in a **multiplicatively increasing pattern** (0.11, 0.40, 1.40, 4.95, increasing by a factor of 3.55).

Hence, many times, it makes more sense to present a logistic regression model’s results in terms of log odds or odds than to talk in terms of probability. This happens especially a lot in industries like finance, banking, etc.

That's the end of this session on univariate logistic regression. You studied logistic regression, specifically, the sigmoid function, 

## Summary

You first learnt what a **binary classification** is. Basically, it is a classification problem in which the target variable has only 2 possible values.

You then went through the **diabetes example** in detail, wherein you tried to predict whether a person has diabetes or not based on that person’s blood sugar level.

You saw why a **simple boundary decision approach** does not work very well for this example. It would be too risky to decide the class blatantly on the basis of the cutoff because, especially in the middle, the patients could belong to any class — diabetic or non-diabetic.

![image.png](https://d35ev2v1xsdze0.cloudfront.net/8d2e0365-c17f-4026-b44d-9774c06b4906-pqif8t9n.png)

Hence, you learnt that it is better to talk in terms of probability. One such curve which can model the probability of diabetes very well is the sigmoid curve.

![image.png](https://d35ev2v1xsdze0.cloudfront.net/f0e57359-f0cb-4c7b-9a0c-01d47e4a8240-pq5xz2di.png)


Its equation is given by the following : 

![ghj](https://latex.upgrad.com/render?formula=P%20%5Cleft%28%5Cright.%20D%20i%20a%20b%20e%20t%20e%20s%20%5Cleft.%5Cright%29%20%3D%20%5Cfrac%7B1%7D%7B1%20%2B%20e%5E%7B-%20%5Cleft%28%5Cright.%20%5Cbeta_%7B0%7D%20%2B%20%5Cbeta_%7B1%7D%20x%20%5Cleft.%5Cright%29%7D%7D)

Then, you learnt that in order to find the **best-fit sigmoid curve**, you need to vary ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_0) and ![82.png](https://latex.upgrad.com/render?formula=%CE%B2_1) until you get the combination of beta values that maximisers the likelihood. For the diabetes example, the likelihood is given by the expression:

![jhk](https://d35ev2v1xsdze0.cloudfront.net/ad3aa242-39d6-4bee-904d-96cf257244fa-w8o5ljfn.png)

![kj](https://latex.upgrad.com/render?formula=L%20i%20k%20e%20l%20i%20h%20o%20o%20d%20%3D%20%5Cleft%281%20-%20P_%7B1%7D%5Cright%29%20%5Cleft%281%20-%20P_%7B2%7D%5Cright%29%20%5Cleft%281%20-%20P_%7B3%7D%5Cright%29%20%5Cleft%281%20-%20P_%7B4%7D%5Cright%29%20%5Cleft%28P_%7B5%7D%5Cright%29%20%5Cleft%281%20-%20P_%7B6%7D%5Cright%29%20%5Cleft%28P_%7B7%7D%5Cright%29%20%5Cleft%28P_%7B8%7D%5Cright%29%20%5Cleft%28P_%7B9%7D%5Cright%29%20%5Cleft%28P_%7B10%7D%5Cright%29)

It is the product of:

[![](https://latex.upgrad.com/render?formula=%281-P_i%29%281-P_i%29) ------ for all non-diabetics --------] * [![](https://latex.upgrad.com/render?formula=%28P_i%29%28P_i%29) -------- for all diabetics -------]

This process, where you vary the betas until you find the best fit curve for the probability of diabetes, is called **logistic regression.**

After this, you saw a simpler way of interpreting the equation for logistic regression. You saw that the following linearised equation is much easier to interpret:

![](https://latex.upgrad.com/render?formula=l%20n%20%5Cleft%28%5Cfrac%7Bp%7D%7B1%20-%20p%7D%5Cright%29%20%3D%20%5Cbeta_%7B0%7D%20%2B%20%5Cbeta_%7B1%7D%20x)


The left-hand side of this equation is what is called **log odds**. Basically, the odds of having diabetes (P/1-P), indicate how much likelier a person is to have diabetes than to not have it. For example, a person for whom the odds of having diabetes are equal to 3, is 3 times more likely to have diabetes than to not have it. In other words, P(Diabetes) = 3 * P(No diabetes).

You also saw how odds vary with variation in x. Basically, with every **linear increase** in x, the increase in odds is** multiplicative**. For example, in the diabetes case, after every increase of 11.5 in the value of x, the odds are approximately doubled, i.e. they increase by a multiplicative factor of about 2.

# Logistic Regression - Optimisation Methods (Optional)

The question is - how do you find the optimal values of β0 and β1 such that the likelihood function is maximized? You can find more details on this from the following [article](https://www.nucleusbox.com/cost-function-in-logistic-regression/).

# Multivariate Logistic Regression (Model Building)

Just like when you’re building a model using linear regression, one independent variable might not be enough to capture all the uncertainties of the target variable in logistic regression as well. So in order to make good and accurate predictions, you need multiple variables.

“Do you need any extensions while moving from univariate to multivariate logistic regression?” 

Recall the equation used in the case of univariate logistic regression was:
![eq](https://latex.upgrad.com/render?formula=P%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-(%5Cbeta_%7B0%7D%2B%5Cbeta_%7B1%7Dx)%7D%7D)


The above equation has only one feature variable X, for which the coefficient is β1. Now, if you have multiple features, say n, you can simply extend this equation with ‘n’ feature variables and ‘n’ corresponding coefficients such that the equation now becomes:

![](https://latex.upgrad.com/render?formula=P%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-(%5Cbeta_%7B0%7D%2B%5Cbeta_%7B1%7DX_%7B1%7D%2B%5Cbeta_%7B2%7DX_%7B2%7D%2B%5Cbeta_%7B3%7DX_%7B3%7D%2B...%2B%5Cbeta_%7Bn%7DX_%7Bn%7D)%7D%7D)




## Steps:

* Build a multivariate logistic regression model in Python
* Conduct feature selection for logistic regression using:
    * Automated methods: RFE -Recursive Feature Elimination
    * Manual methods: VIF and p-value check

We will use the ‘Telecom Churn’ dataset in this session to build a model using multivariate logistic regression. This will involve all the familiar steps such as:

* Data cleaning and preparation
* Preprocessing steps
* Test-train split
* Feature scaling
* Model Building using RFE, p-values and VIFs


Apart from the familiar old steps, you’ll also be introduced to something known as a confusion matrix and you’ll also learn how the accuracy is measured for a logistic regression model.

## Telecom Churn Prediction

**Problem Statment**

You have a telecom firm which has collected data of all its customers. The main types of attributes are:

* Demographics (age, gender etc.)
* Services availed (internet packs purchased, special offers taken etc.)
* Expenses (amount of recharge done per month etc.)

Based on all this past information, you want to build a model which will **predict** whether a particular customer will **churn** or not, i.e. whether they will switch to a different service provider or not. So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether or not a particular customer has churned. It is a binary variable - **1 means** that the customer has **churned** and **0 means** the customer has **not churned**.


* Please find the churn dataset [here](https://ml-course2-upgrad.s3.amazonaws.com/Multivariate+Logistic+Regression+-+Model+Building/churn_data.csv).

* Please find the internet_data dataset [here](https://ml-course2-upgrad.s3.amazonaws.com/Multivariate+Logistic+Regression+-+Model+Building/internet_data.csv).

* Please find the customer_data dataset [here](https://ml-course2-upgrad.s3.amazonaws.com/Multivariate+Logistic+Regression+-+Model+Building/customer_data.csv).

* Please find the data dictionary [here](https://ml-course2-upgrad.s3.amazonaws.com/Multivariate+Logistic+Regression+-+Model+Building/Telecom+Churn+Data+Dictionary.csv) 



Please find the Logistic Regression code file [here](https://github.com/ContentUpgrad/Logistic-Regression/blob/main/Multivariate%20Logistic%20Regression%20-%20Model%20Building/Logistic%2BRegression%2B-%2BTelecom%2BChurn%2BCase%2BStudy%20(1)%20(1).ipynb) 

So, here’s what the data frame churn_data looks like:

![](https://d35ev2v1xsdze0.cloudfront.net/f42ebec6-bf67-40f2-af44-fe80797c2954-62ha7lq5.png)

Also, here’s the data frame customer_data:

![](https://d35ev2v1xsdze0.cloudfront.net/ff18ac43-7c78-4a11-98ef-b4914e7d676c-o8uzn8o5.png)

Lastly, here’s the data frame internet_data:

![](https://d35ev2v1xsdze0.cloudfront.net/9128d33a-7b56-4712-9ced-cd2afef2fb26-jnw4zjru.png)


Now, as you can clearly see, the first 5 customer IDs are exactly the same for each of these data frames. Hence, using the column customer ID, you can collate or merge the data into a single data frame. We'll start with that in the next segment.
