# LOGISTIC REGRESSION

Logistic Regression is supervised learning classification method. The model delivers a binary outcome limited to two possible outcomes: yes/no, 0/1, or true/false. It is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

## Types of Logistic Regression

### Binary logistic regression
  Binary logistic regression is used to predict the probability of a binary outcome, such as yes or no, true or false, or 0 or 1. For example, it could be used to predict whether a customer will churn or not, whether a patient has a disease or not, or whether a loan will be repaid or not.

### Multinomial logistic regression
  Multinomial logistic regression is used to predict the probability of one of three or more possible outcomes, such as the type of product a customer will buy, the rating a customer will give a product, or the political party a person will vote for.

### Ordinal logistic regression
  It is used to predict the probability of an outcome that falls into a predetermined order, such as the level of customer satisfaction, the severity of a disease, or the stage of cancer.


## Difference between Linear Regression and Logistic Regression

The primary difference between linear regression and logistic regression is that logistic regression's range is bounded between 0 and 1. In addition, as opposed to linear regression, logistic regression does not require a linear relationship between inputs and output variables

We all know the equation of the best fit line in linear regression is:

y = β<sub>0</sub> +  β<sub>1</sub>x

Let’s say instead of y we are taking probabilities (P). But there is an issue here, the value of (P) will exceed 1 or go below 0 and we know that range of Probability is (0-1). To overcome this issue we take “odds” of P:

P = β<sub>0</sub> +  β<sub>1</sub>x

P/(1 - P) = β<sub>0</sub> +  β<sub>1</sub>x

Do you think we are done here? No, we are not. We know that odds can always be positive which means the range will always be (0,+∞ ). Odds are nothing but the ratio of the probability of success and probability of failure. Now the question comes out of so many other options to transform this why did we only take ‘odds’? Because odds are probably the easiest way to do this, that’s it.

The problem here is that the range is restricted and we don’t want a restricted range because if we do so then our correlation will decrease. By restricting the range we are actually decreasing the number of data points and of course, if we decrease our data points, our correlation will decrease. It is difficult to model a variable that has a restricted range. To control this we take the log of odds which has a range from (-∞,+∞).

![logistic_img2.png](attachment:logistic_img2.png)

If you understood what I did here then you have done 80% of the maths. Now we just want a function of P because we want to predict probability right? not log of odds. To do so we will multiply by exponent on both sides and then solve for P.

![logistic_img3.png](attachment:logistic_img3.png)


![logistic_img4.png](attachment:logistic_img4.png)

![logistic_img5.png](attachment:logistic_img5.png)

Now we have our logistic function, also called a sigmoid function. The graph of a sigmoid function is as shown below. It squeezes a straight line into an S-curve.

![logistic_img6.png](attachment:logistic_img6.png)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('./dataset/Social_Network_Ads.csv')

In [3]:
X = df.iloc[:,:-1].values
Y = df.iloc[:, -1].values

In [4]:
print(Y)

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1
 1 1 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0 0 1
 1 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 1 0 0 0
 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0
 0 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1
 1 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1]


In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state= 0)

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Add a bias term to x
# X = np.c_[np.ones((X.shape[0], 1)), X]

In [7]:
# Initialize weights and bias
theta = np.zeros(X.shape[1])

# Sigmoid function

In a linear regression model, the hypothesis function is a linear combination of parameters given as y = ax+b for a simple single parameter data. This allows us to predict continuous values effectively, but in logistic regression, the response variables are binomial, either ‘yes’ or ‘no’. So, it makes less sense to use the linear function to predict anything except the values between 0 and 1. And the most effective function to limit the results of a linear equation to [0,1] is the sigmoid or logistic function.

![sigmoid_img.png](attachment:sigmoid_img.png)

As you can see, the sigmoid function intersects the y-axis at 0.5. In most cases, we use this point as a threshold for classification. Any value above it will be classified as 1, while any value below is 0. This is not a rule of thumb. We can also use different values instead of 0.5, depending on the requirements. The sigmoid function:

s(x) = 1/(1 + e<sup>-x</sup>)

We plug the linear equation in place of x. Linear Equation

s(x) = 1/(1 + e<sup>−ax + b</sup>)

In [8]:
def sigmoid(z):
    return 1/(1 + np.exp(-z))

# Cost Function

Cost function or loss function is that function that describes how much the calculated value deviates from the actual value. Linear regression employs the least squared error as the cost function. But the least squared error function for logistic regression is non-convex. While performing gradient descent chances that we get stuck in a local minimum is more. So instead, we use log loss as the cost function.

![logistic_img7.png](attachment:logistic_img7.png)

The formula gives the cost function for the logistic regression.

![logistic_img8.png](attachment:logistic_img8.png)

Where hx = is the sigmoid function we used earlier.
python code:
        def cost(theta):<br>
            z = dot(X,theta)<br>
            cost0 = y.T.dot(log(self.sigmoid(z)))<br>
            cost1 = (1-y).T.dot(log(1-self.sigmoid(z)))<br>
            cost = -((cost1 + cost0))/len(y) <br>
            return cost<br>

In [9]:
def cost_function(X, Y, theta):
  m = len(Y)
  h = sigmoid(np.dot(X, theta))
  cost = -1 / m * np.sum(Y * np.log(h) + (1 - Y) * np.log(1 - h))
  return cost

# Gradient Descent

Gradient descent changes the value of our weights in such a way that it always converges to minimum point or we can also say that, it aims at finding the optimal weights which minimize the loss function of our model. It is an iterative method that finds the minimum of a function by figuring out the slope at a random point and then moving in the opposite direction.

At first gradient descent takes a random value of our parameters from our function. Now we need an algorithm that will tell us whether at the next iteration we should move left or right to reach the minimum point. The gradient descent algorithm finds the slope of the loss function at that particular point and then in the next iteration, it moves in the opposite direction to reach the minima.

In [10]:
# Gradient descent
def gradient_descent(X, Y, theta, alpha, iterations):
    m = len(Y)
    costs = []

    for _ in range(iterations):
        h = sigmoid(np.dot(X, theta))
        gradient = np.dot(X.T, (h - Y)) / m
        theta -= alpha * gradient
        costs.append(cost_function(X, Y, theta))

    return theta, costs

In [11]:
# Set hyperparameters
alpha = 0.01
iterations = 1000

In [12]:
# Perform gradient descent
theta, costs = gradient_descent(X_train, Y_train, theta, alpha, iterations)

In [None]:
# Make predictions
predictions = np.round(sigmoid(np.dot(X_test, theta)))

# Confusion Matrix

Confusion Matrix is the visual representation of the Actual VS Predicted values. It measures the performance of our Machine Learning classification model and looks like a table-like structure.

![logistic_img9.png](attachment:logistic_img9.png)


## Elements of Confusion Matrix

It represents the different combinations of Actual VS Predicted values. Let’s define them one by one.

TP: True Positive: The values which were actually positive and were predicted positive.

FP: False Positive: The values which were actually negative but falsely predicted as positive. Also known as Type I Error.

FN: False Negative: The values which were actually positive but falsely predicted as negative. Also known as Type II Error.

TN: True Negative: The values which were actually negative and were predicted negative.

# Accuracy

It is calculated by dividing the total number of correct predictions by all the predictions.

![logistic_img10.png](attachment:logistic_img10.png)

# Precision

Precision checks how many outcomes are actually positive outcomes out of the total positively predicted outcomes.

![logistic_img11.png](attachment:logistic_img11.png)

# Recall / Sensitivity:

The recall is the measure to check correctly positive predicted outcomes out of the total number of positive outcomes.

![logistic_img14.png](attachment:logistic_img14.png)

# F beta score

F beta score is the harmonic mean of Precision and Recall and it captures the contribution of both of them. The contribution depends on the beta value in the below formula.

![logistic_img12.png](attachment:logistic_img12.png)

The default beta value is 1 which gives us the formula of F1score, where the contribution of Precision and Recall are the same. Higher the F1 score, the better the model.

![logistic_img13.png](attachment:logistic_img13.png)

The beta value < 1 gives more weight to Precision than Recall and the beta value>1 gives more weight to Recall.

In [14]:
# Calculate accuracy
accuracy = np.mean(predictions == Y_test)
print("Accuracy:", accuracy)

# Confusion matrix
confusion_matrix = np.zeros((2, 2))
for i in range(2):
    for j in range(2):
        confusion_matrix[i, j] = np.sum((Y_test == i) & (predictions == j))

print("Confusion Matrix:")
print(confusion_matrix)

Accuracy: 0.84
Confusion Matrix:
[[54. 14.]
 [ 2. 30.]]


In [15]:
# Print the learned parameters
print("Learned Parameters:")
print(theta)

Learned Parameters:
[1.20205057 0.66194762]
