# 1. Logistic Regression

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

It just means a variable that has only 2 outputs, for example, A person will survive this accident or not, The student will pass this exam or not. The outcome can either be yes or no (2 outputs). This regression technique is similar to linear regression and can be used to predict the Probabilities for classification problems.

Here are a few things to know about logistic regression:

- Logistic regression is a Machine Learning method used for classification tasks.
- It is a predictive analytic technique based on the probability idea.
- The dependent variable in logistic regression is binary (coded as 1 or 0).
- The goal is to discover a link between characteristics and the likelihood of a specific outcome.
- Logistic regression uses a more sophisticated cost function called the “Sigmoid function” or “logistic function” instead of a linear function.
- The logistic regression hypothesis limits the cost function to a value between 0 and 1, making linear functions unsuitable for this task.
- Logistic regression is used in many fields, such as finance, marketing, healthcare, and social sciences, to model and predict binary outcomes.

![image.png](attachment:image.png)

Logistic Regression is considered a regression model also. This model creates a regression model to predict the likelihood that a given data entry belongs to the category labeled “1.” Logistic regression models the data using the sigmoid function, much as linear regression assumes that the data follows a linear distribution.

# 2. Why the Name Logistic Regression?

It’s called ‘Logistic Regression’ since the technique behind it is quite similar to Linear Regression. The name “Logistic” comes from the Logit function, which is utilized in this categorization approach.

# 3. Why do we use Logistic Regression rather than Linear Regression?

After reading the definition of logistic regression we now know that it is only used when our dependent variable is binary and in linear regression this dependent variable is continuous.

The second problem is that if we add an outlier in our dataset, the best fit line in linear regression shifts to fit that point.

Now, if we use linear regression to find the best fit line which aims at minimizing the distance between the predicted value and actual value, the line will be like this:

![image.png](attachment:image.png)

Here the threshold value is 0.5, which means if the value of h(x) is greater than 0.5 then we predict malignant tumor (1) and if it is less than 0.5 then we predict benign tumor (0). Everything seems okay here but now let’s change it a bit, we add some outliers in our dataset, now this best fit line will shift to that point. Hence the line will be somewhat like this:

![image-2.png](attachment:image-2.png)

Do you see any problem here? The blue line represents the old threshold and the yellow line represents the new threshold which is maybe 0.2 here. To keep our predictions right we had to lower our threshold value. Hence we can say that linear regression is prone to outliers. Now here if h(x) is greater than 0.2 then only this regression will give correct outputs.

Another problem with linear regression is that the predicted values may be out of range. We know that probability can be between 0 and 1, but if we use linear regression this probability may exceed 1 or go below 0.

To overcome these problems we use Logistic Regression, which converts this straight best fit line in linear regression to an S-curve using the sigmoid function, which will always give values between 0 and 1.

**Note:** Refer to this [playlist](https://www.youtube.com/playlist?list=PLE-8p-CwnFPsg-iYUrsQgBJ-li9pIVv0A) to know more.

# 4. Logistic Function

You must be wondering how logistic regression squeezes the output of linear regression between 0 and 1.

Well, there’s a little bit of math included behind this and it is pretty interesting trust me.

Let’s start by mentioning the formula of logistic function:

![image.png](attachment:image.png)

How similar it is too linear regression?

We all know the equation of the best fit line in linear regression is:

![image-2.png](attachment:image-2.png)

Let’s say instead of y we are taking probabilities (P). But there is an issue here, the value of (P) will exceed 1 or go below 0 and we know that range of Probability is (0-1). To overcome this issue we take “odds” of P:

![image-3.png](attachment:image-3.png)

Do you think we are done here? No, we are not. We know that odds can always be positive which means the range will always be (0,+∞ ). Odds are nothing but the ratio of the probability of success and probability of failure. Now the question comes out of so many other options to transform this why did we only take ‘odds’? Because odds are probably the easiest way to do this, that’s it.

The problem here is that the range is restricted and we don’t want a restricted range because if we do so then our correlation will decrease. By restricting the range we are actually decreasing the number of data points and of course, if we decrease our data points, our correlation will decrease. It is difficult to model a variable that has a restricted range. To control this we take the log of odds which has a range from (-∞,+∞).

![image-4.png](attachment:image-4.png)

If you understood what I did here then you have done 80% of the maths. Now we just want a function of P because we want to predict probability right? not log of odds. To do so we will multiply by exponent on both sides and then solve for P.

![image-5.png](attachment:image-5.png)
![image-6.png](attachment:image-6.png)
![image-7.png](attachment:image-7.png)

Now we have our logistic function, also called a sigmoid function. The graph of a sigmoid function is as shown below. It squeezes a straight line into an S-curve.

![image-8.png](attachment:image-8.png)

# 5. Types of Logistic Regression

In general, it can be classified into:

- **Binary Logistic Regression:** two or binary outcomes like yes or no
- **Multinomial Logistic Regression:** three or more outcomes like first, second, and third class or no class degree
- **Ordinal Logistic Regression:** three or more like multinomial logistic regression but here with the order like customer rating in the supermarket from 1 to 5

# 6. Requirements for Logistic Regression

This model can work for all the datasets, but still, if you need good performance, then there will be some assumptions to consider,

- The dependant variable in binary logistic regression must be binary.
- Only the variables that are relevant should be included.
- The independent variables must be unrelated to one another. That is, there should be minimal or no multicollinearity in the model.
- The log chances are proportional to the independent variables.
- Large sample sizes are required for logistic regression.

# 7. Decision Boundary – Logistic Regression

A threshold can be established to forecast which class a data belongs to. The derived estimated probability is categorized into classes based on this threshold.

If the predicted value is less than 0.5, categorize the particular student as a pass; otherwise, label it as a fail. 

There are two types of decision boundaries: linear and non-linear. To provide a complicated decision boundary, the polynomial order can be raised.

# 8. Cost Function in Logistic Regression

In linear regression, we use the Mean squared error which was the difference between y_predicted and y_actual and this is derived from the maximum likelihood estimator. The graph of the cost function in linear regression is like this:

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In logistic regression Yi is a non-linear function (Ŷ=1​/1+ e-z). If we use this in the above MSE equation then it will give a non-convex graph with many local minima as shown:

![image-3.png](attachment:image-3.png)

The problem here is that this cost function will give results with local minima, which is a big problem because then we’ll miss out on our global minima and our error will increase.

In order to solve this problem, we derive a different cost function for logistic regression called log loss which is also derived from the maximum likelihood estimation method.

![image-4.png](attachment:image-4.png)

In the next section, we’ll talk a little bit about the maximum likelihood estimator and what it is used for. We’ll also try to see the math behind this log loss function.

# 9. What is the use of Maximum Likelihood Estimator?

The main aim of MLE is to find the value of our parameters for which the likelihood function is maximized. The likelihood function is nothing but a joint pdf of our sample observations and joint distribution is the multiplication of the conditional probability for observing each example given the distribution parameters. In other words, we try to find such that plugging these estimates into the model for P(x), yields a number close to one for people who had a malignant tumor and close to 0 for people who had a benign tumor.

Let’s start by defining our likelihood function. We now know that the labels are binary which means they can be either yes/no or pass/fail etc. We can also say we have two outcomes success and failure. This means we can interpret each label as Bernoulli random variable.

**Note:** A random experiment whose outcomes are of two types, success S and failure F, occurring with probabilities p and q respectively is called a Bernoulli trial. If for this experiment a random variable X is defined such that it takes value 1 when S occurs and 0 if F occurs, then X follows a Bernoulli Distribution.

![image.png](attachment:image.png)

Where P is our sigmoid function

![image-2.png](attachment:image-2.png)

where σ(θ^T*x^i) is the sigmoid function. Now for n observations,

![image-3.png](attachment:image-3.png)

We need a value for theta which will maximize this likelihood function. To make our calculations easier we multiply the log on both sides. The function we get is also called the log-likelihood function or sum of the log conditional probability.

![image-4.png](attachment:image-4.png)

In machine learning, it is conventional to minimize a loss(error) function via gradient descent, rather than maximize an objective function via gradient ascent. If we maximize this above function then we’ll have to deal with gradient ascent to avoid this we take negative of this log so that we use gradient descent. We’ll talk more about gradient descent in a later section and then you’ll have more clarity. Also, remember,

max[log(x)] = min[-log(x)]

The negative of this function is our cost function and what do we want with our cost function? That it should have a minimum value. It is common practice to minimize a cost function for optimization problems; therefore, we can invert the function so that we minimize the negative log-likelihood (NLL). So in logistic regression, our cost function is:

![image-5.png](attachment:image-5.png)

Here y represents the actual class and log(σ(θ^T*x^i) ) is the probability of that class.

- p(y) is the probability of 1.

- 1-p(y) is the probability of 0.

Let’s see what will be the graph of cost function when y=1 and y=0

![image-6.png](attachment:image-6.png)

If we combine both the graphs, we will get a convex graph with only 1 local minimum and now it’ll be easy to use gradient descent here.

![image-7.png](attachment:image-7.png)

The red line here represents the 1 class (y=1), the right term of cost function will vanish. Now if the predicted probability is close to 1 then our loss will be less and when probability approaches 0, our loss function reaches infinity.

The black line represents 0 class (y=0), the left term will vanish in our cost function and if the predicted probability is close to 0 then our loss function will be less but if our probability approaches 1 then our loss function reaches infinity.

![image-8.png](attachment:image-8.png)

This cost function is also called log loss. It also ensures that as the probability of the correct answer is maximized, the probability of the incorrect answer is minimized. Lower the value of this cost function higher will be the accuracy.

**Note:** In short, for linear Regression, the Cost function is:

![image-9.png](attachment:image-9.png)

But for Logistic Regression,

![image-10.png](attachment:image-10.png)

It will result in a non-convex cost function as shown above. So, for Logistic Regression the cost function we use is also known as the cross entropy or the log loss.

![image-11.png](attachment:image-11.png)

**Case 1:** If y = 1, that is the true label of the class is 1. Cost = 0 if the predicted value of the label is 1 as well. But as hθ(x) deviates from 1 and approaches 0 cost function increases exponentially and tends to infinity which can be appreciated from the below graph as well. 

![image-12.png](attachment:image-12.png)

**Case 2:** If y = 0, that is the true label of the class is 0. Cost = 0 if the predicted value of the label is 0 as well. But as hθ(x) deviates from 0 and approaches 1 cost function increases exponentially and tends to infinity which can be appreciated from the below graph as well.

![image-13.png](attachment:image-13.png)

With the modification of the cost function, we have achieved a loss function that penalizes the model weights more and more as the predicted value of the label deviates more and more from the actual label.

# 10. Gradient Descent Optimization

In this section, we will try to understand how we can utilize Gradient Descent to compute the minimum cost.

Gradient descent changes the value of our weights in such a way that it always converges to minimum point or we can also say that, it aims at finding the optimal weights which minimize the loss function of our model. It is an iterative method that finds the minimum of a function by figuring out the slope at a random point and then moving in the opposite direction.

![image.png](attachment:image.png)

**Note:** The intuition is that if you are hiking in a canyon and trying to descend most quickly down to the river at the bottom, you might look around yourself 360 degrees, find the direction where the ground is sloping the steepest, and walk downhill in that direction.

At first gradient descent takes a random value of our parameters from our function. Now we need an algorithm that will tell us whether at the next iteration we should move left or right to reach the minimum point. The gradient descent algorithm finds the slope of the loss function at that particular point and then in the next iteration, it moves in the opposite direction to reach the minima. Since we have a convex graph now we don’t need to worry about local minima. A convex curve will always have only 1 minima.

We can summarize the gradient descent algorithm as:

![image-2.png](attachment:image-2.png)

Here alpha is known as the learning rate. It determines the step size at each iteration while moving towards the minimum point. Usually, a lower value of “alpha” is preferred, because if the learning rate is a big number then we may miss the minimum point and keep on oscillating in the convex curve.

![image-3.png](attachment:image-3.png)

Now the question is what is this derivative of cost function? How do we do this? Don’t worry, In the next section we’ll see how we can derive this cost function w.r.t our parameters.

# 11. Derivation of Cost Function

Before we derive our cost function we’ll first find a derivative for our sigmoid function because it will be used in derivating the cost function.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

Now, we will derive the cost function with the help of the chain rule as it allows us to calculate complex partial derivatives by breaking them down.

**Step-1:** Use chain rule and break the partial derivative of log-likelihood.

![image-3.png](attachment:image-3.png)

**Step-2:** Find derivative of log-likelihood w.r.t p

![image-4.png](attachment:image-4.png)

**Step-3:** Find derivative of ‘p’ w.r.t ‘z’

![image-5.png](attachment:image-5.png)

**Step-4:** Find derivate of z w.r.t θ

![image-6.png](attachment:image-6.png)

**Step-5:** Put all the derivatives in equation 1

![image-7.png](attachment:image-7.png)
![image-8.png](attachment:image-8.png)

Now since we have our derivative of the cost function, we can write our gradient descent algorithm as:

![image-9.png](attachment:image-9.png)

If the slope is negative (downward slope) then our gradient descent will add some value to our new value of the parameter directing it towards the minimum point of the convex curve. Whereas if the slope is positive (upward slope) the gradient descent will minus some value to direct it towards the minimum point.

# 12. Regularization

Let’s also discuss Regularization quickly for reducing the cost function to match the parameters to training data. L1 (Lasso) and L2 (Lasso) are the two most frequent regularization types (Ridge). Instead of simply maximizing the aforementioned cost function, regularization imposes a limit on the size of the coefficients in order to avoid overfitting. L1 and L2 use distinct approaches to defining upper limits for coefficients, allowing L1 to conduct feature selection by setting coefficients to 0 for less relevant characteristics and reducing multi-collinearity, whereas L2 penalizes extremely large coefficients but does not set any to 0. There’s also a parameter that regulates the constraint’s weight, λ, to ensure that coefficients aren’t penalized too harshly, resulting in underfitting.

It’s a fascinating topic to investigate why L1 and L2 have different capacities owing to the ‘squared’ and ‘absolute’ values, and how λ affects the weight of regularized and original fit terms. We won’t go into everything here, but it’s well worth your time and effort to learn about. The steps below demonstrate how to convert an original cost function to a regularized cost function.

![image.png](attachment:image.png)

# 13. How Logistic Regression links with Neural Network?

We all know that Neural Networks are the foundation for Deep Learning. The best part is that Logistic Regression is intimately linked to Neural networks. Each neuron in the network may be thought of as a Logistic Regression; it contains input, weights, and bias, and you conduct a dot product on all of that before applying any non-linear function. Furthermore, a neural network’s last layer is a basic linear model (most of the time). That can be understood by visualization as shown below:

![image-2.png](attachment:image-2.png)

Take a deeper look at the “output layer,” and you’ll notice that it’s a basic linear (or logistic) regression: we have the input (hidden layer 2), the weights, a dot product, and finally a non-linear function, depends on the task. A helpful approach to thinking about neural networks is to divide them into two parts: representation and classification/regression. The first section (on the left) aims to develop a decent data representation that will aid the second section (on the right) is doing a linear classification/regression.

![image-3.png](attachment:image-3.png)

# 14. Hyperparameter Fine-tuning – Logistic Regression

There are no essential hyperparameters to adjust in logistic regression. Even though it has many parameters, the following three parameters might be helpful in fine-tuning for some better results.

**Regularization (penalty) might be beneficial at times.**

Penalty – {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’

**The penalty strength is controlled by the C parameter, which might be useful.**

**C – float, default=1.0**

With different solvers, you might sometimes observe useful variations in performance or convergence.

**Solver – {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’**

**Note:** The algorithm to use is determined by the penalty: Solver-supported penalties:

- ‘newton-cg’ – [‘l2’, ‘none’]
- ‘lbfgs’ – [‘l2’, ‘none’]
- ‘liblinear’ – [‘l1’, ‘l2’]
- ‘sag’ – [‘l2’, ‘none’]
- ‘saga’ – [‘elasticnet’, ‘l1’, ‘l2’, ‘none’]

# 15. Python Implementation

Dataset: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=downloadlink

Whenever we start writing the program, always our first step is to start with importing libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
dataset = pd.read_csv('data.csv')
print(dataset.head())

Next to importing libraries, it’s our data to import, either from local disk or from url link

Before getting into modeling, we need to understand the statistical importance for better understanding,

In [None]:
dataset.info()
dataset.shape
dataset.describe().T
print(str('Any missing data or NaN in the dataset:'),dataset.isnull().values.any())

If you understand the correlation between the features, it will be easy to process, like adding for modeling or removing.

In [None]:
corr_var=dataset.corr()
print(corr_var)
plt.figure(figsize=(10,7.5))
sns.heatmap(corr_var, annot=True, cmap='BuPu')

We need to separate dependent and independent features before modeling,

In [None]:
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

We need to split to the standard format (70:30 or 80:20) for training and testing of data during the modeling process for better accuracy.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, random_state=0)
print('Total no. of samples: Training and Testing dataset separately!')
print('X_train:', np.shape(X_train))
print('y_train:', np.shape(y_train))
print('X_test:', np.shape(X_test))
print('y_test:', np.shape(y_test))

As we have different features, each has different scaling or range, we need to do scaling for better accuracy during training and for new dataset

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Importing Logistic Regression from scikit learn

In [None]:
from sklearn.linear_model import LogisticRegression
classifier7 = LogisticRegression()
classifier7.fit(X_train,y_train)

Predicting the end result from the test data set

In [None]:
y_pred7 = classifier7.predict(X_test)
print(np.concatenate((y_pred7.reshape(len(y_pred7),1), y_test.reshape(len(y_test),1)),1))

Finally, we need to evaluate it through classification metrics like confusion matrix, accuracy, and roc-auc score.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
cm7 = confusion_matrix(y_test, y_pred7)
print(cm7)

Visualizing confusion matrix for a better view.

In [None]:
from mlxtend.plotting import plot_confusion_matrix
fig, ax = plot_confusion_matrix(conf_mat=cm7, figsize=(6, 6), cmap=plt.cm.Greens)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

Accuracy of our model:

In [None]:
logreg=accuracy_score(y_test,y_pred7)
logreg

![image.png](attachment:image.png)

Then finally, AUC-ROC score value, closer to 1 makes the system more accurate:

In [None]:
roc_auc_score(y_test, y_pred7)

![image.png](attachment:image.png)

Overall metrics report of the logistic regression by Precision, Recall, F1 Score makes more understanding by how detailed our model predicts the data.

In [None]:
import sklearn.metrics as metrics
print(metrics.classification_report(y_test, y_pred7))

![image.png](attachment:image.png)

Hyperparameter makes our model more fine-tune the parameters and also we can manually fine-tune our parameters for robust model and can see the difference in importance of using parameters.

In [None]:
from sklearn.model_selection import GridSearchCV
parameters_lr = [{'penalty':['l1','l2'],'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}]
grid_search_lr = GridSearchCV(estimator = classifier7,
                           param_grid = parameters_lr,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search_lr.fit(X_train, y_train)
best_accuracy_lr = grid_search_lr.best_score_
best_paramaeter_lr = grid_search_lr.best_params_  
print("Best Accuracy of LR: {:.2f} %".format(best_accuracy_lr.mean()*100))
print("Best Parameter of LR:", best_paramaeter_lr)

![image.png](attachment:image.png)

# 16. Classification Metrics

Classification is about predicting a label and then identifying which category an object belongs to based on different parameters. 

In order to measure how well our classification model is doing at making these predictions, we use classification metrics. It measures the performance of our machine learning model, giving us the confidence that these outputs can be further used in decision-making processes. 

The performance is normally presented in a range from 0 to 1, where a score of 1 represents perfection. 

## 16.1 Problems with the Threshold 

If we use a range from 0 to 1 to represent the performance of our model, what happens when the value is 0.5? As we know from early math classes, if the probability is greater than 0.5, we round it up to 1 (positive) - if not, it is 0 (negative).

That sounds okay, but now when you are using classification models to help determine the output of real-life cases. We need to be 100% sure that the output has been correctly classified.

For example, logistic regression is used to detect spam emails. If the probability that the email is spam is based on the fact that it is above 0.5, this can be risky as we could potentially direct an important email into the spam folder. The want and need for the performance of the model to be highly accurate becomes more sensitive for health-related and financial tasks.

Therefore, using the threshold concept of values above the threshold value tend to be 1, and a value below the threshold value tends to be 0 can cause challenges.

Although there is the option to adjust the threshold value, it still raises the risk that we classify incorrectly. For example, having a low threshold will classify the majority of positive classes correctly, but within the positive will contain negative classes - vice versa if we had a high threshold. 

So let’s get into how these classification metrics can help us with measuring the performance of our logistic regression model.

## 16.2 Accuracy
 
We will start off with accuracy because it’s the one that’s typically used the most, especially for beginners. 

Accuracy is defined as the number of correct predictions over the total predictions:

accuracy = correct_predictions / total_predictions

However, we can further expand on this using these:

- True Positive (TP) - you predicted positive and it’s actually positive 
- True Negative (TN) - you predicted negative and it’s actually negative
- False Positive (FP) - you predicted positive and it’s actually negative
- False Negative (FN) - you predicted negative and it’s actually positive 

So we can say the true predictions are TN+TP, while the false prediction is FP+FN. 

The equation can now be redefined as:

![image.png](attachment:image.png)

In order to find the accuracy of your model, you would do this:

In [None]:
score = LogisticRegression.score(X_test, y_test)
print('Test Accuracy Score', score)

Or you can also use sklearn library:

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)

However, using the accuracy metric to measure the performance of your model is usually not enough. This is where we need other metrics.

## 16.3 Precision and Recall
 
If we want to further test the “accuracy” in different classes where we want to ensure that when the model predicts positive, it is in fact true positive - we use precision. We can also call this Positive Prediction Value which can be defined as:

![image.png](attachment:image.png)

In [None]:
from sklearn.metrics import precision_score

If we want to further test the “accuracy” in different classes where we want to ensure that when the model predicts negative, it actually is negative - we use recall. Recall is the same formula as sensitivity and can be defined as:

![image.png](attachment:image.png)

In [None]:
from sklearn.metrics import recall_score

Using both precision and recall are useful metrics when there is an imbalance in the observations between the two classes. For example, there are more of one class (1) and only a few of the other class (0) in the dataset.

In order to increase the precision of your model, you will need to have fewer FP and not have to worry about the FN. Whereas, if you want to increase recall, you will need to have fewer FN and not have to worry about the FP.

Raising the classification threshold reduces false positives - increasing precision. Raising the classification threshold reduces true positives or keeps them the same, whilst increasing false negatives or keeps them the same. - decreasing recall or keeping it constant.

Unfortunately, it’s not possible to have a high precision and recall value. If you increase precision, it will reduce recall - vice versa. This is known as the precision/recall tradeoff.

![image.png](attachment:image.png)

## 16.4 ROC Curve
 
When it comes to precision we care about lowering the FP and for recall we care about lowering the FN. However, there is a metric that we can use to lower both the FP and FN - it is called the Receiver Operating Characteristic curve, or ROC curve.

It plots the false positive rate (x-axis) against the true positive rate (y-axis).

- True Positive Rate = TP / (TP + FN)
- False Positive Rate = FP / (FP + TN)

The true positive rate is also known as sensitivity, and the false positive rate is also known as the inverted specificity rate. 

- Specificity = TN / (TN + FP)

If the values on the x-axis consist of smaller values, this indicates lower FP and higher TN. If the values on the y-axis consist of larger values, this indicates higher TP and lower FN.

The ROC presents the performance of a classification model at all classification thresholds, like this:

![image.png](attachment:image.png)

Example:

![image-2.png](attachment:image-2.png)

## 16.5 AUC

When it comes to the ROC curve, you may have also heard Area Under the Curve (AUC). It’s exactly what it says it is - the area under the curve. If you want to know how good your curve is, you calculate the ROC AUC score. ??AUC measures the performance across all possible classification thresholds.

The more area under the curve you have, the better - the higher the ROC AUC score. This is when the FN and FP are both at zero - or if we refer to the graph above, it’s when the true positive rate is 1 and the false positive rate is 0.

In [None]:
from sklearn.metrics import roc_auc_score

The below image shows an ascending order of logistic regression predictions. If the AUC value is 0.0, we can say that the predictions are completely wrong. If the AUC value is 1.0, we can say that the predictions are fully correct.

![image.png](attachment:image.png)

## 16.6 Wrapping it Up

To recap, we have gone over what is Logistic Regression, what Classification Metrics are, and problems with the threshold with solutions, such as Accuracy, Precision, Recall, and the ROC Curve. 

There are so many more classification metrics out there, such as confusion matrix, F1 score, F2 score, and more. These are all available to help you better understand the performance of your model.

# 17. Advantages of Logistic Regression

- Overfitting is less likely with logistic regression, although it can happen in high-dimensional datasets. In these circumstances, regularization (L1 and L2) techniques may be used to minimize over-fitting.

- It works well when the dataset is linearly separable and has good accuracy for many basic data sets.

- It is more straightforward to apply, understand, and train.

- The inferences regarding the relevance of each characteristic are based on the anticipated parameters (trained weights). The association’s orientation, positive or negative, is also specified. As a result, logistic regression may be used to determine the connection between the characteristics.

- Unlike decision trees or support vector machines, this technique allows models to be readily changed to incorporate new data. Stochastic gradient descent can be used to update the data.

- It is less prone to over-fitting in a low-dimensional dataset with enough training instances.

- When the dataset includes linearly separable characteristics, Logistic Regression shows to be highly efficient.

- It has a strong resemblance to neural networks. A neural network representation may be thought of as a collection of small logistic regression classifiers stacked together.

- The training time of the logistic regression method is considerably smaller than that of most sophisticated algorithms, such as an Artificial Neural Network, due to its simple probabilistic interpretation.

- Multinomial Logistic Regression is the name given to an approach that may easily be expanded to multi-class classification using a softmax classifier.

# 18. Disadvantages of Logistic Regression

- Logistic Regression should not be used if the number of observations is fewer than the number of features; otherwise, it may result in overfitting.

- Because it creates linear boundaries, we won’t obtain better results when dealing with complex or non-linear data.

- It’s only good for predicting discrete functions. As a result, the Logistic Regression dependent variable is restricted to the discrete number set.

- The average or no multicollinearity between independent variables is required for logistic regression.

- Logistic regression needs a big dataset and enough training samples to identify all of the categories.

- Because this method is sensitive to outliers, the presence of data values in the dataset that differs from the anticipated range may cause erroneous results.

- Only significant and relevant features should be utilized to construct a model; otherwise, the model’s probabilistic predictions may be inaccurate, and its predictive value may suffer.

- Complex connections are difficult to represent with logistic regression. This technique is readily outperformed by more powerful and sophisticated algorithms such as Neural Networks.

- Because logistic regression has a linear decision surface, it cannot address nonlinear issues. In real-world settings, linearly separable data is uncommon. As a result, non-linear features must be transformed, which may be done by increasing the number of features such that the data can be separated linearly in higher dimensions.

- Based on independent variables, a statistical analysis model seeks to predict accurate probability outcomes. On high-dimensional datasets, this may cause the model to be over-fit on the training set, overstating the accuracy of predictions on the training set, and so preventing the model from accurately predicting outcomes on the test set. This is most common when the model is trained on a little amount of training data with many features. Regularization strategies should be explored on high-dimensional datasets to minimize over-fitting (but this makes the model complex). The model may be under-fit on the training data if the regularization parameters are too high.

# 19. Applications of Logistic Regression

All use cases where data must be categorized into multiple groups are covered by Logistic Regression. Consider the following illustration:

- Fraud detection in Credit card
- Email spam or ham
- Sentiment Analysis in Twitter analysis
- Image segmentation, recognition, and classification – X-rays, Scans
- Object detection through video
- Handwriting recognition
- Disease prediction – Diabetes, Cancer, Parkinson etc…

# 20. Conclusion

The logistic Regression model is a powerful technique used for binary classification tasks. It is widely used in various fields, such as finance, marketing, healthcare, and social sciences, to model and predict binary outcomes.