# Logistic regression

Logistic regression is one of the most studied and widely used classification algorithms, probably due to its popularity in regulated industries and financial settings. Although more modern classifiers might likely output models with higher accuracy, logistic regressions are great baseline models due to their high interpretability and parametric nature. This module will walk you through extending a linear regression example into a logistic regression, as well as the most common error metrics that you might want to use to compare several classifiers and select that best suits your business problem.

## Learning Objectives
- Identify, use, and interpret error metrics to evaluate classification models.
- Identify common supervised machine learning algorithms.
- Build logistic regression classification models with sklearn.
- Use logistic regression models for classification.

## About this course
This course introduces you to one of the main types of modelling families of supervised Machine Learning: Classification. You will learn how to train predictive models to classify categorical outcomes and how to use error metrics to compare across different models. The hands-on section of this course focuses on using best practices for classification, including train and test splits, and handling data sets with unbalanced classes. By the end of this course you should be able to:

Differentiate uses and applications of classification and classification ensembles 

- Describe and use logistic regression models

- Describe and use decision tree and tree-ensemble models

- Describe and use other ensemble methods for classification

- Use a variety of error metrics to compare and select the classification model that best suits your data

- Use oversampling and undersampling as techniques to handle unbalanced classes in a data set

### Who should take this course?
This course targets aspiring data scientists interested in acquiring hands-on experience with Supervised Machine Learning Classification techniques in a business setting.

### What skills should you have?
To make the most out of this course, you should have familiarity with programming on a Python development environment, as well as fundamental understanding of Data Cleaning, Exploratory Data Analysis, Calculus, Linear Algebra, Probability, and Statistics.  





## Introduction: What is Classification?

Classification: Outcome is a category

Prediction examples:
- Detecting fraudulent transactions
- Customer churn
- Event attendance
- Network load
- Loan default

### What is Needed for Classification?

Model data with:
- Features that can be quantified
- Labels that are known
- Method to measure similarity

### Modeling Classification:

Examples of models used for Supervised Learning: Classification:
- Logistic Regression
- K-Nearest Neighbors
- Support Vector Machines
- Neural Networks
- Decision Tree
- Random Forests
- Boosting
- Ensemble Models

Each of these models can be used for both regression and classification.


## Introduction to Logistic Regression

### Introduction to Logistic Regression


### Linear Regression vs. Logistic Regression

![](./images/01_LogisticRegression.png)

![](./images/02_LR.png)

### Using Logistic Regression for Classification

#### Sigmoid Function
![](./images/03_SigmoidFunction.png)

![](./images/04_sf2.png)

#### Logistic Regression

$$
P(x) = \dfrac{1}{1+e^{-(\beta_0+\beta_1x+\epsilon)}}
$$
=>
$$
P(x) = \dfrac{e^(\beta_0+\beta_1x)}{1+e^{(\beta_0+\beta_1x)}}
$$
=>

$$
\dfrac{P(x)}{1-P(x)}= e^{(\beta_0+\beta_1x)}
$$

=>

$$
log(\dfrac{P(x)}{1-P(x)})= \beta_0+\beta_1x
$$


![](./images/05_classification_with_lr.png)


### Logistic Regression with Multi-Classes

![](./images/06_multiclass.png)

Steps:
- One vs all
- Assign region


### Implementing Logistic Regression Models

```python
# Logistic Regression: The Syntax

# Import the class containing the classification method
from sklearn. linear model import LogisticRegression

# Create an instance of the class
LR = LogisticRegression (penalty='12', C=10.0) #regularization parameters

# Fit the instance on the data and then predict the expected value
LR = LR. fit (X train, y train)
Y_predict = LR. predict (X test)

# Can now view the output fitted coefficients
LR.coef_

# Tune regularization parameters with cross-validation
LogisticRegressionCV

```

### Classification with Logistic Regression

Applications for Logistic Regression:
- Customer spending: How likely is a customer is to be a top 5%
spender, using previous purchase data
- Customer engagement: which customers are most likely to
engage
in the next 6 months
- e-commerce: which transactions are fraudulent, using
customer characteristics, location, IP address, etc.
- Finance/risk: Predicting whether a loan will default

Interpretation vs. prediction:
- In addition to prediction, we may want to evaluate the
importance of each factor in influencing outcomes



## Confusion Matrix, Accuracy, Specificity, Precision, and Recall

### Choosing the Right Error Measurement

You are asked to build a classifier to predict whether individuals have leukemia.

Training data:
- 1% patients have leukemia, 99% are healthy.

Measure accuracy:
- total % of predictions that are correct.

Build a simple model that always predicts "healthy".

Accuracy will be ... 99%.

### Confusion matrix

![](./images/07_ConfusionMatrix.png)

#### Accuracy : Predicting correctly
$$
Accuracy = \dfrac{TP+TN}{TP+FN+FP+TN}
$$

#### REcall: Identifying all positive instances
$$
Recall or Sensitivity = \dfrac{TP}{TP+FN}
$$

#### Precision: Identifying only positive instances
$$
Precision = \dfrac{TP}{TP+FP}
$$

#### Specificity: Avoid False Alarms

$$
Sensitivity = \dfrac{TN}{FP+TN}
$$

#### => Error Measurement

$$
F1 = 2 \dfrac{Precision*Recall}{Precision+Recall}
$$

### Receiver Operating Characteristic (ROC)

![](./images/08_ReceiverOperatingCharacterisitc.png)

![](./images/08_ROC.png)

![](./images/09_roc2.png)

![](./images/10_precisionREcallcurve.png)



### Choosing the Right Approach

Which approach works best for choosing a classifier?

ROC Curve:
- Generally better for data with balanced classes.

Precision-Recall Curve:
- Generally better for data with imbalanced classes.

The right curve depends on tying results (true positives, true negatives, etc.) to
outcomes (relative cost of false positive or false negative).

The curves compare classifiers generally (across possible decision thresholds),
which may be less relevant to business objectives.

### Multi Class Error Metrics

![](./images/11_MultiClassErrorMetrics.jpg)

### Classification Error Metrics: Syntax
```python
# Import the desired error function
from sklearn. metrics import accuracy_score

# Calculate the error on the test and predicted data sets
accuracy_value = accuracy_score (y test, y pred)

# Lots of other error metrics and diagnostic tools:
from sklearn. metrics import precision score, recall score,
f1_score, roc_auc_score,
confusion_matrix, roc_curve,
precision recall curve
```

# Summary/Review

## Classification Problems
The two main types of supervised learning models are:

Regression models, which predict a continuous outcome

Classification models, which predict a categorical outcome.

The most common models used in supervised learning are:

Logistic Regression

K-Nearest Neighbors

Support Vector Machines

Decision Tree

Neural Networks

Random Forests

Boosting

Ensemble Models

With the exception of logistic regression, these models are commonly used for both regression and classification. Logistic regression is most common for dichotomous and nominal dependent variables.

## Logistic Regression
Logistic regression is a type of regression that models the probability of a certain class occurring given other independent variables.It uses a logistic or logit function to model a dependent variable. It is a very common predictive model because of its high interpretability.

## Classification Error Metrics
A confusion matrix tabulates true positives, false negatives, false positives and true negatives. Remember that the false positive rate is also known as a type I error. The false negatives are also known as a type II error.

Accuracy is defined as the ratio of true postives and true negatives divided by the total number of observations. It is a measure related to predicting correctly positive and negative instances.

Recall or sensitivity identifies the ratio of true positives divided by the total number of actual positives. It quantifies the percentage of positive instances correctly identified.

Precision is the ratio of true positive divided by total of predicted positives. The closer this value is to 1.0, the better job this model does at identifying only positive instances.

Specificity is the ratio of true negatives divided by the total number of actual negatives. The closer this value is to 1.0, the better job this model does at avoiding false alarms.

The receiver operating characteristic (ROC) plots the true positive rate (sensitivity) of a model vs. its false positive rate (1-sensitivity).

The area under the curve of a ROC plot is a very common method of selecting a classification methods.T

he precision-recall curve measures the trade-off between precision and recall.

The ROC curve generally works better for data with balanced classes, while the precision-recall curve generally works better for data with unbalanced classes.  