# <u>Logistic Regression</u>
## Part 1 - Principles
#### 1. Definition
Logistic Regression is a classification model that uses input variables (features) to predict a categorical outcome variable that can take on one of a limited set of class values. A binomial logistic regression is limited to two binary output categories, while a multinomial logistic regression allows for more than two classes. Examples of logistic regression include classifying a binary condition as 'true'/'false', or an image as 'bicycle'/'train'/'car'. If you have studied Linear Regression, you will know Linear Regression is used to determine/predict the continuous value of a dependent variable, and hence it is a regression algorithm. Instead, Logistic Regression is generally used as a classification algorithm.


#### 2. Logistic Function
An explanation of Logistic Regression can begin with an explanation of the standard logistic function. The logistic function is a sigmoid function, which takes any real input ***t***, and outputs a value between zero and one. 

The standard logistic function **σ** : R -> (0, 1) is defined as $σ(t) = \frac{1}{1+e^{-t}}$

In fact, Logical Regression is the combination of linear regression and logical function. Let us assume that ***t*** is a linear function of a single explanatory variable ***x***. We can express ***t*** as $t = \beta_0+\beta_1x$. And the general logistic function **p** can now be writtern as follows:

$$p(x) = σ(t) = \frac{1}{1+e^{-\beta_0+\beta_1x}}$$


#### 3. Maximum Likelihood Estimation
Linear Regression uses the Ordinary Least Squares (OLE) method to select its best fit line. Instead, Logistic Regression uses Maximum Likelihood Estimation(MLE) to select the best fit line. In other words, the regression coefficients ***β<sub>0</sub>*** and ***β<sub>1</sub>*** are usually estimated using Maximum Likelihood Estimation. You can find more details about MLE on the Web, we won't explore it here.



## Part 2 - Code Implementation
> NOTE: The complete code can be found in *code.ipynb*
#### 1. Import the Relevant Libraries

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

#### 2. Load the Dataset

In [None]:
breast_cancer = load_breast_cancer()

# return_X_y=True: the X and y variables will be converted to a dataframe and series respectively
# as_frame=True: the X and y variables will be presented in a pandas dataframe
X, y = load_breast_cancer(return_X_y=True, as_frame=True)

#### 3. Explore the Dataset
The breast cancer dataset is a classic and easy binary classification dataset. The key challenges against it’s detection is how to classify tumors into malignant (cancerous) or benign(non cancerous).
##### 3.1 displays the top 5 rows of the X

In [None]:
X.head()

##### 3.2 provides some information regarding the columns in the data

In [None]:
y.info()

#### 4. Split the dataset into train and test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

#### 5. Model Initialization

In [None]:
LR = LogisticRegression(max_iter = 2000)
LR.fit(X_train, y_train)

#### 6. Predictions

In [None]:
y_pred = LR.predict(X_test)
print('The Predicted values are: ', y_pred)

#### 7. Model Performance Evaluation
##### 7.1 Confusion Matrix
Confusion Matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made.

For this lab:
- if classification '1' is positive: True Negatives(TN) = 46, True Positives(TP) = 62, False Negatives(FN) = 5, False Positives(FP) = 1
- if classification '0' is positive: True Negatives(TN) = 62, True Positives(TP) = 46, False Negatives(FN) = 1, False Positives(FP) = 5

In [None]:
CM = confusion_matrix(y_test, y_pred)
# visualization
plt.figure(figsize = (8,6))
sns.heatmap(CM, annot=True,cmap='Oranges')
plt.xlabel('Predicted label')
plt.ylabel('Actual label')
plt.show()  

##### 7.2 Classification Report
We can find the values of various metrics such as accuracy, precision, recall and f1-score using the classification_report.
- ***accuracy***: number of items in a class labeled correctly out of all items in that class
$$accuracy = (TP+TN) / (TP+TN+FP+FN)$$
- ***precision***: out of all the items labeled as positive, how many truly belong to the positive class
$$precision = TP/(TP+FP)$$
- ***recall***: Out of all the items that are truly positive, how many were correctly classified as positive
$$recall = TP / (TP+FN)$$
- ***f1-score***: the harmonic mean between precision and recall
$$f1-score = TP / (TP + 0.5 × (FP + FN))$$

[Here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) you can find the complete documentation for the classification_report() function.

In [None]:
CR = classification_report(y_test, y_pred)
print('Classification Report is: \n', CR)

## Part 3 - Practice
Use Logistic Regression algorithm to analyze another dataset in sklearn: *iris*. It includes three iris species with 50 samples each as well as some properties about each flower. You are supposed to use 70% of dataset as training set and 30% as test set. Answer the following questions.
1. Draw the confusion matrix and analyze the TN, TP, FN and FP of classification '2'.
2. Explain how the precision, recall and f1-score of classification '1' are calculated.

**HINT**: You can make use of the file *"code.ipynb"* and properly modify it to complete the practice.