# Topic 2 -- Logistic Regression

Welcome to topic 2 of the Beginner AI Course! In this notebook, we are going to explore the **classification** problem in supervised learning, how **logistic regression** triumphs over **linear regression** in this area, as well as see logistic regression in action predicting the types of Pokemon!

## Table of Contents
1. [Installing Dependencies](#installing)


2. [Linear Regression vs Classification](#linregfail)


3. [Logistic Regression](#logreg)
    - [Sigmoid Activation Function](#sigmoid)
    - [Cost Function](#cost)
    - [Putting it all together](#putting)
   
   
4. [Binary Classification](#binary)

## Installing Dependencies <a name="installing">

Lets first install the dependencies needed for this demo. Here are a list of the dependencies and descriptions on what they are for:

- **Numpy**: Powerful linear regression library
- **Pandas**: Used for data visualization and organization
- **SciKitLearn**: Maching learning library containing the `LogisticRegression` class
- **Bokeh** and **MatPlotLib**: Plots and Graphs for visualizing the data
- **disp_utils**: A custom module for displaying the demo visuals

In [1]:
import numpy as np
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
import matplotlib.pyplot as plt
from disp_utils import *

pd.set_option('display.max_rows', None)

## Linear Regression vs Classification <a name="linregfail">

Lets set the scene: you are trying to train a machine learning model that will predict whether you get admitted into the NBA based on your height. Notice that this is a classification problem, as there are two discrete values that the model is allowed to output: Either **yes**, you get admitted, or **no**, you are rejected. We will denote $\hat{y}$ as the output, and it could only take on **1 or 0**, 1 being yes and 0 being no. Lets display a plot of what a hypothetical dataset would look like:

In [2]:
display_height_dataset()

Here, we can see that at around 1.8m, you will be admitted into the NBA. After running linear regression, we are able to fit a linear hypothesis function to the data:

In [3]:
display_height_line1()

One thing to note is that the linear hypothesis function is not discrete, which is okay, because we can assign the output value of the function as a **probability** that you will be admitted given your height. For example, if you are 1.6 metres tall, there is roughly a 10% probability that you will be admitted, whereas if you are 2 meters tall, there is a 90% probability you will be admitted. We will use this probability metric to make a cut-off: **all probabilities greater or equal to 50% we will predict $\hat{y} = 1$, and all probabilities less than 50% we will predict $\hat{y} = 0$.** 

However, there are a few problems to discuss: if your height is less than 1.55 metres, then your probability of getting admitted is less than 0%, and if your height is over 2.05 metres, your probability is greater than 100%. Furthermore, lets say we add some **outliers**; lets add the height of 1.0 metres to our dataset:

In [4]:
display_height_line2()

Here, we can see that adding an outlier has significantly shifted the **decision boundary**. Before, a height of 1.8 meters and above will have a high likelyhood of getting admitted, whereas now, with the introduction of one datapoint, the acceptable height has decreased to 1.67m.

**Therefore, Linear Regression cannot be used for classification problems.**

## Logistic Regression <a name="logreg">

Despite the name "regression", logistic regression is an algorithm used for classification problems. The base concept is **very similar to linear regression**; you still have a linear feed forward step. This time I will use the letter $z$ to denote the linear output:

<div style="text-align: center">
    <div>
    &nbsp;
    </div>
 $z = wx + b$
</div>

### Sigmoid Activation Function <a name="sigmoid">
Now, however, we input the linear results into the **sigmoid function**. The sigmoid function is one type of **activation function** that bounds the output between 0 and 1, which eliminates probabilities of greater than 1 or less than 0. Also, using the sigmoid activation function ensures that the **decision boundary** would not get shifted due to **outliers**. 
    
The sigmoid function is represented by the formula down below:

<div style="text-align: center">
    <div>
    &nbsp;
    </div>
 $\begin{align}\sigma(z) = {1\over 1 + e^{-z}}\end{align}$
</div>
    
To visualize the sigmoid function, lets create a sigmoid function down below:

In [5]:
def sigmoid(z):
    return 1/(1 + np.exp(-z))

In [6]:
display_func(sigmoid)

### Cost Function <a name="cost">
    
Previously, we learned that Linear Regression uses the **Mean Squared Error (MSE)** cost function to measure the error in the fit. While MSE works great for linear regression, it does not work very well for logistic regression. This is because due to the **sigmoid function** used in logistic regression, the resulting cost function is **non-convex**. Take a look at the figures below showing the non-convex nature of MSE used with the sigmoid function:

In [8]:
disp_convex()

Non-convex functions can be problematic to **gradient descent**. To make the cost convex, we need a different cost function:

<div style="text-align: center">
    <div>
    &nbsp;
    </div>
 $\begin{align}C(w, b) = {1\over m} \sum\limits_{i=1}^m -y^{(i)}\log(a^{(i)})-(1-y^{(i)})\log(1-a^{(i)})\end{align}$
</div>

This cost function is called **Binary Cross Entropy (BCE)**, and is the cost function of choice for logistic regression. This cost function is guaranteed to be convex when used with the **sigmoid activation function**.

### Putting it all together <a name="putting">

All in all, how logistic regression learns is very similar to linear regression, other than the activation function and the cost function. Here's a summary of how it all works, shown in **pseudo-code**:

<div style="text-align: center">
    <div>
    &nbsp;
    </div>
</div>


```python
# Linear Regression:
for i in range(iterations):
    prediction = wx + b
    
    cost = MSE_cost_function(prediction, label)
    slopes = cost.get_slopes()
    update(w, b).with(slopes)
    
# Logistic Regression:
for i in range(iterations):
    z = wx + b
    prediction = sigmoid(z)
    
    cost = BCE_cost_function(prediction, label)
    slopes = cost.get_slopes()
    update(w, b).with(slopes)
```

## Binary Classification <a name="binary">
    
Binary Classification, as the name suggests, predicts on an output space of **two** states. An example of which is NBA Admittance, where the two possible states are either **yes** or **no**. This is the most basic form of logistic regression, and is the basis of more advanced forms of logistic regression.