## Logistic Regression

- Can be used for both binary classification and multi-class classification
- We also get the probability of belonging in a class

### When to use?
- If the data is binary (0/1, Yes/No etc.)
- If you need probabilistic results
- If your data is "linearly separable" - meaning if using linear boundries between classes would work
- If you need to understand the impact of a feature. (Age's impact in voting Democrat etc.)

### Linear Regression vs Logistic Regression
- Linear regression does not give us a probability, it just goes to 0 if y < 0.5 and goes to 1 if y >= 0.5
- Instead of (Theta_transposeX) = (Theta0 + Theta1x1 + ... + Theta_n * x_n);
    - We should use Sigmoid(Theta_transposeX) = Sigmoid(Theta0 + Theta1x1 + ... + Theta_n * x_n)

### The Sigmoid Function

- Sigmoid(Theta_transposeX) = 1 / (1 + e^(-Theta_transposeX))
- As Theta_transposeX increases, Y approaches to 1.
- As Theta_transposeX decreases, Y approaches to 0.
- If you're trying to estimate churn by income and age:
    - P(Y=1|X) -- Probability of Y = 1, given X
    - P(Churn=1|income, age) -- Probability of Churn = 1, given age and income

### The Training Process
- Initialize Theta vector with random values, as with most ML algorithms. -> Theta = [-1, 2]
- Calculate Y = Sigmoid(Theta_transposeX) for a customer
    - Y = Sigmoid([-1, 2] * [2, 5]) = 0.7
- Compare the output with the actual label. If label is 1, then the error is 0.3
- Calculate the error for all customers, then add them up. **This is the cost.**
- Change Theta values to reduce the cost, start over until the cost is low enough.
- **We can use gradient descent to estimate the most accurate Theta values.**
- **Stopping the iteration is up to us, we can stop it when we get a satisfactory cost.**

### General Cost Function

- Y = Sigmoid(Theta_transposeX)
- Cost(Y, y) = (Y - y)^2 / 2
- **J(Theta) = Avg(Cost(Y, y))**
- We should calculate the minimum point of this function to show the best parameters by Gradient Descent
- An easier way:
    - If desirable y = 1, cost: **-log(Y)**
    - If desirable y = 0, cost: **-log(1 - Y)**
    - So the cost function is:
    - ![img](coostt.png)

### Using Gradient Descent to minimize the Cost

- Gradient: Slope at a specific point
- If the gradient is decreasing at every step, we're in the right direction
- ![gradient descent](jkjk.png)
- Partial derivative gives the slope at a point
- If we calculate the derivative of the J at Theta1, if it's a positive number we need to change direction.
- If the slope is large, we should take a larger step. Therefore at each iteration, the "size" of the step eventually decreases.
- ![j derivative](jder.png)
- There is a vector VJ that is formed by the derivative of J at Theta1
    - **NewTheta = OldTheta - mu * VJ**
    - **mu** is the learning rate, the length of the step we take