# Logistic Regression

$$\renewcommand{\v}[1]{\mathbf #1}$$

The basic idea is to model the probability of $y=1$ at a given $x$ as a sigmoid:

$$p(y=1 \mid x) = \frac{1}{1+e^{-\v{w} \cdot \v{x}}}$$

![](images/1dclasslogisticprobs.png)

One can write the process schematically as so:

![inline](images/layershorstd.png)

One can switch the sigmoid formulation to the so-called **Softmax** formulation. This helps generalize to more than one class.

## Softmax formulation

Identify $$p_i$$ and $$1-p_i$$ as two separate probabilities constrained to add to 1. 

That is $$p_{1i} = p_i ; p_{2i} = 1 - p_i. $$

$$p_{1i} = \frac{e^{\v{w_1} \cdot \v{x}}}{e^{\v{w_1} \cdot \v{x}} + e^{\v{w_2} \cdot \v{x}}}$$

$$p_{2i} = \frac{e^{\v{w_2} \cdot \v{x}}}{e^{\v{w_1} \cdot \v{x}} + e^{\v{w_2} \cdot \v{x}}}$$

Notice that one can translate coefficients by fixed amount $\v{\psi}$ without any change (if you change $\v{w}$ to $\v{w} + \v{\psi}$ there is no change in the probabilities). 

Thus, by setting $\v{\psi}$ to $-\v{w_1}$ we regain the sigmoid with $\v{w} = \v{w_1} -\v{w_2}$.


Since 

$${\cal L} = \prod_i p_{1i}^{\mathbb1_1(y_i)} p_{2i}^{\mathbb1_2(y_i)}$$ 

where ${\mathbb1_1(y_i)}$ is a function which is 1 if $y_i$ is classified as the first class. In other words, if $y_i$ is written as a one-hot encoded vector, the plce corresponding to the first class has a 1 while the others have 0.

we can derive the cross-entropy loss, or negative log likelihood:

$$NLL = -\sum_i \left( \mathbb1_1(y_i) log(p_{1i}) + \mathbb1_2(y_i) log(p_{2i}) \right)$$

![inline](images/layershorsm.png)


## The MNIST dataset

This is a dataset of handwritten digits from the US Postal service. Automatic sorting machines want to read these and route mail. The are 10 classes corresponding to the 10 fundamental digits.

In [None]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.utils import to_categorical

In [None]:
# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
img_width = X_train.shape[1]
img_height = X_train.shape[2]

# one hot encode outputs
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
labels = range(10)

num_classes = y_train.shape[1]


X_train[0].shape, y_train[0]

We will create 784 features, one from each pixel, and then do a linear function on these 784 features. Thus there are 784 slopes and a bias. We will do this once for each of the 10 classes as in the diagram above. Hence we have 785 times 10, or 7850 features.

In [None]:
# create model
model = Sequential()
model.add(Flatten(input_shape=(img_width, img_height)))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',
              metrics=['accuracy'])
model.summary()

You will notice that we are fitting the model vis an algorithm called `Adam`, a variant of an algorithm called SGD. Lets take a bit of a detour into our fitting method. 

## Descending fast from Mount Everest


![right, fit](images/lfrome.jpg)

Wherever you are on the mountain, plunge down the steepest direction you possibly can.

You will make it down to a valley, but perhaps not the one you want to be in.

Your step size is called the **learning rate**.

Too small a step size and you will freeze.

But if you are a giant with a large step size you might just step into another valley.

Where do you land up going?

![right, fit](images/e2d.jpg)

To find the steepest descent, find the place with the highest "slope", or highest derivative.

To go downwards, go in the opposite direction to this highest derivative.

A step to the right from the top of Everest will make you go down into the Western cwm. If you are an ant, you have too small a step: you run out of oxygen. But if you are a giant roaming the earth you could step past Nuptse!


## Back to Logistic Regression What is this loss geography?

For the Linear regression Loss function:

$${\cal L} =  \frac{1}{N}\sum_{i = 1}^{N} (y_i - (m\,x_i + b))^2 $$

or for the cross-entropy, what does our loss function look like? (let us assume only one $\v{w}$ with 2 parameters $m$ and $b$ for now)

The loss here is a function of these **parameters** of the model. This is the latitude and longitude equivalents in our loss landscape,

The contours are those of constant loss, just as in a map it is constant altitude. The gradient is perpendicular to the contours.

For both linear regression and logistic regression these contours look like this:

![right, fit](images/linregcontour.png)

Both of these are **convex** losses. A bowl. You always find the bottom.

![inline](images/3danim010.png)



## More details: Gradient Descent in 1-D

**go opposite the direction of the derivative.**

Consider the objective function: $ J(x) = x^2-6x+5 $

Its derivative or **gradient** is $2x - 6$. This is positive for $x>3$. Negative below.

Now do:

```python
gradient = fprime(old_x)
move = gradient * step
current_x = old_x - move
```

![right, fit](images/optimcalc_4_0.png)



- For $x > 3$, $step \times (2x-6)$ is positive, and so you go to smaller (leftward) $x$ by $step \times (2x-6)$.
- The moment $x < 3$, you are on the other side of the parabola, the derivative is negative, and so you go rightward by $step \times (2x-6)$.
- How much do you go? Depends on step size.


```python
move = (2x - 6) * step
current_x = old_x - move
```

What is the impact of step size?

#### too big step size

![inline](images/1dgd-bigstep_AdobeExpress.gif)



#### too small step size

![inline](images/1dgd-smallstep_AdobeExpress.gif)



#### good step size

![inline](images/1dgd_AdobeExpress.gif)




### Gradient Descent more formally

$$\theta_{t+1} = \theta_t - \eta \nabla_{\theta} {\cal L}(\theta_t)$$

where $\eta$ is the learning rate, $\theta$ represents the parameters. The symbol $\nabla$ represents the gradient.

ENTIRE DATASET NEEDED

```python
for i in range(n_epochs):
  params_grad = evaluate_gradient(loss_function, data, params)
  params = params - learning_rate * params_grad`
```

On so this is neither SGD, nor Adam, which we will come to later. But Gradient Descent is all we need for linear and logistic regression.

## Back to the fit

In [None]:
# Fit the model
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])