# Calculation example for Stochastic Gradient Descent and Linear Regression

In [None]:
# You can do the exercise on paper! Read through the exercise carefully. The calculations start at section 2.2.

## 1) Prepare a dataset for the linear regression

### 1.1) Example dataset

In [None]:
# Example dataset (x values)
X = [
    [0.1, 0.9],
    [0.2, 0.91],
    [0.7, 0.3],
    [0.05, 0.81],
    [0.95, 0.45],
    [0.78, 0.22]
]

# Dependent variable on X (output of the model)
# first row in X corresponds to first row in y and so forth
y = [
    1,
    1,
    0,
    1,
    0,
    0
]

### 1.2) Depending on the dataset we think that a logistic regression model fits best to our data

In [None]:
# Formular for the first input step (linear regression)

$y_{hat}= w_1x_1 + w_2x_2 + b$

In [None]:
# Sigmoid activation for 0 - 1 output scaling

$S(x) = \frac{1}{1+e^{-x}}$

In [None]:
# combined it means that the output is as follows->

$S(y_{hat}) = \frac{1}{1+e^{-y_{hat}}} = \frac{1}{1+e^{-(w_1x_1 + w_2x_2 + b)}}$

### 1.3) We need a function to evaluate the model outcome - Loss function

In [None]:
# For this example we define our loss function to be the 'Mean Squared Error' function.
# Keep in mind that we are not bound to this method and that we can choose 'whatever' function we want to use.
# The function should be a representative function to estimate the models performance (How good is my model?)

$L = Cross Entropy = -ylog(y_{hat})-(1-y)log(1-y_{hat})$

### 1.4) How can we minimize the loss of our model? - SGD

In [None]:
# This is were Stochastic Gradient Descent comes into play!
# We need to calculate the gradients of our model parameters (w_1, w_2, b) and update their values!

# Calculation follows!

## 2) How to use all the information from above?

### 2.1) Initialize random values for

$w_1, w_2, b$

e. g.

$w_1 = -0.5$, $w_2 = 0.02$, $b = 0$

In [None]:
# The parameter value initialization is only done once (at the beginning)
# We need these random values because we don't know anything about them at the beginning

### 2.2) Use the parameter values and one data row of the dataset and calculate the outputs

In [None]:
# Use for example the first row of X as x_1 and x_2 values and
# calculate the output with the initially set parameter values w_1, w_2, b

In [None]:
# TODO !

### 2.4) Calculate the gradients of the learning parameters (w_1, w_2 and b)

In [None]:
# The gradients can be calculated using the partial derivatives with respect to the parameters

In [None]:
# We start with the derivation of our loss function because this is our evaluation function.
# The loss function gives us information about how good our model performs
# We want to update our parameters based on our quality measurement function (loss function)

In [None]:
# Partially derive the loss function (you can write the end result in code)

$$\frac{\partial L}{\partial S(x)} = \frac{S(x)-y}{S(x)(1-S(x))}$$

In [None]:
# Partially derive the sigmoid activation function (you can write the end result in code)

$$\frac{\partial S(x)}{\partial x} = S(x)(1-S(x))$$

In [None]:
# Parially derive the model function with respect to every parameter...

$y_{hat}= w_1x_1 + w_2x_2 + b$

In [None]:
# for example for w_1... where x_1 from one data row the first entry (column)

$$\frac{\partial y_{hat}}{\partial w_1} = x_1$$

In [None]:
# do this for w_1, w_2 and b with for example the first data row X[0]

In [None]:
# TODO !

In [None]:
# we need to combine the loss function derivation and all the parameter derivations to calculate the final gradient
# this can be done by the chain rule!
# that is backtracing the output and the finding the impact for every single parameter
# the chain rule is just combining the partial derivatives from the end of the model (loss fn) to the parameter we search for
# the chain rule for the paramter w_1 is as follows:

$$\frac{\partial L}{\partial w_{1}} = \frac{\partial L}{\partial S} * \frac{\partial S}{\partial y_{hat}} * \frac{\partial y_{hat}}{\partial w_1}$$

$$\frac{\partial L}{\partial w_{1}} = \frac{S(y_{hat})-y}{S(y_{hat})(1-S(y_{hat}))} * S(y_{hat})(1-S(y_{hat}) * x_1$$


$$\frac{\partial L}{\partial w_{1}} = (S(y_{hat}) - y) * x_1$$

In [None]:
# calculate the gradient for all parameters w_1, w_2 and b

In [None]:
# TODO !

### 2.4) Apply the parameter update rule

$$\theta := \theta - \eta\Delta_{\theta}L(\theta)$$

In [None]:
# For w_1 this looks as follows

$$w_1 := w_1 - \eta \space(\frac{\partial L}{\partial S} * \frac{\partial S}{\partial y_{hat}} * \frac{\partial y_{hat}}{\partial w_1})$$

In [None]:
# Lets set the learning rate eta to 0.05

$$\eta = 0.05$$

In [None]:
# calculate the new value for all parameters w_1, w_2, b

In [None]:
# TODO !

## 3) Iterate this procedure with every data row from X and y to update your model

In [None]:
# One iteration is using all data given once
# Epochs can be defined to iterate over all data rows multiple times