# Calculation example for Stochastic Gradient Descent and Linear Regression

In [None]:
# You can do the exercise on paper! Read through the exercise carefully. The calculations start at section 2.2.

## 1) Prepare a dataset for the linear regression

### 1.1) Example dataset

In [33]:
# Example dataset (x values)
X = [
    [8, 2],
    [2, 8],
    [1, 4],
    [3, 6],
    [9, 4],
    [6, 5]
]

# Dependent variable on X (output of the model)
# first row in X corresponds to first row in y and so forth
y = [
    4,
    -14,
    -7,
    -9,
    1,
    -4
]

### 1.2) Depending on the dataset we think that a Linear Regression model fits the best to our data

In [3]:
# The formular for two independent input variables for linear regresion is as follows

$y_{hat}= w_1x_1 + w_2x_2 + b$

### 1.3) We need a function to evaluate to model outcome - Loss function

In [None]:
# For this example we define our loss function to be the 'Mean Squared Error' function.
# Keep in mind that we are not bound to this method and that we can choose 'whatever' function we want to use.
# The function should be a representative function to estimate the models performance (How good is my model?)

$$L = MSE = \frac{1}{n}\sum_{i=1}^n (y_{i_{hat}}-y_{i})^2$$

In [None]:
# In general for classical approaches: The lower the loss the better the model!

### 1.4) How can we minimize the loss of our model? - SGD

In [None]:
# This is were Stochastic Gradient Descent comes into play!
# We need to calculate the gradients of our model parameters (w_1, w_2, b) and update their values!

# Calculation follows!

## 2) How to use all the information from above?

### 2.1) Initialize random values for

$w_1, w_2, b$

e. g.

$w_1 = -0.5$, $w_2 = 0.02$, $b = 0$

In [1]:
# The parameter value initialization is only done once (at the beginning)
# We need these random values because we don't know anything about them at the beginning

### 2.2) Use the parameter values and one data row of the dataset and calculate the outputs

In [None]:
# Use for example the first row of X as x_1 and x_2 values and
# calculate the output with the initially set parameter values w_1, w_2, b

In [34]:
# TODO !
x_1 = X[0][0]
x_2 = X[0][1]

w_1 = -0.5
w_2 = 0.02
b = 0

y_0_hat = w_1 * (x_1) + w_2 * (x_2) + b

y_0_hat

-3.96

### 2.3) Calculate the loss of the model - How good is our model

In [None]:
# Now we want to find out how good our model is regarding our defined loss function
# Therefore, calculate to loss of our model based on the defined loss function above
# Use the correct y from the data (e. g. if you use the first row X[0] then you have to use y[0])

In [35]:
# TODO !

(y_0_hat - y[0]) ** 2

63.3616

### 2.4) Calculate the gradients of the learning parameters (w_1, w_2 and b)

In [None]:
# The gradients can be calculated using the partial derivatives with respect to the parameters

In [None]:
# We start with the derivation of our loss function because this is our evaluation function.
# The loss function gives us information about how good our model performs
# We want to update our parameters based on our quality measurement function (loss function)

In [36]:
# Partially derive the loss function (you can write the end result in code)

# TODO !
m_L_wrt_yhat = 2 * (y_0_hat - y[0])

m_L_wrt_yhat

-15.92

$$\frac{\partial L}{\partial y_{hat}} = 2(y_{hat}-y)$$

In [None]:
# Parially derive the model function with respect to every parameter...

$y_{hat}= w_1x_1 + w_2x_2 + b$

In [None]:
# for example for w_1... where x_1 from one data row the first entry (column)

$$\frac{\partial y_{hat}}{\partial w_1} = x_1$$

In [None]:
# do this for w_1, w_2 and b with for example the first data row X[0]

In [37]:
# TODO !
m_yhat_wrt_w1 = x_1
m_yhat_wrt_w2 = x_2
m_yhat_wrt_b = 1

In [None]:
# we need to combine the loss function derivation and all the parameter derivations to calculate the final gradient
# this can be done by the chain rule!
# that is backtracing the output and the finding the impact for every single parameter
# the chain rule is just combining the partial derivatives from the end of the model (loss fn) to the parameter we search for
# the chain rule for the paramter w_1 is as follows:

$$\frac{\partial L}{\partial w_{1}} = \frac{\partial L}{\partial y_{hat}} * \frac{\partial y_{hat}}{\partial w_1}$$
$$\frac{\partial L}{\partial w_{1}} = 2(y_{hat}-y) * x_1$$

In [None]:
# calculate the gradient for all parameters w_1, w_2 and b

In [39]:
# TODO !

m_L_wrt_w2 = m_L_wrt_yhat * m_yhat_wrt_w2
m_L_wrt_b = m_L_wrt_yhat * m_yhat_wrt_b

### 2.5) Apply the parameter update rule

$$\theta := \theta - \eta\Delta_{\theta}L(\theta)$$

In [None]:
# For w_1 this looks as follows

$$w_1 := w_1 - \eta \space(\frac{\partial L}{\partial y_{hat}} * \frac{\partial y_{hat}}{\partial w_1})$$

In [None]:
# Lets set the learning rate eta to 0.05

$$\eta = 0.05$$

In [None]:
# calculate the new value for all parameters w_1, w_2, b

In [40]:
# TODO !

eta = 0.05

𝜃_for_w2 = w_2 - n * (m_L_wrt_w2)
𝜃_for_b = b - n * (m_L_wrt_b)

𝜃_for_w2, 𝜃_for_b

(1.612, 0.796)

## 3) Iterate this procedure with every data row from X and y to update your model

In [45]:
# One iteration is using all data given once
# Epochs can be defined to iterate over all data rows multiple times

sum_of_squared_errors = 0
eta = 0.00001

for row in range (len(X)):
    x_1 = X[row][0]
    x_2 = X[row][1]
    y_i = y[row]
    y_i_hat = w_1 * (x_1) + w_2 * (x_2) + b
    
    squared_loss = (y_i_hat - y_i) ** 2
    
    sum_of_squared_errors += squared_loss
    
    m_L_wrt_yhat = 2 * (y_i_hat - y_i)
    
    m_yhat_wrt_w1 = x_1
    m_yhat_wrt_w2 = x_2
    m_yhat_wrt_b = 1
    
    m_L_wrt_w1 = m_L_wrt_yhat * m_yhat_wrt_w1
    m_L_wrt_w2 = m_L_wrt_yhat * m_yhat_wrt_w2
    m_L_wrt_b = m_L_wrt_yhat * m_yhat_wrt_b
    
    w_1 = w_1 - eta * m_L_wrt_w1
    w_2 = w_2 - eta * m_L_wrt_w2
    b = b - eta * m_L_wrt_b
    
    print(f'After row {row + 1}: w_1={w_1}, w_2={w_2}, b={b}, loss={squared_loss}')
    
mean_sum_of_squared_errors = sum_of_squared_errors / (len(X))

After row 1: w_1=516.0675768844525, w_2=-429.9580182616165, b=54.89301461607278, loss=11049673.10613038
After row 2: w_1=516.1611223235608, w_2=-429.5838365051832, b=54.93978733562694, loss=5469218.236230039
After row 3: w_1=516.183927012288, w_2=-429.4926177502743, b=54.96259202435417, loss=1300134.5698647308
After row 4: w_1=516.2417934922944, w_2=-429.37688479026144, b=54.98188085102298, loss=930147.0856483497
After row 5: w_1=515.7049164053327, w_2=-429.6154968289111, b=54.92222784136057, loss=8896203.904459707
After row 6: w_1=515.5843074962772, w_2=-429.71600425312397, b=54.90212635651799, loss=1010174.2321907841
