## Introduction

This is the first notebook in a series, where I will create basic machine learning algorithms from scratch with explanations and additional materials.

So I myself will figure out the little things, and later, maybe it will help someone else 

In gradient descent problems, I set myself the goal of using at a minimum ready-made mathematical solutions, such as numpy

Gradient descent notebooks series:
1. Gradient descent for formulas like y = a * x + b (current)
2. [Multivariate gradient descent](https://www.kaggle.com/konstantinsuspitsyn/multivariate-gradient-descent)
3. More algorithms in progress

**Current notebook task:**<br>
Create a number of functions that will allow you to come as close as possible to the actual distribution of data through a formula of the form $y=a*x+b$

1. We will build synthetic data that obeys a well-known law, for example, $ y = 5 * x + 6 $, without outliers
2. We will look through the theory of gradient descent
3. Construct the function of gradient descent
4. Check the function on real data of the dependence of weight on height 

## Creating synthetic data 

In [None]:
import matplotlib.pyplot as plt

In [None]:
# First, fill in x. From -10 to 10 using steps = 0.1
# And from it we construct y, using the formula 5 * x + 6

# X_true - we will be feeded into the function to get y_pred

X_true = []
y_true = []

# range works only with integers. Multuply everything by 10, and devide by 10 when append
for i in range(-100, 100, 1):
  X_true.append(i/10)
  y_true.append(5*i/10+6)

Let's look at data, that we know

In [None]:
X_true[:10]

In [None]:
y_true[:10]

In [None]:
plt.scatter(X_true, y_true, color='#003f5c')
plt.show()

We solve the regression problem<br>
We know that linear regression will obey the law y = a * x + b

Let's take random a and b from actual x and calculate a new y 

In [None]:
# I'll take a=1 и b=1

a = 1
b = 1

y_pred = []
for f in X_true:
  y_pred.append(f*a + b)

In [None]:
plt.scatter(X_true, y_true, color='#003f5c')
plt.scatter(X_true, y_pred, color='#ff6361')
plt.show()

Obviously, we did not hit the right spots. To understand how close we are in our prediction, it is necessary to create and use the cost function

## Theory

The standard optimization path looks like this: <br>


1. Checking the difference between predicted data and real
2. If we are not satisfied with the result, change the weights in front of $ x_i $ and return to item 1. If the result of the first point suits us, we start using the derived coefficients
<br>
Now, let's analyze the algorithm from a mathematical point of view.
<br>

There are a lot of functions that describe the accuracy of models. But we need one where we can calculate the derivative. <br> I chose [MSE] (https://en.wikipedia.org/wiki/Mean_squared_error) <br> <br> $ MSE = \frac {1} { n} \sum_ {i = 1} ^ n (y_i - \hat {y_i}) ^ 2 $ <br>
$ y_i $ - y_true <br>
$ \hat {y_i} $ - y_pred, as well as $ \hat {y_i} = a * x_i + b $, it means that, $ MSE = \frac {1} {n} \sum_ {i = 1} ^ n (y_i - (a * x_i + b)) ^ 2 $
<br> We figured out how we will measure the quality of predictions <br> <br>
Let's deal with changing the weights of $ x_i $ and the free term $ b $. <br> <br> Many years ago, specially trained scientists said that in order to minimize the cost function, we must move in the opposite direction of the gradient (the gradient is always directed towards the increase of the function). And move by some coefficient, it is called the learning rate. <br> <br>
What you need to know to calculate the gradient and offset:


1. How to find a derivative
2. The derivative of the sum is equal to the sum of the derivatives
3. Derivative of a complex function
4. Gradient and partial derivatives
<br> <br>
We are ready to calculate MSE partial derivatives for $ a $ and $ b $ <br> <br>
The partial derivative with respect to $ a $ is equal to $ \frac {\delta f} {\delta a} = \frac {1} {n} \sum_ {i = 1} ^ n-2 * (y- (a * x_i + b )) * x_i $ <br> <br>
The partial derivative with respect to $ b $ is equal to $ \frac {\delta f} {\delta b} = \frac {1} {n} \sum_ {i = 1} ^ n-2 * (y- (a * x_i + b )) $ 

## The code

In [None]:
# Creating MSE cost-function
def mse_function(y_true: list, y_pred: list) -> float:
  '''
  Function that calculates MSE

  :param y_true: values of y, that we know from training data
  :param y_pred: values of y, created by our model

  :return mse: MSE value by formula
  '''
  # Number of values
  n = len(y_true)
  # Starting from 0
  pre_mse = 0
  for index, value in enumerate(y_true):
    pre_mse += (value - y_pred[index])**2
  mse = pre_mse/n
  return mse

Let's count MSE. We could got lucky and guess a and b correctly from start

In [None]:
mse_function(y_true, y_pred)

Well, we did not. Let's make algorithm that will change a and b

In [None]:
# learning rate
lr = 0.003
# The maximum number of steps, not to wait forever,
# if we don't get to the optimum 
max_steps = 30000
# Starting mse
mse = 999
# Starting coefficients
a=2
b=-1
# Steps counter
step = 0

# Learning tracker
mse_list = []
a_list = []
b_list = []
y_preds = []

y_pred = []
# Number of elements in real data
n=len(y_true)

# Model will work until our current step is less than max_steps, 
# or difference between current MSE and previous MSE is less than 1e-10
while (step <= max_steps) and (mse >= 1e-10):
  # Creating gradient start
  grad_a=0
  grad_b=0
  # Calculating moving steps for weights (just like in theory)
  for i, x in enumerate(X_true):
    grad_a += -2*(y_true[i] - (a*x + b))* x
    grad_b += -2*(y_true[i] - (a*x + b))
  grad_a = grad_a/n
  grad_b = grad_b/n
  # Make a move, according to lr
  a -= lr*grad_a
  b -= lr*grad_b
  # New forecast
  y_pred = [a*x+b for x in X_true]
  # Check MSE
  mse = mse_function(y_true, y_pred)
  # Writing results
  mse_list.append(mse)
  a_list.append(a)
  b_list.append(b)
  y_preds.append(y_pred)
  step += 1

In [None]:
steps_x = []
for i, f in enumerate(mse_list):
  steps_x.append(i+1)

In [None]:
fig = plt.figure(figsize=(15, 5))
ax = fig.add_subplot()
ax.scatter(steps_x, mse_list, color='#ff6361')
plt.title('Изменение MSE')
plt.show()

In [None]:
fig = plt.figure(figsize=(15, 5))
ax = fig.add_subplot()
alpha = 0.5
for j in range(0, len(y_preds), 20):
  alpha = min(alpha + 0.5/(len(y_preds)/20),1)
  ax.plot(X_true, y_preds[j], color='#ff6361', alpha=alpha)
ax.scatter(X_true, y_true, color='#003f5c')
plt.title('Обучение')
plt.show()

In [None]:
print('Our final formula: {:.4f}*x+{:.4f}'.format(a, b))
print('MSE Loss: {:.4f}'.format(mse))

Test is successful. Let's write a class

In [None]:
class GradientDescents:
  
  '''
  Gradient Descents implementation from scratch
  '''

  def progress_tracker(self, step: int, cost_function: float) -> None:
    '''
    Printing current progress

    :param step: current step
    :param cost_function: Loss

    '''
    from IPython.display import clear_output
    clear_output(wait=True)
    print('Step: {}'.format(step))
    print('Loss: {:.2f}'.format(cost_function))

  def mse_function(self, y_true: list, y_pred: list) -> float:
    '''
    MSE calculation

    :param y_true: y from data, that we know
    :param y_pred: predicted ys

    :return mse: MSE Loss
    '''
    # Number of ys
    n = len(y_true)
    # Starting from 0
    pre_mse = 0
    for index, value in enumerate(y_true):
      pre_mse += (value - y_pred[index])**2
    mse = pre_mse/n
    return mse
  
  def gradient_descent(self, X_true, y_true, \
                       start_a=1.0, start_b=1.0, \
                       learning_rate=0.003, max_steps=30000, \
                       save_steps=0):
    '''
    Simple gradient descent for formulas like y=a*x+b

    :param start_a: first a value in y=a*x+b
    :param start_b: first b value in y=a*x+b
    :param learning_rate: learning rate
    :param max_steps: maximum number of steps
    :param save_steps: if 0, only result will be saved and returned
                       if > 0, than every Ns' step will be saved
   
    :return return_dict: { 

            :return a: a value
            :return b: b value
            :return steps: total number of steps made
            :return mse: MSE value
            :return mse_list: lisr of MSE values if save_steps > 0
            :return a_list: list of a values if save_steps > 0
            :return b_list: list of b values if save_steps > 0
    
                        }
    '''
    # Initialize first step
    step = 0
    a = start_a
    b = start_b
    mse = 9999999
    mse_prev = 0

    # Let's make learning tracking
    mse_list = []
    a_list = []
    b_list = []

    # Predicted ys
    y_pred = []
    # Number of y elements in dataset
    n=len(y_true)

    # Initialize first gradients
    grad_a=0
    grad_b=0

    # Model will work until our current step is less than max_steps, 
    # or difference between current MSE and previous MSE is less than 1e-10
    # or MSE will be less than 1e-5
    while (step <= max_steps) and (mse >= 1e-5) \
           and (abs(mse - mse_prev) >= 1e-5):
      
      mse_prev = mse
      # Calculating moving steps for weights (just like in theory)
      for i, x in enumerate(X_true):
        grad_a += -2*(y_true[i] - (a*x + b))* x
        grad_b += -2*(y_true[i] - (a*x + b))
      grad_a = grad_a/n
      grad_b = grad_b/n
      # Make a move, according to lr (-= because we need oposite direction from gradient)
      a -= learning_rate*grad_a
      b -= learning_rate*grad_b
      # New forecast
      y_pred = [a*x+b for x in X_true]
      # Check MSE loss
      mse = self.mse_function(y_true, y_pred)

      step += 1

      # Writing progress
      if save_steps > 0:
        if step % save_steps == 0:
          mse_list.append(mse)
          a_list.append(a)
          b_list.append(b)
      
      self.progress_tracker(step-1, mse)

    if save_steps > 0:
      return_dict = {'a': a, 'b': b, 'mse':mse, 'steps': step-1, \
            'mse_list': mse_list, 'a_list': a_list, 'b_list': b_list}
    else:
      return_dict = {'a': a, 'b': b, 'mse':mse, 'steps': step-1}

    return return_dict

I have added weight-height data from https://www.kaggle.com/mustafaali96/weight-height?select=weight-height.csv

In [None]:
import pandas as pd

In [None]:
df_hw = pd.read_csv('../input/weight-height/weight-height.csv')

In [None]:
df_hw.head()

In [None]:
X_true = df_hw['Height'].to_list()
y_true = df_hw['Weight'].to_list()

In [None]:
fig = plt.figure(figsize=(15, 5))
ax = fig.add_subplot()
ax.scatter(X_true, y_true, color='#003f5c', label='Real data')
ax.plot(X_true, [f*6 + -256 for f in X_true], color='#ff6361', label='Starting regression')
plt.title('Regression step #0')
ax.legend()
plt.show()

In [None]:
grad_test = GradientDescents()
gd = grad_test.gradient_descent(X_true, y_true, \
                                start_a=6, start_b=-256, \
                                learning_rate=0.0002, max_steps=200000, \
                                save_steps=0)

In [None]:
fig = plt.figure(figsize=(15, 5))
ax = fig.add_subplot()
ax.scatter(X_true, y_true, color='#003f5c', label='Real data')
ax.plot(X_true, [f*6 + -256 for f in X_true], color='#ff6361', \
        label='Starting regression')
plt.plot(X_true, [f*gd['a'] + gd['b'] for f in X_true], color='#ffa600', \
         label='Regression after few steps')
plt.title('Learning progress')
ax.legend()
plt.show()

In [None]:
print('Final formula: {:.4f}*x+{:.4f}'.format(gd['a'], gd['b']))
print('MSE loss: {:.4f}'.format(gd['mse']))