In [None]:
from statistics import stdev, mean

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
from IPython.display import Image

## Linear Regression explanation

During Linear Regression we try to find a line that fits a set of data the best. As we try to fit a line to a dataset, many data points will be far away from points of the line.

Equation for the line is $y = mx + b$.

### Loss

Hence we need to compute error (called *Loss*) measuring how bad our line is. We want the distances between point of the line and data point from out set to contribute in the same way, so the errors need to by squared. As we are interested in total value of all distances, they need to be summed.

The formula is Loss = $\frac{1}{N}\sum\limits_{i=1}^{N}(y_i-(mx_i-b))^2$, where $(mx_i-b)$ is predicted $y_i$.

Let's consider an one line and two points.

In [None]:
line_x = [i for i in range(1, 11)]
line_y = [i*2 for i in range(1, 11)]
plt.plot(line_x, line_y)
plt.plot(5,18, "o", color="red") 
plt.plot(8,12, "o", color="green") 


x_values = [5, 5]
y_values = [18, 5*2]
plt.plot(x_values, y_values, color="red", linestyle='dashed')


x_values = [8, 8]
y_values = [12, 8*2]
plt.plot(x_values, y_values, color="green", linestyle='dashed')

In [None]:
error_red = 18-(5*2)
error_green = 12-(8*2)
print(f'Error for red dot is {error_red} and for green one: {error_green}. Total error is {error_red + error_green}.')

In [None]:
line_x = [i for i in range(1, 11)]
line_y = [i*2 for i in range(1, 11)]
plt.plot(line_x, line_y)
plt.plot(5,18, "o", color="red") 
plt.plot(8,12, "o", color="green") 


x_values = [5, 5]
y_values = [18, 5*2]
plt.plot(x_values, y_values, color="red", linestyle='dashed')
plt.annotate(f'dist = {error_red}', xy=(5, 19))


x_values = [8, 8]
y_values = [12, 8*2]
plt.plot(x_values, y_values, color="green", linestyle='dashed')
plt.annotate(f'dist = {error_green}', xy=(8, 10))

We see that negative value of second error is dicreasing of importance of total error. To avoid such situations, it's better to take into account squared values.

In [None]:
sq_error_red = (18-(5*2))**2
sq_error_green = (12-(8*2))**2
print(f'Squared error for red dot is {sq_error_red} and for green one: {sq_error_green}. Total error is {sq_error_red + sq_error_green}.')

### Gradient descent

Gradient descent is an algorithm for finding local minimum of continuous and differentiable function. In case of linear regression, this function is Loss, since we want to have as low errors as possible. In order to use gradient descent algorithm:
1. We choose supossed b and m.
2. We find partial derivative for slope, and for intercept (we need to find the derivative to get to know direction in which we will move through Loss function). 
3. Then, according to whether the derivative is negative or positive, we will move through Loss function upward or downward by updating slope and intercept values.
4. This update looks like: supposed b = supossed b - (learning rate * gradient at b). Respectively for m.
5. We stop once Loss function stops changing or changes very slow.

In [None]:
Image("../input/gradient-chart/gradient.png")

Calculating partial derivatives:

$F=\frac{1}{N}\sum\limits_{i=1}^{N}(y_i-(m*x_i+b))^2$

gradient at b = $(\frac{\partial F}{\partial b}=\frac{1}{N}\sum\limits_{i=1}^{N}(y_i-(mx_i+b))^2)^{'}=\frac{1}{N}\sum\limits_{i=1}^{N}2(y_i-(mx_i+b))(-1)=-\frac{2}{N}\sum\limits_{i=1}^{N}(y_i-(mx_i+b))$

gradient at m = $(\frac{\partial F}{\partial m}=\frac{1}{N}\sum\limits_{i=1}^{N}(y_i-(mx_i+b))^2)^{'}=\frac{1}{N}\sum\limits_{i=1}^{N}2(y_i-(mx_i+b))(-x_i)=-\frac{2}{N}\sum\limits_{i=1}^{N}x_i(y_i-(mx_i+b))$

### Learning rate

It's a rate determining how slow or fast the minimum of Loss function will be found. It can be very small like 0.000001. Then we will have a lot of itreations (steps through Loss function in direction of Loss's minimum), execution will take more time, but it will result in a good accuracy. Reversely for high learning rate.

## Choose a dataset

In [None]:
df = pd.read_csv('../input/google-stock-prices/googl_prices.csv', engine='python', sep=r'\s*,\s*') # sep=r'\s*,\s*' to remove spaces
df = df.iloc[::-1] # as we want to start from the earliest dates
df.index = df.index[::-1]
df = df.iloc[524:] # from 2018 to 2020
df.reset_index(inplace=True)
df.set_index('index')
df = df.drop(columns=['index'])
df

In [None]:
df.isnull().sum()

As none of variables has missing values nor any significant outliers I may choose whatever variable. Let it be a Close/Last.

In [None]:
df["Close/Last"] = df["Close/Last"].str.replace('$', '') 
df["Close/Last"] = df["Close/Last"].str.replace(' ', '') # to convert str into float
df["Close/Last"] = df["Close/Last"].astype(float)
df

In [None]:
df.Date = pd.to_datetime(df.Date) # change to dataframe to have dates visible on plots

In [None]:
date = df.Date
price = df["Close/Last"]
plt.figure(figsize=(10,8)) # custom size to improve visibility
plt.scatter(date, price, s=5) # size of dots = 5 to have them smaller 

## Write algorithms from scratch

On the basis of above explanation, we will write functions for:
* loss function --> get_total_loss(x, y, m, b)
* gradients for m and b --> get_gradient_at_b(x, y, m, b) and get_gradient_at_m(x, y, m, b)
* b and m update --> update_b_and_m(b, m, x, y, learn_rate)
* returning the most optimal b and m at minimal loss --> gradient_descent(x, y, learn_rate, num_of_iterations)

Let's create some linear function that could approximate (poorly) Google stock prices.

In [None]:
index = df.index.tolist()
m=0.8
b=800
y_predicted = [m*x+b for x in index] # y=mx+b

In [None]:
price = df["Close/Last"]
plt.figure(figsize=(10,8)) 
plt.scatter(index, price, s=5)  
plt.plot(index, y_predicted, 'g')
plt.xlabel('2018-2020')
plt.ylabel('Google prices in USD')

### Calculating Loss

In [None]:
def get_total_loss(x, y, m, b, is_standarized):
    
    # since we don't want to multiple by datetimes, 
    # we will convert X datapoints into integers and call it x_set
    if is_standarized == False:
        x_set = [i for i in range(len(x))][1:]
        x_set.append(len(x))
    else:
        x_set = x
    y_predicted = [m*x+b for x in x_set]
    loss = 0
    
    for i in range(len(x_set)):
        # we iterate through each price and calculate distance 
        # between real price and predicted one determined by the straight line
        loss += (y[i] - y_predicted[i])**2 
        
    return loss

In [None]:
x = date
y = price
m=0.8
b=800

print(f'Total loss is: {get_total_loss(x, y, m, b,0)}')

### Calculating both gradients

In [None]:
def get_gradient_at_b(x, y, m, b, is_standarized):
    
    if is_standarized == False:
        x_set = [i for i in range(len(x))][1:]
        x_set.append(len(x))
    else:
        x_set = x
        
    y_set = y
    gradient = 0
    
    for i in range(len(x_set)):
        x = x_set[i]
        y = y_set[i]
        y_predicted = m*x+b
        gradient += y-y_predicted
        
    return gradient*(-2/len(x_set))

In [None]:
def get_gradient_at_m(x, y, m, b, is_standarized):
    
    if is_standarized == False:
        x_set = [i for i in range(len(x))][1:]
        x_set.append(len(x))
    else:
        x_set = x
        
    y_set = y
    gradient = 0
    
    for i in range(len(x_set)):
        x = x_set[i]
        y = y_set[i]
        y_predicted = m*x+b
        gradient += x*(y-y_predicted)
        
    return gradient*(-2/len(x_set))

In [None]:
x = date
y = price
m=0.8
b=800

print(f'gradient at b : {get_gradient_at_b(x, y, m, b, 0)}')
print(f'gradient at m : {get_gradient_at_m(x, y, m, b, 0)}')

### Updating b and m

In [None]:
def update_b_and_m(b, m, x, y, learn_rate, is_standarized):
    
    gradient_b = get_gradient_at_b(x, y, m, b, is_standarized)
    gradient_m = get_gradient_at_m(x, y, m, b, is_standarized)
    
    b = b-(learn_rate*gradient_b)
    m = m-(learn_rate*gradient_m)
    
    return (b, m)

In [None]:
x = date
y = price
m=0.8
b=800
learn_rate = 0.01
is_standarized = 0
b, m = update_b_and_m(b, m, x, y, learn_rate, is_standarized)
print(f'b: {b}, m: {m}')

### Combining all and getting the most optimal b and m 

In [None]:
def gradient_descent(x, y, learn_rate, num_of_iterations, is_standarized):
    
    
    outcome = pd.DataFrame(data={'b': [], 'm': [], 'loss_set': []})
    m = 0 # initial value
    b = 0
    
    # fill in the dict with b, m, and loss values
    for i in range(num_of_iterations):
        b, m = update_b_and_m(b, m, x, y, learn_rate, is_standarized)
        loss = get_total_loss(x, y, m, b, is_standarized)
        outcome = outcome.append({'loss_set': loss, 'b': b, 'm': m}, ignore_index=1)
    
    loss = min(outcome.loss_set)
    
    b = outcome.b[outcome['loss_set'] == loss].values[0]
    m = outcome.m[outcome['loss_set'] == loss].values[0]
    
    return [b, m, loss]

In [None]:
x = date
y = price
learn_rate = 0.01
num_of_iterations = 100
is_standarized =  0 
b, m, loss = gradient_descent(x, y, learn_rate, num_of_iterations, is_standarized)
print(f'The most optimal: b: {b}, m: {m}, loss: {loss}.')

We can notice that above algorithm doesn't output relevant results.

Let's execute above function for more proportional values from x and y datasets.

In [None]:
x = [1,2,3]
y = [i*2 for i in x]
learn_rate = 0.01
num_of_iterations = 1000
is_standarized = 0

b, m, loss = gradient_descent(x, y, learn_rate, num_of_iterations, is_standarized)
print(f'The most optimal: b: {b}, m: {m}, loss: {loss}.')

Let's see how it looks on the graph.

In [None]:
y_pred = [m*i+b for i in x]
plt.plot(x, y_pred)
plt.scatter(x, y, s=20, c='r')
plt.show()

It's easy to realize that b and m as the outputs of gradient_descent() function approximate linear function y=mx+b very well.

## Fix unrelevant output for Googl stock prices

In [None]:
print(f'Min Googl price is {min(price)} and max: {max(price)} of dollars.')
print(f'Length of X dataset: {len(date)}.')

Let's wonder why gradient_descent() output looks fine for our second dummy dataset, but looks quite strange for Googl stock prices.

Our dummy dataset contains proportional data - y value is always 2 times greater that corresponding x value.
In the case of Googl, prices fall into [681.14, 1795.36] of dollars. Our X dataset that is taken into account while get_descent() execution takes in values from 1 to 1259. 

### Normalization

The solution to have lesser spread is **normalization**. Let's transform our data so they fall into range [0,1].
There are a lot of normalization algorithms that fit to specific tasks. 

I will choose **min-max** algorithm, since we don't have any outliers (if we had them, we could use Z-score algorithm), and we want to obtain exact same scale.
* min-max normalization = $\frac{value - min}{max - min}$, where *value* is an each value from dataset that will be standarize and *min* is minimum value from dataset (inversely for *max*).

#### Min-max normalization

In [None]:
def min_max(data):
    
    minimum = min(data)
    maximum = max(data)
    
    normalized = []
    
    for value in data:
        normalized.append((value-minimum)/(maximum-minimum))
        
    return normalized

In [None]:
x = date
x_set = [i for i in range(len(x))][1:]
x_set.append(len(x))

min_max_df = pd.DataFrame(data={
    'Price before': price, 
    'Price after': min_max(price),
    'X before': x_set,
    'X after': min_max(x_set),
})

min_max_df

Let's run get_descent() again.

In [None]:
x = min_max_df['X after']
y = min_max_df['Price after']
learn_rate = 0.01
num_of_iterations = 100
is_standarized = 1
b, m, loss = gradient_descent(x, y, learn_rate, num_of_iterations, is_standarized)
print(f'The most optimal: b: {b}, m: {m}, loss: {loss}.')

In [None]:
y_pred = [m*i+b for i in x]
plt.plot(x, y_pred)
plt.scatter(x, y, s=5) 
plt.show()

The last thing is transforming *m*, *b* and *loss* back.