## Gradient Descent
It is like a superhero that helps find our Linear Regression model to find the best possible fitting line for the data with minimal cost error.

$$f(x) = wx + b$$

Cost function 
$$J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (f(x^i) - y^i)^2$$

$$w = w - \alpha \frac{\partial f}{\partial w} J(w,b)$$
$$b = b - \alpha \frac{\partial f}{\partial b} J(w,b)$$

After derivating we will get these expressions for w and b
$$w = w - \alpha \frac{1}{m} \sum_{i=1}^{m} (f(x^i) - y^i)x$$
$$b = b - \alpha \frac{1}{m} \sum_{i=1}^{m} (f(x^i) - y^i)$$




In [65]:
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.linear_model import LinearRegression
import pandas as pd

In [95]:
def gradient_descent(x, y):
    # we will start with some judicious values for w, b and alpha
    w = 2 # slope
    b = 1 # intercept
    a = 0.04 # alpha - learning rate
    
    m = len(x) # number of training instances
    
    # fig, ax = plt.subplots()
    
    steps =210000
    
    w_calc = []
    j_calc = []
    
    j_prev = 0
    for i in range(steps):
        y_predicted = w * x + b
        J = 1/(2 * m) * sum(val**2 for val in y_predicted - y) # cost function calculation
        # new w and b - calculated simultaneously
        w = w - a * (1/m * sum((y_predicted - y) * x))
        b = b - a * (1/m * sum(y_predicted - y))
        
        w_calc.append(w)
        j_calc.append(J)
        
        # print(f"Values for w : {w} and b : {b} and J : {J} in iteration {i}")
        
        # compare J
        if i > 0:
            if(math.isclose(J, j_prev)):
                # print("Hola we found a good value:", J)
                break
        j_prev = J

    # ax.plot(w_calc, j_calc)
    return w, b

In [96]:
# use gradient_descent
x = np.array([2, 4, 6, 8, 10])
y = np.array([10, 20, 40, 60, 80])

w, b = gradient_descent(x, y)
print("gradient descent result", w, b)

lin_reg = LinearRegression()
lin_reg.fit(pd.DataFrame(x), y)
print("result from sklearn", lin_reg.coef_, lin_reg.intercept_)

gradient descent result 8.999763283006654 -11.998271117149091
result from sklearn [9.] -12.000000000000021


Towards the minimum the slope gets smaller and hence the movement is very little. We started with any random values for w, b and alpha (although this needs to be carefully choosen), after applying gradient descent and increasing number of iterations and tweaking alpha we were able to find a good fit, comparing to sklearn result.