# Gradient:

-> Gradient means change in value of a quantity with change in a given variable.
Gradient means slope that means inclination of line.

slope=tan(theta)=(y2-y1)/(x2-x1) = d (f(x))/dx

The gradient of a curve at any point is equal to the gradient of its tangent at that point on the curve.

The gradient of two parallel lines is equal. m1=m2

The product of the gradients of two perpendicular lines is -1 => m1.m2=−1

-> The gradient of a scalar-valued multivariable function f(x,y,..), denoted del(f) - Collection of partial derivatives in the form of vector. This means its a vector-valued function. Gradient vector,

del(f) =[ do(f)/do(x)

          do(f)/do(y)

          ......
        ]

If you imagine standing at a point (x0,y0,..) in the input space of f, the vector del(f(x0,y0,..)) tells you which direction you should travel to increase the value of f most rapidly.

## Gradient Descent

-> Optimization algorithm. Basic idea is to tweak parameters iteratively in order to minimize a cost function.

-> It start by filling theta with random values (random initialization). It computes the gradient of cost function with regard to each model parameter theta_j and it goes in the direction of descending gradient. Once gradient is zero, we've reached the minimum.

-> We need to calculate how much the cost function will change if we change theta_j just a little bit. Its called partial derivative.

-> Size of steps is determined by learning rate hyperparameter. Learning step size is proportional to the slope of cost function so the steps gradually gets smaller as the cost approaches the minimum. If learning rate is too small , algo has to go through many iterations to converge which will take long time. If learning rate is too high algorithm diverges with larger and larger values failing to find a good solution.

-> Convergence will become difficult if cost function graph has holes,ridges,plateaus (basically irregular terrain). Sometimes it will converge to a local minimum. Sometimes it take very long time to cross plateau and if we stop early, we will never reach global minimum.

-> MSE cost function for linear regression is a convex function. That means if we pick any 2 points on the curve, the line segment joining them is never below the curve. That means there's no local minima and just one global minimum. Its a continous function with a slope that never changes abruptly. So gradient descent is guranteed to approach global minimum.

-> When using gradient descent all features should have similar scale. Otherwise it will take much longer to converge.

-> Training a model means seraching for a combination of model parameters that minimizes the cost function. Its a search in the model's parameter space.

Computational complexity of training a linear regression model using normal equation or the SVD approach is linear with regard to both the number of instances we want to make predictions on and the number of features. Gradient descent scales well with the number of features.

theta(next step)=theta-eta*del(MSE(theta))

## 1] Batch gradient descent

Use entire training set to compute gradient in every iteration. Its terribly slow on large training sets. Here we gently decrease cost function at each iteration until it reaches minima. Batch GD with a fixed learning rate will eventually converge to optimal solution.

## 2] Stochastic gradient descent

It randomly picks one instance at every step and compute gradient based on that. Here cost function will bounce up and down decreasing only average. Over time it will end up close to minima but even after that it will bounce around, never settling down. When algorithm stops, final parameter values will be good but not optimal.

When cost function is irregular, it helps to jump out of local minima so SGD has better chance of finding global minima than batch GD. But algo never settle at minima. SOlution - Start with large learning rate - quick progress and escape local minima - then get smaller and smaller to settle at global minimum. Function that determines learning rate at each iteration is called the learning schedule.

Learning rate reduced quickly - may get stucjk in local minima

Learning rate reduced slowly - May jump around the minimum for long time and end up in a suboptimal solution if we halt training too early.

## 3] Mini-Batch gradient descent

Computes gradients on small random sets of instances called mini batches. Advantage over SGD - performance boost from hardware optimization of matrix operations.

Batch GD actually stops at the minimum, while other two continue to walk around.

In [1]:
## Generating some linear-looking data
import numpy as np
np.random.seed(42)
m=100
X=2*np.random.rand(m,1)
y=4+3*X+np.random.randn(m,1)  ## last term refers to noise

In [6]:
from sklearn.preprocessing import add_dummy_feature
X_b=add_dummy_feature(X)

In [3]:
## Batch GD
eta=0.1
n_epochs=1000
m=len(X_b)
np.random.seed(42)
theta=np.random.randn(2,1)
theta

array([[ 0.49671415],
       [-0.1382643 ]])

In [7]:
for epoch in range(n_epochs):
  gradients=2/m*X_b.T@(X_b@theta - y)
  theta=theta-eta*gradients

-> We can use Grid search to find optimal learning rate.

-> How to set the number of epochs?

Too low => We will be far away from the optimal solution when the algorithm stops.

Too high => Waste of time when the model parameters do not change anymore

Solution => Set large number of epochsbut to interrupt algorithm when the gradient vector becomes tiny that means norm becomes smaller than a tiny number (tolerance). Beacuse this happens when gradient descent has almost reached the minimum.

In [9]:
## SGD
m=len(X_b)
n_epochs=50
t0,t1=5,50

def learning_schedule(t):
  return t0/(t+t1)

np.random.seed(42)
theta=np.random.randn(2,1)

for epoch in range(n_epochs):
  for iteration in range(m):
    random_index=np.random.randint(m)
    xi=X_b[random_index:random_index+1]
    yi=y[random_index:random_index+1]
    gradients=2*xi.T@(xi@theta-yi)   ## For SGD do not divide by m
    eta=learning_schedule(epoch*m+iteration)
    theta=theta-eta*gradients
theta

array([[4.21076011],
       [2.74856079]])

In [10]:
## Linear regression using stochastic GD with Scikit-Learn - SGDRegressor class defaults to optimizing MSE cost function
from sklearn.linear_model import SGDRegressor
sgd_reg=SGDRegressor(max_iter=1000,tol=1e-5,penalty=None,eta0=0.01,n_iter_no_change=100,random_state=42)
sgd_reg.fit(X,y.ravel())

In [11]:
sgd_reg.intercept_,sgd_reg.coef_

(array([4.21278812]), array([2.77270267]))