# Gradient Descent

In [1]:
# import necessary libraries and specify that graphs should be plotted inline
%matplotlib inline 
import numpy as np
import pandas as pd
import sklearn
# import matplotlib.pyplot as plt


In this short material, we apply gradient descent algorithm to linear models. Note that when applying gradient descent, it is important to make sure features are on a similar scale (otherwise, the algorithm can take a long time or even fail to converge). For now, we create a hypothetical dataset to guarantee this point.

Consider a dataset with 1000 observations and 500 variables. All the X values are random draws from standard normal distribution. The relationship between Y and X is: $Y = 3 + \theta X + e. $

The cell below generates the hypothetical dataset.

In [2]:
## Create a hypothetical dataset with 1000 rows, 500 variables.

n_samples, n_features = 1000, 500  # define dimensions
np.random.seed(0)                  # set seed for reproduction


X = np.random.randn(n_samples, n_features) # a 1000 by 500 matrix, each entry is a random draw from standard normal distribution

theta = np.random.rand(n_features)  # 500 features, so we need 500 parameters

y = 3 + X.dot(theta) + np.random.randn(n_samples)  # define a hypothetical relationship

print(X.shape)

(1000, 500)


We introduced three gradient descent variations: (1) batch gradient descent, (2) stochastic gradient descent, and (3) mini-batch gradient descent. Relying on sklearn package, we can only implement the second variation, i.e., SGD. The syntax is:
**sklearn.linear_model.SGDRegressor()**
- eta0: the initial learning rate

The third variation is most frequently used in deep learning and can be applied in keras. It takes some effort to install keras, and we will save this part to the deep learning sections.

Batch gradient descent is rarely seen in python packages. Here I will provide a template to code up this variation manually.

Now let's apply (the first two) gradient descent algorithms to the linear model to see if we obtain the same result from the LinearRegression() function.


**Practice**
- Obtain the coefficient for the intercept using LinearRegression().
- Obtain the coefficient for the intercept using SGDRegressor(). Set initial learning rate to 0.01
- Obtain the coefficient for the intercept by doing a batch gradient descent manually. Set initial learning rate to 0.01. Assume convergence after 1000 iterations (in practice, need to check the change in cost function).
- Check if the reported coefficients are close to 3 (i.e., the true value).

In [3]:
# Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
print(model.intercept_)

3.0338873093107876


In [4]:
# SGD Regressor
from sklearn.linear_model import SGDRegressor

sgd_model = SGDRegressor(eta0 = 0.01)

sgd_model.fit(X, y)

print(sgd_model.intercept_)

[3.03799072]


In [5]:
# Now, let's do a naive batch gradient descent manually

# S1. include a column of 1s to the original X matrix. 
##The estimation of the new variable (i.e., column of 1) is the intercept.

b = np.append(X,np.ones([len(X),1]),1)
# b is our new X matrix, the last column b[500] estimates the intercept


# S2. Initialization, set iteration number, learning rate, and a set of initial thetas.
n_iterations = 1000
eta = 0.01

theta = np.random.randn(b.shape[1]) # set initial thetas, they are random draws from standard normal distribution
m = X.shape[0]

# S3. Use a for-loop to accommodate the iterations

for iteration in range(n_iterations):
    gradients = 2/m * b.T.dot(b.dot(theta) - y)
    theta = theta - eta * gradients

# b'(b * theta - y)

# S4. Print results    
print(theta[-1])

3.0407622760646515
