# CSCI-P556
# Assignment 1
# Due date: Friday, February 15, 11:59PM

In [1]:
import numpy as np
import pandas as pd
import sklearn as sk
import time

diab = pd.read_csv("diabetes.txt", delim_whitespace = True)

## Question 1 (10 points)

Implement linear regression using ordinary least squares (closed-form solution)



In [2]:
def linear_regression_ols(X, y):
  # Make sure that you return the weights in a np.array, 
  # other data types will cause our grading script to crash
    X_transpose = X.T
    temp = np.dot(X_transpose, X)
    temp = np.linalg.inv(temp)
    temp = np.dot(temp, X_transpose)
    ols = np.dot(temp, y)
    return ols

## Question 2 (40 points)

Implement linear regression using gradient descent. A boolean parameter named *regularization* has been included in the function definition. If regularization=True, then the linear regression will be computed using L2 regularization.

![L2 regularization](https://cdn-images-1.medium.com/max/1200/1*jgWOhDiGjVp-NCSPa5abmg.png).

In [3]:
def linear_regression_gd(X, y, theta, \
                         learning_rate, \
                         num_iters, \
                         regularization=False):
  # Make sure that you return the weights in a np.array, 
  # other data types will cause our grading script to crash
    lamda = 3
    iterations = 0
    while iterations<num_iters:
        h_theta = np.dot(X, theta) 
        if regularization == True:
            loss = np.square(h_theta - y)
            gradient = np.dot((h_theta - y).T,X)/(X.shape[0])
            reg_factor = 1 - ((learning_rate * lamda)/X.shape[0])
            theta = (theta * reg_factor) - (learning_rate * gradient.T)
        elif regularization == False:
            loss = np.square(h_theta - y)
            gradient = np.dot((h_theta - y).T,X)/(X.shape[0])
            theta = theta - (learning_rate * gradient.T)
        iterations = iterations + 1
    return theta
    

## Question 3 (20 points)

- Apply your linear regression OLS, gradient descent without regularization, and gradient descent with regularization functions to [Sci-Kit Learn's diabetes dataset](https://www.programcreek.com/python/example/85913/sklearn.datasets.load_diabetes). Additionally, apply [Sci-Kit Learn's Linear Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- Calculate the amount of time it took for each of the functions to execute with the code that we have included.

In [4]:
from sklearn.datasets import load_diabetes

In [5]:
# load variables in this area
y = (diab['Y'])
y = np.array(y)
y = np.reshape(y, (442,1))
diab = np.matrix(diab)
X = np.delete(diab, 10,axis = 1) #delete target variable and load  to a new variable X
X = np.insert(X, 0, np.array([1]).T, axis = 1) #insert bias term
original_x = X
theta = np.array([0.0] * 11)
theta = np.reshape(theta, (11,1))

start_time = time.process_time()
# Do OLS here
print("Linear Regression using OLS: ",linear_regression_ols(X,y))
end_time = time.process_time()
print("Total execution time for OLS: " + str(end_time-start_time))


start_time = time.process_time()
# Do GD without regularization here
print("Linear Regression using Gradient Descent without regularization: ",linear_regression_gd(X, y,theta, learning_rate = 0.000026, num_iters = 500000, regularization = False))
end_time = time.process_time()
print("Total execution time for gradient descent without regularization: " + str(end_time-start_time))


start_time = time.process_time()
# Do GD with regularization here
print("Linear Regression using Gradient Descent with regularization: ",linear_regression_gd(X, y,theta, learning_rate = 0.000026, num_iters = 500000, regularization = True))
end_time = time.process_time()
print("Total execution time for gradient descent with regularization: " + str(end_time-start_time))

start_time = time.process_time()
# Do SK's linear regression here
from sklearn import linear_model
l_regr = linear_model.LinearRegression()
l_regr.fit(X,y)
print("Linear Regression using sklearn: ",l_regr.coef_)
end_time = time.process_time()
print("Total execution time for Sci-Kit Learn's Linear Regression: " + str(end_time-start_time))

Linear Regression using OLS:  [[-3.34567139e+02]
 [-3.63612242e-02]
 [-2.28596481e+01]
 [ 5.60296209e+00]
 [ 1.11680799e+00]
 [-1.08999633e+00]
 [ 7.46450456e-01]
 [ 3.72004715e-01]
 [ 6.53383194e+00]
 [ 6.84831250e+01]
 [ 2.80116989e-01]]
Total execution time for OLS: 0.015625
Linear Regression using Gradient Descent without regularization:  [[-6.24569101e+00]
 [ 1.71076672e-02]
 [-2.37808723e+01]
 [ 5.49302209e+00]
 [ 1.02567573e+00]
 [ 1.33546739e+00]
 [-1.36701744e+00]
 [-3.02589875e+00]
 [-4.71242711e+00]
 [ 2.52093522e+00]
 [ 1.56491118e-01]]
Total execution time for gradient descent without regularization: 71.3125
Linear Regression using Gradient Descent with regularization:  [[-6.00137266e+00]
 [ 1.50867474e-02]
 [-2.31815840e+01]
 [ 5.49706904e+00]
 [ 1.02049071e+00]
 [ 1.34227351e+00]
 [-1.37505862e+00]
 [-3.02479745e+00]
 [-4.67409381e+00]
 [ 2.35663528e+00]
 [ 1.51620552e-01]]
Total execution time for gradient descent with regularization: 79.15625
Linear Regression using sk

## Question 4 (20 points)

Normalize the appropriate variables in the dataset and re-do Question 3 using this dataset.

In [6]:
# Normalize variables here
mu = np.zeros(X.shape[1])
sigma = np.zeros(X.shape[1])
mu = np.mean(X, axis = 0)
sigma = np.std(X, axis = 0)
X = (X - mu)/sigma
X = np.delete(X, 0, axis = 1)
X = np.insert(X, 0, np.array([1]).T, axis = 1)

  


In [7]:
# Make sure to use the normalized variables here, not the original ones.

start_time = time.process_time()
# Do OLS here
print("Linear Regression using OLS: ",linear_regression_ols(X,y))
end_time = time.process_time()
print("Total execution time for OLS: " + str(end_time-start_time))


start_time = time.process_time()
# Do GD without regularization here
print("Linear Regression using Gradient Descent without regularization: ",linear_regression_gd(X, y,theta, learning_rate = 0.1, num_iters = 100000, regularization = False))
end_time = time.process_time()
print("Total execution time for gradient descent without regularization: " + str(end_time-start_time))


start_time = time.process_time()
# Do GD with regularization here
print("Linear Regression using Gradient Descent with regularization: ",linear_regression_gd(X, y,theta, learning_rate = 0.1, num_iters = 100000, regularization = True))
end_time = time.process_time()
print("Total execution time for gradient descent with regularization: " + str(end_time-start_time))


start_time = time.process_time()
# Do SK's linear regression here
l_regr = linear_model.LinearRegression()
l_regr.fit(X,y)
print("Linear Regression using sklearn: ",l_regr.coef_)
end_time = time.process_time()
print("Total execution time for Sci-Kit Learn's Linear Regression: " + str(end_time-start_time))

Linear Regression using OLS:  [[152.13348416]
 [ -0.47612079]
 [-11.40686692]
 [ 24.72654886]
 [ 15.42940413]
 [-37.67995261]
 [ 22.67616277]
 [  4.80613814]
 [  8.42203936]
 [ 35.73444577]
 [  3.21667372]]
Total execution time for OLS: 0.0
Linear Regression using Gradient Descent without regularization:  [[152.13348416]
 [ -0.47612079]
 [-11.40686692]
 [ 24.72654886]
 [ 15.42940413]
 [-37.67995261]
 [ 22.67616277]
 [  4.80613814]
 [  8.42203936]
 [ 35.73444577]
 [  3.21667372]]
Total execution time for gradient descent without regularization: 15.90625
Linear Regression using Gradient Descent with regularization:  [[151.10786517]
 [ -0.37238047]
 [-11.22224526]
 [ 24.78272579]
 [ 15.29197839]
 [-21.61503683]
 [  9.93607156]
 [ -2.23138687]
 [  6.5650626 ]
 [ 29.56618188]
 [  3.34055771]]
Total execution time for gradient descent with regularization: 15.90625
Linear Regression using sklearn:  [[  0.          -0.47612079 -11.40686692  24.72654886  15.42940413
  -37.67995261  22.67616277 

## Question 5 (10 points, 2 points per quesiton)

1. Did you notice any difference between the normalize and non-normalized versions in questions 3 and 4? Explain your answer.
2. Which is the linear regressions is faster? Why is it faster?
3. Why don't we train all the machine learning models using that technique?
4. Describe in your own words at least two regularization methods
5. What would happen if you use a regularization parameter value that is too low or too high?

Write your answers here:

1. I normalized the dataset using mean normalization in which case all the features will be in the same range. As the features are in the same range the contour circles around the local minima gets shrunk and the descent happened soon compared to the non-normalized version. Another difference I found was that in this specific dataset the ranges of the features varied widely because of which we had no control of the magnitude of the gradient step size and ended up with an exploded gradient for the unnormalized version. To fix the shoot up of the weights, we had to take a very low learning rate and run for a higher number of iterations. If the data is normalized then the magnitude of step size is fixed and the convergence occurs with considerably high learning rate and less number of iterations. 

2. Sklearn's implementation is faster than all three implementations. But when sklearn is not considered then OLS(Normal equation is faster than Gradient descent(both without and with regularization). OLS is faster than gradient descent because in OLS we just take matrix transposes and donot do any iterations like in gradient descent.

3. In the diabetes dataset we just have ten features so OLS was faster than other techniques but as the feature count gets significantly high then the OLS method gets very slow as a matrix inverse function is involved, in this case gradient descent will be a better option. Or if np.dot(X,X.T) becomes non-invertible like a singular matrix or degenerate matrix then also OLS method cannot be used as an inverse function is involved.

4. The two famous regularization methods are L1 and L2 regularization (aka Lasso and Ridge regression respectively). Regularization is the process of penalizing the higher order parameters so that the model doesnot overfit the data. By penalizing we reduce the influence of higher order polynomials in the prediction equation. In Lasso Regression, we penalize by taking absolute values of the weights and in ridge regression we penalize the weights by take square of the weights. 

5. If lambda value is too high then it may smooth out the function too much and cause underfitting and if lambda value is too small or close to zero then regularization will have no effect and overfitting might remain.