# Preliminary work

- I found one of the datasets from the paper
- I chose 1 of the features which seemed most appropriate for a polynomial fit for linear regression (basically exactly what the paper did)
- Below is some preliminary work where I conduct standard linear regression and then polynomial linear regression using existing package sklearn
- I show MSE for both fits


- We will be doing linear regression in this notebook: $y = X\beta + \epsilon$
- We will be choosing only a single regressor to predict the dependent variable
- We have chosen a dataset which requires polynomial linear regression
- As a result the line we're trying to fit is $y=\beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon$
- Need to transform (N * 1) matrix $X$ to (N * 4) matrix as a result where ith row is $\begin{bmatrix} 1 & x_i & x_i^2 & x_i^3 \end{bmatrix}$
- After this transformation, initially we will use the closed form solution for the Least Squares Estimator to fit the data: $\hat{\beta} = (X^T X)^{-1}X^T y$
    - Here $\hat{\beta} = \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \\ \hat{\beta_2} \\ \hat{\beta_3} \end{bmatrix}$
- Then we move onto an iterative gradient descent approach for LSE Estimation of Linear Regression. SGD is one of the algorithms here!
    - Batch Gradient Descent where we use the entire dataset: $\hat{\beta}_{k+1} = \hat{\beta}_{k} - \alpha X^T(\hat{y} - y)$
        - $\hat{\beta}_{k} = \begin{bmatrix} \hat{\beta_0}_{k} \\ \hat{\beta_1}_{k} \\ \hat{\beta_2}_{k} \\ \hat{\beta_3}_{k} \end{bmatrix}, y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, X = \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 \\ 1 & x_2 & x_2^2 & x_2^3 \\ \vdots & \ddots \\ 1 & x_n & x_n^2 & x_n^3 \end{bmatrix}$
    - Stochastic Gradient Descent where we use just a single randomly selected sample: $\hat{\beta}_{k+1} = \hat{\beta}_{k} - \alpha(\hat{y_i} - y) x_i^T$
        - Now the $y, y_i$ are both scalar, and $x_i$ is not a matrix but rather a vector of a single randomly selected row from $X$

In [1]:
import os
import pandas as pd
import openpyxl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
import random
from multiprocessing import Process, Pool
import time
import workers
import SGD_Zinkevich
from datetime import datetime
import math
import numpy as np

In [None]:
dataframe = pd.read_excel('dataset1/dataset1.xlsx')
print(dataframe.shape)
dataframe.describe()

In [3]:
X = dataframe['V'].sort_values()
X = (X-X.mean()) / X.std()
y = dataframe['AT'][X.index].values
y = (y-y.mean()) / y.std()

X = np.reshape(X.values, (-1,1))

In [None]:
plt.scatter(X, y)
plt.xlabel("Exhaust Vacuum")
plt.ylabel("Average Temperature")
plt.show()

In [None]:
lr = LinearRegression()
lr.fit(X, y)
y_hat_sklearn = lr.predict(X)

pr = PolynomialFeatures(degree=3)
X_poly = pr.fit_transform(X)
lr_poly = LinearRegression()
lr_poly.fit(X_poly, y)
y_hat_poly_sklearn = lr_poly.predict(X_poly)


In [None]:
plt.scatter(X, y, color = 'blue')
plt.plot(X, y_hat_sklearn, color = 'firebrick')
plt.plot(X, y_hat_poly_sklearn, color = 'green')
plt.show()

In [None]:
y_hat_sklearn = lr.predict(X)
y_hat_poly_sklearn = lr_poly.predict(X_poly)

print("mean squared error for standard linear:", mean_squared_error(y, y_hat_sklearn))
print("mean squared error for linear polynomial:", mean_squared_error(y, y_hat_poly_sklearn))

# Going from built-in package to implementing it ourselves

- Now using the dataset, I will conduct linear regression, but this time using matrix multiplication and numpy.
- I will implement a closed-form based based algorithm before moving onto a gradient descent based algorithm.

In [6]:
'''
Converts an (N * 1) matrix into a (N * h) matrix where h is the number of basis functions ()
The degree of the polynomial is (h-1)
'''
def polynomial_basis_function_transformation(X, h):
    powers = np.arange(h)
    X_poly = np.power(X, powers)
    return X_poly

'''
Conducts Linear Regression but initially transforms data using polynomial basis functions
Takes in an (N * 1) matrix, converts it into a (N * h) matrix
Performs linear regression on the (N*h) matrix resulting in h weights - betas
Returns the predictions only
'''
def lin_reg_poly_closed_form(X, y, h):
    X_poly = polynomial_basis_function_transformation(X, h)
    beta_hat_poly = np.linalg.pinv(X_poly.T @ X_poly) @ X_poly.T @ y
    y_hat_poly = X_poly @ beta_hat_poly
    return y_hat_poly


In [None]:
y_hat_poly = lin_reg_poly_closed_form(X, y, 4)

In [None]:
plt.scatter(X, y, color = 'blue')
plt.plot(X, y_hat_sklearn, color = 'firebrick')
plt.plot(X, y_hat_poly, color = 'green')
plt.show()

print("mean squared error for linear polynomial through numpy (closed form):", mean_squared_error(y, y_hat_poly))

# Implementing the Batch Gradient Descent algorithm for linear regression

- We have implemented the closed form solution for polynomial linear regression ourselves above, moving away from sklearn as a package
- We now look to implement an iterative algorithm, useful when closed form solution is computationally prohibitive, such as when $X^TX$ is $10,000*10,000$ leading to matrix inversion times being extremely long (in the above case it is only $4 * 4$)
- We will initially implement Batch Gradient Descent and parallelize it before finally moving onto Stochastic Gradient Descent, and then parallelizing Stochastic Gradient Descent

## Non Parallelized version

In [11]:
'''
Conducts Linear Regression but initially transforms data using polynomial basis functions
Takes in an (N * 1) matrix, converts it into a (N * h) matrix
Performs linear regression on the (N*h) matrix resulting in h weights - betas
But this time linear regression is conducted through iterative batch gradient descent
MSE as you iterate through the algorithm is shown
Returns the predictions only
'''
def lin_reg_poly_bgd(X, y, h, alpha, n):
    X_poly = polynomial_basis_function_transformation(X, h)
    beta_hat_poly = np.random.rand(h)
    for i in range(n):
        y_hat_poly = X_poly @ beta_hat_poly
        beta_hat_poly = beta_hat_poly - alpha * (X_poly.T @ (y_hat_poly - y))
        if i % 100000 == 0:
            print("MSE in iteration", i, ": ", mean_squared_error(y, y_hat_poly))
    return y_hat_poly
    

In [None]:
y_hat_poly_bgd = lin_reg_poly_bgd(X, y, 4, 0.00001, 500000)

In [None]:
plt.scatter(X, y, color = 'blue')
plt.plot(X, y_hat_sklearn, color = 'firebrick')
plt.plot(X, y_hat_poly_bgd, color = 'green')
plt.show()

print("mean squared error for linear polynomial through numpy (gradient descent):", mean_squared_error(y, y_hat_poly_bgd))

## Parallelized version
- We now implement the parallelized version of Batch Gradient Descent
- We can expect to see clear advantages to the Batch Gradient Descent algorithm when using parallelization

# Stochastic Gradient Descent
- We will now implement the non-parallelized version of SGD

## Non-Parallelized Version

In [None]:
'''
Conducts Linear Regression but initially transforms data using polynomial basis functions
Takes in an (N * 1) matrix, converts it into a (N * h) matrix
Performs linear regression on the (N*h) matrix resulting in h weights - betas
But this time linear regression is conducted through iterative gradient descent
Specifically stochastic gradient descent where we just choose a single sample from the the dataset
MSE as you iterate through the algorithm is shown
Returns the predictions only
'''
def lin_reg_poly_sgd(X, y, h, alpha, n):
    X_poly = polynomial_basis_function_transformation(X, h)
    beta_hat_poly = np.random.rand(h)
    for i in range(n):
        idx = np.random.randint(0, X_poly.shape[0])
        X_sample = X_poly[idx, :]
        y_sample = y[idx]
        y_hat_sample_poly = X_sample @ beta_hat_poly
        beta_hat_poly = beta_hat_poly - alpha * (X_sample.T * (y_hat_sample_poly - y_sample))
        
        y_hat_poly = X_poly @ beta_hat_poly
        if i % 100000 == 0:
            print("MSE in iteration", i, ": ", mean_squared_error(y, y_hat_poly))
    return y_hat_poly

In [None]:
start = datetime.now()
lin_reg_poly_sgd(X, y, 4, 0.00001, 50000000)
diff = datetime.now() - start
print(diff)

## Example of the Multiprocessing library
- Here is a very simple example of how to use multiprocessing
- Note that the actual function which the worker processes in parallel HAS TO BE IN ANOTHER FILE
- This is why workers.py and SGD_Zinkevich.py exists
- Try by changing workers.f to just f and f2 defined in this notebook. You will see what I mean

In [None]:
def info(title):
    print(title)
    print('module name:', __name__)
    print('parent process:', os.getppid())
    print('process id:', os.getpid())
    
    
def f(name):
    print("hello bob1")
    info('function f')
    print()
    
def f2(name):
    time.sleep(5)
    print("hello bob2")
    info('function f')
    print()

workers.info('main line')
p1 = Process(target=workers.f, args=('bob1',))
p2 = Process(target=workers.f2, args=('bob2',))

p2.start()
p1.start()
p2.join()
p1.join()

## Parallelization 1
- Implementing SGD in parallel according to the paper "Parallelized Stochastic Gradient Descent" published by Zinkevich et al.


In [4]:
'''
Conducts Linear Regression but initially transforms data using polynomial basis functions
Takes in an (N * 1) matrix, converts it into a (N * h) matrix
Performs linear regression on the (N*h) matrix resulting in h weights - betas
But this time linear regression is conducted through iterative gradient descent
Specifically stochastic gradient descent where we just choose a single sample from the the dataset

This time t threads conduct sgd in parallel on the entire dataset. 
Each thread does it for n iterations.

After each thread has returned estimates beta_hats, we aggregate them to get the final beta_hat

Returns the predictions only
'''

def lin_reg_poly_sgd_parallel(X, y, h, alpha, n, t):
    start = datetime.now()
    X_poly = polynomial_basis_function_transformation(X, h)
    
    curr1 = datetime.now()
    diff1 = curr1 - start
    
    with Pool(processes=t) as p:
        
        curr2 = datetime.now()
        diff2 = curr2 - curr1
        
        outputs = p.starmap(SGD_Zinkevich.lin_reg_poly_sgd, [[X_poly, y, h, alpha, n]] * t)
        outputs = np.array(outputs)
        
        curr3 = datetime.now()
        diff3 = curr3 - curr2
        
    time.sleep(10)
    
    print(outputs.shape)
    print(outputs)
    print(diff1)
    print(diff2)
    print(diff3)
    
    beta_hat_poly = np.sum(outputs, axis=0) / t
    print(beta_hat_poly.shape)
    y_hat_poly = X_poly @ beta_hat_poly
    print("mean squared error for linear polynomial through SGD (parallel):", mean_squared_error(y, y_hat_poly))
    return y_hat_poly

In [None]:
lin_reg_poly_sgd_parallel(X, y, 4, 0.00001, 500000, 6)

## Future work
- Comment regularly
- Branch out in git while doing own work and merge back in
- Regularly explain linear algebra/parallelisation algorithm in markdown above cells as he will run code