# Lab 03 - Regression
## Tasks
- Explore linear regression and overfitting
- Experiment with stocastic gradient descent

# Set up environment

In [None]:
!pip install git+https://github.com/uspas/optimization_and_ml --quiet

In [None]:
%reset -f

import os
import numpy as np
import matplotlib.pyplot as plt

#matplotlib graphs will be included in your notebook, next to the code:
%matplotlib inline

#import toy accelerator package
from uspas_ml.accelerator_toy_models import high_dimensional_data_generator

#import sklearn modules
from sklearn import linear_model
from sklearn.model_selection import train_test_split

#import pytorch
import torch

## Linear regression
We start with simple least squares linear regression. Our model is defined as follows:

$
f(\mathbf{x}) = \mathbf{x}^T\mathbf{w} + \mathbf{b}
$

Our objective is to determine the weights $\mathbf{w}$ and the biases $\mathbf{b}$ for a set of input variables (also referred to as "features") to best model the observed values $y = f(\mathbf{x}) + \epsilon$ where $\epsilon$ represents noise (usually Gaussian noise).

We start by constructing a training set and a test set. We will train the model on the training set and then evaluate the model's abilitiy to correctly predict values from the test set. This (hopefully) allows us to evaluate if the model generalizes well to data outside the test set.

One problem we face is overfitting (see https://en.wikipedia.org/wiki/Overfitting for more details). Here we will try to demonstrate methods for avoiding this problem.

In [None]:
#load dataset - last column is f
from uspas_ml.accelerator_toy_models import data_dir
data = np.load(os.path.join(data_dir, 'complex_dataset.npy'))
true_weights = np.load(os.path.join(data_dir, 'complex_dataset_weights.npy'))

print(data.shape)
x = data[:,:-1]
f = data[:,-1]

#split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, f, test_size=0.3, random_state=3)
print(X_train.shape)

In [None]:
#fit data with a linear model
reg = linear_model.LinearRegression()
reg.fit(X_train,y_train)
reg.coef_

#get training and test scores - least squared distance
train_score = reg.score(X_train, y_train)
test_score = reg.score(X_test, y_test)
print(train_score)
print(test_score)

fig,ax = plt.subplots()
ax.hist(reg.coef_, range = [-6,6]);

<div class="alert alert-block alert-info">
    
**Task:** 
    Refit the data using Ridge and Lasso regression with alpha = 0.01, 100 and plot histogram of weights Hint: use scikit-learn. Also plot a histogram of `true_weights` to compare. Also calculate the training and test scores for each. Which method is best for this data set?
    
</div>

In [None]:
#Ridge regression

In [None]:
#Lasso regression

# Using SGD

## Create some data

In [None]:
# Vector of 100 random numbers between zero and one; multiply all by 2.
X = 2*np.random.rand(100,1)

# Y(x) + dY, for random values of Y
Y = 4 + 3*X + np.random.randn(100,1)

plt.scatter(X, Y, label='gaussian noise abt. y=4x+3')
print(data.shape)
x = torch.from_numpy(X).float()
f = torch.from_numpy(Y).float()


#add data to dataset objects, which allows us to easily split data into train/test sets and shuffle the data as needed
train_fraction = 0.7
n_train = int(train_fraction*x.shape[0])
n_test = x.shape[0] - n_train
train_data, test_data = torch.utils.data.random_split(torch.utils.data.TensorDataset(x, f), [n_train, n_test])

## Define a model to train

In [None]:
# Define a model class that we will train - represents a linear combination
class LinearModel(torch.nn.Module):
    def __init__(self, n_features):
        super(LinearModel, self).__init__()
        self.linear = torch.nn.Linear(n_features, 1)
        
    def forward(self, x):
        return self.linear(x)

# list trainable parameters for linear model if we have 10 inputs
lmodel = LinearModel(2)
for name, item in lmodel.named_parameters():
    print(f'{name}:{item}')


### Modifications to Linear Regression
The analytical solution for the weights using least squares involves matrix inversion, which makes it practically impossible to solve this problem analytically. The least squares loss function is defined as

$
    L(\mathbf{w}) = ||X\mathbf{w} - Y||^2
$

where $X$, $Y$ are matricies with our input and output data respecitvely. To minimize the loss we compute the gradient

$
    \frac{dL}{d\mathbf{w}} = -2X^TY + 2X^TX\mathbf{w} 
$

and set it to zero to find the minimum value for the weights, we get

$
    \mathbf{w} = (X^TX)^{-1}X^T\mathbf{y}
$

If $X$, $Y$ are large (maybe 10k x 10k) then the matrix inversion here becomes computationally impossible. 

Instead, we will optimize the weights numerically. However,computing the gradient is also expensive in this case.  Instead we will approximate the gradient using smaller "batches" of training points.

We divide the entire dataset into batches and use the gradient with respect to these batches as a stand-in for the real gradient. Depending on the batch size we call this method as follows:
- "Batch gradient descent / Gradient descent" -> the batch size is equal to the dataset size
- "Mini-batch gradient descent" -> the batch size is > 1 but less than the dataset size
- "Stochastic gradient descent" -> the batch size is 1

Choosing the correct batch size has a significant effect on optimization speed depedning on the problem at hand. Reducing the batch size can make optimization faster to compute, but takes a noiser path. 

In this case the term "epoch" describes the number of times the optimizer will use the entire dataset during optimization (generally greater than one).

We will start by trying batch gradient descent using `torch.optim.SGD`
#### NOTE: even though the function name is SGD, it depends on how much data you pass it. In this case since we pass it the entire dataset it is effectively batch gradient descent.

In [None]:
epochs = 200
gd_model = LinearModel(x.shape[1])

#use Mean Squared Error loss function
loss_fn = torch.nn.MSELoss()
optim = torch.optim.SGD(gd_model.parameters(), lr = 0.01)

batch_size = len(train_data)

train_loader = torch.utils.data.DataLoader(train_data, batch_size = batch_size, shuffle = True)

#track loss during epochs
train_loss = []
test_loss = []
for i in range(epochs):

    #iterate over the batches (in this case we only have one batch)
    for batch_idx, (data, target) in enumerate(train_loader):
        # zero gradients
        optim.zero_grad()
        
        # calculate model prediction -> calculate loss
        fval = gd_model(data)
        loss = loss_fn(fval, target.reshape(-1,1))
        
        # calculate derivatives of loss based on model parameters
        loss.backward()
    
        # step the optimizer based on the gradients
        optim.step()
        
        # keep track of loss function
        train_loss += [loss.detach().numpy()]
        if i % 5 == 0:
            print(f'epoch: {i}, loss: {loss}')   
            
        #calculate test loss
        test_loss += [loss_fn(gd_model(test_data[:][0]),test_data[:][1].unsqueeze(1)).detach().numpy()]


<div class="alert alert-block alert-info">
    
**Task:** 
    Refit the data using stochastic gradient descent (batch size of 1) and mini-batch gradient descent with 10 batches. You will have to change the number of epochs so that it takes a reasonable amount of time (50 for mini-batch and 10 for SGD). Compare how changing batch size effects optimization by plotting test_loss as a function of epoch for each batch size. Compare how mini-batch training changes with varying learning rates (0.001, 0.01, 0.1).
    
</div>

<div class="alert alert-block alert-success">
    
**Homework:**
    Provide an explaniation for why the MSE performance of gradient based methods varies so dramatically based on batch size.
</div>

<div class="alert alert-block alert-success">
    
**Homework:**
    How could we extend this method towards modeling more complex functions? Use a pytorch model to perform polynomial
    regression of the dataset generated below. Your answer should closely match the analytical function used to generate the data.
    Hint: You can access the weight elements of nn.Linear layer using `.weight`, see above for use.
</div>

In [None]:
x = torch.linspace(0,1, 100).unsqueeze(dim=1)
y = 0.75 + 0.1 * x - 0.5 * x ** 2 + 0.5 * x ** 3 + torch.randn(x.shape)*0.005
plt.plot(x,y)