# Thinking in Tensors, writing in PyTorch

A hands-on course by [Piotr Migdał](https://p.migdal.pl) (2019).
This notebook prepared by [Weronika Ormaniec](https://github.com/werkaaa).

## Notebook 4: Multiple Linear Regression


<a href="https://colab.research.google.com/github/stared/thinking-in-tensors-writing-in-pytorch/blob/master/extra/Using%20an%20ImageNet-pretrained%20model.ipynb">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

Simple linear regression is a useful tool when it comes to predicting an output given single predictor input. However, in practice we often come across problems which are described by more than one predictor. In this case we use Multiple Linear Regression.

Instead of fitting several linear equations for each predictor, we will create one equation that will take the form:
$$ Y = \alpha_0 + \alpha_1 \cdot X_1 + \alpha_2\cdot X_2 + ... + \alpha_n\cdot X_n$$
where $X_i$ is one of the predictors, $\alpha_1$ is a coefficient, we want to get to know and $n$ is the number of predictors.

The learning process in Multiple Linear Regression is the same as the one in Simple Linear Regression. 

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
from livelossplot import PlotLosses

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import transforms

### Data

In this notebook we will analyze The Boston Housing Dataset. It contains information about 506 houses in Boston. There are 13 features of the houses, which have grate or little impact on the price of the house. Using PyTorch we will implement a model that will predict the price of the house and then we will try to answer the question, which parameters have the biggest impact on the price of the houses

We will take the dataset from scikit learn datasets.

In [None]:
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df[:5]

Parameters description:
    
* CRIM: Per capita crime rate by town
* ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
* INDUS: Proportion of non-retail business acres per town
* CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX: Nitric oxide concentration (parts per 10 million)
* RM: Average number of rooms per dwelling
* AGE: Proportion of owner-occupied units built prior to 1940
* DIS: Weighted distances to five Boston employment centers
* RAD: Index of accessibility to radial highways
* TAX: Full-value property tax rate per $10,000
* PTRATIO: Pupil-teacher ratio by town
* B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
* LSTAT: Percentage of lower status of the population

Target: Median value of owner-occupied homes in $1000s

First of all, let's check which parameters have the most linear correlation with the price of the houses.

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4,figsize=(15,7))
ax1.scatter(boston_data_frame['RM'], boston.target)
ax1.set_xlabel('RM')
ax1.set_ylabel('Target')

ax2.scatter(boston_data_frame['CRIM'], boston.target)
ax2.set_xlabel('CRIM')
ax2.set_ylabel('Target')

ax3.scatter(boston_data_frame['PTRATIO'], boston.target)
ax3.set_xlabel('PTRATIO')
ax3.set_ylabel('Target')

ax4.scatter(boston_data_frame['LSTAT'], boston.target)
ax4.set_xlabel('LSTAT')
ax4.set_ylabel('Target')


We can say that the correlation between RM (number of rooms per dwelling) and the price may be linear. The same as correlation between LSTAT (percentage of lower status of the population) and the price. What about CRIM (per capita crime rate by town) and PTRATIO (pupil-teacher ratio by town)? Those relationships are clearly not linear. Let's check how linear model will put up with it!

Looking at the data, we can see that some predictors have different orders of magnitude. That can be an obstacle during model training. That is why, we will normalize the data, so they will be in range $[-1,1]$.

In [None]:
X = torch.tensor(boston.data, dtype=torch.float32)
Y = torch.tensor(boston.target, dtype=torch.float32)

In [None]:
def Normalize(data):
    data_mean = torch.mean(data, dim=0)
    data_max = torch.max(data, dim=0)[0]
    data_min = torch.min(data, dim=0)[0]
    data = (data-data_mean)/(data_max-data_min)
    return data

In [None]:
X_normalized = Normalize(X)

In [None]:
boston_df = pd.DataFrame(np.array(X_normalized), columns=boston.feature_names)
boston_df[:5]

This time we will divide the data into training and test sets because we will be able to measure how well the model is doing in general, on the examples it has not seen during training process.

In [None]:
X_train = X_normalized[:400]
Y_train = Y[:400]
X_test = X_normalized[401:]
Y_test = Y[401:]

### Model

In [None]:
class Linear(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(in_features=13, out_features=1)
        
    def forward(self, x):
        return self.linear(x).squeeze(1)

In [None]:
linear_model = Linear()
print(linear_model.linear.weight)
print(linear_model.linear.bias)

In [None]:
y_predict_train = linear_model(X_train)
rmse_train = torch.sqrt(F.mse_loss(Y_train, y_predict_train))

y_predict_test = linear_model(X_test)
rmse_test = torch.sqrt(F.mse_loss(Y_test, y_predict_test))

print("The PyTorch model performance:")
print('RMSE_train is {}'.format(rmse_train))
print('RMSE_test is {}'.format(rmse_test))

In [None]:
#optim = torch.optim.SGD(linear_model.parameters(), lr=0.1)
optim = torch.optim.Adam(linear_model.parameters(), lr=1.)
loss_function = F.mse_loss
loss = loss_function(linear_model(X), Y)
print(loss)  

In [None]:
def train(X, Y, model, loss_function, optim, num_epochs):
    loss_history = []
    liveloss = PlotLosses()


    for epoch in range(num_epochs):
        
        epoch_loss = 0.0
        
        Y_pred = model(X)
        loss = loss_function(Y_pred, Y)
        
        loss.backward()
        optim.step()
        optim.zero_grad()
        

        epoch_loss = loss.data.item()
        
        avg_loss = epoch_loss / len(X)

        liveloss.update({
            'loss': avg_loss,
        })
        liveloss.draw()

train(X_train, Y_train, linear_model, loss_function, optim, num_epochs=80)

In [None]:
y_predict_train = linear_model(X_train)
rmse_train = torch.sqrt(F.mse_loss(Y_train, y_predict_train))

y_predict_test = linear_model(X_test)
rmse_test = torch.sqrt(F.mse_loss(Y_test, y_predict_test))

print("The PyTorch model performance:")
print('RMSE train is {:.3f}'.format(rmse_train))
print('RMSE test is {:.3f}'.format(rmse_test))

A we can see, our model fits the data better after training. 

We can now compare it with scikit learn linear regression model.

In [None]:
lin_model = LinearRegression()
lin_model.fit(X_train.numpy(), Y_train.numpy())

In [None]:
y_ptrain = lin_model.predict(X_train)
rmse_train = np.sqrt(mean_squared_error(Y_train, y_ptrain))

y_ptest = lin_model.predict(X_test)
rmse_test = np.sqrt(mean_squared_error(Y_test, y_ptest))

print("The model performance for training set")
print('RMSE train is {:.3f}'.format(rmse_train))
print('RMSE test is {:.3f}'.format(rmse_test))

Our model is not perfect but it has learned some intuition about the data and is able to make predictions even on the data it has not seen during learning process.

Let's compare the coefficients of both models.

In [None]:
n_groups = 13

fig, ax = plt.subplots(figsize=(15, 7))

index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8

rects1 = plt.bar(index, linear_model.linear.weight.detach().squeeze(), bar_width,
alpha=opacity,
color='b',
label='our model')

rects2 = plt.bar(index + bar_width, lin_model.coef_, bar_width,
alpha=opacity,
color='g',
label='scikit learn')

plt.xlabel('Variable')
plt.ylabel('Value')
plt.title('Coefficients')
plt.xticks(index + bar_width, ('CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'))
plt.legend()

plt.tight_layout()
plt.show()

## To do
* plots of correlation
* more intro to dataset
* bar plots of coefficients (PyTorch and ScikitLearn)