<a href="https://colab.research.google.com/github/sael17/Project_2_Notebooks_AISP23/blob/master/Sebastian_Estrada_Regression_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Student Name: Sebastian A. Estrada Lopez
# Student Number: 802-19-3839

### Libraries to be used and set up

Some of the libraries that are going to be used in here are common ones like numpy, matplotlib and PyTorch 2.0

In [None]:
import torch
import torchvision
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt
import torchvision.transforms as transforms

from torch.utils.data import TensorDataset, DataLoader


# Make NumPy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)


### Obtaining the dataset

The dataset to be used for this experiment is 'The Auto MPG' dataset, which is available from the UCI Machine Learning Repository. This dataset contains the fuel efficiency of the late-1970s and early 1980s automobiles and the models to be built need to predict the miler per galon fuel efficiency. To do this, the models will be provided with a description of many automobiles from that time period. This description includes attributes like cylinders, displacement, horsepower, and weight. The data set contains the following attributes:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

In the raw dataset it can be observed that the Car Name column is not very useful for the experiment, since the Name of a car does not affect its MPG. In that case, the Car Name column is ignored/removed from the dataset to be used.




In [None]:
# Origin/source of the dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# List that contains the names for the columns for the dataset 
columns_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Origin']
# load the data using pandas
raw_dataset = pd.read_csv(url,names=columns_names,na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

# make a copy of the original dataset in order to avoid risk of accidental manipulation
raw_dataset_copy = raw_dataset.copy()
# get the last 5 rows of the dataset
raw_dataset_copy.tail()
# display(raw_dataset[:50])

## Understanding the Data

Before doing anything with the experiment, first we must understand what we are predicting and what our data is. We want to predict the Full Efficiency based on the other features that our data set contains. In the following sections, we will analyze and understand the data better

### Data Analysis

Before starting with the experiment, the data has to be cleaned as much as possible in order to reduce the risk of errors that can affect the analyis. 

The data is sorted by Model year. Its is important that each batch is a random example of the data, and we do not want the model to make predictions based in the sorting of the model year. Therefore, the data is going to be shuffled to remove this sorting.

In [None]:
# We want to shuffle the order of the rows without touching the columns.
# First, we get a list of indices corresponding to the rows.
indices = np.arange(raw_dataset_copy.shape[0])
print('Indices',indices, '\n')

# Next, we shuffle the indices using np.random.permutation but set a random seed
# so that results can be replicated
np.random.seed(0)
shuffled_indices = np.random.permutation(indices)
print('shuffled indices:', shuffled_indices, '\n')

# Finally, we use dataframe.reindex to change the ordering of the original
# dataframe.
raw_dataset_copy = raw_dataset_copy.reindex(shuffled_indices)
display(raw_dataset_copy)

We also want to clean the data by removing from it everything that is not a number

In [None]:
# Check which columns of the dataset are not a number. We want to drop everything that is not a number in order to have clean numeric data
raw_dataset_copy.isna().sum()
# The dropna function drops rows with missing value(s) by default.
raw_dataset_copy = raw_dataset_copy.dropna()

The "Origin" column is categorical, not numeric. So the next step is to one-hot encode the values in the column with pd.get_dummies.

In [None]:
# Associate each Origin ID with a specific country
raw_dataset_copy['Origin'] = raw_dataset_copy['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
# Associate each Origin ID now with a one hot vector
raw_dataset_copy = pd.get_dummies(raw_dataset_copy, columns=['Origin'], prefix='', prefix_sep='')
# Display the last 5 rows to verify the dataset
raw_dataset_copy.tail()                                        

In [None]:
# Now that the dataset has been cleaned, it will officially be the dataset for the experiment
dataset = raw_dataset_copy
# Free the space in memory from the old variable
del raw_dataset_copy
display(dataset)

### Train/Test Split
The data has to be split into Training and Test portions since the model will learn with the training data and then it will be evaluated with the test data to verify its performance. In this experiment, the split will be 80/20, this means that 80% will be training and the other 20% will be testing.

In [None]:
# The input features to be used
features = ['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Europe', 'Japan', 'USA']
# Use a ~80/20 data split
X_train = dataset[:314]
X_test = dataset[314:]

## Create separate variables for features (inputs) and labels (outputs).
# We will be using these in the cells below.
dataset_train_features = X_train[features]
dataset_test_features = X_test[features]
dataset_train_labels = X_train['MPG']
dataset_test_labels = X_test['MPG']


# Confirm the data shapes are as expected.
print('train data shape:', dataset_train_features.shape)
print('train labels shape:', dataset_train_labels.shape)
print('test data shape:', dataset_test_features.shape)
print('test labels shape:', dataset_test_labels.shape)

### Baseline

Before implementing the model for the experiment, we need to implement a basic baseline for it. In this case since we want to predict the fuel efficiency for a car, we will use the average price of the cars to predict regardless of the input. For the baseline evaluation we will be using Root Mean Square Error (RMSE) which is just the root square of Mean Square Error (MSE).

In [None]:
def baseline(Y):
  # predict the average of the
  return np.mean(Y)

def RMSE(true_values, predicted_values):
  MSE = (1/len(true_values)) * np.sum(np.power(np.subtract(true_values,predicted_values),2))
  squared_error = np.sqrt(MSE)
  return squared_error

evaluated_baseline = baseline(dataset_train_labels)
print("Evaluated Baseline:",evaluated_baseline)

print('RMSE Evaluated in the Training Data', RMSE(dataset_train_labels,evaluated_baseline))
print('RMSE Evaluated in the Test Data', RMSE(dataset_test_labels,evaluated_baseline))


### Feature Histograms
Histograms can be used to interprate the data in a more visual way. Plotting histograms is s good way to starting building intuition about the data. This gives us a sense of the distribution of the data of each feature, but not how all features relate to each other.

In [None]:
plt.figure(figsize=(25,3))
for i in range(len(features)):
  plt.subplot(1,9,i+1)
  plt.hist(np.array(X_train[features[i]]))
  plt.title(features[i])
plt.show()

display(X_train.describe())

### Feature Correlations
Lets check if there are correlations between the features in order to understand the data better. To determine if there is a correlation between two features, we can look at the value in the corresponding cell of the correlation matrix. The value in each cell represents the correlation coefficient between the two variables, which can range from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no correlation.

If the value is close to -1 or 1, it suggests a strong correlation between the two features. 

A Heatmap is going to be used in order to visualize represent the correlations between features. The resulting heatmap will have a color scale that represents the strength of the correlation between the features, with lighter colors indicating stronger positive correlations, darker colors indicating stronger negative correlations, and white indicating no correlation.

In [None]:
features_correlations = pd.DataFrame(X_train)
display(features_correlations.corr())

sns.heatmap(features_correlations.corr(), cmap="YlGnBu")

The diagonal of the heatmap is the strong relation between a feature and itself. Using the dark blue square in the top left of the heatmap, it can be analyzed that there is a strong correlation between the following features: Cylinders and displacement,Cylinders and horsepower, and cylinders and weight for example. 

However, we want to focus in our goal, in what we want to predict: MPG

From the heatmap and the table above, it is observed the following:



*   Negative Correlation between MPG and Cylinders. 
*   Negative Correlation between MPG and Displacement.
*   Negative Correlation between MPG and Horsepower. 
*   Negative Correlation between MPG and Weight.
*   Positive Correlation between MPG and Acceleration.
*   Positive Correlation between MPG and Model Year.
*   Positive Correlation between MPG and Europe.
*   Positive Correlation between MPG and Japan.
*   Negative Correlation between MPG and USA.

## Pytorch Models

### Normalization

For the models to be stable, the data has to be normalized. The data has to be normalized so that each feature is in the same scale. The STD and Mean will be used in order to normalize the data. Important: we can't normalize the test data by computing mean and variance on the test data, as this would violate our willful blindness of the test data.

In [None]:
def feature_normalization(data_frame_features,data_frame):
  features_mean = data_frame_features.mean()
  real_mean = data_frame-features_mean
  standard_deviation = data_frame_features.std()
  
  return real_mean/standard_deviation


train_features_norm = feature_normalization(dataset_train_features,dataset_train_features)
test_features_norm = feature_normalization(dataset_train_features,dataset_test_features)


print("Training Normalization")
display(train_features_norm.describe())
print("Test Normalization")
display(test_features_norm.describe())

### Converting Pandas DataFrames into Tensors.

Before training the Models, we have to convert the Dataframes into Tensors so that we can pass the data to the models with PyTorch

In [None]:
# convert dataframes tensors
tensor_data_train_features = torch.tensor(train_features_norm.values,dtype=torch.float32)
tensor_data_train_labels = torch.tensor(dataset_train_labels.values,dtype=torch.float32).view(-1,1)
tensor_data_test_features = torch.tensor(test_features_norm.values,dtype=torch.float32)
tensor_data_test_labels = torch.tensor(dataset_test_labels.values,dtype=torch.float32).view(-1,1)

## Regression Models 

All networks will use ReLU activation in the intermediate layers, but not in the final. Train for
20 epochs with learning rate of 0.01.
In the notebook add a cell and answer the following

### Regression Model #1

For this experiment, we will be doing various regression models. This is due to the experiment requirements being predicting a number (MPG) based on other numbers (features). Model #1 will be a basic regression with 1 layer and 1 neuron

In [None]:
class RegressionModel(torch.nn.Module):
  def __init__(self, input_dim):
    super(RegressionModel,self).__init__()
    # One Layer, One Neuron 
    self.layer1 = torch.nn.Linear(input_dim,1,bias=True)


  def forward(self,x):
    y_pred = self.layer1(x)
    return y_pred

### Regression Model #2

Model #2 will be a neural network with the following characteristics:

4-layer network


1.   Layer 1 – 10 neurons
2.   Layer 2 – 20 neurons
3.   Layer 3 – 10 neurons
4.   Layer 4 – output neuron

In [None]:
class RegressionModel_2(torch.nn.Module):
  def __init__(self, input_dim):
    super(RegressionModel_2,self).__init__()
    # first layer receives the features as their inputs. This layer contains 10 neurons.
    self.layer1 = torch.nn.Linear(input_dim,10,bias=True)
    # second layer receives the 10 outputs from the prev layer. This layer contains 20 neurons.
    self.layer2 = torch.nn.Linear(10,20,bias=True)
    # Third layer receives the 20 outputs from the prev layer. This layer contains 10 neurons.
    self.layer3 = torch.nn.Linear(20,10,bias=True)
    # Forth layer receives the 10 outputs from the prev layer. This layer contains 1 neuron (the output).
    self.layer4 = torch.nn.Linear(10,1,bias=True)
    # activation function for the hidden layers
    self.activation = torch.nn.ReLU()
    

  def forward(self,x):
      y_pred = self.activation(self.layer1(x))
      y_pred = self.activation(self.layer2(y_pred))
      y_pred = self.activation(self.layer3(y_pred))
      y_pred = self.layer4(y_pred)
      return y_pred


### Regression Model #3

Model #3 will be a neural network with the following characteristics:

5-layer network


1.   Layer 1 – 10 neurons
2.   Layer 2 – 20 neurons
3.   Layer 3 – 30 neurons
4.   Layer 4 – 20 neurons
5.   Layer 5 - output neuron

In [None]:
class RegressionModel_3(torch.nn.Module):
  def __init__(self,input_dim):
    super(RegressionModel_3,self).__init__()
    # first layer receives the features as their inputs. This layer contains 10 neurons.
    self.layer1 = torch.nn.Linear(input_dim,10,bias=True)
    # second layer receives the 10 outputs from the prev layer. This layer contains 20 neurons.
    self.layer2 = torch.nn.Linear(10,20,bias=True)
    # Third layer receives the 20 outputs from the prev layer. This layer contains 30 neurons.
    self.layer3 = torch.nn.Linear(20,30,bias=True)
    # Forth layer receives the 20 outputs from the prev layer. This layer contains 20 neurons.
    self.layer4 = torch.nn.Linear(30,20,bias=True)
    # Fifth layer receives the 20 outputs from the prev layer. This layer contains 1 neuron (the output).
    self.layer5 = torch.nn.Linear(20,1,bias=True)
    # activation function for the hidden layers
    self.activation = torch.nn.ReLU()
    

  def forward(self,x):
      y_pred = self.activation(self.layer1(x))
      y_pred = self.activation(self.layer2(y_pred))
      y_pred = self.activation(self.layer3(y_pred))
      y_pred = self.activation(self.layer4(y_pred))
      y_pred = self.layer5(y_pred)
      return y_pred

## Train models

Now we have to train each model we have created. These will be our distinct experiments. The models will be trained with the training data previosly separated.

In [None]:
# The features will be the initial input for the models
input_dim = len(features)
# we just want to predict a number (MPG)
output_dim = 1
# Learning rate to be used
lr = 0.01
# number of iterations to be done 
epochs = 20
# how much data will be passed to the model for each epoch
batch_size = 32
# how will the model evaluate its performance
criterion = torch.nn.MSELoss()

In [None]:
# Define and create a DataLoader object
train_data = TensorDataset(tensor_data_train_features, tensor_data_train_labels)
train_loader = DataLoader(train_data, batch_size=batch_size)

test_data = TensorDataset(tensor_data_test_features, tensor_data_test_labels)
test_loader = DataLoader(test_data, batch_size=batch_size)

In [None]:
def train_model(neural_network_model_version):
  '''
     This function creates a Pytorch Neural Network, trains it and returns the loss produced
     by the model
     @params: integer representing which model the function will be created
     @returns: the created and trained model and the loss list produced by the model 
     
  '''

  # These conditionals check which Regression Model version will be created
  if neural_network_model_version == 1:
    model = RegressionModel(input_dim)
  elif neural_network_model_version == 2:
    model = RegressionModel_2(input_dim)
  else:
    model = RegressionModel_3(input_dim)
  # Depending on the model being created is the optimizer that the model will use
  if neural_network_model_version==1:
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
  else:
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

  # Save the losses being produced so that it can be print out later
  loss_list = []
  for epoch in range(epochs):
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data
        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)

        # Compute loss
        loss = criterion(outputs, labels)
        # Backward pass and optimize
        loss.backward()
        optimizer.step()
        # Print statistics
        print('{}, \t{}, \t{}'.format(epoch, loss.item(), [param.data for param in model.parameters()]))
    # Append the loss to the loss list
    loss_list.append(loss.item())
  print('Finished Training')
  return loss_list,model

### Training Model #1
Having defined the settings for training our models, we proceed to train each of the models defined previously as classes above.
Since this is model #1, we let know the training function that we want to build and train model #1. We save the trained model and the loss list produced.

In [None]:
model_1_loss_list,model_1 = train_model(neural_network_model_version=1)

### Training Model #2
Having defined the settings for training our models, we proceed to train each of the models defined previously as classes above.
Since this is model #2, we let know the training function that we want to build and train model #2. We save the trained model and the loss list produced.

In [None]:
print("===================")
model_2_loss_list,model_2 = train_model(neural_network_model_version=2)

### Training Model #3
Having defined the settings for training our models, we proceed to train each of the models defined previously as classes above.
Since this is model #3, we let know the training function that we want to build and train model #3. We save the trained model and the loss list produced.

In [None]:
print("===================")
model_3_loss_list,model_3 = train_model(neural_network_model_version=3)

In [None]:
print(model_3_loss_list)

##Display Loss vs. Epochs for each of the models being evaluated
For observation purposes, the loss of training for all models will be plotted so that the behavior of the models can be better analyzed.

In [None]:
plt.figure()
plt.plot(model_1_loss_list, 'r',label='Model #1')
plt.plot(model_2_loss_list, 'g',label='Model #2')
plt.plot(model_3_loss_list, 'b',label='Model #3')
plt.tight_layout()
plt.grid('True', color='y')
plt.xlabel("Epochs/Iterations")
plt.ylabel("Loss")
plt.legend()
plt.show()

## Test Models
Now, we want to see how our models behave with new data after being trained. This is the purpose of our test data.

In [None]:
def test_model(model):
  '''
    Evaluate a model already trained with new/test data
    @params: Model - This is the model that will be used to test the data on
    @returns: A list containing the loss for the test evaluation
  '''

  # list to keep track of the test loss
  loss_list = []
  for epoch in range(epochs):
    with torch.no_grad():
        for data in test_loader:
          inputs,labels = data
          outputs = model(inputs)
          loss = criterion(outputs,labels)
          # print diagnostic data
          print('{}, \t{}, \t{}'.format(i, loss.item(), [param.data for param in model.parameters()]))      
    loss_list.append(loss.item())
  return loss_list


### Test Model #1 
After training Model #1, we have to evaluate the model's performance with data that it has not seen before in order to analyze how good is at predicting our desired goal.

In [None]:
model_1_test = test_model(model_1)

In [None]:
print(model_1_test)

### Test Model #2 
After training Model #2, we have to evaluate the model's performance with data that it has not seen before in order to analyze how good is at predicting our desired goal.

In [None]:
model_2_test = test_model(model_2)

### Test Model #3 
After training Model #3, we have to evaluate the model's performance with data that it has not seen before in order to analyze how good is at predicting our desired goal.

In [None]:
model_3_test = test_model(model_3)

## Plot Training and Test Loss 

In this section, we will be plotting the loss for each model and for both the training and testing as the epochs go. This is to analyze the behavior of the loss for both the training and loss data

In [None]:
plt.figure()
plt.plot(model_1_loss_list, 'r',label='Model #1')
plt.plot(model_2_loss_list, 'g',label='Model #2')
plt.plot(model_3_loss_list, 'b',label='Model #3')
plt.plot(model_1_test, linestyle='--', color='r',label='Model #1 Test')
plt.plot(model_2_test, linestyle='--', color='g',label='Model #2 Test')
plt.plot(model_3_test, linestyle='--', color='b',label='Model #3 Test')
plt.tight_layout()
plt.grid('True', color='y')
plt.xlabel("Epochs/Iterations")
plt.ylabel("Loss")
plt.legend()
plt.show()

# Which of the three models had the least amount of error validation?

Model #3 (Blue) has the least amount of error in validation. It can be observed that in the training loss, its the model that converges faster. Also, when looking at its loss in validation, it has the lowest validation compared with the other 3 models. An example of this can be looked when observing the values of the loss found in the both the training and test arrays for each model.


In [None]:
# print the training loss for the 3 models
print("Model #1 Training Loss",model_1_loss_list)
print("===================")
print("Model #2 Training Loss",model_2_loss_list)
print("===================")
print("Model #3 Training Loss",model_3_loss_list)

In [None]:
# print the test loss for the 3 models
print("Model #1 Test Loss", model_1_test)
print("===================")
print("Model #1 Test Loss", model_2_test)
print("===================")
print("Model #1 Test Loss", model_3_test)