<a href="https://colab.research.google.com/github/qcbegin/DSME6635-S24/blob/main/problem_sets/PS2_Neural_Nets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Set 2 - Implementing Neural Nets

### DSME 6635: Artificial Intelligence for Business Research (Spring 2024)

### Due at 12:30PM, Tuesday, January 30, 2024

Please first copy the CoLab file onto your own Google Drive. Finish the questions below and submit the **CoLab link** of your solutions in [this Google Sheet](https://docs.google.com/spreadsheets/d/1nOE-saTptG73WMCONDB1Z3pt-jHhmDA_1OHpQVHqQ1M/edit#gid=434132169). The total achievable points are 8 for this problem set. Please name you solution as

- `Member1LastName_Member1FirstName-Member2LastName_Member2FirstName_PS2.ipynb` (e.g., `Cao_Leo-Zhang_Renyu_PS2.ipynb`)

## Pre-requisites

For building neural networks, there are two fundamental computational frameworks specificialized towards Deep Learning:
1. [TensorFlow](https://www.tensorflow.org/tutorials) implemented by **Google**.
2. [PyTorch](https://pytorch.org/tutorials/) implemented by **Facebook**.

**You should carefully review the documentations of both frameworks to understand what they do.**

## California Housing Price Prediction

In this problem, you are asked to build a three layer multilayer percetron (MLP) to predict the housing price using the Califonira housing data. The following gives you the description of the data:

The California Housing dataset, used in this exercise, is a popular dataset for regression tasks in machine learning. It consists of data collected from the 1990 California census and contains information on the median house values for various census blocks in the state of California. The dataset includes 20,640 samples with 8 features, and the goal is to predict the median house value (in units of 100,000 USD) for each block.

Features included in the dataset are:

1. `MedInc`: Median income in the block
2. `HouseAge`: Median age of houses in the block
3. `AveRooms`: Average number of rooms per household in the block
4. `AveBedrms`: Average number of bedrooms per household in the block
5. `Population`: Total population in the block
6. `AveOccup`: Average number of occupants per household in the block
7. `Latitude`: Latitude of the block
8. `Longitude`: Longitude of the block

This data set is available in the sklearn.datasets library as part of the `scikit-learn` package. You can find more information about the dataset and its usage in the `scikit-learn` documentations: [California Housing Dataset Description](https://scikit-learn.org/stable/datasets/toy_dataset.html#california-housing-dataset) and
[fetch_california_housing function documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html).

The original dataset can be found at the following source: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297. You may also directly download the data set from Kaggle: https://www.kaggle.com/datasets/camnugent/california-housing-prices

The problem will be divided into several sub-problems and in each of the subproblem  you need to follow the instruction to build the code and there are unit tests at the end of each coding block to test your code for that block.

## 1. Loading Packages and Data.

You need to use TensorDataset and DataLoader to load both the training and testing data into tensor dataset for PyTorch and DataLoader (so that you can do batch processing). For training data, the shuffle is True and for testing data the shuffle is False. Your default batch size should be 64. See [this tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) for details.



In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, TensorDataset

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Use TensorDataset and DataLoader to process X_train and X_test, and define your returns as train_data, test_data, train_loader and test_loader
### BEGIN YOUR CODE HERE
train_data = test_data = train_loader = test_loader = None

### END YOUR CODE HERE


In [None]:
# Assert that the length of train_loader and test_loader are as expected
assert len(train_loader) == np.ceil(len(train_data) / train_loader.batch_size), "Incorrect train_loader length"
assert len(test_loader) == np.ceil(len(test_data) / test_loader.batch_size), "Incorrect test_loader length"

# Assert that the first batch of training data has the correct size
first_batch_features, first_batch_targets = next(iter(train_loader))
assert first_batch_features.size(0) == train_loader.batch_size, "Incorrect batch size for train_loader features"
assert first_batch_targets.size(0) == train_loader.batch_size, "Incorrect batch size for train_loader targets"


## 2. Create a Multilayer Perceptron (MLP) Class.

The MLP will have 3 linear layers (input->hidden->hidden->output). Each hidden layer should have hidden_size number of nodes, and use ReLU as the activation function. You are asked to implement both the initialization function (how the neural network structure is built), as well as the foward function (how to calculate the output from the input).




In [None]:
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        """
        This function initializates the neural network
        Input:
            input_size: the dimension of input data
            hidden_size: number of nodes in each hidden layer
            output_size: the dimension of output data
        Output:
            None
        """
        super(MLP, self).__init__()
        ### BEGIN YOUR CODE
        # Name layers and activation functions as fc1, relu1, fc2, relu2, etc., which will be tested in the next code block

        ### END YOUR CODE


    def forward(self, input):
        """
        This function calculates the output from the input
        Input:
            input: input data
        Output:
            out: output data derived from the model
        """
        ### BEGIN YOUR CODE

        ### END YOUR CODE


        return output

In [None]:
test_input_size = 8
test_hidden_size = 16
test_output_size = 1

test_model = MLP(test_input_size, test_hidden_size, test_output_size)

assert isinstance(test_model.fc1, nn.Linear), "First layer should be an instance of nn.Linear"
assert isinstance(test_model.relu1, nn.ReLU), "ReLU activation function is not properly set"
assert isinstance(test_model.fc2, nn.Linear), "Second layer should be an instance of nn.Linear"
assert isinstance(test_model.relu2, nn.ReLU), "ReLU activation function is not properly set"


## 3. Set the model's (Hyper)-Parameters and Train the Model.

Right now the model has 2 hidden layers with 64 nodes per layer. Train the model for 100 epochs with a learning rate equal to 0.001. You will also use the Adam optimizer for your SGD. In this case, you are going to implement the training loop, where you will make a forward pass, compute the loss and update the gradient (step the optimizer).

In [None]:
input_size = X_train.shape[1]
hidden_size = 64
output_size = 1
num_epochs = 100
learning_rate = 0.001

model = MLP(input_size, hidden_size, output_size)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    for i, (features, targets) in enumerate(train_loader):
        targets = targets.view(-1, 1)

        ### BEGIN YOUR CODE
        # Forward pass

        # Backward pass and optimization

        ### END YOUR CODE


    if (epoch + 1) % 10 == 0:
      print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

## 4. Use `sklearn.linear_model` to write a simle linear regression where the features are X and the outcome is Y. You then train the linear regression model on the California Housing Data.

In [None]:
from sklearn.linear_model import LinearRegression

### BEGIN YOUR CODE
# Name your linear regression model as linear_reg, and train linear_reg
linear_reg = None

### END YOUR CODE

In [None]:
assert linear_reg.coef_ is not None, "Model coefficients have not been updated"
assert linear_reg.coef_.shape == (X_train.shape[1],), "Incorrect number of coefficients"

## 5. Compute the Mean Absolute Percentage Error (MAPE) given the predicted and true variables. Then compute the MAPE for both the linear regression model and the MLP model.

Note that you should set model to the eval mode so that it is faster. You should also make sure you are not updating the gradients while predicting with torch.no_grad().


In [None]:
def mean_absolute_percentage_error(y_true, y_pred):
    """
    The function calculates mean absolute percentage error
    Input:
        y_true: true data
        y_pred: prediction data
    Output:
        mape: mean absolute percentage error * 100
    """
    ### BEGIN YOUR CODE

    ### END YOUR CODE



### BEGIN YOUR CODE
# Get predictions by linear regression model (y_pred_lr)
y_pred_lr = None

# Get predictions by MLP model (y_pred)
y_pred = None

### END YOUR CODE


mape_mlp = mean_absolute_percentage_error(y_test, y_pred)
mape_lr = mean_absolute_percentage_error(y_test, y_pred_lr)

print(f'MAPE for MLP Model: {mape_mlp:.2f}%')
print(f'MAPE for Linear Regression Model: {mape_lr:.2f}%')

## End of Problem Set 2.