<a target="_blank" href="https://colab.research.google.com/github/retowuest/uio-dl-2024/blob/main/Notebooks/nb-2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Deep Learning for Social Scientists

### University of Oslo, November 27-28, 2024

### **Demo 3:**<br>Fully Connected Deep Neural Networks

### Table of Contents
* [Introduction](#section_1)
* [Loading the Data](#section_2)
* [Preprocessing the Data](#section_3)
* [Training a DNN Regression Model](#section_4)

### Introduction <a class="anchor" id="section_1"></a>

In this notebook, we will use [PyTorch](https://pytorch.org/) to build a feed-forward fully connected deep neural network to predict the fuel efficiency of a car in miles per gallon (MPG). Since MPG is a continuous variable, this is a regression task.

We will use the **Auto MPG** data set, which is a common machine learning benchmark data set for predicting the fuel efficiency of a car in MPG. The data set and its description are available from the UC Irvine Machine Learning Repository [here](https://archive.ics.uci.edu/ml/datasets/auto+mpg).

The data set includes 398 examples and nine features. We will use the following eight features *(data type in parentheses)*:

- MPG *(continuous)*
- Cylinders *(continuous)*
- Displacement *(continuous)*
- Horsepower *(continuous)*
- Weight *(continuous)*
- Acceleration *(continuous)*
- Model Year *(ordered categorical)*
- Origin *(unordered categorical)*


### Loading the Data <a class="anchor" id="section_2"></a>

Let's first load the data set.

In [None]:
# Import pandas library
import pandas as pd

# Specify URL for downloading data set
url = "http://archive.ics.uci.edu/ml/" \
      "machine-learning-databases/auto-mpg/auto-mpg.data"

# Create list with feature names
col_names = [
      "MPG", "Cylinders", "Displacement", "Horsepower",
      "Weight", "Acceleration", "Model Year", "Origin"
]

# Load data set
auto_mpg_df = pd.read_csv(
      url, sep=" ", names=col_names,
      skipinitialspace=True,  # skip spaces after delimiter
      na_values="?",  # string to be recognized as NA
      comment="\t"  # character indicating that remainder of line should not be parsed
)

# Convert column Cylinders from integer to float
auto_mpg_df["Cylinders"] = auto_mpg_df["Cylinders"].astype(float)

### Preprocessing the Data <a class="anchor" id="section_3"></a>

As a first preprocessing step, we drop from the data set any rows containing NAs.

In [None]:
# Count number of NAs by column
print(auto_mpg_df.isnull().sum(axis=0))

# Drop rows with NAs
auto_mpg_df = auto_mpg_df.dropna()
auto_mpg_df = auto_mpg_df.reset_index(drop=True)

MPG             0
Cylinders       0
Displacement    0
Horsepower      6
Weight          0
Acceleration    0
Model Year      0
Origin          0
dtype: int64


Next, we partition the data set into a training set and a test test, using the `train_test_split()` function from the `model_selection` module in the `scikit-learn` library.

In [None]:
# Import scikit-learn library and the train_test_split function
# from model_selection module in the scikit-learn library
import sklearn
from sklearn.model_selection import train_test_split

# Create training and test sets
df_train, df_test = train_test_split(
    auto_mpg_df,
    train_size=0.8,  # indicates proportion of data set to include in training set
    random_state=1
)

We standardize the continuous inputs (which are `Cylinders`, `Displacement`, `Horsepower`, `Weight`, and `Acceleration`).

In [None]:
# Create list with names of numeric features
numeric_col_names = [
    "Cylinders", "Displacement",
    "Horsepower", "Weight",
    "Acceleration"
]

# Use describe method to obtain descriptive statistics for features
# (we will use the mean and sd to standardize the continuous features)
train_stats = df_train.describe().transpose()

# Create copies of the training and test sets
df_train_std, df_test_std = df_train.copy(), df_test.copy()

# Iterate over numeric inputs and standardize them
for col_name in numeric_col_names:
    mean = train_stats.loc[col_name, "mean"]  # we use the loc method to access rows and columns by label
    sd = train_stats.loc[col_name,"std"]
    df_train_std.loc[:, col_name] = (df_train_std.loc[:, col_name] - mean) / sd
    df_test_std.loc[:, col_name] = (df_test_std.loc[:, col_name] - mean) / sd

# Print last few rows of standardized training set
df_train_std.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
203,28.0,-0.824303,-0.90102,-0.736562,-0.950031,0.255202,76,3
255,19.4,0.351127,0.4138,-0.340982,0.29319,0.548737,78,1
72,13.0,1.526556,1.144256,0.713897,1.339617,-0.625403,72,1
235,30.5,-0.824303,-0.89128,-1.053025,-1.072585,0.475353,77,1
37,14.0,1.526556,1.563051,1.636916,1.47042,-1.35924,71,1


To simplify the learning task, let's collapse input `Model Year` into four equal-sized buckets: (-∞, 73), [73, 76), [76, 79), [79, ∞). Note that the cut-off values 73, 76, and 79 represent the first (Q1), second (Q2), and third (Q3) quartile of the distribution of feature `Model Year` (see below).

In [None]:
# Descriptive statistics for column Model Year
auto_mpg_df["Model Year"].describe()

count    392.000000
mean      75.979592
std        3.683737
min       70.000000
25%       73.000000
50%       76.000000
75%       79.000000
max       82.000000
Name: Model Year, dtype: float64

We use the `torch.bucketize()` [function](https://pytorch.org/docs/stable/generated/torch.bucketize.html) from PyTorch to generate indices of the buckets.

In [None]:
# Import torch library
import torch

# Specify cut-off values (boundaries)
boundaries = torch.tensor([73, 76, 79])

# Extract values of feature Model Year and create tensor
# for training and test sets
v_train = torch.tensor(df_train_std["Model Year"].values)

v_test = torch.tensor(df_test_std["Model Year"].values)

# Use bucketize function to generate indices of buckets
# and add as new feature to training and test sets
df_train_std["Model Year Bucketed"] = torch.bucketize(
    v_train, boundaries, right=True  # if right is True, then right boundary is open (as defined in above text cell)
)

df_test_std["Model Year Bucketed"] = torch.bucketize(
    v_test, boundaries, right=True
)

# Add new features to list with names of numeric features
numeric_col_names.append("Model Year Bucketed")

Recall that the feature `Origin` is of type unordered categorical. We will one-hot-encode this feature (i.e., create a set of indicator variables representing the feature categories), using the `one_hot()` [function](https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html) from PyTorch.

In [None]:
# Import one_hot function
from torch.nn.functional import one_hot

# Create a set containing the values in feature Origin,
# return the number of items in the set, and store this
# number in an object
total_origin = len(set(df_train_std["Origin"]))

# One-hot-encode feature Origin for training and test sets
origin_encoded_train = one_hot(
    torch.from_numpy(df_train_std["Origin"].values) % total_origin  # modulo operation turns set of class values {1,2,3} into {0,1,2}
)

origin_encoded_test = one_hot(
    torch.from_numpy(df_test_std["Origin"].values) % total_origin
)

We now concatenate the one-hot-encoded feature `Origin` with the numeric features for both the training and test set.

In [None]:
# Extract numeric features in training set and create tensor
x_train_numeric = torch.tensor(
    df_train_std[numeric_col_names].values
)

# Concatenate numeric features and encoded origin for training set
x_train = torch.cat([x_train_numeric, origin_encoded_train], 1).float()  # dim=1 indicates dimension over which the tensors are concatenated: add 1 column

# Extract numeric features in test set and create tensor
x_test_numeric = torch.tensor(
    df_test_std[numeric_col_names].values
)

# Concatenate numeric features and encoded origin for test set
x_test = torch.cat([x_test_numeric, origin_encoded_test], 1).float()

Finally, we create tensors from the output values for both the training and test set.

In [None]:
# Create output tensors for training and test sets
y_train = torch.tensor(df_train_std["MPG"].values).float()
y_test = torch.tensor(df_test_std["MPG"].values).float()

### Training a DNN Regression Model <a class="anchor" id="section_4"></a>

We first create a data loader that uses a batch size of 8 for the training data.

In [None]:
# Import TensorDataset and Dataloader
from torch.utils.data import TensorDataset, DataLoader

# Combine input and output tensors in a joint training data set
train_ds = TensorDataset(x_train, y_train)

# Specify batch size
batch_size = 8

# Create data loader
torch.manual_seed(1)
train_dl = DataLoader(
    train_ds,  # data set from which to load data
    batch_size,  # how many examples per batch to load
    shuffle=True  # if True, then data are reshuffled at every epoch
)

Next, we will build a deep neural network with two hidden layers using the `torch.nn` [module](https://pytorch.org/docs/stable/nn.html) from PyTorch. The first hidden layer has 8 hidden units and the second hidden layer has 4 hidden units.

In [None]:
# Import torch.nn module
import torch.nn as nn

# Create list specifying number of hidden units per hidden layer
hidden_units = [8, 4] # first layer 8 units, second 4

# Initialize input size
# (for the first hidden layer, the input size is the number of input features)
input_size = x_train.shape[1]

# Create container for the layers of the network
all_layers = []

# Create the layers of the network
# (the loop first creates the two hidden layers
for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)  # nn.Linear applies a linear transformation to the incoming data; the first argument specifies input size, and the second argument specifies output size
    all_layers.append(layer)  # add layer to container
    all_layers.append(nn.ReLU())  # applies ReLU activation function
    input_size = hidden_unit  # set input size of next layer equal to output size of current layer
all_layers.append(nn.Linear(hidden_units[-1], 1))  # create the output layer; # of input is the output of last layer, and output is just one value, y; if more than 1, y is a scalar vector

# Create the model based on the layers defined above
model = nn.Sequential(*all_layers)  # Sequential connects the layers in a cascading way (all_layers is a list and the * operator that o;unpacks this list)

# Print the model
model

Sequential(
  (0): Linear(in_features=9, out_features=8, bias=True)
  (1): ReLU()
  (2): Linear(in_features=8, out_features=4, bias=True)
  (3): ReLU()
  (4): Linear(in_features=4, out_features=1, bias=True)
)

After defining the model, we choose a loss function and an optimization algorithm. We will use here the mean squared error (MSE) as the loss function and stochastic gradient descent (SGD) as the optimization algorithm.

In [None]:
# Specify MSE as loss function
loss_fn = nn.MSELoss()

# Specify SGD as optimization algorithm
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)  # the first argument specifies parameters to optimize, the second argument specifies the learning rate

We can now train the model. We use 200 epochs (passes over the full training data set) and display the training loss for every 20 epochs.

In [None]:
# Set seed
torch.manual_seed(1)

# Specify number of epochs
num_epochs = 200

# Specify after how many epochs loss is to be displayed
log_epochs = 20

# Train model
for epoch in range(num_epochs):  # loop over epochs
    loss_hist_train = 0  # Initialize loss history
    for x_batch, y_batch in train_dl:  # loop over batches of training data
        pred = model(x_batch)[:, 0]  # prediction of model for current batch
        loss = loss_fn(pred, y_batch)  # loss for current batch
        loss.backward()  # compute gradients with backpropagation
        optimizer.step()  # update parameters based on computed gradients
        optimizer.zero_grad()  # reset gradients to 0 (so that we do not add up the gradients)
        loss_hist_train += loss.item()  # add current loss to loss history
    if epoch % log_epochs==0:
        print(f"Epoch {epoch}:  Loss "
              f"{loss_hist_train/len(train_dl):.4f}")  # calculate average loss per batch for the epoch

Epoch 0:  Loss 536.1047
Epoch 20:  Loss 8.4361
Epoch 40:  Loss 7.8695
Epoch 60:  Loss 7.1891
Epoch 80:  Loss 6.7064
Epoch 100:  Loss 6.7603
Epoch 120:  Loss 6.3107
Epoch 140:  Loss 6.6884
Epoch 160:  Loss 6.7549
Epoch 180:  Loss 6.2029


We can now evaluate the performance of the trained model on the test data set. To predict the outputs of new test examples, we feed their inputs to the model.

In [None]:
with torch.no_grad():  # diables gradient calculation (since we are not anymore training the model) # even if I don't disable, pred doesn't compute gradient anymore actually, but just to be on the safe side, and also save memory
    pred = model(x_test)[:, 0]
    loss = loss_fn(pred, y_test)
    print(f"Test MSE: {loss.item():.4f}")

Test MSE: 9.5907


We now have an estimate of the expected test error of the model.