# Chapter 3 -- Neural Networks with Multiple Inputs and Outputs

In this tutorial, we will use [PyTorch](https://pytorch.org) with [Lightning](https://www.lightning.ai) to create and optimize a simple neural network with multiple inputs and outputs, like the one shown in the picture below:

<img src="./images/final_nn.png" alt="a neural network with multiple inputs and outputs" style="width: 1000px;">

In this tutorial, we will:

- **Import and format data and then build a DataLoader from scratch**
- **Build a Neural Network with multiple inputs and outputs**
- **Train a Neural Network with multiple inputs and outputs**
- **Make predictions with new data**

In [323]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam

import lightning as L
from torch.utils.data import TensorDataset, DataLoader

import numpy as np
import pandas as pd  # Read in the data and normalize it
from sklearn.model_selection import train_test_split  # Create training and testing datasets

## Buil a DataLoader

### The Iris Flower dataset

Once we have the Python modules imported, we need to import the data that we will use to train and test our neural network. Specifically, we're going to use the **[Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)**, which we will import from a comma-separated (CSV) text file so that we can learn how to build a DataLoader from scratch.

It is a classic dataset originally made famous by Rondal Fisher in 1936, and has since been used countless times to demonstrate the effectiveness of various classification algorithms. The dataset consists of 150 samples total, 50 for each of 3 species of Iris, *Setosa*, *Versicolor*, and *Virginica*. Each row in the dataset contains measurements for 4 variables: **[petal](https://en.wikipedia.org/wiki/Petal)** width and length and **[sepal](https://en.wikipedia.org/wiki/Sepal)** width and length.

_The data file we are going import, `iris.txt`, was originally downloaded from the **[UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/)**, which has a lot of great datasets that we can practice building Neural Networks (or any other machine learning algorithm) with. When we download the datasets from UCI, we get one file that has the data and another file that describes the data, including providing us with the names of each variable, or column, in the dataset. If we'd like to see the original data files, we can find them **[here](https://archive.ics.uci.edu/dataset/53/iris)**._

### Import data

In [254]:
# We'll read in the dataset with the pandas function read_table()
# read_table() can read in various text files including, comma-separated and tab-delimited
# url = "https://raw.githubusercontent.com/StatQuest/signa/main/chapter_03/iris.txt"
# df = pd.read_table(url, sep=",", header=None)

# Or we can directly use the `read_csv` method
df = pd.read_csv("iris.txt", header=None)

Now, in theory, we have loaded the data into a DataFrame called `df`, but it's always a good idea to make sure this worked as expected. So, we'll print out the first handful of rows in the dataset with the `head()` method.

In [255]:
# print out the first handful of rows using the head() method
df.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


When we print out the first few rows of our new DataFrame, `df`, the first thing we see is that the columns are not named. In theory, it's fine to have unnamed columns (and just have numbers), but it makes the data hard to look at, so let's add the column names to `df`. To name each column, we simply assign a list of column names to `columns`.

In [256]:
# To name each column, we assign a list of column names to `columns`
df.columns = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "class"]

# To verify we did that correctly, let's print out the first few rows
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Hooray! Now that we can look at our DataFrame without getting a headache, let's see how big this dataset is and figure out how many different iris species we will have to train our neural network to predict. First, let's see how many rows and columns are in the dataset with `.shape`.

In [257]:
df.shape # shape returns the rows and colunns

(150, 5)

In [258]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


So, our dataset has 150 rows and 5 columns. Now let's see how many different types of iris are in it. We'll do this by counting the unique values in the column called `class` with `.nunique()`.

In [259]:
# To determine the number of iris species in the dataset,
# we'll count the number of unique values in the column called `class`
df['class'].nunique()

3

And we get the number we expected, 3. So that's good! Now let's print out the names of the 3 species with `.unique()`

In [260]:
# We can print out the unique values in a dataframe's 
# column with the 'unique()' method
df['class'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

So, just as we expected, we see that we have 3 different species of iris in our dataset: *Setosa*, *Versicolor*, *Virginica*.

Now let's verify that our dataset is balanced, meaning we have roughly the same number of entries (rows) in our data for each of the 3 iris species that we want our neural network to classify. We can do this with a fancy `for` loop that prints out the number of rows per class, regardless of the number of classes we have in our dataset.

In [261]:
for class_name in df['class'].unique(): # for each unique class name...

    # ...print out the number of rows associated with it
    print(class_name, ": ", sum(df['class'] == class_name), sep="")

Iris-setosa: 50
Iris-versicolor: 50
Iris-virginica: 50


In [262]:
df.groupby('class').count()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,50,50,50,50
Iris-versicolor,50,50,50,50
Iris-virginica,50,50,50,50


In this case, our dataset isn't just relatively well balanced, it is exactly balanced, and each class has exactly 50 rows of data associated with it. However, if things were really skewed, for example, we had 100 rows of data for *Setosa*, 100 rows for *Versicolor*, and only 10 rows of data for *Virginica*, then we might need to find some way to make the data more balanced. Balancing datasets is way out of the scope of this tutorial, but if you'd like to learn more with this simple [Google search](https://www.google.com/search?q=how+to+balance+datasets).

### Prepare the training data

#### Split

Now, let's split the data into **training** and **testing** datasets. The first step is to separate the columns into input values and labels.

In this example, to keep the neural network simple, we'll just use `petal_width` and `sepal_width` values for the inputs. So the first we'll do is make sure we can correctly isolate the columns we want from the columns we don't want. We do this by passing `df` a list of column names we want to get values for, `['petal_width', 'sepal_width']`.

In [263]:
# Print out the first few rows of just the `petal_width` and `sepal_width` columns
df[['petal_width', 'sepal_width']].head()

Unnamed: 0,petal_width,sepal_width
0,0.2,3.5
1,0.2,3.0
2,0.2,3.2
3,0.2,3.1
4,0.2,3.6


Now that we have confirmed that we can correctly isolate the values for `petal_width` and `sepal_width`, let's use the original DataFrame, `df`, to create two new DataFrames. One DataFrame will have the petal and sepal widths, the values we will use to make predictions, and we'll call this DataFrame `input_values`.

In [264]:
input_values = df[['petal_width', 'sepal_width']]
input_values.head()

Unnamed: 0,petal_width,sepal_width
0,0.2,3.5
1,0.2,3.0
2,0.2,3.2
3,0.2,3.1
4,0.2,3.6


The other DataFrame will have the species, the values we will use to determine how good those predictions are, and this DataFrame will be called `label_values`.

In [265]:
label_values = df['class']
print(label_values.head())

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: class, dtype: object


Now, because neural networks expect the inputs and output values to be numbers, we need to convert the values in the `label_values` into numbers, and we'll do this with [`factorize()`](https://pandas.pydata.org/docs/reference/api/pandas.factorize.html).

In [266]:
label_values.factorize()

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 Index(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype='object'))

It returns a list of lists (with codes and uniques), and since we only need the first list of values, we index the output of factorize() with `[0]`.

In [267]:
# Convert the strings in the 'class' column into numbers with factorize()
classes_as_numbers = label_values.factorize()[0]

As we can see, the strings were converted into numbers. The first 50 values are 0, which represents *Setosa*. The following 50 values are 1, for *Versicolor*, and the last 50 values are 2, for *Viriginica*.

Now, we need to split `input_values` and `classes_as_numbers` into **training** and **testing datasets**. And we do this with the **[sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)** function `train_test_split()`.

_In practice, people usually use anywhere from 25-33% of the data for testing how well the model was trained. In this case, we'll use 25%, which is the default, but any percentage can be specified by setting the `test_size` parameter to a value between 0 and 1. Also, because we want to ensure that our test dataset has data for all three species of iris, we'll set `stratify=label_values`._

In [268]:
input_train, input_test, label_train, label_test = train_test_split(
    input_values,
    classes_as_numbers,
    test_size=0.25,
    stratify=classes_as_numbers,
    random_state=42)

Now we can verify that `train_test_split()` correctly put 75% of the data into `input_train` and `input_test` by printing out their shapes. Remember 75% of 150 = 112.5, so we would expect both `input_train` and `input_test` to have 112 rows.

In [269]:
print(
    "Size of the training data set:",
    input_train.shape,
    "\nSize of the training label set:",
    label_train.shape
)

Size of the training data set: (112, 2) 
Size of the training label set: (112,)


Both `input_train` and `label_train` have 112 rows, which is what we expect. Now, let's verify that the remaining 38 rows of data went into `input_test` and `label_test` by printing out their shapes.

In [270]:
print(
    "Size of the test data set:",
    input_test.shape,
    "\nSize of the test label set:",
    label_test.shape
)

Size of the test data set: (38, 2) 
Size of the test label set: (38,)


#### One-hot encoding

Now, because our neural network will have 3 outputs, one for each species (see the drawing of the neural network above), we need to convert the numbers in `label_train` into 3 element arrays, where each element in an array corresponds to a specific output in the neural network. Specifically, we'll use:

- `[1.0, 0.0, 0.0]` to correspond to *Setosa*
- `[0.0, 1.0, 0.0]` for *Versicolor*
- `[0.0, 0.0, 1.0]` for *Virginica*

The good news is that we can easily do the **[one-hot encoding](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html)**. We also tack on `type(torch.float32)` to ensure the numbers are saved in the correct format for the neural network to process efficiently.

In [271]:
print(label_train)

[2 2 1 1 1 2 0 2 0 2 0 2 1 0 0 1 2 0 0 1 1 1 0 1 2 0 2 1 2 0 0 1 0 2 0 0 1
 0 1 0 0 1 2 2 0 2 1 0 2 0 2 2 0 1 2 2 1 1 0 1 1 2 1 2 0 1 0 2 1 2 1 2 2 0
 2 1 0 2 0 2 1 1 0 2 2 0 0 2 2 1 2 0 2 1 2 2 0 1 1 1 1 1 0 2 1 1 0 0 0 0 1
 0]


In [272]:
# Create a new tensor with one-hot encoded rows for each row in the original dataset.
one_hot_label_train = F.one_hot(torch.tensor(label_train)).type(torch.float32)

If we printed out the entire contents of `one_hot_label_train`, we'd get a matrix with 150 rows, which would take up a lot of space. So, instead, let's print out the first 10 rows.

In [273]:
# Print out a few of the rows one-hot encoded data.
one_hot_label_train[:10]

tensor([[0., 0., 1.],
        [0., 0., 1.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 0., 1.],
        [1., 0., 0.],
        [0., 0., 1.],
        [1., 0., 0.],
        [0., 0., 1.]])

#### Normalization

So, as we can see in the output above, `classes_as_numbers` was correctly one-hot encoded and saved in `one_hot_label_train`.

Now, let's normalize the input variables so that their values range from 0 to 1. Normalizing data, so that it's all on the same scale, often makes it easier to train machine learning methods. In this case, since we have two datasets, `input_train` and `input_test`, we'll start determining the maximum and minimum values in `input_train`. Then we will use those values to normalize `input_train` and `input_test`. Using the maximum and minimum values from `input_train` to normalize both datasets avoids something called **Data Leakage**.

In [274]:
# First, determine the maximum values in input_train
max_vals_in_input_train = input_train.max()

print(max_vals_in_input_train)

petal_width    2.5
sepal_width    4.4
dtype: float64


In [275]:
# Second, determine the minimum values in input_train
min_vals_in_input_train = input_train.min()

print(min_vals_in_input_train)

petal_width    0.1
sepal_width    2.0
dtype: float64


In [276]:
input_train.describe()

Unnamed: 0,petal_width,sepal_width
count,112.0,112.0
mean,1.192857,3.061607
std,0.771533,0.44077
min,0.1,2.0
25%,0.3,2.8
50%,1.3,3.0
75%,1.825,3.3
max,2.5,4.4


In [277]:
# Now normalize input_train with the maximum and minimum values from input_train
input_train = (
    (input_train - min_vals_in_input_train)
    /
    (max_vals_in_input_train - min_vals_in_input_train)
)
input_train.describe()

Unnamed: 0,petal_width,sepal_width
count,112.0,112.0
mean,0.455357,0.442336
std,0.321472,0.183654
min,0.0,0.0
25%,0.083333,0.333333
50%,0.5,0.416667
75%,0.71875,0.541667
max,1.0,1.0


In [278]:
# Now normalize input_test with the maximum and minimum values from input_train
input_test = (input_test - min_vals_in_input_train) / (max_vals_in_input_train - min_vals_in_input_train)
input_test.describe()

Unnamed: 0,petal_width,sepal_width
count,38.0,38.0
mean,0.464912,0.429825
std,0.311583,0.173591
min,0.041667,0.125
25%,0.083333,0.34375
50%,0.541667,0.416667
75%,0.708333,0.541667
max,0.958333,0.833333


We may see values more extreme, i.e., lower than 0 and higher than 1, in the test set. This is the expected behavior. The goal of normalization is not to force every single data point into the $[0, 1]$ range. The goal is to apply a consistent scaling transformation based on the knowledge gained from the training data alone. Values outside the $[0, 1]$ range in the test set are a normal and informative result of doing this correctly.

The test set is supposed to be a completely unseen, pristine dataset that simulates how your model will perform on new, real-world data. If your model's training process has been influenced by any information from the test set—even something as simple as its minimum or maximum value—then the test set is no longer truly "unseen."

#### DataLoader

Now, let's put our training data into a **DataLoader**, which we can use to train the neural network. The DataLoader is a PyTorch utility that takes our final, fully prepared data and makes it easy to iterate over. It handles:

- Batching: giving the model, say, 64 samples at a time instead of the whole dataset.
- Shuffling: randomizing the order of the data each epoch to prevent the model from learning the sequence.
- Parallelism: using multiple CPU cores to load data in the background so the GPU doesn't have to wait.

DataLoaders are great for large datasets because they make it easy to access the data in batches, make it easy to shuffle the data each epoch, and they make it easy to use a relatively small fraction of the data if we want to do a quick and dirty training for debugging our code.

To put our data training data into a DataLoader, we'll start by converting `input_train` into **tensors** with `torch.tensor()`. We'll then combine `input_train` with `one_hot_label_train` to create a **TensorDataset**.

In [279]:
# Convert the DataFrame input_train into tensors
input_train_tensors = torch.tensor(input_train.values).type(torch.float32)

# now print out the first 5 rows to make sure they are what we expect.
input_train_tensors[:5]

tensor([[0.7500, 0.3333],
        [0.7917, 0.3333],
        [0.3750, 0.1667],
        [0.5000, 0.3333],
        [0.5000, 0.2083]])

Because we'll also need to run `input_test` through the neural network, we'll need to convert it to tensors as well, and we might as well do it now.

In [280]:
## Convert the DataFrame input_test into tensors
input_test_tensors = torch.tensor(input_test.values).type(torch.float32)

## now print out the first 5 rows to make sure they are what we expect.
input_test_tensors[:5]

tensor([[0.0417, 0.5000],
        [0.6250, 0.5417],
        [0.5000, 0.3333],
        [0.5000, 0.1250],
        [0.0417, 0.4167]])

Now that we have tensors for `input_train`, named `input_train_tensors`, and we have the one-hot encoded `class` values stored in tensors called `label_train`, we can combine them into a **TensorDataset** that are, in turn, turned into **DataLoader**.

In [281]:
train_dataset = TensorDataset(input_train_tensors, one_hot_label_train)
train_dataloader = DataLoader(train_dataset)

### Refactoring

Now that we saw how to execute all these steps, let's write a more robust and reusable pipeline, using scaler and label encoder tools from `sklearn`.

In [290]:
# --- Import the scaler and label encoder from sklearn ---
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

# --- Load and Define X and y ---
path = "./iris.txt"
# Assign column names since the file has no header
column_names = [
    'sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv(
    path, sep=",", names=column_names)

# --- Define X and y ---
input_values = df[['petal_width', 'sepal_width']]
label_values = df['class']

# --- SPLIT THE DATA FIRST ---
# (This is best practice)
input_train, input_test, label_train_str, label_test_str = train_test_split(
    input_values,
    label_values,           # split the original label strings
    test_size=0.25,
    stratify=label_values,  # stratify on the label strings
    random_state=42
)

# --- SKLEARN PREPROCESSING ---
# Initialize and fit the LabelEncoder
le = LabelEncoder()
le.fit(label_train_str)  # Fit *only* on the training labels

# Transform *both* sets
label_train_num = le.transform(label_train_str)
label_test_num = le.transform(label_test_str)

# (We can also use le.fit_transform(label_train_str) for the first one)

# Initialize and fit the MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(input_train)
input_train_scaled = scaler.transform(input_train)
input_test_scaled = scaler.transform(input_test)

# --- PYTORCH CONVERSION ---
# Convert scaled data to tensors
input_train_tensors = torch.tensor(input_train_scaled, dtype=torch.float32)
input_test_tensors = torch.tensor(input_test_scaled, dtype=torch.float32)

# Convert encoded labels to tensors
label_train_tensors = torch.tensor(label_train_num)
label_test_tensors = torch.tensor(label_test_num)

# One-hot encode the numeric labels
# The model's output layer will produce float32 predictions. 
# To calculate the loss, PyTorch requires the predictions 
# and the labels to be the same data type.
one_hot_label_train = F.one_hot(label_train_tensors).type(torch.float32)
one_hot_label_test = F.one_hot(label_test_tensors).type(torch.float32)

# Create TensorDataset and DataLoader as before
train_dataset = TensorDataset(input_train_tensors, one_hot_label_train)
train_dataloader = DataLoader(train_dataset)

## Build a neural network

Building a neural network with PyTorch means creating a new class. And to make it easy to train the neural network, this class will inherit from `LightningModule`.

Our new class will have the following methods:

- `__init__()` to initialize the Weights and Biases and keep track of a few other housekeeping things.
- `forward()` to make a forward pass through the neural network.
- `configure_optimizers()` to configure the optimizer. There are lots of optimizers to choose from, but in this tutorial, we'll change things up and use `Adam`.
- `training_step()` to pass the training data to `forward()`, calculate the loss and keep track of the loss values in a log file.

Also, for reference, here is a picture of the neural network we want to create:

<img src="./images/final_nn.png" alt="a neural network with multiple inputs and outputs" style="width: 1000px;">

As we can see, our neural network has **2 inputs**, one for `Petal Width` and one for `Sepal Width`, a single _hidden layer_ with two **ReLU** activation functions, and **3 outputs**, one for each species of iris.

So, given this specification for this neural network, let's code it in a new class called `MultipleInsOuts`.

In [240]:
class MultipleInsOuts(L.LightningModule):

    def __init__(self):
        super().__init__()

        # Set the seed for the random number generator.
        L.seed_everything(seed=42)
        
        # When self.linear_layer = nn.Linear() is executed, PyTorch automatically 
        # initializes the weight and bias tensors for this layer with random values 
        # drawn from a specific distribution (see below).

        ## We don't have to specifiy each and every single Weight and Bias values!

        ############################################################################
        ##
        ## Here is where we initialize the Weights and Biases for the neural network
        ##
        ############################################################################

        # If we look at the drawing of the network we want to build (above),
        # we see that we have 2 inputs that lead to 2 activation functions.
        # We create these connections and **initialize their Weights and Biases**
        # with the nn.Linear() function by setting in_features=2 and out_features=2
        self.input_to_hidden = nn.Linear(in_features=2, out_features=2, bias=True)

        # Next, we see that the 2 activation functions are connected to 3 outputs.
        # We create these connections and initialize their Weights and Biases
        # with the nn.Linear() function by setting in_features=2 and out_features=3.
        self.hidden_to_output = nn.Linear(in_features=2, out_features=3, bias=True)

        self.loss = nn.MSELoss(reduction='sum')


    def forward(self, input):
        # First, we run the input values to the activation functions 
        # in the hidden layer
        hidden = self.input_to_hidden(input)

        # Then we run the values through a ReLU activation function 
        # and then run those values to the output
        output_values = self.hidden_to_output(torch.relu(hidden))

        return(output_values)
    
        # We could also have defined the entire net in __init__() such as 
        # self.net = nn.Sequential(
        #     nn.Linear(in_features=2, out_features=2),  # Linear
        #     nn.ReLU(),                                 # ReLU
        #     nn.Linear(in_features=2, out_features=3),  # Linear
        # )
        # and then in forward() method:
        # return self.net(input)


    def configure_optimizers(self):
        # In this example, configuring the optimizer
        # consists of passing it the weights and biases we want
        # to optimize, which are all in self.parameters(),
        # and setting the learning rate with lr=0.001.

        # Adam (Adaptive Moment Estimation) is a little less stochastic 
        # than SGD by using weighted average beteween each step
        # https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam

        return Adam(self.parameters(), lr=0.001)


    def training_step(self, batch, batch_idx):
        # The first thing we do is split 'batch' into 
        # the input and label values
        inputs, labels = batch

        # Then we run the input through the neural network
        outputs = self.forward(inputs)

        # Then we calculate the loss.
        loss = self.loss(outputs, labels)

        # Lastly, we could add the loss a log file so that 
        # we can graph it later. This would help us decide 
        # if we have done enough training. Ideally, if we
        # do enough training, the loss should be small 
        # and not getting any smaller.
        #self.log("loss", loss)

        return loss

In [291]:
model = MultipleInsOuts()  # First, make model from the class

Seed set to 42


In [292]:
# Now print out the name and value for each named parameter
# in the model, like Weights and Biases, that we can train.
for name, param in model.named_parameters():
    print(name, '\n', param.data)

input_to_hidden.weight 
 tensor([[ 0.5406,  0.5869],
        [-0.1657,  0.6496]])
input_to_hidden.bias 
 tensor([-0.1549,  0.1427])
hidden_to_output.weight 
 tensor([[-0.3443,  0.4153],
        [ 0.6233, -0.5188],
        [ 0.6146,  0.1323]])
hidden_to_output.bias 
 tensor([0.5224, 0.0958, 0.3410])


In [283]:
# Or we can print the parameters directly as a dictionary
model.state_dict()

OrderedDict([('input_to_hidden.weight',
              tensor([[ 0.5406,  0.5869],
                      [-0.1657,  0.6496]])),
             ('input_to_hidden.bias', tensor([-0.1549,  0.1427])),
             ('hidden_to_output.weight',
              tensor([[-0.3443,  0.4153],
                      [ 0.6233, -0.5188],
                      [ 0.6146,  0.1323]])),
             ('hidden_to_output.bias', tensor([0.5224, 0.0958, 0.3410]))])

The `nn.Linear` weight matrix is structured as `[out_features, in_features]`. Therefore, the first row fo the 'input_to_hidden.weight' tensor contains the weights leading to H₀, and the second row contains the weights leading to H₁. This correspond to the following state of the neural network:

```mermaid
%%{init:{'theme': 'neutral'}}%%
graph LR
    %% Style Definitions
    classDef biasStyle fill:none,stroke:none,color:#7f8c8d;

    %% Input Layer (2 nodes)
    subgraph Input Layer [Inputs]
        direction TB
        I0(("I₀"))
        I1(("I₁"))
    end

    %% Hidden Layer (2 nodes)
    subgraph Hidden Layer [Hidden]
        direction TB
        H0(("H₀"))
        H1(("H₁"))
    end

    %% Output Layer (3 nodes)
    subgraph Output Layer [Outputs]
        direction TB
        O0(("O₀"))
        O1(("O₁"))
        O2(("O₂"))
    end

    %% Bias Nodes for Hidden Layer
    BH0["Bias: -0.15"]:::biasStyle
    BH1["Bias: 0.14"]:::biasStyle

    %% Bias Nodes for Output Layer
    BO0["Bias: 0.52"]:::biasStyle
    BO1["Bias: 0.10"]:::biasStyle
    BO2["Bias: 0.34"]:::biasStyle

    %% Connections from Input to Hidden Layer Weights
    I0 -->|"0.54"| H0
    I1 -->|"0.59"| H0
    
    I0 -->|"-0.17"| H1
    I1 -->|"0.65"| H1

    %% Connections from Hidden Layer Biases
    BH0 -.-> H0
    BH1 -.-> H1

    %% Connections from Hidden to Output Layer Weights
    H0 -->|"-0.34"| O0
    H1 -->|"0.42"| O0
    
    H0 -->|"0.62"| O1
    H1 -->|"-0.52"| O1
    
    H0 -->|"0.61"| O2
    H1 -->|"0.13"| O2

    %% Connections from Output Layer Biases
    BO0 -.-> O0
    BO1 -.-> O1
    BO2 -.-> O2
```

In [133]:
print("Model:", model)

Model: MultipleInsOuts(
  (input_to_hidden): Linear(in_features=2, out_features=2, bias=True)
  (hidden_to_output): Linear(in_features=2, out_features=3, bias=True)
  (loss): MSELoss()
)


In [293]:
# Run different a set of values through the current (not optimized)
# neural network through `forward`. Note that the model learns to 
# expect inputs in this [0, 1] range, so that the value in the input
# tensor represent the scaled values for Petal and Sepal Widths.
output_values = model(torch.tensor([0.5, .35]))

print("Output from the initialized NN:")
print(torch.round(output_values))

Output from the initialized NN:
tensor([1., 0., 1.], grad_fn=<RoundBackward0>)


## Train the Neural Network

Now that we've created a class for our neural network, let's train it.

Training our new neural network means we create a **Lightning Trainer**, `L.Trainer`, and use it to optimize the parameters.

We will start with 10 epochs, complete runs through our training data. This may be enough to successfully optimize all of the parameters, but it might not. We'll find out later in the tutorial when we make a graph of how the loss values change during training.

In [294]:
trainer = L.Trainer(max_epochs=10)
trainer.fit(model, train_dataloaders=train_dataloader)

💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name             | Type    | Params | Mode 
-----------------------------------------------------
0 | input_to_hidden  | Linear  | 6      | train
1 | hidden_to_output | Linear  | 9      | train
2 | loss             | MSELoss | 0      | train
-----------------------------------------------------
15        Trainable params
0         Non-trainable params
15        Total params
0.000     Total estimated model params size (MB)
3         Modules in train mode
0         Modules in eval mode
c:\Users\Sébastien\Documents\data_science\machine_learning\statsquest_neural_networks\.env\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:433: The 'train_dataloader

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [295]:
model.state_dict()

OrderedDict([('input_to_hidden.weight',
              tensor([[ 0.9860,  0.3223],
                      [-0.3675,  0.7899]])),
             ('input_to_hidden.bias', tensor([-0.1776,  0.0590])),
             ('hidden_to_output.weight',
              tensor([[-0.7209,  0.7427],
                      [ 0.4759, -0.4373],
                      [ 0.7126, -0.2330]])),
             ('hidden_to_output.bias', tensor([0.4505, 0.2084, 0.1178]))])

We've trained the model with 10 epochs! Now, let's see if the predictions are any good. We can do this by seeing how well it predicts the testing data. We'll start by running `input_test_tensors` through the neural network and saving the output `predictions`.

In [296]:
# Run the input_test_tensors through the neural network
predictions = model(input_test_tensors)

Now, because our neural network has three outputs, one for each Iris species, we should get 3 values for each row in `input_test_tensors`. We can verify that by looking at the first few rows of `predictions`.

In [297]:
predictions[:5,]

tensor([[0.7585, 0.0282, 0.0331],
        [0.1994, 0.3877, 0.4948],
        [0.2486, 0.3490, 0.3868],
        [0.1941, 0.3776, 0.3712],
        [0.7274, 0.0453, 0.0309]], grad_fn=<SliceBackward0>)

We can determine which species was predicted in `predictions` by selecting the index in each row that corresponding to the largest value, and we do that with `torch.argmax()`. It returns a tensor that contains the indices with the largest values for each row.

In [298]:
# Select the output with highest value
predicted_labels = torch.argmax(predictions, dim=1) # dim=1 applies argmax to columns
predicted_labels[0:5]

tensor([0, 2, 2, 1, 0])

In the first and last rows index 0 had the largest value. Thus, these prediction corresponds to *Setosa*. The second and third predicted 2, which corresponds to *Virginica*. The fourth predicted *Versicolor*.

Now, let's compare what the neural network predicted in `predicted_labels` to the known values in `label_test` and calculate the percentage of correct predictions. We do this by adding up the number of times an element in `predicted_labels` equals the corresponding element in `label_test` and dividing by the number of elements in `predicted_labels`.

In [299]:
# Now compare predicted_labels with test_labels to calculate accuracy.
# `torch.eq()` computes element-wise equality between two tensors.
# label_test, however, is just an array, so we convert it to a tensor
# before passing it in. `torch.sum()` then adds up all of the "True"
# output values to get the number of correct predictions.
# We then divide the number of correct predictions by the number of predicted values,
# obtained with len(predicted_labels), to get the percentage of correct predictions

torch.sum(torch.eq(torch.tensor(label_test), predicted_labels)) / len(predicted_labels)

tensor(0.7368)

And we see that our neural network only correctly predicts 73.7% of the testing data. This isn't very good. So, will training our model for more epochs improve the model's predictions?

One way to answer that question is to just train for longer and see what happens.

The good news is that because we're using **Lightning**, we can pick up where we left off training without starting over from scratch. This is because training with **Lightning** creates _checkpoint_ files that keep track of the Weights and Biases as they change. As a result, all we have to do to pick up where we left off is tell the `Trainer` where the checkpoint files are. This is awesome and will save us a lot of time since we don't have to retrain the first 10 epochs. So, let's add an additional 90 epochs to the training.

To add additional epochs to the training, we first identify where the checkpoint file is with the following command.

In [300]:
path_to_checkpoint = trainer.checkpoint_callback.best_model_path  # By default, "best" = "most recent"

By doing so, we ask for the manual and intentional process of resuming training. However, if we restart the Jupyter kernel, the Trainer is smart and looks in its default save directory (e.g., "./checkpoints/" or "./lightning_logs/") and sees the model we trained last time. So it's good to remove old checkpoint folder if we don't need them anymore.

Let's create a new Lightning Trainer, just like before, but we set the number of epochs to 100. Given that we already trained for 10 epochs, this means we'll do 90 more. We can observe that the training process actually starts at epoch number 11.

In [301]:
# First, create a new Lightning Trainer
trainer = L.Trainer(max_epochs=100)  # Before, max_epochs=10, so, by setting it to 100, we're adding 90 more.

# Then call trainer.fit() using the path to the most recent checkpoint files
# so that we can pick up where we left off.
trainer.fit(model, train_dataloaders=train_dataloader, ckpt_path=path_to_checkpoint)

💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Restoring states from the checkpoint path at c:\Users\Sébastien\Documents\data_science\machine_learning\statsquest_neural_networks\chapter_03\lightning_logs\version_0\checkpoints\epoch=9-step=1120.ckpt
c:\Users\Sébastien\Documents\data_science\machine_learning\statsquest_neural_networks\.env\Lib\site-packages\lightning\pytorch\callbacks\model_checkpoint.py:445: The dirpath has changed from 'c:\\Users\\Sébastien\\Documents\\data_science\\machine_learning\\statsquest_neural_networks\\chapter_03\\lightning_logs\\version_0\\checkpoints' to 'c:\\Users\\Sébastien\\Documents\\data_science\\machine_learning\\statsquest_neural_networks\\chapter_03\\lightning_logs\\version_1\\checkpoin

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=100` reached.


Now, let's run the testing data through the network and calculate the accuracy. We'll do this just like we did before.

In [304]:
# Run the input_test_tensors through the neural network
predictions = model(input_test_tensors)

# Select the output with highest value...
predicted_labels = torch.argmax(predictions, dim=1) ## dim=0 applies softmax to rows, dim=1 applies softmax to columns

# Now compare predicted_labels with test_labels to calculate accuracy
torch.sum(torch.eq(torch.tensor(label_test), predicted_labels)) / len(predicted_labels)

tensor(0.9211)

After 100 training epochs, we correctly classified 92.1% of the testing data. This means adding more training was helpful.

In [226]:
model.state_dict()

OrderedDict([('input_to_hidden.weight',
              tensor([[ 1.4299,  0.0198],
                      [-1.1063,  0.8932]])),
             ('input_to_hidden.bias', tensor([-0.4824,  0.1904])),
             ('hidden_to_output.weight',
              tensor([[-0.2433,  1.2659],
                      [-0.7735, -1.1830],
                      [ 1.4803,  0.2863]])),
             ('hidden_to_output.bias', tensor([ 0.1418,  0.8101, -0.1898]))])

This corresponds to the following NN configuration.

```mermaid
%%{init:{'theme': 'neutral'}}%%
graph LR
    %% Style Definitions
    classDef biasStyle fill:none,stroke:none,color:#7f8c8d;

    %% Input Layer (2 nodes)
    subgraph Input Layer [Inputs]
        direction TB
        I0(("I₀"))
        I1(("I₁"))
    end

    %% Hidden Layer (2 nodes)
    subgraph Hidden Layer ["Hidden (ReLU)"]
        direction TB
        H0(("H₀"))
        H1(("H₁"))
    end

    %% Output Layer (3 nodes)
    subgraph Output Layer [Outputs]
        direction TB
        O0(("O₀"))
        O1(("O₁"))
        O2(("O₂"))
    end

    %% Bias Nodes for Hidden Layer
    BH0["Bias: -0.48"]:::biasStyle
    BH1["Bias: 0.19"]:::biasStyle

    %% Bias Nodes for Output Layer
    BO0["Bias: 0.15"]:::biasStyle
    BO1["Bias: 0.81"]:::biasStyle
    BO2["Bias: -0.19"]:::biasStyle

    %% Connections from Input to Hidden Layer Weights
    I0 -->|"1.43"| H0
    I1 -->|"0.02"| H0
    
    I0 -->|"-1.11"| H1
    I1 -->|"0.89"| H1

    %% Connections from Hidden Layer Biases
    BH0 -.-> H0
    BH1 -.-> H1

    %% Connections from Hidden to Output Layer Weights
    H0 -->|"-0.24"| O0
    H1 -->|"1.27"| O0
    
    H0 -->|"-0.77"| O1
    H1 -->|"-1.18"| O1
    
    H0 -->|"1.48"| O2
    H1 -->|"0.29"| O2

    %% Connections from Output Layer Biases
    BO0 -.-> O0
    BO1 -.-> O1
    BO2 -.-> O2
```

If we now pass the input tensor that represents the scaled values for Petal and Sepal Widths used in the book to the trained model, we receive the output *Versicolor*, as expected.

In [305]:
model(torch.tensor([0.5, .37]))

tensor([0.0834, 0.6245, 0.1653], grad_fn=<ViewBackward0>)

In [229]:
# Manual calculation to get the output of the second cell
(
    (  # H0
        max(0, 
            (.5 * 1.43)
            +
            (.37 * .02)
            +
            -.48
        ) * -.77
    )
    +
    (  # H1
        max(0, 
            (.5 * -1.11)
            +
            (.37 * .89)
            +
            .19
        ) * -1.18
    )
    +
    .81
)

0.6233520000000001

## Make prediction

Now that our model is trained, we can use it to make predictions from new data. This is done by passing the model a tensor with normalized petal and sepal widths wrapped up in a tensor.

In [313]:
df[['petal_width', 'sepal_width', 'class']].sample(n=5, random_state=42)

Unnamed: 0,petal_width,sepal_width,class
73,1.2,2.8,Iris-versicolor
18,0.3,3.8,Iris-setosa
118,2.3,2.6,Iris-virginica
78,1.5,2.9,Iris-versicolor
76,1.4,2.8,Iris-versicolor


For example, if the raw Petal and Sepal width measurements were 0.3 and 3.8, like the values from the *Setosa* individual in row 18, we would first normalize them using the maximum and minimum values we calculated with the training data.

We have to think as the `MinMaxScaler` as **part of the model itself**, as it has learned the properties of the training data. Therefore for consistency, we must treat the new, unseen data exactly the same way we treated the training data.

In [321]:
df.loc[18, ['petal_width', 'sepal_width']].values

array([np.float64(0.3), np.float64(3.8)], dtype=object)

In [None]:
#petal_sepal_widths = df.loc[18, ['petal_width', 'sepal_width']].values.reshape(1, -1)
#normalized_values = (petal_sepal_widths - min_vals_in_input_train) / (max_vals_in_input_train - min_vals_in_input_train)
petal_sepal_widths = np.array([.3, 3.8]).reshape(1, -1)
print("Petal and sepal widths = ", petal_sepal_widths)

normalized_petal_sepal_widths = scaler.transform(petal_sepal_widths)
print("Normalized petal and sepal widths = ", normalized_petal_sepal_widths)

Petal and sepal widths =  [[0.3 3.8]]
Normalized petal and sepal widths =  [[0.08333333 0.75      ]]




Then we convert `normalized_values` into a tensor and pass it to the model to see what it predicts.

In [None]:
model(torch.tensor(normalized_petal_sepal_widths, dtype=torch.float32))

tensor([[ 1.1142, -0.0986,  0.0302]], grad_fn=<AddmmBackward0>)

And first output has the largest value, meaning that the neural network predicts that the measurements come from *Setosa*, as expected.