<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-fundamentals/unit03-model-training/03_exercise_banknotes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise: Banknote Authentication

In this exercise, we are applying logistic regression to a banknote authentication dataset to distinguish between genuine and forged bank notes.


**The dataset consists of 1372 examples and 4 features for binary classification.** The features are 

1. variance of a wavelet-transformed image (continuous) 
2. skewness of a wavelet-transformed image (continuous) 
3. kurtosis of a wavelet-transformed image (continuous) 
4. entropy of the image (continuous) 

(You can fine more details about this dataset at [https://archive.ics.uci.edu/ml/datasets/banknote+authentication](https://archive.ics.uci.edu/ml/datasets/banknote+authentication).)


In essence, these four features represent features that were manually extracted from image data. Note that you do not need the details of these features for this exercise. 

However, you are encouraged to explore the dataset further, e.g., by plotting the features, looking at the value ranges, and so forth. (We will skip these steps for brevity in this exercise)

Most of the code should look familiar to you since it is based on the logistic regression code from Unit 3.6.

## 1) Setup

In [1]:
import pandas as pd
import numpy as np

import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

In [2]:
!wget https://github.com/Lightning-AI/dl-fundamentals/raw/main/unit03-pytorch-training/exercises/1_banknotes/data_banknote_authentication.txt

--2023-02-14 12:44:46--  https://github.com/Lightning-AI/dl-fundamentals/raw/main/unit03-pytorch-training/exercises/1_banknotes/data_banknote_authentication.txt
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Lightning-AI/dl-fundamentals/main/unit03-pytorch-training/exercises/1_banknotes/data_banknote_authentication.txt [following]
--2023-02-14 12:44:46--  https://raw.githubusercontent.com/Lightning-AI/dl-fundamentals/main/unit03-pytorch-training/exercises/1_banknotes/data_banknote_authentication.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 46400 (45K) [text/plain]
Saving to: ‘data_bankno

## 2) Loading the Dataset

We are using the familiar `read_csv` function from pandas to load the dataset:

In [3]:
df = pd.read_csv("data_banknote_authentication.txt", header=None)
df.head()

Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [4]:
X_features = df[[0, 1, 2, 3]].values
y_labels = df[4].values

Number of examples and features:

In [5]:
X_features.shape

(1372, 4)

It is usually a good idea to look at the label distribution:

In [6]:
np.bincount(y_labels)

array([762, 610])

## 3) Defining a DataLoader

The `DataLoader` code is the same code we used in Unit 3.6:

In [7]:
class MyDataset(Dataset):

  def __init__(self, X, y):
    self.features = torch.tensor(X, dtype=torch.float32)
    self.labels = torch.tensor(y, dtype=torch.float32)

  def __getitem__(self, index):
    x = self.features[index]
    y = self.labels[index]        
    return x, y

  def __len__(self):
    return self.labels.shape[0]

We will be using 80% of the data for training, 20% of the data for validation. 

In a real-project, we would also have a separate dataset for the final test set (in this case, we do not have an explicit test set).

In [8]:
train_size = int(X_features.shape[0]*0.80)
train_size

1097

In [9]:
val_size = X_features.shape[0] - train_size
val_size

275

Using `torch.utils.data.random_split`, we generate the training and validation sets along with the respective data loaders:

In [10]:
import torch

dataset = MyDataset(X_features, y_labels)

torch.manual_seed(1)
train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

train_loader = DataLoader(
  dataset=train_set,
  batch_size=10,
  shuffle=True,
)

val_loader = DataLoader(
  dataset=val_set,
  batch_size=10,
  shuffle=False,
)

## 4) Standardization

There are multiple ways to implement the standardization procedure. For this exercise, we are going to implement a procedure that standardizes the features after we created the data loader.

Since this dataset has 4 features, there should be 4 means and 4 standard deviations we compute from the training set. We can do this as follows:

In [11]:
train_mean = torch.zeros(X_features.shape[1])

for x, y in train_loader:
  train_mean += x.sum(dim=0)
train_mean /= len(train_set)

train_std = torch.zeros(X_features.shape[1])
for x, y in train_loader:
  train_std += ((x - train_mean) ** 2).sum(dim=0)
train_std = torch.sqrt(train_std / len(train_set))

In [12]:
print("Feature means:", train_mean)
print("Feature std. devs:", train_std)

Feature means: tensor([ 0.3854,  1.8680,  1.4923, -1.1999])
Feature std. devs: tensor([2.8562, 5.9189, 4.3849, 2.1031])


We compute the means and standard deviations by iterating over the training loader. This is an approach that even works for large datasets where the entire dataset doesn't fit into memory. 

A simpler approach, which only works for smaller datasets that fit into memory, is as follows:

In [13]:
all_x = []

for x, y in train_loader:
  all_x.append(x)

train_std = torch.concat(all_x).std(dim=0)
train_mean = torch.concat(all_x).mean(dim=0)

print("Feature means:", train_mean)
print("Feature std. devs:", train_std)

Feature means: tensor([ 0.3854,  1.8680,  1.4923, -1.1999])
Feature std. devs: tensor([2.8575, 5.9216, 4.3869, 2.1041])


<font color='red'>YOUR TASK is now to implement a standardization function based on these training set parameters above:</font>

In [14]:
def standardize(df, mean, std):
  return (df - mean) / std

## 5) Implementing the model

Here, we are resusing the same model code we used in Unit 3.6:

In [15]:
class LogisticRegression(torch.nn.Module):
    
  def __init__(self, num_features):
    super().__init__()
    self.linear = torch.nn.Linear(num_features, 1)
  
  def forward(self, x):
    logits = self.linear(x)
    probas = torch.sigmoid(logits)
    return probas

## 6) The training loop

In this section, we are using the training loop from Unit 3.6. It's the exact same code except for some small modification: We added the line `if not batch_idx % 20` to only print the lost for every 20th batch (to reduce the number of output lines).

<font color='red'>YOUR TASK is to use the standardization code correctly in the for loop. Then, find a good learning rate and epoch number to that you achieve a training and validation performance of at least 98%.</font>

In [16]:
torch.manual_seed(1)

model = LogisticRegression(num_features=4)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01) ## FILL IN VALUE

num_epochs = 100  ## FILL IN VALUE

for epoch in range(num_epochs):
  model = model.train()
  for batch_idx, (features, class_labels) in enumerate(train_loader):
    features = standardize(features, train_mean, train_std) ## SOLUTION
    probas = model(features)
    
    loss = F.binary_cross_entropy(probas, class_labels.view(probas.shape))
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    ### LOGGING
    if not batch_idx % 20: # log every 20th batch
        print(f'Epoch: {epoch+1:03d}/{num_epochs:03d}'
                f' | Batch {batch_idx:03d}/{len(train_loader):03d}'
                f' | Loss: {loss:.2f}')

Epoch: 001/100 | Batch 000/110 | Loss: 0.93
Epoch: 001/100 | Batch 020/110 | Loss: 0.85
Epoch: 001/100 | Batch 040/110 | Loss: 0.76
Epoch: 001/100 | Batch 060/110 | Loss: 0.63
Epoch: 001/100 | Batch 080/110 | Loss: 0.63
Epoch: 001/100 | Batch 100/110 | Loss: 0.65
Epoch: 002/100 | Batch 000/110 | Loss: 0.68
Epoch: 002/100 | Batch 020/110 | Loss: 0.53
Epoch: 002/100 | Batch 040/110 | Loss: 0.56
Epoch: 002/100 | Batch 060/110 | Loss: 0.55
Epoch: 002/100 | Batch 080/110 | Loss: 0.44
Epoch: 002/100 | Batch 100/110 | Loss: 0.42
Epoch: 003/100 | Batch 000/110 | Loss: 0.43
Epoch: 003/100 | Batch 020/110 | Loss: 0.52
Epoch: 003/100 | Batch 040/110 | Loss: 0.53
Epoch: 003/100 | Batch 060/110 | Loss: 0.57
Epoch: 003/100 | Batch 080/110 | Loss: 0.38
Epoch: 003/100 | Batch 100/110 | Loss: 0.48
Epoch: 004/100 | Batch 000/110 | Loss: 0.38
Epoch: 004/100 | Batch 020/110 | Loss: 0.29
Epoch: 004/100 | Batch 040/110 | Loss: 0.48
Epoch: 004/100 | Batch 060/110 | Loss: 0.41
Epoch: 004/100 | Batch 080/110 |

## 7) Evaluating the results

Again, reusing the code from Unit 3.6, we will calculate the training and validation set accuracy.

In [17]:
def compute_accuracy(model, dataloader):

  model = model.eval()
  
  correct = 0.0
  total_examples = 0
  
  for idx, (features, class_labels) in enumerate(dataloader):
    with torch.no_grad():
      probas = model(features)
    
    pred = torch.where(probas > 0.5, 1, 0)
    lab = class_labels.view(pred.shape).to(pred.dtype)

    compare = lab == pred
    correct += torch.sum(compare)
    total_examples += len(compare)

  return correct / total_examples

In [18]:
train_acc = compute_accuracy(model, train_loader)
print(f"Accuracy: {train_acc*100:.2f}%")

Accuracy: 86.69%


<font color='red'>Notice that the code validation accuracy is not shown? It's part of the exercise to implement it :)</font>

In [19]:
val_acc = compute_accuracy(model, val_loader)
print(f"Accuracy: {val_acc*100:.2f}%")

Accuracy: 82.18%


<font color='red'>Now, add the standardization to the `compute_accuracy` function above and recompute the training and validation accuracy. What do you observe?</font>

In [20]:
def compute_accuracy(model, dataloader):

  model = model.eval()
  
  correct = 0.0
  total_examples = 0
  
  for idx, (features, class_labels) in enumerate(dataloader):
    with torch.no_grad():
      features = standardize(features, train_mean, train_std) ## SOLUTION
      probas = model(features)
    
    pred = torch.where(probas > 0.5, 1, 0)
    lab = class_labels.view(pred.shape).to(pred.dtype)

    compare = lab == pred
    correct += torch.sum(compare)
    total_examples += len(compare)

  return correct / total_examples

In [21]:
train_acc = compute_accuracy(model, train_loader)
print(f"Accuracy: {train_acc*100:.2f}%")

Accuracy: 97.63%


In [22]:
val_acc = compute_accuracy(model, val_loader)
print(f"Accuracy: {val_acc*100:.2f}%")

Accuracy: 98.18%
