# Neural Network to Distinguish Power Density Graphs

This repository contains the code to train and export a neural network that takes in an input of a Power Density Function (PDF) file and output a value between 0 and 1. A value closer to 1 indicates that the function contains inaccuracies that may need to be reviewed, and values closer to 0 means that there are no apparent inaccuracies.

### Prerequisites

The following code heavily utilizes the Pytorch library. If unfamiliar with Pytorch or fundamental machine learning concepts, please review the following links before attempting to modify the code:

[3Blue1Brown's Introduction to Deep Learning](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)

[Pytorch Tutorial - What is torch.nn really?](https://pytorch.org/tutorials/beginner/nn_tutorial.html)

[Pytorch Tutorial - Beginner Basics](https://pytorch.org/tutorials/beginner/basics/intro.html)

In [1]:
import pandas as pd
import numpy as np

import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch import nn

import os

from tqdm import tqdm

import sys

### Loading Data

In [2]:
# For debugging purposes - prints the full values of numpy arrays and pytorch tensors

# np.set_printoptions(threshold=sys.maxsize)
# torch.set_printoptions(threshold=sys.maxsize)

The following code creates a [custom dataset](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) that is used to load data for both the training and validation set. The current repository must contain a folder for training data and a folder for validation, with each file named after numbers starting from 0 (0, 1, 2...).

Each PDF file inside both the training and validation data must follow the format of a PDF analysis generated by data center stations. An example called `PDFanalysis.2023.001.pdf` is attached for reference. The data format is then decoded using `np.fromfile`, and the probability values are then extracted for modifications.

For the sake of simplicity in labeling, all correct PDF files have been named after an even number, while all incorrect PDF files have been named after an odd number. Hence, all correct PDF files will have the label of 0 and incorrect the label of 1 when mod 2 is applied.

In [20]:
class PDFDataset(Dataset):
    def __init__(self, is_training):
        correct_path = "/home/gcl/TT/sylvesterseo/bsl/BSL_TOOLKIT/strongmotion_ml/data/correct"
        incorrect_path = "/home/gcl/TT/sylvesterseo/bsl/BSL_TOOLKIT/strongmotion_ml/data/incorrect"
        
        self.data = {}

        # Correct data: 0, 2, 4, ...
        # Incorrect data: 1, 3, 5, ...
        self.files = [item for pair in zip(sorted(os.listdir(correct_path)), sorted(os.listdir(incorrect_path))) for item in pair]
        self.length = 0
        print(self.files)

        initial_index = 0 if is_training else 2
        counter = 0
        for i in tqdm(range(initial_index, len(self.files), 4)):
            print(i)
            correct_arr = np.fromfile(f"{correct_path}/{self.files[i]}", sep=" ")[2::3]
            incorrect_arr = np.fromfile(f"{incorrect_path}/{self.files[i + 1]}", sep=" ")[2::3]
            self.data[counter] = torch.from_numpy(correct_arr).float()
            self.data[counter + 1] = torch.from_numpy(incorrect_arr).float() 
            counter += 2
            self.length += 2
        
        print(self.data)
    
    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        return self.data[i], torch.tensor(i % 2).reshape((1, )).float()

In [21]:
training_data = PDFDataset(True)
validation_data = PDFDataset(False)

['0', 'BK.BRIB.01.HNN.2024.049.pdf', '1', 'BK.BRIB.01.HNN.2024.056.pdf', '2', 'BK.BRIB.01.HNN.2024.063.pdf']


100%|██████████| 2/2 [00:00<00:00, 57.68it/s]


0
4
{0: tensor([]), 1: tensor([0., 0., 0.,  ..., 0., 0., 0.]), 2: tensor([]), 3: tensor([0., 0., 0.,  ..., 0., 0., 0.])}
['0', 'BK.BRIB.01.HNN.2024.049.pdf', '1', 'BK.BRIB.01.HNN.2024.056.pdf', '2', 'BK.BRIB.01.HNN.2024.063.pdf']


100%|██████████| 1/1 [00:00<00:00, 56.64it/s]

2
{0: tensor([]), 1: tensor([0., 0., 0.,  ..., 0., 0., 0.])}





In [None]:
train_dataloader = DataLoader(training_data, batch_size=2, shuffle=True)
test_dataloader = DataLoader(validation_data, batch_size=2)

### Neural Network

The following [neural network](https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html) utilizes linear transformations for training. Future prospect includes the implementation of a convolutional neural network in an attempt to detect common patterns present among correct PDF files.

The input value is a 122 by 151 tensor, which are all the probability values present within a PDF analysis.

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(122 * 151, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.linear_relu_stack(x)

In [None]:
model = NeuralNetwork()

### Training

Functions for training and validating the models are defined below. This is the back propagation part of machine learning.

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)

    model.train()

    for batch, (X, y) in enumerate(dataloader):
        pred = model(X)
        loss = loss_fn(pred, y)

        #Backward propagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * batch_size + len(X)
            print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")

In [None]:
def test_loop(dataloader, model, loss_fn):
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.round() == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

The learning rate affects how much impact each iteration has on the neural network; high learning rate will transform the neural network more per iteration, while low learning rate transforms it less.

Batch size indicates how many pieces of data (in this case, PDF files) are used per one instance of training. It has been set to 2 currently due to the low number of training data available.

In one epoch, the machine iterates through all of the training data available. Ten epochs indicate that the machine will be iterating through the training data for a total of ten times, with each one shuffled in a unique way.

In [None]:
learning_rate = 1e-2
batch_size = 2
epochs = 10

In [None]:
loss_fn = nn.BCELoss()

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [None]:
for t in range(epochs):
    print(f"Epoch {t + 1}\n----------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
print("Done!")

### Exporting Data

After the model is trained and ready to go, it can then be exported for use by other programs. To see an example of how to use an exported model weights, check out [the BSL analysis toolkit](https://github.com/sylvster/BSL-ML-ANALYZE), a program that uses this model's parameters to separate batches of PDF data into correct and incorrect bins.

In [None]:
torch.save(model.state_dict(), 'model_weights.pth')