## Demo of DUNES data
A demonstration of how to use the DUNES_data Python module to stream DUNES data for neural networks in PyTorch. 
This assumes that a user is using the data on INL's HPC systems. (Or that the directory structure matches HPC's: /projects/dunes/[code, large_data])

* The LargeIterDataset defines the dataset, much like a PyTorch Dataset. It is not necessary to even import this class. Everything can be defined by just the BatchIterator.
* The BatchIterator is a simple way to properly sample and iterate on a LargeIterDataset. It handles all the randomized shuffling of the data.
* The function `make_data_dicts()` takes an SNR and a preamble size value and then creates two dictionaries (train, test) of LargeIterDataset objects. These dictionaries (one for train data, one for test data) contains a LargeIterDataset for **each** protocol. 
* The default `train_ratio` is 0.8.
* It is not hard to modify the `make_data_dicts()` function to create a train, validation, and test dictionary. You would then be able to define three iterators. Or you can split the test iterator if you really wanted. 
* Each iteration of the BatchIterator will return one **batch** of samples over three domains (Time, Frequency, Discrete Cosine Transform), plus the associated label vector of the sample.
* You can also use the MacroIterator class directly to get a single sample of data. This would give a single sample in a quadruple (time, freq, DCT, label). It works essentially just like a BatchIterator with a batch size of one. 
* **Note:** The premade neural networks all assume a batch dimension, so if you just use the MacroIterator, you will have to use something like `<data_tensor>.unsqueeze(dim=0)` to get the dimensionality correct.
* For demonstration purposes, we can also import a simple set of autoencoders (written in PyTorch). Importing these autoencoders is not necessary generally.

The data labels are as follows (`make_data_dicts()` also prints this out as a reminder):
* Bluetooth: 0
* Bluetooth LE: 1
* WLAN: 2
* WLAN HE: 3
* Zigbee: 4

At some point, we can add functionality for holding Zigbee out in order to create an "open" class as opposed to the closed classifications. 


In [1]:
from DUNES_data import BatchIterator, make_data_dicts, SimpleConvAE, SimpleLinearAE

import torch
import torch.nn as nn
from torch.optim import Adam

Define the signal-to-noise-ratio (SNR: 5,10,15,20,25) and then the packet preable size (usually 128 or 256). For simple printout demonstration purposes, you may want to have a preamble size of 4 or 8. For actual model training, you will want it to be much larger. 

In [2]:
snr=25
preamb_sz = 128
train_ratio = 0.7
total_samples = 10000
batch_sz=64

To make the `make_data_dicts()` function not print out the labels, set `explain=False` in the function. Note that these data dictionaries are only created once for the training and test sets. This is important because it means that we can create as many iterators on the train and test data as we want.

In [3]:
train_data_dict, test_data_dict = make_data_dicts(snr=snr, chunk_sz=preamb_sz, train_ratio = train_ratio)

The data labels are as follows:

	0: Bluetooth
	1: Bluetooth LE
	2: WLAN
	3: WLAN HE
	4: Zigbee


--------------
## Using the BatchIterator
We can use a BatchIterator to iterate through the data set.

#### Only run this block with small batch size and small preamble size.
Otherwise the print out will be a mile long. 

In [None]:
train_batchIter = BatchIterator(train_data_dict, batch_sz=batch_sz, num_samp=int(total_samples*train_ratio))
test_batchIter = BatchIterator(test_data_dict, batch_sz=batch_sz, num_samp=int(total_samples*(1-train_ratio)))

for t,f,d,l in train_batchIter:
    print("Time domain:\n", t)
    print("Freq domain:\n", f)
    print("DCT domain:\n", d)
    print("Label:", l)
    print("\n-----------------------------------\n")

-------------

## BatchIterator and Model
We can use the BatchIterator to feed data into a simple PyTorch model. Throughout this notebook, the quadruple `t,f,d,l` refers to time, frequency, DCT, label. This is the quadruple of data. 

There is nothing too magical about the learning rates or layer sizes. Anecdotally, the feature extractor for the frequency domain seems to need a smaller leraning rate. 

In [4]:
t_ae = SimpleConvAE()
t_optimizer = Adam(t_ae.parameters(), lr=0.01)
f_ae = SimpleConvAE()
f_optimizer = Adam(f_ae.parameters(), lr=0.001)
d_ae = SimpleConvAE()
d_optimizer = Adam(d_ae.parameters(), lr=0.01)

criterion = nn.MSELoss()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"We are using {device} as our device.")
t_ae.to(device)
f_ae.to(device)
d_ae.to(device)

We are using cuda as our device.


SimpleConvAE(
  (encoder): Sequential(
    (0): Conv1d(2, 4, kernel_size=(3,), stride=(2,), padding=(1,))
    (1): LeakyReLU(negative_slope=0.2, inplace=True)
    (2): Conv1d(4, 8, kernel_size=(3,), stride=(2,), padding=(1,))
    (3): LeakyReLU(negative_slope=0.05, inplace=True)
  )
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (decoder): Sequential(
    (0): ConvTranspose1d(8, 4, kernel_size=(3,), stride=(2,), padding=(1,), output_padding=(1,))
    (1): LeakyReLU(negative_slope=0.05, inplace=True)
    (2): ConvTranspose1d(4, 2, kernel_size=(3,), stride=(2,), padding=(1,), output_padding=(1,))
    (3): Sigmoid()
  )
)

Note that in theory you can train all three autoencoders (t,f,d) in a single loop. You just have to be careful to keep everything well annotated. We could also implement early stopping if we wanted. May not be necessary at this point.  

In [5]:
## Number of training epochs
num_epochs = 20

## Training loop for PyTorch
for epoch in range(num_epochs):
    total_t_loss = 0.0
    total_f_loss = 0.0
    total_d_loss = 0.0
    num_batches = 0
    train_batchIter = BatchIterator(train_data_dict, batch_sz=batch_sz, num_samp=int(total_samples*train_ratio))
    
    # Iterate over batches in the training dataset.
    for  t,f,d,l in train_batchIter:
        ## Zero the gradients for all three models
        t_optimizer.zero_grad()
        f_optimizer.zero_grad()
        d_optimizer.zero_grad()
        
        t,f,d,l = t.to(device), f.to(device), d.to(device), l.to(device)
        
        # Forward pass all models
        t_out = t_ae(t.float())
        f_out = f_ae(f.float())
        d_out = d_ae(d.float())
        
        # Compute the loss of each model separately. 
        t_loss = criterion(t_out, t.float())
        f_loss = criterion(f_out, f.float())
        d_loss = criterion(d_out, d.float())
        
        # Backward pass
        t_loss.backward()
        f_loss.backward()
        d_loss.backward()
        
        # Update the parameters
        t_optimizer.step()
        f_optimizer.step()
        d_optimizer.step()
        
        # Accumulate the total loss
        total_t_loss += t_loss.item()
        total_f_loss += f_loss.item()
        total_d_loss += d_loss.item()
        num_batches += 1
    
    # Compute the average loss for the epoch
    ## print(f"This is the number of batches so far: {num_batches}")
    avg_t_loss = total_t_loss / num_batches
    avg_f_loss = total_f_loss / num_batches
    avg_d_loss = total_d_loss / num_batches
    
    # Print the average loss for the epoch
    if (epoch+1) % 2 ==0:
        print(f"\nEpoch [{epoch+1}/{num_epochs}],\nAvg. time domain loss: {avg_t_loss:.6f}")
        print(f"Avg. frequency domain loss: {avg_f_loss:.6f}\nAvg. DCT domain loss: {avg_d_loss:.6f}")
        print("-----------------------------------------------\n")


Epoch [2/20],
Avg. time domain loss: 0.050349
Avg. frequency domain loss: 10.004286
Avg. DCT domain loss: 0.058740
-----------------------------------------------


Epoch [4/20],
Avg. time domain loss: 0.046727
Avg. frequency domain loss: 9.868808
Avg. DCT domain loss: 0.056868
-----------------------------------------------


Epoch [6/20],
Avg. time domain loss: 0.045294
Avg. frequency domain loss: 9.853449
Avg. DCT domain loss: 0.055346
-----------------------------------------------


Epoch [8/20],
Avg. time domain loss: 0.043657
Avg. frequency domain loss: 9.846097
Avg. DCT domain loss: 0.054403
-----------------------------------------------


Epoch [10/20],
Avg. time domain loss: 0.043363
Avg. frequency domain loss: 9.840994
Avg. DCT domain loss: 0.053265
-----------------------------------------------


Epoch [12/20],
Avg. time domain loss: 0.043018
Avg. frequency domain loss: 9.838065
Avg. DCT domain loss: 0.052748
-----------------------------------------------


Epoch [14/20

------------------------
## Feature Extraction
Once the autoencoder (feature extractor) is trained, then we can use that trained feature extractor to give us the features. 
* Use the `<SimpleConvAE>.features(domain_data)` function to get these features. 
* If the model was trained on GPU (cuda) then the data will also need to be sent to that device. 
* Use `detach()` if you need to detatch the data from the device. 
* Because PyTorch is picky about data types, we need to cast the data to float using `.float()`
* Keep in mind that whatever your batch size will also be the first dimension of the output of the `features()` function.
* Thus if we have a batch size of 16, we will get *16* feature vectors for each input batch (also of size 16 of course).

In [8]:
test_batchIter = BatchIterator(test_data_dict, batch_sz=batch_sz, num_samp=int(total_samples*(1-train_ratio)))

In [9]:
t,f,d,l = next(test_batchIter)
t,f,d,l = t.to(device), f.to(device), d.to(device), l.to(device)
##print(f"The label is {l}")
t_ae.features(t.float()).shape

torch.Size([64, 256])

-------------

In [21]:
## Fully connected autoencoder. Does not seem to be as effective as the CNN. 

In [15]:
lae = SimpleLinearAE(preamb_sz)
criterion = nn.MSELoss()
optimizer = Adam(lae.parameters(), lr=0.005)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lae.to(device)


-----------------
It is assumed that input data has separated real and imaginary parts.

Data will be flattened to an input size of 256 unless a value for factor arg has been set.
-----------------


SimpleLinearAE(
  (encoder): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=256, out_features=128, bias=True)
    (2): ReLU()
    (3): Linear(in_features=128, out_features=64, bias=True)
    (4): ReLU()
    (5): Linear(in_features=64, out_features=16, bias=True)
  )
  (decoder): Sequential(
    (0): Linear(in_features=16, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=256, bias=True)
    (5): Sigmoid()
  )
)

In [16]:
## Number of training epochs
num_epochs = 20

## Training loop for PyTorch
for epoch in range(num_epochs):
    total_loss = 0.0
    num_batches = 0
    train_batchIter = BatchIterator(train_data_dict, batch_sz=batch_sz, num_samp=int(total_samples*train_ratio))
    
    # Iterate over batches in the training dataset
    for t,f,d,l in train_batchIter:
        # Zero the gradients
        optimizer.zero_grad()
        
        t,f,d,l = t.to(device), f.to(device), d.to(device), l.to(device)
        
        # Forward pass
        output_data = lae(d.float())
        #print(d)
        #print(output_data)
        # Compute the loss
        loss = criterion(output_data, d.float())
        
        # Backward pass
        loss.backward()
        
        # Update the parameters
        optimizer.step()
        
        # Accumulate the total loss
        total_loss += loss.item()
        num_batches += 1
    
    # Compute the average loss for the epoch
    avg_loss = total_loss / num_batches
    
    # Print the average loss for the epoch
    if (epoch+1) % 1 ==0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Avg. Loss: {avg_loss:.4f}")

Epoch [1/20], Avg. Loss: 0.0876
Epoch [2/20], Avg. Loss: 0.0818
Epoch [3/20], Avg. Loss: 0.0818
Epoch [4/20], Avg. Loss: 0.0818
Epoch [5/20], Avg. Loss: 0.0817
Epoch [6/20], Avg. Loss: 0.0817
Epoch [7/20], Avg. Loss: 0.0817
Epoch [8/20], Avg. Loss: 0.0817
Epoch [9/20], Avg. Loss: 0.0817
Epoch [10/20], Avg. Loss: 0.0819
Epoch [11/20], Avg. Loss: 0.0818
Epoch [12/20], Avg. Loss: 0.0817
Epoch [13/20], Avg. Loss: 0.0817
Epoch [14/20], Avg. Loss: 0.0818
Epoch [15/20], Avg. Loss: 0.0818
Epoch [16/20], Avg. Loss: 0.0818
Epoch [17/20], Avg. Loss: 0.0818
Epoch [18/20], Avg. Loss: 0.0817
Epoch [19/20], Avg. Loss: 0.0817
Epoch [20/20], Avg. Loss: 0.0817


9