# RNN

I used an example from the series \'Machine Learning with PyTorch and Scikit-Learn\' by Sebastian Raschka (Chapter 15, Project two - character-level language modeling in PyTorch) to create a RNN model and train it.

To accelerate the model training I used Google Colab environment where I specified Runtime type as *Python 3* and Hardware accelerator as *T4 GPU*. For this purpose I used *.to(device)* method to perform relevant tensors device conversion to GPU.

This noteseries is adapted to be run in Google Colab environment.\
Section **6.** (Training the model) is included to show how I trained the model. You can decide whether you want to perform the training yourself or load the results of the training I did by choosing between **perform_model_training** and **load_pretrained_model** modes below:


In [2]:
# training_mode = 'perform_model_training'
training_mode = 'load_pretrained_model'

## 1. Required imports

In [3]:
import time
import torch
import numpy as np
import pandas as pd
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.distributions.categorical import Categorical

Here I'm checking Google Colab module installation and whether hardware accelerator was set correctly:

In [4]:
print('torch version: ',torch.__version__)

print("GPU Available:", torch.cuda.is_available())

if torch.cuda.is_available():
  device = torch.device("cuda:0")
else:
  device = "cpu"
print('device: ', device)

torch version:  2.5.0+cpu
GPU Available: False
device:  cpu


## 2. Data preparation
Time series to be uploaded here:

In [7]:
param = 'saturacja_prc'
df = pd.read_csv('data/df_augmented_' + param + '.csv', parse_dates=['date'], date_format='%Y-%m-%d %H:%M:%S')
df.set_index('date', inplace=True)

In [8]:
df.values.flatten()

array([100, 100,  99, ...,  96,  94,  95])

In [9]:
all_values = df.values.flatten()
values_set = sorted(set(pd.unique(all_values)))
values_set

[np.int64(83),
 np.int64(84),
 np.int64(85),
 np.int64(86),
 np.int64(87),
 np.int64(88),
 np.int64(89),
 np.int64(90),
 np.int64(91),
 np.int64(92),
 np.int64(93),
 np.int64(94),
 np.int64(95),
 np.int64(96),
 np.int64(97),
 np.int64(98),
 np.int64(99),
 np.int64(100),
 np.int64(101),
 np.int64(102),
 np.int64(103),
 np.int64(104),
 np.int64(105)]

In [10]:
print('All series include:')
print(f'total values: {len(all_values)}\nunique values: {len(values_set)}')

All series include:
total values: 3590000
unique values: 23


I'm gonna use first 3000(???) series for training (and validation) and ....(???) for testing - idk

Below a *values_array* is created containing all the unique values present in the series and a dictionary *encoding_dict* assigning an unique integer to each character. *encoding_dict* is used to generate a *values* version - *df_encoded* - where all values are replaced with their numerical representations.

In [11]:
encoding_dict = {val: i for i, val in enumerate(values_set)}
values_array = np.array(values_set)
df_encoded = df.map(lambda val: encoding_dict[val])

In [12]:
print('unique values array:\n', values_array)

unique values array:
 [ 83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
 101 102 103 104 105]


In [13]:
encoding_dict

{np.int64(83): 0,
 np.int64(84): 1,
 np.int64(85): 2,
 np.int64(86): 3,
 np.int64(87): 4,
 np.int64(88): 5,
 np.int64(89): 6,
 np.int64(90): 7,
 np.int64(91): 8,
 np.int64(92): 9,
 np.int64(93): 10,
 np.int64(94): 11,
 np.int64(95): 12,
 np.int64(96): 13,
 np.int64(97): 14,
 np.int64(98): 15,
 np.int64(99): 16,
 np.int64(100): 17,
 np.int64(101): 18,
 np.int64(102): 19,
 np.int64(103): 20,
 np.int64(104): 21,
 np.int64(105): 22}

In [14]:
df_encoded.head(3)

Unnamed: 0_level_0,aug_series_00001,aug_series_00002,aug_series_00003,aug_series_00004,aug_series_00005,aug_series_00006,aug_series_00007,aug_series_00008,aug_series_00009,aug_series_00010,...,aug_series_09991,aug_series_09992,aug_series_09993,aug_series_09994,aug_series_09995,aug_series_09996,aug_series_09997,aug_series_09998,aug_series_09999,aug_series_10000
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-11-13 14:00:00,17,17,16,19,17,15,15,16,18,14,...,18,17,17,17,16,16,18,17,16,14
2023-11-13 15:00:00,18,16,17,18,19,15,17,16,18,18,...,16,17,18,17,16,18,18,18,16,17
2023-11-13 16:00:00,15,16,15,14,15,17,15,14,17,16,...,15,15,13,14,15,15,15,14,15,15


## 3. ML Dataset construction
Encoded values is divided into chunks to be fed into the model:


In [15]:
seq_length = 40
chunk_size = seq_length + 1
series_chunks_list = []
for i, col in enumerate(df_encoded.columns):
    series_chunks_list.append([df_encoded[col].values[i: i + chunk_size] for i in range(len(df_encoded[col]) - chunk_size)])

In [16]:
len(series_chunks_list)

10000

In [17]:
len(series_chunks_list[0])

318

In [18]:
type(series_chunks_list[0])

list

In [37]:
series_chunks_tensor = torch.tensor(np.array(series_chunks_list))

In [53]:
series_chunks_tensor[0:8000].shape

torch.Size([8000, 41])

In [38]:
series_chunks_training = torch.flatten(series_chunks_tensor[0:3000], start_dim=0, end_dim=1)

valuesDataset class is created. It's instance, *seq_dataset*, stores the sequences samples and their corresponding targets

In [39]:
class valuesDataset(Dataset):
    def __init__(self, chunks_tensor):
        self.chunks_tensor = chunks_tensor

    def __len__(self):
        return len(self.chunks_tensor)

    def __getitem__(self, idx):
        chunk = self.chunks_tensor[idx]
        return chunk[:-1].long(), chunk[1:].long()

In [40]:
seq_dataset = valuesDataset(series_chunks_training)

In [48]:
for i, (seq, target) in enumerate(seq_dataset):
    print('Model input (x): ', repr(values_array[seq]))
    print('Model target (y):', repr(values_array[target]))
    if i==3:
        break

Model input (x):  array([100, 101,  98,  97,  98,  96,  95,  97,  96,  97,  96,  95,  96,
        95,  95,  96,  95,  96,  96,  97,  94,  91,  91,  89,  92,  93,
        91,  92,  95,  94,  95,  93,  94,  95,  93,  95,  95,  94,  94,
        98])
Model target (y): array([101,  98,  97,  98,  96,  95,  97,  96,  97,  96,  95,  96,  95,
        95,  96,  95,  96,  96,  97,  94,  91,  91,  89,  92,  93,  91,
        92,  95,  94,  95,  93,  94,  95,  93,  95,  95,  94,  94,  98,
        95])
Model input (x):  array([101,  98,  97,  98,  96,  95,  97,  96,  97,  96,  95,  96,  95,
        95,  96,  95,  96,  96,  97,  94,  91,  91,  89,  92,  93,  91,
        92,  95,  94,  95,  93,  94,  95,  93,  95,  95,  94,  94,  98,
        95])
Model target (y): array([98, 97, 98, 96, 95, 97, 96, 97, 96, 95, 96, 95, 95, 96, 95, 96, 96,
       97, 94, 91, 91, 89, 92, 93, 91, 92, 95, 94, 95, 93, 94, 95, 93, 95,
       95, 94, 94, 98, 95, 94])
Model input (x):  array([98, 97, 98, 96, 95, 97, 96, 97, 96

Next, a dataloader is created - which is an object used to pass data into the model in the form of \'batches\' (groups of specified size)

In [49]:
torch.manual_seed(1)
batch_size = 64
seq_dataloader = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

## 4. Defining a RNN model

*RNN* class defines RNN model's architecture:

#### __init__ method:

**self.embedding** is a table that stores an embedding vector for each unique character (we generate one embedding vector for each character by specifying vocab_size=len(values_array) below). Each embedding vector has length of embed_dim

**self.rnn** is a neural network we will use. We specify its properties using *torch.nn.LSTM* function. LSTM stands for \'Long Short-Term Memory\' end indicates that we will use a RNN with LSTM cells used as hidden layers. The input to a hidden layer will be a specific character represented as an embedding vector. Therefore, we specify that we expect an input to be of size of the embedding vector length (embed_dim).\
The output of calling self.rnn is:\
*output_features,\
(final hidden state (for each element in the sequence),\
final cell state (for each element in the sequence))*\
It is utilized in *forward* method.

**self.fc** is where we define a type of transformation we will apply to the output of hidden layers

**self.rnn_hidden_size** is the number of features in the hidden state of RNN

#### forward method:
Define the computation performed at every model call (we can use this method because our RNN class inherits form class Module, which is a PyTorch Base class for all neural network modules)

#### init_hidden method:
Is where we initialize the state of a hidden layer and LSTM cell. (which will be used in forward method)

In [50]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(input_size=embed_dim,
                           hidden_size=rnn_hidden_size,
                           batch_first=True)
        self.fc = nn.Linear(in_features=rnn_hidden_size,
                            out_features=vocab_size)

        self.rnn_hidden_size = rnn_hidden_size

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size).to(device)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size).to(device)
        return hidden, cell

## 5. Creating an instance of the defined model

In [51]:
vocab_size = len(values_array)
embed_dim = 256
rnn_hidden_size = 512
model = RNN(vocab_size, embed_dim, rnn_hidden_size).to(device)
model

RNN(
  (embedding): Embedding(23, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=23, bias=True)
)

define loss function and optimizer:

In [52]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

## 6. Training the model

Model is trained for multiple epochs. In each epoch only one batch is used.\
For each epoch we perform the following:
1. hidden layer and cell states initialization
2. use seq_dataloader as iterator object and load one batch (set of 64 input sequences and corresponding target sequences)
3. reset the gradients of all optimized tensors.
4. initialize loss (between predicted sequences and target sequences of loaded batch) as 0
5. we use for loop to:\
   I. predict next character for each character in the input_sequence. It is done simultaneously for all input sequences in the batch.\
   II.Compute temporary loss as sum of losses for all values
6. Compute loss gradients after iterating through all values.
7. Perform optimization step to update model parameters (function .step() can be called once the gradients are computed using .backward())
8. Compute final loss for a batch (dividing by the number of values each sequence had)
9. Printing current loss updates.

In [None]:
if training_mode == 'perform_model_training':

    num_epochs = 10000
    
    start_time = time.time()
    
    for epoch in range(num_epochs):
        hidden, cell = model.init_hidden(batch_size)
        seq_batch, target_batch = next(iter(seq_dataloader))
        seq_batch = seq_batch.to(device)
        target_batch = target_batch.to(device)
        optimizer.zero_grad()
        loss = 0
        for c in range(seq_length):
            pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
            loss += loss_fn(pred, target_batch[:, c])
        loss.backward()
        optimizer.step()
        loss = loss.item()/seq_length
        if epoch % 500 == 0:
            print(f'Epoch {epoch} loss: {loss:.4f}')
    print('time passed: ', time.time() - start_time)

## 7. Model saving and loading

### I. saving and loading after training:

There are 2 ways to save model training results:\
**method 1.** : saving the whole model object (i.e. model object of a specified architecture and its learned parameters)\
**method 2.** : saving just parameters (to reuse the parameters one needs to create a model object of the same architecture as the trained one and load them into the model)

I'm using the **method 1.**, however, below I include commented code for **method 2.**.

model.eval() function is called to indicate that the model will now be used in evaluation mode (i.e. for inference). It's because some layers behave differently during training and inference and need their mode to be set in advance.

In [None]:
saving_method = 'method_1'

**method 1.**

In [None]:
if training_mode == 'perform_model_training' and saving_method == 'method_1':
    torch.save(model, 'data/models/self_trained_model.pt')

    trained_model = torch.load('data/models/self_trained_model.pt', weights_only=False, map_location=device)
    trained_model.eval()

**method 2.**

In [None]:
# if training_mode == 'perform_model_training' and saving_method == 'method_2':
#     torch.save(trained_model.state_dict(), 'data/models/self_trained_model_state_dict.pt')

#     trained_model = RNN(vocab_size, embed_dim, rnn_hidden_size)
#     trained_model.load_state_dict(torch.load('data/models/self_trained_model_state_dict.pt', weights_only=False))
#     trained_model.eval()

### II. loading a pretrained model:

In [None]:
if training_mode == 'load_pretrained_model':
    trained_model = torch.load('data/models/einstein_pretrained_model.pt', map_location=device, weights_only=False)
    trained_model.eval()

## 8. values generating function

Below function is defined that utilizes the model to generate values on the basis of the series.

The rate to with a generated values may be meaningful can be altered by changing a **predictability_factor** - the bigger the more predictable (and likely more meaningful) the generated values will be.

values are added to starting string one at a time. Randomness is enabled by usage of Categorical() class and sample() function - the added value is not always the one with the highest probability. 

In [None]:
def generate_values(model, starting_str, len_generated_values=500, predictability_factor=2.0):
    encoded_input = torch.tensor(
        [encoding_dict[s] for s in starting_str]
    )
    encoded_input = torch.reshape(encoded_input, (1, -1)).to(device)
    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(
            encoded_input[:, c].view(1), hidden, cell
        )

    last_char = encoded_input[:, -1]
    for i in range(len_generated_values):
        logits, hidden, cell = model(
            last_char.view(1), hidden, cell
        )
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * predictability_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(values_array[last_char])

    return generated_str

## 9. Final use

In [None]:
print(generate_values(trained_model, starting_str='Time and Space'))

In [None]:
print(generate_values(trained_model, starting_str='Time and Space'))

In [None]:
print(generate_values(trained_model, starting_str='Time and Space'))

In [None]:
#embedding problematyczny
#model od promo używa tylko tego szeregu

#to nie:

#usunąć anomalie z szeregu
#augumentacja tego z usuniętymi anomaliami
#forcasting bez anomalii i badamy jakość forecastingu na podstawie rozkładu residuów
#jeśli będzie ok to dodajemy anomalie (losowe anomalie do róznych szeregów)

#detektor trendu -ale to raczje nie zdążymy
#--------------------------------------------------
#To tak:
#jeden parametr
#dla orginalnego i jednego augumentowanego (2 szeregi) szeregu spróbować model od promo 1 i 2
#porównać rozkład resuguów dla obu w funkcji seq_length [5, 10, 15, 20] w formie wykresów (jeden dla każdej długości seq_lengt 2 modele na tym plocie)
#plus w tekście można std odchylenie i średnią
#pierwsze 100 zakładamy że to uczenie, dla kolejnych dalszych sprawdzamy forecasting