# Final Project

I used an example from the book \'Machine Learning with PyTorch and Scikit-Learn\' by Sebastian Raschka (Chapter 15, Project two - character-level language modeling in PyTorch) to create a RNN model and train it to generate short statement-like texts on the basis of Albert Einstein's book \'Relativity : the Special and General Theory\'.\
Initial idea was to create a functionality resembling asking a geeky friend who read the book to express their views on a chosen topic.\
Text generation results proved to be more unpredictable than we would expect from a friend's statements, which, however, made them a nice playground for creating hilarious physical statements.

To accelerate the model training I used Google Colab environment where I specified Runtime type as *Python 3* and Hardware accelerator as *T4 GPU*. For this purpose I used *.to(device)* method to perform relevant tensors device conversion to GPU.

This notebook is adapted to be run in Google Colab environment.\
Section **6.** (Training the model) is included to show how I trained the model. You can decide whether you want to perform the training yourself or load the results of the training I did by choosing between **perform_model_training** and **load_pretrained_model** modes below:


In [1]:
# training_mode = 'perform_model_training'
training_mode = 'load_pretrained_model'

## 1. Required imports

In [2]:
import numpy as np
import time
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.distributions.categorical import Categorical

Here I'm checking Google Colab module installation and whether hardware accelerator was set correctly:

In [3]:
print('torch version: ',torch.__version__)

print("GPU Available:", torch.cuda.is_available())

if torch.cuda.is_available():
  device = torch.device("cuda:0")
else:
  device = "cpu"
print('device: ', device)

torch version:  2.5.0+cpu
GPU Available: False
device:  cpu


## 2. Data preparation
I obtained the book text in the .txt format form https://www.gutenberg.org/files/5001/old/2004-5001.txt and saved it as \'data/books/Einstein_relativity_book.txt\' using a command:

In [4]:
# !curl -o data/books/Einstein_relativity_book.txt https://www.gutenberg.org/files/5001/old/2004-5001.txt

The file is included in my GitHub repo so you don't have to download it from the Project Gutenberg page directly.

Below code performs GitHub repo cloning using Git (if you use Google Colab repo is cloned to Google Colab not your local system). If you have already cloned the repo you can omit this part by changing *clone_repo* variable to *False*:

In [5]:
clone_repo = False

In [6]:
if clone_repo:
    !git clone https://github.com/zuzka-szczelina/python_calc_project_sem_7.git
    !cd python_calc_project_sem_7

Below the obtained text is preprocessed to extract only a book content and count the number of characters it consists of:

In [7]:
with open('./data/books/Einstein_relativity_book.txt', 'r', encoding='cp1252') as f:
    text = f.read()
    start_idx = text.find('CONTENTS')
    end_idx = text.find('END OF THE PROJECT GUTENBERG')
    text = text[start_idx: end_idx]
    char_set = set(text)

In [8]:
print('Book text includes:')
print(f'total characters: {len(text)}\nunique characters: {len(char_set)}')

Book text includes:
total characters: 186305
unique characters: 87


Below a *char_array* is created containing all the unique characters present in the book and a dictionary *char2int_encoding* assigning an unique integer to each character. *char2int_encoding* is used to generate a *text* version - *text_encoded* - where all characters are replaced with their numerical representations.

In [9]:
chars_sorted = sorted(char_set)
char2int_encoding = {ch: i for i, ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)
text_encoded = np.array(
    [char2int_encoding[ch] for ch in text],
    dtype=np.int32
)

In [10]:
print('unique characters array:\n', char_array)

unique characters array:
 ['\n' ' ' '!' '"' '&' "'" '(' ')' '*' '+' ',' '-' '.' '/' '0' '1' '2' '3'
 '4' '5' '6' '7' '8' '9' ':' ';' '=' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H'
 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z'
 '[' ']' '^' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' 'µ' 'æ' 'é' 'ü']


In [11]:
print(f'numerical code: {text_encoded[0:17]}\n stands for {char_array[text_encoded[0:17]]}')

numerical code: [30 42 41 47 32 41 47 46  0  0 43 74 61 62 57 59 61]
 stands for ['C' 'O' 'N' 'T' 'E' 'N' 'T' 'S' '\n' '\n' 'P' 'r' 'e' 'f' 'a' 'c' 'e']


## 3. ML Dataset construction
Encoded text is divided into chunks to be fed into the model:

In [12]:
seq_length = 40
chunk_size = seq_length + 1
text_chunks = [text_encoded[i: i + chunk_size] for i in range(len(text_encoded) - chunk_size)]

TextDataset class is created. It's instance, *seq_dataset*, stores the sequences samples and their corresponding targets

In [13]:
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

In [14]:
seq_dataset = TextDataset(torch.tensor(np.array(text_chunks)))

In [15]:
for i, (seq, target) in enumerate(seq_dataset):
    print('Model input (x): ', repr(''.join(char_array[seq])))
    print('Model target (y):', repr(''.join(char_array[target])))
    if i==5:
        break

Model input (x):  'CONTENTS\n\nPreface\n\nPart I: The Special T'
Model target (y): 'ONTENTS\n\nPreface\n\nPart I: The Special Th'
Model input (x):  'ONTENTS\n\nPreface\n\nPart I: The Special Th'
Model target (y): 'NTENTS\n\nPreface\n\nPart I: The Special The'
Model input (x):  'NTENTS\n\nPreface\n\nPart I: The Special The'
Model target (y): 'TENTS\n\nPreface\n\nPart I: The Special Theo'
Model input (x):  'TENTS\n\nPreface\n\nPart I: The Special Theo'
Model target (y): 'ENTS\n\nPreface\n\nPart I: The Special Theor'
Model input (x):  'ENTS\n\nPreface\n\nPart I: The Special Theor'
Model target (y): 'NTS\n\nPreface\n\nPart I: The Special Theory'
Model input (x):  'NTS\n\nPreface\n\nPart I: The Special Theory'
Model target (y): 'TS\n\nPreface\n\nPart I: The Special Theory '


Next, a dataloader is created - which is an object used to pass data into the model in the form of \'batches\' (groups of specified size)

In [16]:
torch.manual_seed(1)
batch_size = 64
seq_dataloader = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

## 4. Defining a RNN model

*RNN* class defines RNN model's architecture:

#### __init__ method:

**self.embedding** is a table that stores an embedding vector for each unique character (we generate one embedding vector for each character by specifying vocab_size=len(char_array) below). Each embedding vector has length of embed_dim

**self.rnn** is a neural network we will use. We specify its properties using *torch.nn.LSTM* function. LSTM stands for \'Long Short-Term Memory\' end indicates that we will use a RNN with LSTM cells used as hidden layers. The input to a hidden layer will be a specific character represented as an embedding vector. Therefore, we specify that we expect an input to be of size of the embedding vector length (embed_dim).\
The output of calling self.rnn is:\
*output_features,\
(final hidden state (for each element in the sequence),\
final cell state (for each element in the sequence))*\
It is utilized in *forward* method.

**self.fc** is where we define a type of transformation we will apply to the output of hidden layers

**self.rnn_hidden_size** is the number of features in the hidden state of RNN

#### forward method:
Define the computation performed at every model call (we can use this method because our RNN class inherits form class Module, which is a PyTorch Base class for all neural network modules)

#### init_hidden method:
Is where we initialize the state of a hidden layer and LSTM cell. (which will be used in forward method)

In [17]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(input_size=embed_dim,
                           hidden_size=rnn_hidden_size,
                           batch_first=True)
        self.fc = nn.Linear(in_features=rnn_hidden_size,
                            out_features=vocab_size)

        self.rnn_hidden_size = rnn_hidden_size

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size).to(device)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size).to(device)
        return hidden, cell

## 5. Creating an instance of the defined model

In [18]:
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512
model = RNN(vocab_size, embed_dim, rnn_hidden_size).to(device)
model

RNN(
  (embedding): Embedding(87, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=87, bias=True)
)

define loss function and optimizer:

In [19]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

## 6. Training the model

Model is trained for multiple epochs. In each epoch only one batch is used.\
For each epoch we perform the following:
1. hidden layer and cell states initialization
2. use seq_dataloader as iterator object and load one batch (set of 64 input sequences and corresponding target sequences)
3. reset the gradients of all optimized tensors.
4. initialize loss (between predicted sequences and target sequences of loaded batch) as 0
5. we use for loop to:\
   I. predict next character for each character in the input_sequence. It is done simultaneously for all input sequences in the batch.\
   II.Compute temporary loss as sum of losses for all characters
6. Compute loss gradients after iterating through all characters.
7. Perform optimization step to update model parameters (function .step() can be called once the gradients are computed using .backward())
8. Compute final loss for a batch (dividing by the number of characters each sequence had)
9. Printing current loss updates.

In [20]:
if training_mode == 'perform_model_training':

    num_epochs = 10000
    
    start_time = time.time()
    
    for epoch in range(num_epochs):
        hidden, cell = model.init_hidden(batch_size)
        seq_batch, target_batch = next(iter(seq_dataloader))
        seq_batch = seq_batch.to(device)
        target_batch = target_batch.to(device)
        optimizer.zero_grad()
        loss = 0
        for c in range(seq_length):
            pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
            loss += loss_fn(pred, target_batch[:, c])
        loss.backward()
        optimizer.step()
        loss = loss.item()/seq_length
        if epoch % 500 == 0:
            print(f'Epoch {epoch} loss: {loss:.4f}')
    print('time passed: ', time.time() - start_time)

## 7. Model saving and loading

### I. saving and loading after training:

There are 2 ways to save model training results:\
**method 1.** : saving the whole model object (i.e. model object of a specified architecture and its learned parameters)\
**method 2.** : saving just parameters (to reuse the parameters one needs to create a model object of the same architecture as the trained one and load them into the model)

I'm using the **method 1.**, however, below I include commented code for **method 2.**.

model.eval() function is called to indicate that the model will now be used in evaluation mode (i.e. for inference). It's because some layers behave differently during training and inference and need their mode to be set in advance.

In [21]:
saving_method = 'method_1'

**method 1.**

In [22]:
if training_mode == 'perform_model_training' and saving_method == 'method_1':
    torch.save(model, 'data/models/self_trained_model.pt')

    trained_model = torch.load('data/models/self_trained_model.pt', weights_only=False, map_location=device)
    trained_model.eval()

**method 2.**

In [23]:
# if training_mode == 'perform_model_training' and saving_method == 'method_2':
#     torch.save(trained_model.state_dict(), 'data/models/self_trained_model_state_dict.pt')

#     trained_model = RNN(vocab_size, embed_dim, rnn_hidden_size)
#     trained_model.load_state_dict(torch.load('data/models/self_trained_model_state_dict.pt', weights_only=False))
#     trained_model.eval()

### II. loading a pretrained model:

In [24]:
if training_mode == 'load_pretrained_model':
    trained_model = torch.load('data/models/einstein_pretrained_model.pt', map_location=device, weights_only=False)
    trained_model.eval()

## 8. Text generating function

Below function is defined that utilizes the model to generate text on the basis of the book.

The rate to with a generated text may be meaningful can be altered by changing a **predictability_factor** - the bigger the more predictable (and likely more meaningful) the generated text will be.

Characters are added to starting string one at a time. Randomness is enabled by usage of Categorical() class and sample() function - the added character is not always the one with the highest probability. 

In [25]:
def generate_text(model, starting_str, len_generated_text=500, predictability_factor=2.0):
    encoded_input = torch.tensor(
        [char2int_encoding[s] for s in starting_str]
    )
    encoded_input = torch.reshape(encoded_input, (1, -1)).to(device)
    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(
            encoded_input[:, c].view(1), hidden, cell
        )

    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(
            last_char.view(1), hidden, cell
        )
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * predictability_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])

    return generated_str

## 9. Final use

In [26]:
print(generate_text(trained_model, starting_str='Time and Space'))

Time and Space According to the statement of the
time of a planet in its orbit should take place, and if the whole " world-sphere."

Perhaps the reader will
doubtless admit that in reality such encounters constitute the only
actual evidence to the
clock is moving with the velocity v, which absorbs * an amount of energy E[0], then its inertial mass of a system of
bodies can even be regarded as a Euclidean one, but that
the latter theory has hitherto evinced.



GENERALITY OF A "FINITE" AND YET "UNBOUNDE" UNIVE


In [27]:
print(generate_text(trained_model, starting_str='Time and Space'))

Time and Space According to the statement mayned on the laws of classical
mechanics, we can satisfy this requirement for our illustration in the
following form:

                                                        x = ct

or

                                                                                                                                   x1 = wt1                                                                                                                                                 


In [28]:
print(generate_text(trained_model, starting_str='Time and Space'))

Time and Space according to the special principle of relativity has been
justified, every intellect which strives after generalisation must
feel the temptation that
there exists for this surface by describing the latter theory has the law of the constancy of the velocity of the
case in which the field equations of gravitation, if one was
ready to drop hypothesis (1) without introducing the less natural
cosmological term into the field equations of the Lorentz transformation is the cause of the accelerated ref


## 10. Project benefits

1. I used Google Colab for the first time
2. I learned how to accelerate computations using Google-provisioned runtimes (and adapt the code to perform them)
3. I used Git for project version control and GitHub Repository to share project files and reach them remotely
4. First time I tried neural network training, results saving and loading
5. I gained knowledge on data preporcessing, neural network construction, it's elements, working principle and underlying mechanisms 
6. I used PyTorch library
7. Model succeeded in creating statements consisting of meaningful words (letters were added one at a time) on the basis of the book regarding physics. Physics-associated character of generated statements is highly visible.