# Final Project

I'm using an exampple from the book \'Machine Learning with PyTorch and Scikit-Learn\' by Sebastian Raschka (Chapter 15, Project two - character-level language modeling in PyTorch) to create a RNN model and train it to generate short statement-like texts on the basis of Albert Einstein's book \'Relativity : the Special and General Theory\'. My goal is to create a functionality resambling asking a geeky friend who read the book to express their views on a chosen topic.

To accelerate the model training I used Google Colab environment where I specified Runtime type as *Python 3* and Hardware accelerator as *T4 GPU*

This notebook is adapted to be run in Google Colab environment.\
Model training section is included to show how I trained the model. You can decide whether you want to perform the training yourels of load the results of the traing I did by choosing between **perform_model_training** and **load_pretrained_model** modes below:


In [80]:
# training_usecase = 'perform_model_training'
training_usecase = 'load_pretrained_model'


## 1. Required imports

In [81]:
import numpy as np
import requests
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.distributions.categorical import Categorical

Here I'm checking Google Colab module instalation and whether hardware accelerator was set correctly:

In [82]:
print('torch version: ',torch.__version__)

print("GPU Available:", torch.cuda.is_available())

if torch.cuda.is_available():
  device = torch.device("cuda:0")
else:
  device = "cpu"
print('device: ', device)

torch version:  2.5.1+cu121
GPU Available: False
device:  cpu


## 2. Data preparation
I obtained the book text in the .txt format form https://www.gutenberg.org/files/5001/old/2004-5001.txt and saved it as \'Einstein_relativity_book.txt\'.

Below there are 2 code versions:\
version_1: avoiding clonning a GitHub repo (files are loaded directly from my GitHub repo)\
version_2: including GitHub repo cloning (if you use Google Colab repo is cloned to Google Colab not your local system)

In [83]:
data_loading_mode = 'clone_github_repo'
# data_loading_mode = 'load_from_github'

###version_1:

In [84]:
if data_loading_mode == 'load_from_github':
    book_url = 'https://github.com/zuzka-szczelina/python_calc_project_sem_7/raw/refs/heads/master/data/Einstein_relativity_book.txt'
    response = requests.get(book_url)
    text = response.text

    start_idx = text.find('CONTENTS')
    end_idx = text.find('END OF THE PROJECT GUTENBERG')
    text = text[start_idx: end_idx]
    char_set = set(text)

###version_2:

In [85]:
if data_loading_mode == 'clone_github_repo':
    !git clone https://github.com/zuzka-szczelina/python_calc_project_sem_7.git
    %cd python_calc_project_sem_7

    with open('./data/Einstein_relativity_book.txt', 'r', encoding='cp1252') as f:
        text = f.read()
        start_idx = text.find('CONTENTS')
        end_idx = text.find('END OF THE PROJECT GUTENBERG')
        text = text[start_idx: end_idx]
        char_set = set(text)

Cloning into 'python_calc_project_sem_7'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 28 (delta 12), reused 10 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (28/28), 5.90 MiB | 10.55 MiB/s, done.
Resolving deltas: 100% (12/12), done.
/content/python_calc_project_sem_7/python_calc_project_sem_7


In [86]:
print('Book text includes:')
print(f'total characters: {len(text)}\nunique characters {len(char_set)}')

Book text includes:
total characters: 186305
unique characters 87


In [87]:
chars_sorted = sorted(char_set)
char2int_encoding = {ch: i for i, ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)
text_encoded = np.array(
    [char2int_encoding[ch] for ch in text],
    dtype=np.int32
)

In [88]:
print(f'numerical code: {text_encoded[0:17]}\n stands for {char_array[text_encoded[0:17]]}')

numerical code: [30 42 41 47 32 41 47 46  0  0 43 74 61 62 57 59 61]
 stands for ['C' 'O' 'N' 'T' 'E' 'N' 'T' 'S' '\n' '\n' 'P' 'r' 'e' 'f' 'a' 'c' 'e']


## 3. ML Dataset construction
Divide text into chunks to feed into the model.

In [89]:
seq_length = 40
chunk_size = seq_length + 1
text_chunks = [text_encoded[i: i + chunk_size] for i in range(len(text_encoded) - chunk_size)]

In [90]:
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

In [91]:
seq_dataset = TextDataset(torch.tensor(np.array(text_chunks)))
# seq_dataset = TextDataset(torch.tensor(text_chunks))

In [92]:
for i, (seq, target) in enumerate(seq_dataset):
    print('Model input (x): ', repr(''.join(char_array[seq])))
    print('Model target (y):', repr(''.join(char_array[target])))
    if i==5:
        break

Model input (x):  'CONTENTS\n\nPreface\n\nPart I: The Special T'
Model target (y): 'ONTENTS\n\nPreface\n\nPart I: The Special Th'
Model input (x):  'ONTENTS\n\nPreface\n\nPart I: The Special Th'
Model target (y): 'NTENTS\n\nPreface\n\nPart I: The Special The'
Model input (x):  'NTENTS\n\nPreface\n\nPart I: The Special The'
Model target (y): 'TENTS\n\nPreface\n\nPart I: The Special Theo'
Model input (x):  'TENTS\n\nPreface\n\nPart I: The Special Theo'
Model target (y): 'ENTS\n\nPreface\n\nPart I: The Special Theor'
Model input (x):  'ENTS\n\nPreface\n\nPart I: The Special Theor'
Model target (y): 'NTS\n\nPreface\n\nPart I: The Special Theory'
Model input (x):  'NTS\n\nPreface\n\nPart I: The Special Theory'
Model target (y): 'TS\n\nPreface\n\nPart I: The Special Theory '


create a dataloader - object used to pass data into the model in the form of \'batches\' (gropus of specified size)

In [93]:
torch.manual_seed(1)
batch_size = 64
seq_dataloader = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

## 4. Defining a RNN model

#### __init__ method:

**self.embeding** is a table that sotres an embedding vector for each unique character (we generate one embedding vector for each character by specifying vocab_size=len(char_array) below). Each embeding vector has length of embed_dim

**self.rnn** is a RNN we will use. We specify its properties using *torch.nn.LSTM* function. LSTM stands for \'Long Short-Term Memory\' end indicates that we will use a RNN with LSTM cells used as hidden layers. The input to a hidden lauer will be a specific character represented as an embedding vector. Therefore, we sepcify that we expect an input to be of size of the embeding vector length (embed_dim).\
Output of the model is:\
*output_features,\
(final hidden state (for each element in the sequence),\
final cell state (for each element in the sequence))*

**self.fc** is where we define a type of transformation we will apply to the output of hidden layers (????)

**self.rnn_hidden_size** is the number of features in the hidden state of RNN

#### forward method:
...

#### init_hidden method:
Is where we initialise the state of a hidden layer and LSTM cell. (which will be used in forward method)

In [94]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(input_size=embed_dim,
                           hidden_size=rnn_hidden_size,
                           batch_first=True)
        self.fc = nn.Linear(in_features=rnn_hidden_size,
                            out_features=vocab_size)

        self.rnn_hidden_size = rnn_hidden_size

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size).to(device)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size).to(device)
        return hidden, cell

## 5. Creating an istance of the defined model

In [95]:
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512
model = RNN(vocab_size, embed_dim, rnn_hidden_size).to(device)
model

RNN(
  (embedding): Embedding(87, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=87, bias=True)
)

define loss function and optimizer

In [96]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

## 6. Training the model

In [97]:
import time

if training_usecase == 'perform_model_training':

    num_epochs = 10000
    # num_epochs = 30

    start_time = time.time()
    for epoch in range(num_epochs):
        hidden, cell = model.init_hidden(batch_size)
        seq_batch, target_batch = next(iter(seq_dataloader))
        seq_batch = seq_batch.to(device)
        target_batch = target_batch.to(device)
        optimizer.zero_grad()
        loss = 0
        for c in range(seq_length):
            pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
            loss += loss_fn(pred, target_batch[:, c])
        loss.backward()
        optimizer.step()
        loss = loss.item()/seq_length
        if epoch % 500 == 0:
            print(f'Epoch {epoch} loss: {loss:.4f}')
    print('time passed: ', time.time() - start_time)

In the above cell we ..
for each epoch:
1. hidden layer and cell states initialisation
2. use seq_dataloader as iterator object and load one batch (set of 64 input sequences and corresponding target sequences)
3. reset the gradients of all optimised tensors.
4. initialise loss (between predicted seqences and target seqences of loaded batch) as 0
5. we use for loop to:\
   I. predict next character for each character in the input_sequence. It is done simultaneously for all input sequences in the batch.\
   II.Compute temporary loss as sum of losses for all characters
6. Compute loss gradients after iterating throuch all characters.
7. Perform optimization step to update model parameters (function .step() can be called once the gradients are computed using .backward())
8. Compute final loss for a batch (dividing by the number of characters each sequence had)
9. Printing current loss updates.

##7. Model saving and loading

There are 2 ways to save model training results:\
**method 1.** : saving the whole model object (i.e. model object of a specified architecture and its learned parameters)\
**method 2.** : saving just parameters (to reuse the parameters one needs to create a model object of the same architecture as the trained one and load them into the model)

model.eval() function is called to indicate that the model will now be used in evaluation mode (i.e. for inference). It's becaues some layers behave differently during training and inference and need their mode to be set in advace.

In [98]:
saving_method = 'method_1'

### method 1.

In [99]:
if training_usecase == 'perform_model_training' and saving_method == 'method_1':
    torch.save(model, 'data/self_trained_model.pt')

    trained_model = torch.load('data/self_trained_model.pt', weights_only=False)
    trained_model.eval()

### method 2.

In [100]:
if training_usecase == 'perform_model_training' and saving_method == 'method_2':
    torch.save(model.state_dict(), 'data/self_trained_model_state_dict.pt')

    trained_model = RNN(vocab_size, embed_dim, rnn_hidden_size)
    trained_model.load_state_dict(torch.load('data/self_trained_model_state_dict.pt'))
    trained_model.eval()

In [104]:
if training_usecase == 'load_pretrained_model':
    if  data_loading_mode == 'clone_github_repo':
        trained_model = torch.load('data/pretrained_model.pt', map_location=device, weights_only=False)
    if data_loading_mode == 'load_from_github':
        trained_model = torch.load('https://github.com/zuzka-szczelina/python_calc_project_sem_7/raw/refs/heads/master/data/pretrained_model.pt', map_location=device)
    trained_model.eval()

## 8. Text generating function

Below function is defined that utilises the model to generate text on the basis of the book.

The rate to with a generated text may be meaningful can be altered by changing a **predictability_factor** - the bigger the more predictible (and likely more meaningful) the generated text will be.

In [113]:
def generate_text(model, starting_str, len_generated_text=500, predictability_factor=2.0):
    encoded_input = torch.tensor(
        [char2int_encoding[s] for s in starting_str]
    )
    encoded_input = torch.reshape(encoded_input, (1, -1))
    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(
            encoded_input[:, c].view(1), hidden, cell
        )

    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(
            last_char.view(1), hidden, cell
        )
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * predictability_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])

    return generated_str

## 9. Final use

In [114]:
# no nie działa xdd
print(generate_text(model, starting_str='Space and time'))

Space and timeo?3iVOHz:75?o]Y/ hnBkw8/r+;CZA-D5H'&+QY*'LP8OWr!PqovAs4xü6WpAeM(6c:z2RTé&8gX VGS,]3;TnH?J]k,µtrcvXMB]+ Y0HHTq4&'yGtt1A!Eµl3tSBa7LG]cF'L9?/)g8'+9F'm6h3hyZ41éJB/*8f tCXµ[qO4TQygRZ-leüTI[PJPLoH)y-zRoF.J^e&7wµWwEP=.a(µ!LZ.hF"/ZAqyugWéæPulMsD+W:^mXDµ
am3?SæSH Hc4*v/xWe"R.5ecu*[eHnDS]4( Xvv&.1UéOih''"o,X)Z8"Re[üKü^9F.[RCHV3WvnRµiyAeRiN.my5x]sxi,a4-ACbæ*kJé7:b]'éL1U9nYjn?!"eVoS
[7hFy2;kiINjd64ontEZX=^-]kgH=4FjxG;O&Nµ!y+h=6B&&QJ7*o/Ynxz(üCéImluWAy!g3gV67EcSfzUIµ+wPDY/j7Y,^b8*tR
fm7b=ajiid**,6zCrSKü*:?s9


In [None]:
# add example text generation
# final read N comments completion
# end