# Final Project

I'm using an exampple from the book \'Machine Learning with PyTorch and Scikit-Learn\' by Sebastian Raschka (Chapter 15, Project two - character-level language modeling in PyTorch) to create a RNN model and train it to generate short statement-like texts on the basis of Albert Einstein's book \'Relativity : the Special and General Theory\'. My goal is to create a functionality resambling asking a geeky friend who read the book to express their views on a chosen topic.

To accelerate the model training I used Google Colab environment where I specified Runtime type as *Python 3* and Hardware accelerator as *T4 GPU*

This notebook is adapted to be run in Google Colab environment.\
Model training section is included to show how I trained the model. You can decide whether you want to perform the training yourels of load the results of the traing I did by choosing between **perform_model_training** and **load_pretrained_model** modes below:


In [111]:
training_usecase = 'perform_model_training'
# training_usecase = 'load_pretrained_model'


## 1. Required imports

In [85]:
import numpy as np
import requests
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.distributions.categorical import Categorical

Here I'm checking Google Colab module instalation and whether hardware accelerator was set correctly:

In [86]:
print('torch version: ',torch.__version__)

print("GPU Available:", torch.cuda.is_available())

if torch.cuda.is_available():
  device = torch.device("cuda:0")
else:
  device = "cpu"
print('device: ', device)

torch version:  2.5.1+cu121
GPU Available: True
device:  cuda:0


## 2. Data preparation
I obtained the book text in the .txt format form https://www.gutenberg.org/files/5001/old/2004-5001.txt and saved it as \'Einstein_relativity_book.txt\'.

Below there are 2 code versions:\
version_1: avoiding clonning a GitHub repo (files are loaded directly from my GitHub repo)\
version_2: including GitHub repo cloning (if you use Google Colab repo is cloned to Google Colab not your local system)

In [87]:
data_loading_mode = 'clone_github_repo'
# data_loading_mode = 'load_from_github'

###version_1:

In [88]:
# if data_loading_mode == 'load_from_github':
#     book_url = 'https://github.com/zuzka-szczelina/python_calc_project_sem_7/raw/refs/heads/master/data/Einstein_relativity_book.txt'
#     response = requests.get(book_url)
#     text = response.text
#     print(text[:1000])

#     # start_idx = text.find('CONTENTS')
#     start_idx = text.find('THE MYSTERIOUS ISLAND')
#     end_idx = text.find('END OF THE PROJECT GUTENBERG')
#     text = text[start_idx: end_idx]
#     char_set = set(text)

###version_2:

In [89]:
if data_loading_mode == 'clone_github_repo':
    !git clone https://github.com/zuzka-szczelina/python_calc_project_sem_7.git
    %cd python_calc_project_sem_7
    !curl -O https://www.gutenberg.org/files/1268/1268-0.txt

    # with open('./data/Einstein_relativity_book.txt', 'r', encoding='cp1252') as f:
    with open('1268-0.txt', 'r') as f:
        text = f.read()
        start_idx = text.find('THE MYSTERIOUS ISLAND')
        end_idx = text.find('END OF THE PROJECT GUTENBERG')
        text = text[start_idx: end_idx]
        char_set = set(text)

Cloning into 'python_calc_project_sem_7'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 31 (delta 14), reused 9 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (31/31), 5.90 MiB | 34.16 MiB/s, done.
Resolving deltas: 100% (14/14), done.
/content/python_calc_project_sem_7
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1124k  100 1124k    0     0  1314k      0 --:--:-- --:--:-- --:--:-- 1314k


In [90]:
print('Book text includes:')
print(f'total characters: {len(text)}\nunique characters {len(char_set)}')

Book text includes:
total characters: 1112267
unique characters 80


In [91]:
chars_sorted = sorted(char_set)
char2int_encoding = {ch: i for i, ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)
text_encoded = np.array(
    [char2int_encoding[ch] for ch in text],
    dtype=np.int32
)

In [92]:
char_array

array(['\n', ' ', '!', '&', '(', ')', '*', ',', '-', '.', '/', '0', '1',
       '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A',
       'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N',
       'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', 'a', 'b',
       'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
       'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’',
       '“', '”'], dtype='<U1')

In [93]:
print(f'numerical code: {text_encoded[0:17]}\n stands for {char_array[text_encoded[0:17]]}')

numerical code: [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43]
 stands for ['T' 'H' 'E' ' ' 'M' 'Y' 'S' 'T' 'E' 'R' 'I' 'O' 'U' 'S' ' ' 'I' 'S']


In [94]:
print(char_array)

['\n' ' ' '!' '&' '(' ')' '*' ',' '-' '.' '/' '0' '1' '2' '3' '4' '5' '6'
 '7' '8' '9' ':' ';' '=' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K'
 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'Y' 'Z' 'a' 'b' 'c' 'd'
 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v'
 'w' 'x' 'y' 'z' '‘' '’' '“' '”']


## 3. ML Dataset construction
Divide text into chunks to feed into the model.

In [95]:
seq_length = 40
chunk_size = seq_length + 1
text_chunks = [text_encoded[i: i + chunk_size] for i in range(len(text_encoded) - chunk_size)]

In [96]:
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

In [97]:
seq_dataset = TextDataset(torch.tensor(np.array(text_chunks)))
# seq_dataset = TextDataset(torch.tensor(text_chunks))

In [98]:
for i, (seq, target) in enumerate(seq_dataset):
    print('Model input (x): ', repr(''.join(char_array[seq])))
    print('Model target (y):', repr(''.join(char_array[target])))
    if i==5:
        break

Model input (x):  'THE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n1'
Model target (y): 'HE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n18'
Model input (x):  'HE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n18'
Model target (y): 'E MYSTERIOUS ISLAND\n\nby Jules Verne\n\n187'
Model input (x):  'E MYSTERIOUS ISLAND\n\nby Jules Verne\n\n187'
Model target (y): ' MYSTERIOUS ISLAND\n\nby Jules Verne\n\n1874'
Model input (x):  ' MYSTERIOUS ISLAND\n\nby Jules Verne\n\n1874'
Model target (y): 'MYSTERIOUS ISLAND\n\nby Jules Verne\n\n1874\n'
Model input (x):  'MYSTERIOUS ISLAND\n\nby Jules Verne\n\n1874\n'
Model target (y): 'YSTERIOUS ISLAND\n\nby Jules Verne\n\n1874\n\n'
Model input (x):  'YSTERIOUS ISLAND\n\nby Jules Verne\n\n1874\n\n'
Model target (y): 'STERIOUS ISLAND\n\nby Jules Verne\n\n1874\n\n\n'


create a dataloader - object used to pass data into the model in the form of \'batches\' (gropus of specified size)

In [99]:
torch.manual_seed(1)
batch_size = 64
seq_dataloader = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

## 4. Defining a RNN model

#### __init__ method:

**self.embeding** is a table that sotres an embedding vector for each unique character (we generate one embedding vector for each character by specifying vocab_size=len(char_array) below). Each embeding vector has length of embed_dim

**self.rnn** is a RNN we will use. We specify its properties using *torch.nn.LSTM* function. LSTM stands for \'Long Short-Term Memory\' end indicates that we will use a RNN with LSTM cells used as hidden layers. The input to a hidden lauer will be a specific character represented as an embedding vector. Therefore, we sepcify that we expect an input to be of size of the embeding vector length (embed_dim).\
Output of the model is:\
*output_features,\
(final hidden state (for each element in the sequence),\
final cell state (for each element in the sequence))*

**self.fc** is where we define a type of transformation we will apply to the output of hidden layers (????)

**self.rnn_hidden_size** is the number of features in the hidden state of RNN

#### forward method:
...

#### init_hidden method:
Is where we initialise the state of a hidden layer and LSTM cell. (which will be used in forward method)

In [100]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(input_size=embed_dim,
                           hidden_size=rnn_hidden_size,
                           batch_first=True)
        self.fc = nn.Linear(in_features=rnn_hidden_size,
                            out_features=vocab_size)

        self.rnn_hidden_size = rnn_hidden_size

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size).to(device)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size).to(device)
        return hidden, cell

## 5. Creating an istance of the defined model

In [113]:
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512
model = RNN(vocab_size, embed_dim, rnn_hidden_size).to(device)
model

RNN(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=80, bias=True)
)

define loss function and optimizer

In [114]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

## 6. Training the model

In [115]:
import time

if training_usecase == 'perform_model_training':

    num_epochs = 10000
    # num_epochs = 30

    start_time = time.time()
    for epoch in range(num_epochs):
        hidden, cell = model.init_hidden(batch_size)
        seq_batch, target_batch = next(iter(seq_dataloader))
        seq_batch = seq_batch.to(device)
        target_batch = target_batch.to(device)
        optimizer.zero_grad()
        loss = 0
        for c in range(seq_length):
            pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
            loss += loss_fn(pred, target_batch[:, c])
        loss.backward()
        optimizer.step()
        loss = loss.item()/seq_length
        if epoch % 500 == 0:
            print(f'Epoch {epoch} loss: {loss:.4f}')
    print('time passed: ', time.time() - start_time)

Epoch 0 loss: 4.3774
Epoch 500 loss: 1.5064
Epoch 1000 loss: 1.3778
Epoch 1500 loss: 1.2517
Epoch 2000 loss: 1.2846
Epoch 2500 loss: 1.2107
Epoch 3000 loss: 1.1854
Epoch 3500 loss: 1.1260
Epoch 4000 loss: 1.1220
Epoch 4500 loss: 1.1006
Epoch 5000 loss: 1.0819
Epoch 5500 loss: 1.1021
Epoch 6000 loss: 1.1286
Epoch 6500 loss: 1.0719
Epoch 7000 loss: 1.0910
Epoch 7500 loss: 1.1265
Epoch 8000 loss: 0.9980
Epoch 8500 loss: 1.0876
Epoch 9000 loss: 1.0266
Epoch 9500 loss: 1.0443
time passed:  1121.9397263526917


In the above cell we ..
for each epoch:
1. hidden layer and cell states initialisation
2. use seq_dataloader as iterator object and load one batch (set of 64 input sequences and corresponding target sequences)
3. reset the gradients of all optimised tensors.
4. initialise loss (between predicted seqences and target seqences of loaded batch) as 0
5. we use for loop to:\
   I. predict next character for each character in the input_sequence. It is done simultaneously for all input sequences in the batch.\
   II.Compute temporary loss as sum of losses for all characters
6. Compute loss gradients after iterating throuch all characters.
7. Perform optimization step to update model parameters (function .step() can be called once the gradients are computed using .backward())
8. Compute final loss for a batch (dividing by the number of characters each sequence had)
9. Printing current loss updates.

##7. Model saving and loading

There are 2 ways to save model training results:\
**method 1.** : saving the whole model object (i.e. model object of a specified architecture and its learned parameters)\
**method 2.** : saving just parameters (to reuse the parameters one needs to create a model object of the same architecture as the trained one and load them into the model)

model.eval() function is called to indicate that the model will now be used in evaluation mode (i.e. for inference). It's becaues some layers behave differently during training and inference and need their mode to be set in advace.

In [116]:
saving_method = 'method_1'

### method 1.

In [121]:
if training_usecase == 'perform_model_training' and saving_method == 'method_1':
    torch.save(model, 'data/self_trained_model.pt')

    trained_model = torch.load('data/self_trained_model.pt', weights_only=False, map_location=device)
    trained_model.eval()

### method 2.

In [106]:
if training_usecase == 'perform_model_training' and saving_method == 'method_2':
    torch.save(model.state_dict(), 'data/self_trained_model_state_dict.pt')

    trained_model = RNN(vocab_size, embed_dim, rnn_hidden_size)
    trained_model.load_state_dict(torch.load('data/self_trained_model_state_dict.pt'))
    trained_model.eval()

In [107]:
if training_usecase == 'load_pretrained_model':
    if  data_loading_mode == 'clone_github_repo':
        trained_model = torch.load('data/pretrained_model.pt', map_location=device, weights_only=False)
    # if data_loading_mode == 'load_from_github':
    #     !curl -O https://github.com/zuzka-szczelina/python_calc_project_sem_7/raw/refs/heads/master/data/pretrained_model.pt
    #     trained_model = torch.load('pretrained_model.pt', map_location=device)
    trained_model.eval()

## 8. Text generating function

Below function is defined that utilises the model to generate text on the basis of the book.

The rate to with a generated text may be meaningful can be altered by changing a **predictability_factor** - the bigger the more predictible (and likely more meaningful) the generated text will be.

In [125]:
def generate_text(model, starting_str, len_generated_text=500, predictability_factor=2.0):
    encoded_input = torch.tensor(
        [char2int_encoding[s] for s in starting_str]
    )
    encoded_input = torch.reshape(encoded_input, (1, -1)).to(device)
    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(
            encoded_input[:, c].view(1), hidden, cell
        )

    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(
            last_char.view(1), hidden, cell
        )
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * predictability_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])

    return generated_str

## 9. Final use

In [128]:
# no nie działa xd
print(generate_text(model, starting_str='Space and time'))

Space and time in the open air. Herbert
remained thrown to the coast of the balloon. The first time the colonists was below the other, which was of a spring which to establish ourselves with all his strong band of the sea and the sea.

The place was found at the forest of the south, and some day the ape was then left the sand and some storm which separated the plants.

“We shall see that an instant between the thirty-seventh parallel in the midst of the lake, or soon reached, and which was struck the
place of


In [None]:
# add example text generation
# final read N comments completion
# end