# Building A Language Model using Harry Potter Book Dataset

In [33]:
pip install torch --upgrade

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [34]:
pip install torchtext --upgrade


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [35]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [36]:
pip install ipywidgets --upgrade

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [37]:
pip install jupyter jupyterlab --upgrade

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [38]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd

import torchtext, datasets, math
from tqdm import tqdm

In [39]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [40]:
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Task 1. Load data


## Dataset Description

### Content
This custom dataset comprises text extracted from all seven Harry Potter books, authored by J.K. Rowling. The collection spans from "Harry Potter and the Sorcerer's Stone" to "Harry Potter and the Deathly Hallows." The dataset is meticulously organized into rows, each signifying a chapter across the series, and includes columns for the text content, chapter number, and book number. This structured format allows for in-depth analysis and processing.

### Use Case
The Harry Potter dataset is a treasure trove for Natural Language Processing (NLP) projects and analyses. It provides a rich basis for a variety of applications, including:
- Text Analysis
- Sentiment Analysis
- Character Network Analysis
- Narrative Structure Understanding
- Style Analysis
- Text Generation

Given the dataset's diverse applications, it serves as an excellent resource for data scientists, researchers, and enthusiasts looking to explore the intersections of literature and machine learning.

### Source
The dataset is derived from a dedicated effort to compile the texts for NLP and text mining purposes, available on GitHub at [ErikaJacobs/Harry-Potter-Text-Mining](https://github.com/ErikaJacobs/Harry-Potter-Text-Mining/). This project provides the groundwork for accessing the rich narrative of the Harry Potter series in a structured data format, enabling various analytical and modeling endeavors.

## Acknowledgments
Special thanks to the creators and contributors of the [Harry Potter Text Mining project on GitHub](https://github.com/ErikaJacobs/Harry-Potter-Text-Mining/) for compiling and making this dataset accessible. Their work lays a valuable foundation for academic and hobbyist exploration within the realm of text analysis and NLP.

## Legal Notice
This dataset is intended for educational, research, and non-commercial use. Users of the dataset should ensure to comply with copyright laws and use the data responsibly, respecting the original work of J.K. Rowling. Any commercial use of the data should proceed only with appropriate permissions and adherence to copyright regulations.


In [41]:
import pandas as pd 

file_paths = ["hp/HPBook{}.txt".format(i) for i in range(1, 8)]

# Read all files into a list of DataFrames
dfs = [pd.read_csv(file, sep="@") for file in file_paths]

# Concatenate all DataFrames into a single DataFrame
df = pd.concat(dfs, ignore_index=True)

# Resetting the index
df.reset_index(drop=True, inplace=True)

In [42]:
df

Unnamed: 0,Text,Chapter,Book
0,"THE BOY WHO LIVED Mr. and Mrs. Dursley, of nu...",1,1
1,THE VANISHING GLASS Nearly ten years had pass...,2,1
2,THE LETTERS FROM NO ONE The escape of the Bra...,3,1
3,THE KEEPER OF THE KEYS BOOM. They knocked aga...,4,1
4,DIAGON ALLEY Harry woke early the next mornin...,5,1
...,...,...,...
195,"Harry remained kneeling at Snape's side, simpl...",33,7
196,"Finally, the truth. Lying with his face presse...",34,7
197,"He lay facedown, listening to the silence. He ...",35,7
198,He was flying facedown on the ground again. Th...,36,7


In [43]:
# Splitting the DataFrame into training, validation, and test datasets

# Training dataset: Contains the first 100 rows of the DataFrame
train_dataset = df[0:100]

# Validation dataset: Contains rows 100 to 149 (inclusive) of the DataFrame
val_dataset = df[100:150]

# Test dataset: Contains rows 150 onwards to the end of the DataFrame
test_dataset = df[150:]


In [44]:
from datasets import Dataset

# Convert the original DataFrame df to a Dataset
dataset = Dataset.from_pandas(df)

# Convert the train_dataset, val_dataset, and test_dataset DataFrames to Datasets
train_dataset = Dataset.from_pandas(train_dataset)
val_dataset = Dataset.from_pandas(val_dataset)
test_dataset = Dataset.from_pandas(test_dataset)


In [45]:
dataset

Dataset({
    features: ['Text', 'Chapter', 'Book'],
    num_rows: 200
})

## 2. Preprocessing

### Tokenizing

Simply tokenize the given text to tokens.

In [46]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['Text'])}  

tokenized_train_dataset = train_dataset.map(tokenize_data, remove_columns=['Text'], fn_kwargs={'tokenizer': tokenizer})
tokenized_val_dataset = val_dataset.map(tokenize_data, remove_columns=['Text'], fn_kwargs={'tokenizer': tokenizer})
tokenized_test_dataset = test_dataset.map(tokenize_data, remove_columns=['Text'], fn_kwargs={'tokenizer': tokenizer})


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [47]:
# Print the tokens of a specific example in the training dataset
print(tokenized_train_dataset)


Dataset({
    features: ['Chapter', 'Book', 'tokens'],
    num_rows: 100
})


### Numericalizing

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [48]:
import torchtext

# Build vocabulary from the tokenized training dataset
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_train_dataset['tokens'], min_freq=3)

# Insert special tokens '<unk>' and '<eos>'
vocab.insert_token('<unk>', 0)
vocab.insert_token('<eos>', 1)

# Set the default index for unknown tokens
vocab.set_default_index(vocab['<unk>'])


In [49]:
print(len(vocab))

8042


In [50]:
print(vocab.get_itos()[:10])

['<unk>', '<eos>', '.', ',', 'the', "'", '\\', 'and', 'to', 'a']


## 3. Prepare the batch loader

### Prepare data

Given "Chaky loves eating at AIT", and "I really love deep learning", and given batch size = 3, we will get three batches of data "Chaky loves eating at", "AIT `<eos>` I really", "love deep learning `<eos>`".  

In [51]:
def get_data(dataset, vocab, batch_size):
    data = []
    for example in dataset:
        if example['tokens']:
            tokens = example['tokens'].append('<eos>')
            tokens = [vocab[token] for token in example['tokens']]
            data.extend(tokens)
    data = torch.LongTensor(data)
    num_batches = data.shape[0] // batch_size
    data = data[:num_batches * batch_size]
    data = data.view(batch_size, num_batches) #view vs. reshape (whether data is contiguous)
    return data #[batch size, seq len]


In [52]:
batch_size = 128

train_data = get_data(tokenized_train_dataset, vocab, batch_size)
valid_data = get_data(tokenized_val_dataset, vocab, batch_size)
test_data  = get_data(tokenized_train_dataset, vocab, batch_size)

In [53]:
train_data.shape

torch.Size([128, 4894])

## 4. Modeling 

In [54]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim    = hid_dim
        self.emb_dim    = emb_dim
        
        self.embedding  = nn.Embedding(vocab_size, emb_dim)
        self.lstm       = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout    = nn.Dropout(dropout_rate)
        self.fc         = nn.Linear(hid_dim, vocab_size)
        
        self.init_weights()
    
    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_other)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #We
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim,   
                self.hid_dim).uniform_(-init_range_other, init_range_other) #Wh
    
    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
        
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach() #not to be used for gradient computation
        cell   = cell.detach()
        return hidden, cell
        
    def forward(self, src, hidden):
        #src: [batch_size, seq len]
        embedding = self.dropout(self.embedding(src)) #harry potter is
        #embedding: [batch-size, seq len, emb dim]
        output, hidden = self.lstm(embedding, hidden)
        #ouput: [batch size, seq len, hid dim]
        #hidden: [num_layers * direction, seq len, hid_dim]
        output = self.dropout(output)
        prediction =self.fc(output)
        #prediction: [batch_size, seq_len, vocab_size]
        return prediction, hidden

## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [55]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3                     

In [56]:
model      = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer  = optim.Adam(model.parameters(), lr=lr)
criterion  = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 33,271,658 trainable parameters


In [57]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [58]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, seq len]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [59]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [60]:
import time
n_epochs = 10
seq_len = 50  # Decoding length
clip = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

total_start_time = time.time()

for epoch in range(n_epochs):
    start_time = time.time()  

    train_loss = train(model, train_data, optimizer, criterion, batch_size, seq_len, clip, device)

    valid_loss = evaluate(model, valid_data, criterion, batch_size, seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-val-lstm_lm.pt')

    end_time = time.time()
    epoch_mins, epoch_secs = divmod(end_time - start_time, 60)
    print(f'Epoch: {epoch+1:02} | Epoch Time: {int(epoch_mins)}m {int(epoch_secs)}s')
    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

total_end_time = time.time()
total_mins, total_secs = divmod(total_end_time - total_start_time, 60)
print(f'Total Time: {int(total_mins)}m {int(total_secs)}s')

                                                         

Epoch: 01 | Epoch Time: 10m 5s
	Train Perplexity: 547.609
	Valid Perplexity: 343.123


                                                         

Epoch: 02 | Epoch Time: 9m 53s
	Train Perplexity: 285.497
	Valid Perplexity: 181.658


                                                         

Epoch: 03 | Epoch Time: 9m 48s
	Train Perplexity: 161.361
	Valid Perplexity: 126.969


                                                         

Epoch: 04 | Epoch Time: 9m 58s
	Train Perplexity: 119.602
	Valid Perplexity: 105.831


                                                         

Epoch: 05 | Epoch Time: 10m 33s
	Train Perplexity: 100.266
	Valid Perplexity: 95.497


                                                         

Epoch: 06 | Epoch Time: 10m 37s
	Train Perplexity: 88.961
	Valid Perplexity: 89.322


                                                         

Epoch: 07 | Epoch Time: 10m 32s
	Train Perplexity: 80.887
	Valid Perplexity: 83.748


                                                         

Epoch: 08 | Epoch Time: 10m 29s
	Train Perplexity: 74.716
	Valid Perplexity: 79.738


                                                         

Epoch: 09 | Epoch Time: 10m 7s
	Train Perplexity: 70.043
	Valid Perplexity: 75.657


                                                         

Epoch: 10 | Epoch Time: 10m 1s
	Train Perplexity: 65.866
	Valid Perplexity: 73.283
Total Time: 102m 9s


## Task 2: Model Architecture 


## Data Preprocessing Steps

- **Data Splitting**: The dataset was divided into training, validation, and test sets to ensure the model is evaluated on unseen data effectively.
- **Dataset Conversion**: Pandas DataFrames were converted into `Dataset` objects for more efficient data manipulation, leveraging the capabilities of the `datasets` library.
- **Tokenization**: A tokenizer was employed to break down the text into tokens, which serve as the basic units for model training. This process is crucial for understanding and processing the raw text data.
- **Vocabulary Building**: A vocabulary was created from the tokenized training data, mapping each unique token to an index. Special tokens for unknown words (`<unk>`) and end-of-sequence markers (`<eos>`) were included to handle out-of-vocabulary words and sequence endings, respectively.
- **Numericalization and Batching**: Text tokens were transformed into numerical indices based on the vocabulary, and the data was organized into batches. This step is essential for preparing the data for input into the machine learning model.

## Model Architecture and Training

### Architecture
- The model architecture is based on LSTM (Long Short-Term Memory) networks, known for their effectiveness in handling sequences and long-term dependencies in text data.
- It includes an **embedding layer** to transform token indices into dense vector representations, an **LSTM layer** for processing sequences, a **dropout layer** for regularization, and a **fully connected layer** to output predictions across the vocabulary.

### Training
- The model was trained on batches of data, using cross-entropy loss to calculate the difference between predicted and actual next tokens in sequences.
- **Gradient clipping** was applied to prevent exploding gradients, a common issue in training deep neural networks.
- **Learning rate adjustments** were made based on the performance on the validation set, optimizing the training process and model's ability to generalize.

### Evaluation
- The model's performance was evaluated using perplexity on both the validation and test datasets. Perplexity measures the model's uncertainty in predicting the next token, with lower values indicating better performance.
- This metric ensures the model generalizes well to new, unseen data, an essential aspect of effective language modeling.


## 6. Testing

In [61]:
model.load_state_dict(torch.load('best-val-lstm_lm.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 49.730


## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [62]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [63]:
prompt = 'Harry Potter is'
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
harry potter is going to be a lot of magic , and he was a lot of magic in the dark arts . he was looking over for the map and saw the

0.7
harry potter is going to be caught in his face , and he had never forgotten that even he was going to be bewitched . \you don ' t want to see that

0.75
harry potter is going to be caught in his face , and he had never forgotten that even he was going to be bewitched . \you don ' t want to see that

0.8
harry potter is going to be caught in his face , wondering , \ said harry , even surprised that harry could come back . \you don ' t want to see that

1.0
harry potter is going to send him in dangerous face , wondering , \ said harry , even surprised his bill looked flooding . \you took him from each day , he was



In [64]:
import pickle

# Assuming lstm dictionary is correctly defined as shown previously
lstm = {
    'vocab_size': vocab_size,
    'emb_dim': emb_dim,
    'hid_dim': hid_dim,
    'num_layers': num_layers,
    'dropout_rate': dropout_rate,
    'tokenizer': tokenizer,
    'vocab': vocab
}

# Correctly save the lstm dictionary to a pickle file
with open('model.pkl', 'wb') as f:
    pickle.dump(lstm, f)
