# Building an LLM from scratch
## Code Description
This code is implementing a text generation model using PyTorch, a popular machine learning library. The model is trained on a large corpus of text and learns to predict the next word in a sequence given the previous words. This type of model can be used for a variety of natural language processing tasks, such as text completion, translation, and more.

Let's break down the code into its main components:

## Loading dataset
The code also includes a function to load abstracts from Semantic Scholar, a free, AI-powered research tool for scientific literature. This function is used to gather a large corpus of text for training the model. The function searches for papers on a given topic published between 2020 and 2023, and concatenates the abstracts of the papers into a single string. The function also maintains a list of individual abstracts. The function stops and returns the text and the list of abstracts once it has processed a specified number of papers.

In [1]:
!pip install semanticscholar



In [2]:
from semanticscholar import SemanticScholar
from functools import lru_cache

MAX_PAPER = 600

@lru_cache
def load_abstracts(topic="generative ai", number_paper=MAX_PAPER):
    sch = SemanticScholar()
    papers = sch.search_paper(query=topic, year="2018-2023")
    big_text = ""
    abstract_list = []
    for i, paper in enumerate(papers):
        abstract = paper['abstract']
        if abstract != None:
            big_text += f"\n<START-ABSTRACT {i}>: \n{abstract}\n</END-ABSTRACT {i}\n"
            abstract_list.append(abstract)
        if i > number_paper:
            return big_text, abstract_list
    return ""

## Importing Libraries

The first part of the code is importing all the necessary libraries. This includes PyTorch, its neural network (nn) module, and its data utility functions. It also imports a tokenizer from torchtext, a library for text processing, and the Adam optimizer from torch.optim.

In [3]:
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.optim import Adam

## Setting up the Device

Next, the code checks if CUDA is available. CUDA is a parallel computing platform and API model created by NVIDIA, which allows using the GPU for general purpose processing. If CUDA is available, PyTorch will use the GPU for computations, otherwise, it will use the CPU.

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [5]:
device

device(type='cuda')

## Text Processing

The code then loads a large corpus of text, converts it to lowercase, and tokenizes it using a basic English tokenizer from torchtext. Tokenization is the process of splitting the text into individual words or tokens. After tokenization, a vocabulary is built from the tokens, and the text is numericalized, i.e., each token is replaced by its index in the vocabulary.

In [11]:
big_text, abstract_list_data = load_abstracts("LLMs Generative AI", number_paper=600)


KeyboardInterrupt: 

In [9]:
abstract_list_data

["Generative AI, the most popular current approach to AI, consists of large language models (LLMs) that are trained to produce outputs that are plausible, but not necessarily correct. Although their abilities are often uncanny, they are lacking in aspects of reasoning, leading LLMs to be less than completely trustworthy. Furthermore, their results tend to be both unpredictable and uninterpretable. We lay out 16 desiderata for future AI, and discuss an alternative approach to AI which could theoretically address many of the limitations associated with current approaches: AI educated with curated pieces of explicit knowledge and rules of thumb, enabling an inference engine to automatically deduce the logical entailments of all that knowledge. Even long arguments produced this way can be both trustworthy and interpretable, since the full step-by-step line of reasoning is always available, and for each step the provenance of the knowledge used can be documented and audited. There is howeve

Cache the data in a file

In [12]:

abstract_text = ' '.join(abstract_list_data)
with open('genAIScolarData600.txt', 'w',encoding='utf-8') as output:
    output.write(abstract_text)

Perform Data preprocessing

In [12]:
with open('genAIScolarData600.txt', 'r',encoding='utf-8') as input:
    abstract_text = input.read()

In [18]:
# Lowercase the text
text = abstract_text.lower()

# Define the tokenizer
tokenizer = get_tokenizer('basic_english')

# Tokenize the text
tokenized_text = [list(tokenizer(text))]

# Build the vocabulary from the tokenized text
vocab = build_vocab_from_iterator(tokenized_text)

# Numericalize the text
numericalized_text = [vocab[token] for token in tokenized_text[0]]

In [19]:
len(vocab)

9482

## Dataset Creation

The code defines a custom PyTorch Dataset for the text data. In PyTorch, a Dataset is an abstract class representing a dataset, and it has two main methods: __len__ and __getitem__. The __len__ method returns the number of items in the dataset, and the __getitem__ method returns the item (a sequence of tokens) and its label (the next token in the sequence). The sequences are of a fixed length, defined by sequence_length.

A DataLoader is then created for the dataset. The DataLoader is a PyTorch utility for loading data in parallel.

In [30]:
# Define the dataset
class LlamaDataset(Dataset):
    def __init__(self, text, sequence_length):
        self.text = text
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.text) - self.sequence_length

    def __getitem__(self, idx):
        return (
            torch.tensor(self.text[idx:idx+self.sequence_length]),
            torch.tensor(self.text[idx+1:idx+self.sequence_length+1]),
        )

# Create the dataset and dataloader
sequence_length = 8
dataset = LlamaDataset(numericalized_text, sequence_length)
dataloader = DataLoader(dataset, batch_size=128)

## Model Definition

The code defines a custom PyTorch Module for the text generation model. The model consists of an embedding layer, a transformer layer, and a linear layer. The embedding layer converts the input tokens into vectors of a fixed size. The transformer layer is the main part of the model, and it learns the relationships between the words in the text. The linear layer converts the output of the transformer layer into predictions for the next word in the sequence.

In [39]:
class LlamaModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers, num_heads, dropout):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.transformer = nn.Transformer(
            d_model=embed_size,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            dim_feedforward=hidden_size,
            dropout=dropout,
            batch_first=True,
        )
        self.fc = nn.Linear(embed_size, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output = self.transformer(embedded, embedded)
        output = self.fc(output)
        return output

Simplified version of the model using GRU instead of Transformer

In [None]:
# Define the model
class LlamaModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output, _ = self.rnn(embedded)
        output = self.fc(output)
        return output

In [32]:
for batch in dataloader:
        x, y = batch
        print(x.shape)
        print(y.shape)
        break

torch.Size([128, 8])
torch.Size([128, 8])


## Model Initialization and Training

The model is then initialized with the size of the vocabulary, the embedding size, the hidden size, the number of layers, the number of heads for the multi-head attention mechanism in the transformer, and the dropout rate. The model is moved to the GPU if available.

If multiple GPUs are available, the model is wrapped with nn.DataParallel, which allows parallelizing the computations over the GPUs.

The Adam optimizer is initialized with the model parameters and a learning rate of 0.001.

The model is then trained for 80 epochs. In each epoch, the model goes through all the data in the dataloader. For each batch, the model makes predictions for the next word in the sequence, computes the cross-entropy loss between the predictions and the actual next words, and updates the model parameters to minimize the loss.

In [40]:
# Initialize the model and the optimizer
model = LlamaModel(len(vocab), embed_size=128, hidden_size=256, num_layers=2, num_heads=8, dropout=0.1).to(device)

# If there are multiple GPUs, wrap the model with nn.DataParallel
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)
model = model.to(device)

optimizer = Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(10):
    for batch in dataloader:
        x, y = batch
        x = x.to(device)
        y = y.to(device)
        optimizer.zero_grad()
        y_pred = model(x)
        loss = nn.functional.cross_entropy(y_pred.view(-1, len(vocab)), y.view(-1))
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch}, Loss {loss.item()}')
    if float(loss.item()) < 0.06:
        break

Epoch 0, Loss 5.322413921356201
Epoch 1, Loss 3.9353713989257812
Epoch 2, Loss 2.745954990386963
Epoch 3, Loss 1.8482836484909058
Epoch 4, Loss 1.3274807929992676
Epoch 5, Loss 1.0970139503479004
Epoch 6, Loss 0.944744884967804
Epoch 7, Loss 0.7786787748336792
Epoch 8, Loss 0.7251054048538208
Epoch 9, Loss 0.6376631855964661


# Result
## Text Generation
Finally, the trained model is used to generate new text. A seed text is provided as a starting point, and the model generates a specified number of tokens following the seed text.

In [35]:
# Use the trained model to generate new text
def generate_text(model, human_input, num_tokens):
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():  # No need to track the gradients
        tokens = [vocab[token] for token in tokenizer(human_input)]
        tokens = torch.tensor(tokens).unsqueeze(0).to(device)
        for _ in range(num_tokens):
            output = model(tokens)
            probabilities = nn.functional.softmax(output[0, -1], dim=0)
            next_token = torch.multinomial(probabilities, 1).item()
            tokens = torch.cat([tokens, torch.tensor([[next_token]]).to(device)], dim=1)
        generated_text = ' '.join(vocab.get_itos()[token] for token in tokens[0].cpu().numpy())
        return generated_text

Example 1

In [41]:
result = generate_text(model, human_input="Generative AI is ", num_tokens=100)
print(result)

generative ai is generative ai is generative ai is generative ai is generative ai is generative ai is generative ai is is generative ai is part is generative ai is generative ai is generative ai is generative ai is generative ai is is generative ai is generative ai is released generative ai is generative ai is open generative ai is is released generative ai is generative ai is is generative ai is released generative ai is generative ai is generative ai is generative ai is is released generative ai is released generative ai is released is released generative ai is released generative ai


Example 2

In [42]:
result = generate_text(model, human_input="Intelligence is ", num_tokens=100)
print(result)

intelligence is ai is is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that is that


Example3

In [38]:
result = generate_text(model, human_input="Question answering system can ", num_tokens=100)
print(result)

question answering system can can speed to to to to to the the the the the the the the the the the it detection that that that that from from from text-based the the the the the to only had in in found dt users innovation months the the the the to only context engagement in prompt interest a similar generative generative generative design drivers enabling the in only concerns interesting communities action generative generative network ( ) ) ) ) ) ) ) ( ) ) ( ) ) ( ) ( ) , , , , , , , complexity . essays


## Model summary and overview

In [24]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
print(f'The model has {len(vocab)} tokens')

The model has 3,099,914 trainable parameters
The model has 9482 tokens


In [25]:
!pip install torchviz

Collecting torchviz
  Downloading torchviz-0.0.2.tar.gz (4.9 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting graphviz (from torchviz)
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
   ---------------------------------------- 0.0/47.1 kB ? eta -:--:--
   ---------------------------------------- 0.0/47.1 kB ? eta -:--:--
   ---------------------------------------- 0.0/47.1 kB ? eta -:--:--
   ---------------------------------- ----- 41.0/47.1 kB 991.0 kB/s eta 0:00:01
   ---------------------------------------- 47.1/47.1 kB

In [26]:
# visualize the model
import torchviz
from torch.autograd import Variable

# Create a variable with the size of your input
x = torch.randint(high=len(vocab), size=(1, 30), dtype=torch.long).to(device)

# Generate a diagram for a specific model
y = model(x)
torchviz.make_dot(y.mean(), params=dict(model.named_parameters()))

ModuleNotFoundError: No module named 'distutils'

In [23]:
for batch in dataloader:
        x, y = batch
        print(x.shape)
        print(y.shape)
        break

torch.Size([128, 30])
torch.Size([128, 30])
