# Text Generation

<a target="_blank" href="https://colab.research.google.com/github/raghavbali/llm_workshop/blob/main/module_02/solutions/02_simple_text_generator.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [2]:
import torch

In [25]:
# if you are on apple silicon execute the below command before starting jupyter
#export PYTORCH_ENABLE_MPS_FALLBACK=1
if torch.cuda.is_available():
    DEVICE = 'cuda'
    Tensor = torch.cuda.FloatTensor
    LongTensor = torch.cuda.LongTensor
    DEVICE_ID = 0
# Some Causal Modeling Ops are not available on MPS yet 
# elif torch.backends.mps.is_available():
#     DEVICE = 'mps'
#     Tensor = torch.FloatTensor
#     LongTensor = torch.LongTensor
#     DEVICE_ID = 0
else:
    DEVICE = 'cpu'
    Tensor = torch.FloatTensor
    LongTensor = torch.LongTensor
    DEVICE_ID = -1
print(f"Backend Accelerator Device={DEVICE}")

Backend Accelerator Device=cpu


In [6]:
import time
import datetime

In [7]:
import pandas as pd
import numpy as np
import transformers
from numpy import random
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW

In [8]:
from torch.utils.data import Dataset, DataLoader
from torch.utils.data import random_split, RandomSampler, SequentialSampler
torch.manual_seed(42)

<torch._C.Generator at 0x1177d40f0>

In [9]:
print(transformers.__version__)

4.42.3


In [10]:
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
torch.manual_seed(42)

<torch._C.Generator at 0x1177d40f0>

In [27]:
# colab/gpu systems
!nvidia-smi
# htop or activity monitor for linux based systems

zsh:1: command not found: nvidia-smi


## Get Data
We will fine-tune a pre-trained model GPT-2 model on our earlier dataset itself. But wait, what do you mean pre-trained?

In [11]:
!wget -O sherlock_homes.txt http://www.gutenberg.org/files/1661/1661-0.txt

--2024-07-28 00:53:52--  http://www.gutenberg.org/files/1661/1661-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 2610:28:3090:3000:0:bad:cafe:47, 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|2610:28:3090:3000:0:bad:cafe:47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/files/1661/1661-0.txt [following]
--2024-07-28 00:53:52--  https://www.gutenberg.org/files/1661/1661-0.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|2610:28:3090:3000:0:bad:cafe:47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607504 (593K) [text/plain]
Saving to: ‘sherlock_homes.txt’


2024-07-28 00:53:53 (1.21 MB/s) - ‘sherlock_homes.txt’ saved [607504/607504]



In [12]:
filename = "sherlock_homes.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
text = raw_text [1450:100000]

## Foundation & Pre-trained Models

**Foundation models** are the models that are trained from scratch on a large corpus of data. In the context of NLP, these models are designed to learn the fundamental patterns, structures, and representations of natural language. Foundation models are typically trained using unsupervised learning objectives, such as language modeling or autoencoding, where the model predicts the next word in a sentence or reconstructs the original sentence from a corrupted version/masked version.
Models such as GPT, BERT, T5, etc are typical examples of Foundation Models


Instances of foundation models that have been trained on specific downstream tasks or datasets are termed as **Pre-Trained Models**. Pretrained models leverage the knowledge learned from foundation models and are fine-tuned on task-specific data to perform well on specific NLP tasks, such as text classification, named entity recognition, machine translation, sentiment analysis, etc.

In [13]:
BOS_TOKEN = '<|sot|>'
EOS_TOKEN = '<|eot|>'
PAD_TOKEN = '<|pad|>'
MODEL_NAME = "raghavbali/gpt2_ft_sherlock_holmes"

In [14]:
# first, let us get the tokenizer object
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME,
                                          bos_token=BOS_TOKEN,
                                          eos_token=EOS_TOKEN,
                                          pad_token=PAD_TOKEN
                                          )

## Prepare Dataset

In [15]:
class GPT2Dataset(Dataset):

  def __init__(self, txt_list, tokenizer, max_length=768):

    self.tokenizer = tokenizer
    self.input_ids = []
    self.attn_masks = []

    for txt in txt_list:

      encodings_dict = tokenizer(
          BOS_TOKEN + txt + EOS_TOKEN, #TODO
          truncation=True,
          max_length=max_length,
          padding="max_length"
          )

      self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
      self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

  def __len__(self):
    return len(self.input_ids)#TODO: return size of input_ids

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx]

In [16]:
# set batch size to work it out on colab
BATCH_SIZE = 3

In [26]:
dataset = GPT2Dataset(text.split('\n'),
                      tokenizer, max_length=768)

# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

1,949 training samples
  217 validation samples


In [27]:
# Create the DataLoaders for our training and validation datasets.
train_dataloader = DataLoader(
            train_dataset,
            sampler = RandomSampler(train_dataset),
            batch_size = BATCH_SIZE#TODO: set batch-size
        )

validation_dataloader = DataLoader(
            val_dataset,
            sampler = SequentialSampler(val_dataset),
            batch_size = BATCH_SIZE#TODO: set batch-size
        )

## Setup Model Object

In [28]:
# Training Params
epochs = 1 #3 seems good if you train from gpt2 checkpoint
learning_rate = 5e-4
# to speed up learning
warmup_steps = 1e2
epsilon = 1e-8

# generate output after N steps
sample_every = 100

In [29]:
# Set Config
configuration = GPT2Config.from_pretrained(MODEL_NAME,
                                           output_hidden_states=False)

# instantiate the model
model = GPT2LMHeadModel.from_pretrained(MODEL_NAME, config=configuration,)

# NOTE: This is important to imply that we have updated BOS, EOS, etc
model.resize_token_embeddings(len(tokenizer))
model = model.to(DEVICE)

In [30]:
optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = epsilon
                )

In [31]:
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = warmup_steps,
                                            num_training_steps = total_steps)

In [32]:
def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

In [None]:
total_t0 = time.time()
training_stats = []


for epoch_i in range(0, epochs):

    # Training
    print("*"*25)
    print('>> Epoch {:} / {:} '.format(epoch_i + 1, epochs))
    print("*"*25)

    t0 = time.time()
    total_train_loss = 0

    #TODO: call model's training interface
    model.train()
    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(DEVICE)
        b_labels = batch[0].to(DEVICE)
        b_masks = batch[1].to(DEVICE)

        model.zero_grad()

        outputs = model(  b_input_ids,
                          labels=b_labels,
                          attention_mask = b_masks,
                          token_type_ids=None
                        )

        loss = outputs[0]

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every x batches.
        if step % sample_every == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}. Training Loss: {:>5,}.   Time Taken: {:}.'.format(step,
                                                                                     len(train_dataloader),
                                                                                     batch_loss,
                                                                                     elapsed))

            model.eval()

            sample_outputs = model.generate(
                                    do_sample=True,
                                    top_k=50,
                                    max_length = 200,
                                    top_p=0.95,
                                    num_return_sequences=1,
                                    pad_token_id=tokenizer.eos_token_id
                                )
            for i, sample_output in enumerate(sample_outputs):
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

            model.train()

        loss.backward()
        optimizer.step()

        scheduler.step()

    # Average Loss
    avg_train_loss = total_train_loss / len(train_dataloader)

    # training time
    training_time = format_time(time.time() - t0)

    print("Average training loss: {0:.2f}".format(avg_train_loss))
    print("Training epoch time: {:}".format(training_time))

    # Validation
    t0 = time.time()

    model.eval()
    total_eval_loss = 0
    nb_eval_steps = 0

    for batch in validation_dataloader:

        b_input_ids = batch[0].to(DEVICE)
        b_labels = batch[0].to(DEVICE)
        b_masks = batch[1].to(DEVICE)

        with torch.no_grad():

            outputs  = model(b_input_ids,#TODO: pass batch's ids,
                             attention_mask = b_masks,
                            labels=b_labels)

            loss = outputs[0]

        batch_loss = loss.item()
        total_eval_loss += batch_loss

    avg_val_loss = total_eval_loss / len(validation_dataloader)

    validation_time = format_time(time.time() - t0)

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation time: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'train_loss': avg_train_loss,
            'val_oss': avg_val_loss,
            'train_ime': training_time,
            'val_ime': validation_time
        }
    )

print("Training Completed")
print("Total training time {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))

*************************
>> Epoch 1 / 1 
*************************


In [None]:
df_stats = pd.DataFrame(data=training_stats)
df_stats

## Save the Model

In [None]:
import os

In [None]:
output_dir = './model_save/'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

In [51]:
model.eval()

prompt = "the King of England"

generated = torch.tensor(tokenizer.encode(BOS_TOKEN+prompt)).unsqueeze(0)
generated = generated.to(DEVICE)

sample_outputs = model.generate(
                                generated,
                                do_sample=True,
                                top_k=50,
                                max_length = len(generated) + 50,
                                top_p=0.92,
                                num_return_sequences=3,
                                pad_token_id=tokenizer.eos_token_id,
                                temperature=0.8,
                                )

for i, sample_output in enumerate(sample_outputs):
  print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: the King of England, and he was as good a queen, as if she had


1: the King of England, with a face of a royal-fancy one, and a thick,


2: the King of England.”




In [48]:
# compare output to foundation model
pre_trainedtokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME)
pre_trainedmodel = GPT2LMHeadModel.from_pretrained(MODEL_NAME)

In [52]:
input_ids = pre_trainedtokenizer.encode(prompt, return_tensors="pt")

# Generate text
output = pre_trainedmodel.generate(
    input_ids,
    bos_token_id=random.randint(1,30000),
    max_length=len(input_ids[0]) + 50,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_p=0.92,  # Adjust the sampling parameters as needed
    temperature=0.8,
)

In [53]:
pre_trainedtokenizer.decode(output[0], skip_special_tokens=True)

'the King of England was a good fellow, for I was the better'

## Decoding Strategies

The ``generate()`` utility we used above used every output prediction as input for the next time step. This method of using the highest probability prediction as output is called __Greedy Decoding__. Greeding decoding is fast and simple but is marred with issues we saw in samples we just generated.

Focusing on only highest probability output narrows our model's focus to just the next step which inturn may result in inconsistent or non-dictionary terms/words.

### Beam Search
Beam search is the obvious next step to improve the output predictions from the model. Instead of being greedy, beam search keeps track of n paths at any given time and selects the path with overall higher probability.

<img src="../assets/beamsearch_nb_2.png">

### Other Key Decoding Strategies:
- Sampling
- Top-k Sampling
- Nucleus Sampling

### Temperature
Though sampling helps bring in required amount of randomness, it is not free from issues. Random sampling leads to gibberish and incoherence at times. To control the amount of randomness, we introduce __temperature__. This parameter helps increase the likelihood of high probability terms reduce the likelihood of low probability ones. This leads to sharper distributions. 

> High temperature leads to more randomness while lower temperature brings in predictability.


In [None]:
prompt = "the King of England"

generated = tokenizer.encode(BOS_TOKEN+prompt,return_tensors='pt')
generated = generated.to(DEVICE)

beam_output = model.generate(
    **generated,
    max_new_tokens=40,
    num_beams=5,
    num_return_sequences=5,
    early_stopping=True
)

## Limitations and What Next?
- Long Range Context
- Scalability
- Instruction led generation
- Benchmarking
- Halucination / Dreaming
