<a target="_blank" href="https://colab.research.google.com/github/raghavbali/mastering_llms_workshop/blob/main/docs/module_02_llm_building_blocks/02_transformers_pipelines.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Transformer Task Pipelines 

## BERT-ology
- BERT, or __[Bi-Directional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805)__, was presented by Devlin et al., a team at Google AI in 2018
- Multi-task Learning: BERT also helped push the transfer-learning envelope in the NLP domain by showcasing how a pre-trained model can be fine-tuned for various tasks to provide state-of-the-art performance
- BERT tweaked the usual Language Model objective to only predict next token based on past context by building context from both directions, i.e. the objective of predicting masked words along with next sentence prediction.


<img src="../assets/02_bert_models_layout_notebook_3.jpeg">

> source [PLM Papers](https://github.com/thunlp/PLMpapers)

In [None]:
import torch
import transformers
from transformers import pipeline

In [None]:
# Let us define some configs/constants
DISTILBET_BASE_UNCASED_CHECKPOINT = "distilbert/distilbert-base-uncased"
DISTILBET_QA_CHECKPOINT = "distilbert/distilbert-base-uncased-distilled-squad"
DISTILBET_CLASSIFICATION_CHECKPOINT = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

In [None]:
if torch.cuda.is_available():
    DEVICE = 'cuda'
    Tensor = torch.cuda.FloatTensor
    LongTensor = torch.cuda.LongTensor
    DEVICE_ID = 0
elif torch.backends.mps.is_available():
    DEVICE = 'mps'
    Tensor = torch.FloatTensor
    LongTensor = torch.LongTensor
    DEVICE_ID = 0
else:
    DEVICE = 'cpu'
    Tensor = torch.FloatTensor
    LongTensor = torch.LongTensor
    DEVICE_ID = -1
print(f"Backend Accelerator Device={DEVICE}")

### Predicting the Masked Token
This was a unique objective when BERT was originally introduced as compared to usual NLP tasks such as classification. The objective requires us to prepare a dataset where we mask a certain percentage of input tokens and train the model to learn to predict those tokens. This objective turns out to be very effective in helping the model learn the nuances of language. 

In this first task we will test the pre-trained model against this objective itself. The model outputs a bunch of things such as the predicted token, encoded index of the predicted token/word along with a score which indicates the model's confidence.

In [None]:
mlm_pipeline = pipeline(
    'fill-mask',
    model=DISTILBET_BASE_UNCASED_CHECKPOINT,
    device=DEVICE_ID
)
mlm_pipeline("Bangalore is the IT [MASK] of India.")

### Question Answering
This is an interesting NLP task and quite complex one as well. For this task, the model is provided input consisting of the context along with a question and it predicts the answer by selecting text from the context. The training setup for this task is a bit involved process, the following is an overview:
- The training input as triplet of context, question and answer
- This is transformed as combined input of the form ``[CLS]question[SEP]context[SEP]`` or ``[CLS]contex[SEP]question[SEP]`` with answer acting as the label
- The model is trained to predict the start and end indices of the the corresponding answer for each input.


For our current setting, we will leverage both _pretrained_ and _fine-tuned_ versions of **DistilBERT** via the _question-answering_ pipeline and understand the performance difference.

In [None]:
qa_ft_pipeline = pipeline(
    'question-answering',
    model=DISTILBET_QA_CHECKPOINT,
    device=DEVICE_ID
)
qa_pt_pipeline = pipeline(
    'question-answering',
    model=#TODO: Set the pretrained 
    device=DEVICE_ID
)

In [None]:
# we use a snippet about BERT like models from the module itself
context = """The key contribution from this set of models is the masked language modeling objective during the pre-training phase, where some tokens in the input are masked, and the model is trained to predict them (we will cover these in the upcoming section). Key works in this group of architectures are BERT, RoBERTa (or optimized BERT), DistilBERT (lighter and more efficient BERT), ELECTRA and ALBERT.
In this notebook we will work through the task of Question Answering where our language model will learn to answer questions based on the context provided."""
question = "What are the key works in this set of models?"

In [None]:
ft_qa_result= qa_ft_pipeline(
    question=question,
    context=context
)

pt_qa_result= qa_pt_pipeline(
    question=question,
    context=context
)

In [None]:
print("*"*55)
print(f"Context:{context}")
print("*"*55)
print(f"Question:{question}")
print("-"*55)
print(f"Response from Fine-Tuned Model:\n{ft_qa_result}")
print()
print(f"Response from Pretrained Model:\n{pt_qa_result}")

# Generative Pretraining

## Behold, its GPT (Generative pre-training)

The first model in this series is called GPT, or Generative Pre-Training. It was released in [2018](https://openai.com/blog/language-unsupervised/), about the same time as the BERT model. The paper presents a task-agnostic architecture based on the ideas of transformers and unsupervised learning.

- GPT is essentially a language model based on the __transformer-decoder__ 
- Introduction of large training datasets: __BookCorpus__ dataset contains over 7,000 unique, unpublished books across different genres
- The GPT architecture makes use of 12 decoder blocks (as opposed to 6 in the original transformer) with 768-dimensional states and 12 self-attention heads each.


### GPT-2
- Radford et al. presented the GPT-2 model as part of their work titled [Language Models are Unsupervised Multi-task Learners in 2019](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- The model achieves state-of-the-art performance in a few-shot setting
- Similar to GPT, the secret sauce for GPT-2 is its dataset. The authors prepared a massive 40 GB dataset by crawling 45 million outbound links from a social networking site called Reddit.
- The vocabulary was also expanded to cover 50,000 words and the context window was expanded to 1,024 tokens (as compared to 512 for GPT).


### GPT-3
- OpenAI published paper titled [Language Models are Few Shot Learners](https://arxiv.org/abs/2005.14165) in May 2020. 
- This paper introduces the mammoth __175 billion-parameter GPT-3 model__.
- Apart from more layers and parameters, this model made use of sparse attention
- Dataset again played a key role, a 300 billion-token dataset based on existing datasets like Common Crawl (filtered for better content), WebText2 (a larger version of WebText used for GPT-2), Books1 and Books2, and the Wikipedia dataset was prepared for this model

## Language Modeling
By far the most widely used application from the NLP world is language modeling. We use it daily on our phone keyboards, email applications and a ton of other places.

In simple words, a language model takes certain text as input context to generate the next set of words as output. This is interesting because a language model tries to understand the input context, the language structure (though in a very naive way) to predict the next word(s). We use it in the form of text completion utilities on search engines, chat platforms, emails etc. all the time. Language models are a perfect real life application of NLP and showcase the power of RNNs.

Language models can be developed train in different ways. The most common and widely used method is the sliding window approach. The model takes a small window of text as input and tried to predict the next word as the output. The following figure illustrates the same visually.

<img src="../assets/02_lm_training_notebook_3.png">

### PreTrained GPT2 for Text Generation

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# generative tasks are not available through MPS/Apple Silicon
DEVICE = 'cpu'
Tensor = torch.FloatTensor
LongTensor = torch.LongTensor
DEVICE_ID = -1
print(f"Backend Accelerator Device={DEVICE}")

In [None]:
tokenizer = AutoTokenizer.#TODO: get pretrained GPT2 tokenizer

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(DEVICE)

In [None]:
# encode context the generation is conditioned on
model_inputs = tokenizer('The king of England is', return_tensors='pt').to(DEVICE)

# generate 40 new tokens
greedy_output = model.generate(**model_inputs, max_new_tokens=40)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

---
## Recap
- **BERT and DistilBERT**: The notebook introduces BERT (Bidirectional Encoder Representations from Transformers) and its variants, explaining their unique masked language modeling objective and transfer learning capabilities. It demonstrates using pipelines to predict masked tokens and perform question answering tasks with both pre-trained and fine-tuned versions of DistilBERT.
- **GPT Series**: The notebook covers the evolution of GPT models (Generative Pre-Training), including GPT, GPT-2, and GPT-3. It highlights their architectures, datasets, and achievements in language modeling, emphasizing their role as unsupervised multi-task learners.
- **Language Modeling**: The notebook discusses the concept of language modeling, its applications in text completion, and the use of sliding window approaches for training models. It also provides an example using a pre-trained GPT2 model to generate text based on given input context.