# Large Language Models LLMs-101

Author: Varuni Sastry, Data Science Group, ALCF

# Overview

*   What are LLMs ?
*   LLM Pipeline
*   Models and Tokenizers
*   Batch inference
*   Save and load models
*   ModelHub



# Large Language Models

Language models that forms the foundation of Natural Languague Processing, is a machine learning model that once trained on a large set of data corpus, predicts the next most appropriate word, based on the context of the given text. LLMs are used for varied tasks including sentence classification, text generation, language translation, question answering, text summarization and more.

<img src="https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/01-llm-101/images/transformer.png?raw=1" width="400">

## Architecture
Large Language Models uses the transformer architecture introduced by Vaswani et al. in the paper "**Attention is All You Need**".The transformer architecture has revolutionized NLP due to its parallelizability, scalability, and ability to capture long-range dependencies in text.

Key components of the transformer architecture include:

*   Input Embeddings: Word embedding or word vectors help us represent words or text as a numeric vector where words with similar meanings have the similar representation.
*   Positional Encoding: Injects information about the position of words in a sequence, helping the model understand word order.
*   Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sentence, enabling it to effectively capture contextual information.
*   Feedforward Neural Networks: Process information from self-attention layers to generate output for each word/token.
*   Layer Normalization and Residual Connections: Aid in stabilizing training and mitigating the vanishing gradient problem.
*  Transformer Blocks: Comprised of multiple layers of self-attention and feedforward neural networks, stacked together to form the model.

Though initial transformer architectures used both encoder and decoder stacks, off late most of the models do away with the encoder stacks.

<img src="https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/01-llm-101/images/TransformerArch.png?raw=1" width="400">

Source: [Attention Is all you need](https://arxiv.org/pdf/1706.03762.pdf)

## Models
There are a wide range of models available today, each with different model architectures, with varying number of model parameters trained on varied data corpus. The most well known among them are GPT3.5, GPT4, Bloom, Llamma7B, Llamma70B, and many more.

<img src="https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/01-llm-101/images/models.png?raw=1" width="600">

Source: [AILab](https://s10251.pcdn.co/wp-content/uploads/2024/02/2024-Alan-D-Thompson-AI-Bubbles-Planets-Rev-1.png)

## What is Huggingface and the transformer library ?
Several tools and libraries are available for working with Large Language Models. In this tutorial we will look at the **"transformers"** which is a popular library for natural language understanding and generation tasks, built on top of PyTorch and TensorFlow.

<img src="https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/01-llm-101/images/HF.png?raw=1" width="200">

Source: [HF](https://huggingface.co/)

HuggingFace is a platform and community that provides open-source library tools and resources like pre-trained models and datasets.

Refer to the following links for more information :

*   https://huggingface.co/docs/hub/index
*   https://huggingface.co/docs/transformers/en/index


# LLM Pipeline

 Hugging Face's "transformers" library, provide pre-built transformer pipelines that users can easily deploy and customize for their specific use cases. These pipelines abstract away the complexities of model integration and allow users to focus on their NLP tasks.

 There are three main stages in the NLP pipeline,

*   Preprocessing the data
*   Applying model
*   Post processing the outputs.

Lets look into an example of such a pipeline.





### Example of pipeline for a classification task

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
res = classifier("The panoramic view of the ocean was breathtaking")
print(res)

res = classifier(["The movie was boring and too long", "This restaurant is awesome"])
print(res)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9998416900634766}]
[{'label': 'NEGATIVE', 'score': 0.9997920393943787}, {'label': 'POSITIVE', 'score': 0.9998743534088135}]


### Example of pipeline for a generation task

This is an example of a simple pipeline for a text generation task. The pipeline can be instucted to use a specific model instead of using the default model('distilbert-base-uncased-finetuned-sst-2-english') by passing the "model" argument.

In [8]:
prompt = "The goal of the Large Language model workshop is to "

generator = pipeline("text-generation", model='gpt2')
res = generator(prompt, max_length=25, num_return_sequences=5)

for each in res:
    print(each)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'generated_text': 'The goal of the Large Language model workshop is to \xa0create as many new words or nouns as possible and to identify'}
{'generated_text': 'The goal of the Large Language model workshop is to \xa0encourage collaboration among linguists from across the country and Canada and'}
{'generated_text': 'The goal of the Large Language model workshop is to \xa0help educators and students understand language from a wide perspective, creating a'}
{'generated_text': 'The goal of the Large Language model workshop is to \xa0increase the number of participants and learn how the model can be'}
{'generated_text': 'The goal of the Large Language model workshop is to ʻ*\x8f\x9bʻʦ\x8f�'}


Refer to
https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline.task that lists the different tasks that are supported as part of the pipleine.

## Pipeline Components



In [9]:
# STEP 1 : Installations and imports
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch
import torch.nn.functional as F

#### **How to set up a prompt?**

A "prompt" refers to a specific input or query provided to a language model. They guide the text processing and generation by  providing the context for the model to generate coherent and relevant text based on the given input.

The choice and structure of the prompt depends on the specific task, the context and desired output. Prompts can be "discrete" or "instructive" where they are explicit instructions or questions directed to the language model. They can also be more nuanced by more providing suggestions, directions and contexts to the model.  

We will use very simple prompts in this tutorial section, but we will learn more about prompt engineering and how it helps in optimizing the performance of the model for a given use case in the following tutorials.


In [10]:
# STEP 2 : Set up the prompt
input_text = "The panoramic view of the ocean was breathtaking."

#### **Pretrained Models**

The AutoModelForSequenceClassification.from_pretrained() method instantiates a sequence classification model. Refer to https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodels for the list of model classes supported.

"from_pretrained" method downloads the pre-trained weights from the Hugging Face Model Hub or the specified URL if the model is not already cached locally. It then loads the weights into the instantiated model, initializing the model parameters with the pre-trained values.

The model cache contains:
*   model configuration (config.json)
*   pretrained model weights (model.safetensors)
*   tokenizer information (tokenizer.json, vocab.json, merges.txt, tokenizer.model)


<img src="https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/01-llm-101/images/GPT2_cache.png?raw=1" width="700">


In [11]:
# STEP 3 : Load the pretrained model.
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
print(config)

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.42.4",
  "vocab_size": 30522
}



#### **Tokenization**

Tokenization is a data preprocessing step which transforms the raw text data into a format suitable for machine learning models. Tokenizers break down raw text into smaller units called **tokens**. These tokens are what is fed into the language models. Based on the type and configuration of the tokenizer, these tokens can be words, subwords, or characters.

**Types of tokenizers:**
*   Character Tokenizers: Split text into individual characters.
*   Word Tokenizers: Split text into words based on whitespace or punctuation.
*   Subword Tokenizers: Split text into subword units, such as morphemes or character n-grams. Common subword tokenization algorithms include Byte-Pair Encoding (BPE), SentencePiece, and WordPiece.

<img src="https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/01-llm-101/images/tokenization.png?raw=1" width="600">

Source: [nlpiation](https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa)


**Vocabulary**: The "vocabulary" of a model refers to the set of words that the model has been trained to understand and use. Each of these words or subwords have a one-to-one numerical mapping.

**Special Tokens**: Tokenizers may also include special tokens such as [CLS] (classification token), [SEP] (separator token), [UNK] (unknown token), and [PAD] (padding token). These tokens serve specific purposes in certain NLP tasks and help the model understand the structure of the input.

**Tokenization Libraries:** Wellknown tokenization libraries include: Hugging Face Tokenizers, NLTK (Natural Language Toolkit), Spacy etc.

**AutoTokenizer.from_pretrained** is a method provided by the HuggingFace Transformers library for loading a tokenizer from a pretrained model configuration and provides functionality or methods needed to preprocess text data for input to a pretrained model outputs the **input_ids** that represent the numerical representation of the raw text. Some of the models have additional information from tokenization like **attention mask** which is a binary mask indicating which tokens in an input sequence should be attended to by the model and which tokens should be ignored. return_tensors="pt" indicates that the output will be of type pytorch tensors.

Lets look at some the methods that are used for tokenization of the data.

In [12]:
#STEP 4 : Load the tokenizer and tokenize the input text
tokenizer  =  AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"]
print(input_ids)


tensor([[  101,  1996,  6090,  6525,  7712,  3193,  1997,  1996,  4153,  2001,
          3052, 17904,  1012,   102]])


In [13]:
sequence = "I am working on a tutorial"

# get the vocabulary
vocab = tokenizer.vocab
# Number of entries to print
n = 10

# Print subset of the vocabulary
print("Subset of tokenizer.vocab:")
for i, (token, index) in enumerate(tokenizer.vocab.items()):
    print(f"{token}: {index}")
    if i >= n - 1:
        break

print("Vocab size of the tokenizer = ", len(vocab))
print("------------------------------------------")

# .tokenize chunks the existing sequence into different tokens based on the rules and vocab of the tokenizer.
tokens = tokenizer.tokenize(sequence)
print("Tokens : ", tokens)
print("------------------------------------------")

# .convert_tokens_to_ids or .encode or .tokenize converts the tokens to their corresponding numerical representation.
#  .convert_tokens_to_ids has a 1-1 mapping between tokens and numerical representation
ids = tokenizer.convert_tokens_to_ids(tokens)
print("encoded Ids: ", ids)

# .encode also adds additional information like Start of sequence tokens and End of sequene
print("tokenized sequence : ", tokenizer.encode(sequence))

# .tokenizer has additional information about attention_mask.
encode = tokenizer(sequence)
print("Encode sequence : ", encode)
print("------------------------------------------")

# .decode decodes the ids to raw text
decode = tokenizer.decode(ids)
print("Decode sequence : ", decode)

Subset of tokenizer.vocab:
##plane: 11751
##dorff: 26559
##‒: 30050
quicker: 19059
mad: 5506
victims: 5694
shout: 11245
cecilia: 18459
##mberg: 29084
gmina: 7061
Vocab size of the tokenizer =  30522
------------------------------------------
Tokens :  ['i', 'am', 'working', 'on', 'a', 'tutor', '##ial']
------------------------------------------
encoded Ids:  [1045, 2572, 2551, 2006, 1037, 14924, 4818]
tokenized sequence :  [101, 1045, 2572, 2551, 2006, 1037, 14924, 4818, 102]
Encode sequence :  {'input_ids': [101, 1045, 2572, 2551, 2006, 1037, 14924, 4818, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
------------------------------------------
Decode sequence :  i am working on a tutorial


In [14]:
# Tokenization with truncation
sequence = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder."

print(len(sequence))
encoder = (tokenizer(sequence))
print(len(encoder["input_ids"]))
encoder = (tokenizer(sequence, max_length=30,truncation=True))
print(len(encoder["input_ids"]))

144
33
30


#### **Perform Inference**

In [15]:
# STEP 5 : Perform inference
outputs = model(input_ids)
result = outputs.logits
print(result)

# STEP 6 :  Interpret the output.
probabilities = F.softmax(result, dim=-1)
print(probabilities)
predicted_class = torch.argmax(probabilities, dim=-1).item()
labels = ["NEGATIVE", "POSITIVE"]
out_string = "[{'label': '" + str(labels[predicted_class]) + "', 'score': " + str(probabilities[0][predicted_class].tolist()) + "}]"
print(out_string)

tensor([[-4.2767,  4.5486]], grad_fn=<AddmmBackward0>)
tensor([[1.4695e-04, 9.9985e-01]], grad_fn=<SoftmaxBackward0>)
[{'label': 'POSITIVE', 'score': 0.9998530149459839}]


#### Put it all together !!

In [16]:
# STEP 1 : Installations and imports
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch
import torch.nn.functional as F

# STEP 2 : Set up the prompt
input_text = "The panoramic view of the ocean was breathtaking."

# STEP 3 : Load the pretrained model.
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)

#STEP 4 : Load the tokenizer and tokenize the input text
tokenizer  =  AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"]

# STEP 5 : Perform inference
outputs = model(input_ids)
result = outputs.logits

# STEP 6 :  Interpret the output.
probabilities = F.softmax(result, dim=-1)
print(probabilities)
predicted_class = torch.argmax(probabilities, dim=-1).item()
labels = ["NEGATIVE", "POSITIVE"]
out_string = "[{'label': '" + str(labels[predicted_class]) + "', 'score': " + str(probabilities[0][predicted_class].tolist()) + "}]"
print(out_string)


tensor([[1.4695e-04, 9.9985e-01]], grad_fn=<SoftmaxBackward0>)
[{'label': 'POSITIVE', 'score': 0.9998530149459839}]


## Batch Inference

Here is another example of batched inference, where multiple samples are fed to the classifier pipeline and classifies each sentence as positive or negative. As noted, the inputs in the batch are of different lengths, so `padding`=`True` , helps in ensuring that all the inputs are of same lengths by padding zeros at the end, but the `attention_mask` indicates which tokens to attend to and which ones to ignore.

Note : `'max_length'` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`





In [17]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer  =  AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

X_train = ['This is the first sample',
           'This is the second sample but I am longest in the batch',
           'This is the last sample but short']

batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors='pt')
print(batch)

result = classifier(X_train)
print(result)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'input_ids': tensor([[  101,  2023,  2003,  1996,  2034,  7099,   102,     0,     0,     0,
             0,     0,     0,     0],
        [  101,  2023,  2003,  1996,  2117,  7099,  2021,  1045,  2572,  6493,
          1999,  1996, 14108,   102],
        [  101,  2023,  2003,  1996,  2197,  7099,  2021,  2460,   102,     0,
             0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}
[{'label': 'POSITIVE', 'score': 0.9916760921478271}, {'label': 'NEGATIVE', 'score': 0.9906133413314819}, {'label': 'NEGATIVE', 'score': 0.9917491674423218}]


In [18]:
# Load the pretrained model.
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

print("Batched inference with padding ")
# Load the tokenizer and tokenize the input text with padding = True
tokenizer  =  AutoTokenizer.from_pretrained(model_name)
batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors='pt')
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
print(batch)

# Perform inference on padded inputs
outputs = model(input_ids, attention_mask )
result = outputs.logits
print(result)

print("________________________________________________________________")

print("Batched inference with Truncation ")

batch = tokenizer(X_train, padding=False, truncation=True, max_length=5, return_tensors='pt')
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
print(batch)
outputs = model(input_ids, attention_mask )
result = outputs.logits
print(result)

Batched inference with padding 
{'input_ids': tensor([[  101,  2023,  2003,  1996,  2034,  7099,   102,     0,     0,     0,
             0,     0,     0,     0],
        [  101,  2023,  2003,  1996,  2117,  7099,  2021,  1045,  2572,  6493,
          1999,  1996, 14108,   102],
        [  101,  2023,  2003,  1996,  2197,  7099,  2021,  2460,   102,     0,
             0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}
tensor([[-2.4335,  2.3467],
        [ 2.5392, -2.1198],
        [ 2.5990, -2.1902]], grad_fn=<AddmmBackward0>)
________________________________________________________________
Batched inference with Truncation 
{'input_ids': tensor([[ 101, 2023, 2003, 1996,  102],
        [ 101, 2023, 2003, 1996,  102],
        [ 101, 2023, 2003, 1996,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1

## Save and Load models and tokenizer

Model can be saved and loaded to and from a local model directory.

In [19]:
from transformers import AutoModel, AutoModelForCausalLM

# Instantiate and train or fine-tune a model
model = AutoModelForCausalLM.from_pretrained("bert-base-uncased")

# Train or fine-tune the model...

# Save the model to a local directory
directory = "my_local_model"
model.save_pretrained(directory)

# Load a pre-trained model from a local directory
loaded_model = AutoModel.from_pretrained(directory)

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Some weights of BertModel were not initialized from the model checkpoint at my_local_model and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Model Hub

The Model Hub is
*   where the members of the Hugging Face community can host all of their model checkpoints for simple storage, discovery, and sharing.
*   Download pre-trained models with the huggingface_hub client library, with Transformers for fine-tuning.
* Make use of Inference API to use models in production settings.

You can filter for different models for different tasks, frameworks used, datasets used, and many more. You can select any model, that will show the model card. Model card contains information of the model, including the description, usage, limitations etc. Some models also have inference API's that can be used directly.

Model Hub Link : https://huggingface.co/docs/hub/en/models-the-hub

Example of a model card : https://huggingface.co/bert-base-uncased/tree/main
