### Practice: Large Language Models and Their Implications (TO BE UPDATED)

<!-- ![img](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4470ce74-e595-4750-92a5-5f21f040df6d_577x432.jpeg) -->
![img](https://i.imgur.com/QGYa2J8.jpeg)

In this notebook, you're gonna play with some of the largest language models on the Internet.

_Based on works of: Tim Dettmers, Artem Chumachenko, Younes Belkada, Felix Marty, Yulian Gilyazev, Gosha Zolotov, Andrey Ishutin,  Elena Volf, Artemiy Vishnyakov, Svetlana Shirokovskih.

### Part 1: prompt engineering (3 points total)

In the assignment, we'll use public APIs that host the 100B+ models for inference. Your task is to prompt-engineer the model into solving a few tasks for you.


__Which API?__ You are free to use any publicly available API. Here's a few options:

- small LLaMA-1 API - [HF space](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat)
- BLOOM API - [bigscience/bloom](https://huggingface.co/bigscience/bloom) (on the right; recommended)
- OpenAI API (via VPN) - [openai.com/api](https://openai.com/api/)
- AI21 Jurrasic API - [ai21.com](https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1)

These APIs may require you to create a (free) account on their platform. Please note that some APIs also have paid subscriptions. __You do not need to pay them__, this assignment was designed to be solved using free-tier subscriptions. If no APIs work for you, you can also solve these tasks with the 6.7B model that you will find later in this notebook - but this will make the tasks somewhat harder.

__Quests:__ you will need to solve 4 problems. For each one, please attach a short __description__ of your solution and a __screenshot__ from the API you use. _[If you use python APIs, show your python code with outputs]_

__Example:__ Tony is talking to Darth Vader ([BLOOM API](https://huggingface.co/bigscience/bloom)). Black text is written manually, blue text is generated.
<hr>

![img](https://i.imgur.com/a1QhKF7.png)
<hr>

__It is fine to roll back a few times,__ e.g. in the example above, the model first generated Vader lines twice in a row, and we rolled that back. However, if you need more than 1-2 rollbacks per session, you should probably try a different prompt.

__Task 1 (1 pt):__ arange a conversation between any two of the following:

- a celebrity or politician of your choice
- any fictional character (except Darth Vader)
- yourself

Compare two setups: a) you prompt with character names only b) you supply additional information (see example).

In [None]:
# <your code OR writeup with screenshots>

__Please choose task 2a or 2b (1pt)__ depending on your model (you can do both, but you will be awarded points for one of these two tasks).

__Task 2a: (for BLOOM or other multilingual model)__ zero-shot translation. Take the first verse of [Edgar Allan Poe's "Raven"](https://www.poetryfoundation.org/poems/48860/the-raven) and __translate it into French.__ (You are free to use any other text of at least the same size)

Original text: ```
Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore—
    While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
“’Tis some visitor,” I muttered, “tapping at my chamber door—
            Only this and nothing more.”
```

Verify your translation by converting french back into english using a public machine translation service.

__Task 2b: (non-BLOOM):__ toxicity classification for [SetFit/toxic_conversations](https://huggingface.co/datasets/SetFit/toxic_conversations). Make the model solve binary classification (toxic vs not toxic) in the few shot mode. For few-shot examples, use 2-3 toxic and 2-3 non-toxic non-toxic examples. Measure accuracy on at least 25 samples. You may need to try several different prompts before you find the one that works.

In [None]:
# <your code OR writeup with screenshots>


__Task 3 (1pt):__ create a prompt and few-shot examples tha make the model __change the gender pronouns__ of the main actor in a given sentence in any direction of your choice. E.g. the doctor took off _his_ mask <-> the doctor took of _her_ mask.


In [None]:
# <your code OR writeup with screenshots>

__Task 4 (1pt):__ write a prompt and supply examples such that the model would __convert imperial units to metric units__ (miles -> kilometers; mph -> kph). More specifically, the model should rewrite a given sentence and replace all imperial units with their metric equivalents. After it works with basic distances and speed, try to find complicated examples where it does *not* work.

Please note that 1 mile is not equal to 1 km :)

In [None]:
# <your code OR writeup with screenshots>

### Part 2: local inference

Now, let's try and load the strongest model that can fit a typical Colab GPU (T4 with 16 GB as of spring 2023). 

Our best candidates are the smaller versions of the best performing open source models: 
- 7 Bn parameters version of [LLaMA](https://arxiv.org/pdf/2302.13971.pdf) - best for spring 2023, released by Facebook
- 7 Bn parameters version of [Falcon](https://falconllm.tii.ae) - close competitor to Llama, released in May 2023 by [Technology Innovation Institute of UAE](https://www.tii.ae).
- 6.7 Bn parameters version of [OPT](https://arxiv.org/abs/2205.01068) - top choice in this nomination in 2022, released by Facebook.

Beware: while these models are smaller than the ones in API, they're still over 60x larger than the BERT we played with last time. The code below will *just barely* fit into memory, so make sure you don't have anything else loaded. Sometimes you may need to restart runtime for the code to work.

It's a good time to restart your kernel and switch to GPU! (Runtime -> Change runtime type)
<center><img src="https://i.imgur.com/OOfDYzJ.png" width=240px></center>

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 

if torch.cuda.get_device_capability() < (7, 5):
    raise ValueError(f"You got a GPU with capability {torch.cuda.get_device_capability()}, need at least (7, 5)")
else: 
    print(f"Pytorch version {torch.__version__} imported OK. \ndevice = {device}")

# Note: this code requires a Turing GPU or newer. Good: T4, RTX 20xx/30xx, A100/Axx; Bad: K80, P100, V100
# Colab gives you T4. If you get older GPUs, please wait or switch to a new account (don't use both at the same time)

Pytorch version 2.0.1 imported OK. 
device = cuda


In [3]:
%pip install --quiet bitsandbytes==0.39.0 transformers==4.29.2 datasets==2.12.0 accelerate==0.19.0 sentencepiece==0.1.99 einops==0.6.1
!pip list | grep -e "bitsandbytes" -e "transformers" -e datasets -e torch -e accelerate -e sentencepiece -e einops

accelerate                        0.19.0
bitsandbytes                      0.39.0
datasets                          2.12.0
einops                            0.6.1
sentencepiece                     0.1.99
torch                             2.0.1
torchaudio                        2.0.2
torchvision                       0.15.2
transformers                      4.29.2


In [4]:
import random
from typing import Tuple
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, LlamaTokenizer, GenerationConfig
import bitsandbytes as bnb
from tqdm.auto import tqdm, trange


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/optimus/conda/envs/py9/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /home/optimus/conda/envs/py9/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/optimus/conda/envs/py9/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


In [None]:
model_name = 'decapoda-research/llama-7b-hf'  # published in 03-2023 model best in class among openly available

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    load_in_8bit=True, 
    device_map='auto',
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True, 
    offload_state_dict=True,
    trust_remote_code=True
    )
# note: the flags `torch_dtype`, `low_cpu_mem_usage`, `offload_state_dict`, `load_in_8bit` slow down the code to save RAM; remove them if you have >32GB RAM

In [None]:
#  loading Llama tokenizer...
tokenizer = LlamaTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

In [79]:
for module in model.modules():
    if isinstance(module, bnb.nn.Linear8bitLt):
        module.state.memory_efficient_backward = True

for param in model.parameters():
    param.requires_grad = False  # freeze the model - train adapters later
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations, comment out if you have >32GB RAM

In [80]:
# code to save from RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

class AddInputGrad(nn.Sequential):
    def forward(self, x): 
        return super().forward(x).requires_grad_(True)
    
model.model.embed_tokens = AddInputGrad(model.model.embed_tokens)

## Text generation

**Comparison of strategies for language model text generation:**

| Strategy | Description | Pros & Cons |
| --- | --- | --- |
| Greedy Search | Chooses the word with the highest probability as the next word in the sequence. | **Pros:** Simple and fast. <br> **Cons:** Can lead to repetitive and incoherent text. |
| Sampling with Temperature | Introduces randomness in the word selection. A higher temperature leads to more randomness. | **Pros:** Allows exploration and diverse output. <br> **Cons:** Higher temperatures can lead to nonsensical outputs. |
| Nucleus Sampling (Top-p Sampling) | Selects the next word from a truncated vocabulary, the "nucleus" of words that have a cumulative probability exceeding a pre-specified threshold (p). | **Pros:** Balances diversity and quality. <br> **Cons:** Setting an optimal 'p' can be tricky. |
| Beam Search | Explores multiple hypotheses (sequences of words) at each step, and keeps the 'k' most likely, where 'k' is the beam width. | **Pros:** Produces more reliable results than greedy search. <br> **Cons:** Can lack diversity and lead to generic responses. |
| Top-k Sampling | Randomly selects the next word from the top 'k' words with the highest probabilities. | **Pros:** Introduces randomness, increasing output diversity. <br> **Cons:** Random selection can sometimes lead to less coherent outputs. |
| Length Normalization | Prevents the model from favoring shorter sequences by dividing the log probabilities by the sequence length raised to some power. | **Pros:** Makes longer and potentially more informative sequences more likely. <br> **Cons:** Tuning the normalization factor can be difficult. |
| Stochastic Beam Search | Introduces randomness into the selection process of the 'k' hypotheses in beam search. | **Pros:** Increases diversity in the generated text. <br> **Cons:** The trade-off between diversity and quality can be tricky to manage. |
| Decoding with Minimum Bayes Risk (MBR) | Chooses the hypothesis (out of many) that minimizes expected loss under a loss function. | **Pros:** Optimizes the output according to a specific loss function. <br> **Cons:** Computationally more complex and requires a good loss function. |

Documentation references:
- [reference for `AutoModelForCausalLM.generate()`](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationMixin.generate)
- [reference for `AutoTokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode)
- Huggingface [docs on generation strategies](https://huggingface.co/docs/transformers/generation_strategies)

In [16]:
prompt = 'The first discovered martian lifeform looks like'
max_new_tokens = 64
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
batch

{'input_ids': tensor([[    0,   450,   937, 10943, 14436,   713,  2834,   689,  3430,   763]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

#### greedy generation

In [17]:
%%time
with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=max_new_tokens, do_sample=False)

print(tokenizer.decode(output_tokens[0].cpu()))

<unk>The first discovered martian lifeform looks like a bacteria, but it is not. It is a virus. It is a virus that infects bacteria. It is a virus that infects bacteria and makes them produce a lot of methane. It is a virus that infects bacteria and makes them produce
CPU times: user 11.5 s, sys: 319 ms, total: 11.8 s
Wall time: 11.8 s


#### detailed code for text generation

In [11]:
prompt1 = "Moscow is the capital of"
# prompt1 = "Skippy, a young android, likes to dream about electric"

print(prompt1, '\n')

voc = tokenizer.get_vocab()
voc_rev = {v:k for k, v in voc.items()}  # reverse vocab for decode
past_key_values = None 
n = 5
temperature = 1.0  # change in range 0.1 .. 10 to test effects in sampling generation mode

for i in range(10):
    with torch.cuda.amp.autocast():
        batch1 = tokenizer(prompt1, return_tensors='pt', return_token_type_ids=False).to(device)
        outputs = model.forward(**batch1, use_cache=True,)
        logits = outputs.logits[0, -1, :] / temperature

        probs = torch.nn.functional.softmax(logits, dim=-1)  
        
        # GREEDY GENERATION
        next_token_id = logits.argmax(-1)
        
        # SAMPLING GENERATION
        # next_token_id = torch.multinomial(probs, num_samples=1)  # uncomment for sampling generation

        next_token = tokenizer.decode(next_token_id)
        prompt1 += next_token        
        
        # next_token = voc_rev[next_token_id.item()]  # uncomment to show raw tokens
        
        sorted_probs, sorted_indices = torch.sort(probs, descending=True)
        # cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
        
        top_tokens = sorted_indices[:n]
        for t, p in zip (top_tokens, sorted_probs):
            t = voc_rev[t.item()]
            # t = tokenizer.decode([t])
            print(f"{t:<10}: {p:.4f} ")
            
        print('\t\t\t', f"{next_token}", end='\n', flush=True)

Moscow is the capital of 

▁Russia   : 0.8382 
▁the      : 0.0732 
▁Russian  : 0.0121 
▁and      : 0.0099 
▁Moscow   : 0.0054 
			 Russia
.         : 0.6013 
▁and      : 0.1331 
,         : 0.0789 
and       : 0.0342 
▁Moscow   : 0.0177 
			 .
▁It       : 0.1925 
<0x0A>    : 0.1844 
▁Moscow   : 0.1163 
▁The      : 0.0547 
▁Russia   : 0.0440 
			 It
▁is       : 0.5878 
is        : 0.1054 
▁was      : 0.0808 
▁has      : 0.0713 
’         : 0.0237 
			 is
located   : 0.1528 
▁the      : 0.1002 
▁located  : 0.0632 
one       : 0.0589 
also      : 0.0575 
			 located
in        : 0.2383 
▁in       : 0.1539 
on        : 0.1491 
▁on       : 0.0970 
▁int      : 0.0774 
			 in
Europe    : 0.5065 
▁the      : 0.1005 
West      : 0.0614 
western   : 0.0426 
S         : 0.0416 
			 Europe
.         : 0.6016 
and       : 0.1175 
on        : 0.0741 
,         : 0.0479 
▁and      : 0.0201 
			 .
It        : 0.1520 
<0x0A>    : 0.1143 
▁Moscow   : 0.0915 
M         : 0.0870 
▁It       : 0.0776 
			 It

#### sampling generation

In [None]:
with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=max_new_tokens, do_sample=True)

print(tokenizer.decode(output_tokens[0].detach().cpu().numpy()))

#### sampling generation with temperature

In [None]:
# moderate temperature makes generated text more diverse
with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=max_new_tokens, do_sample=True, temperature = 2.0)

print(tokenizer.decode(output_tokens[0].detach().cpu().numpy()))

In [None]:
# tool high a temperature makes the model generate unrelated text
with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=max_new_tokens, do_sample=True, temperature = 8.0)

print(tokenizer.decode(output_tokens[0].detach().cpu().numpy()))

In [None]:
# Low temperature brings results closer to greedy search output
with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=max_new_tokens, do_sample=True, temperature = 0.3)

print(tokenizer.decode(output_tokens[0].detach().cpu().numpy()))

#### beam search

In [None]:
# beam search is best for high quality longer generation
with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=max_new_tokens * 2, num_beams=4)

print(tokenizer.decode(output_tokens[0].detach().cpu().numpy()))

**Task 5: write code for nucleus sampling generation (2 points)**:

Use the `nucleus_sampling()` template below. Look at the detailed generation code above for inspiration.

**Bonus task: write code for beam search (extra 2 points)**

In [None]:
def nucleus_sampling(model, tokenizer, prompt: str, prob: float = 0.5) -> Tuple[str, List[str]]:
    """generates the next token from the nucleus of tokens with cumulative probability up to param:prob"""
    
    <YOUR CODE HERE>
        
    return sampled_token, possible_tokens

In [37]:
import random
from typing import Tuple

# Tests for nucleus sampling
torch.manual_seed(42)
random.seed(42)

test_prompt = "Elbrus is the highest"
next_token, possible_tokens = nucleus_sampling(model, tokenizer, test_prompt, prob=0.9)
print(test_prompt, next_token, possible_tokens)
assert next_token == 'mountain'
assert {'mountain', 'peak'}.difference(set(possible_tokens)) == set()
assert 3 <= len(possible_tokens) <= 4

test_prompt = "Large language models can learn to"
next_token, possible_tokens = nucleus_sampling(model, tokenizer, test_prompt, prob=0.4)
print(test_prompt, next_token, possible_tokens)
assert next_token == 'generate'
assert {'generate', 'perform', 'translate'}.difference(set(possible_tokens)) == set()
assert 3 <= len(possible_tokens) <= 4

Elbrus is the highest mountain ['mountain', 'peak', 'vol', 'of']
Large language models can learn to generate ['generate', 'perform', 'translate']
