## LLM Inference - Llama 2 7B Chat Huggingface
In this notebook, we perform inference using [Llama 2 7B Chat Huggingface](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model famous evaluation benchmark datasets such as,
- [ARC: AI2 Reasoning Challenge Dataset](https://huggingface.co/datasets/ai2_arc)
- ?

This notebook uses bitsandbytes quantisation for inference, reducing GPU needs.

### Mount Google Drive (Optional, but recommended)

Model parameter weights are stored in Google Drive, it saves you time the next time you load the model.

If you don't use it, remove cache_dir from the model and tokeniser below.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os

# This is the path to the Google Drive folder.
drive_path = "/content/drive"

# This is the path where you want to store your cache.
cache_dir_path = os.path.join(drive_path, "MyDrive/huggingface_cache")

# Check if the Google Drive folder exists. If it does, use it as the cache_dir.
# If not, set cache_dir to None to use the default Hugging Face cache location.
if os.path.exists(drive_path):
    cache_dir = cache_dir_path
    os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists
else:
    cache_dir = None

# Setup and Install
- It's best to run Llama 2 7B model on a GPU, which you can do using a free Colab notebook.
- Check the Google Colab runtime to the top right corner.
- Or, go to the menu -> Runtime -> Change Runtime Type.
- Select GPU (T4).

### Install Libraries

In [1]:
!pip install -q bitsandbytes transformers datasets accelerate einops safetensors torch xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

In [2]:
# Set the runtime to cpu or gpu. Llama 7B (or 13B) requires too much RAM to work on cpu alone on a free or PRO Colab notebook - so use runtime = "gpu".
runtime = "gpu"  # OR "cpu"

if runtime == "cpu":
    runtimeFlag = "cpu"
elif runtime == "gpu":
    runtimeFlag = "cuda:0"
else:
    print("Invalid runtime. Please set it to either 'cpu' or 'gpu'.")
    runtimeFlag = None

print("Runtime flag is:", runtimeFlag)

Runtime flag is: cuda:0


In [3]:
# Huggingface space & mode information - Llama 2 7 Billion parameters Chat model
model_id = "meta-llama/Llama-2-7b-chat-hf"

### Import

In [4]:
import transformers
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

### Huggingface Login

In [5]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Load Quantized Model

In [144]:
import torch, gc
gc.collect()
torch.cuda.empty_cache()

In [61]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [131]:
!nvidia-smi

Wed Dec 27 18:06:14 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0              33W /  70W |  14967MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [133]:
!nvidia-smi --gpu-reset

GPU 00000000:00:04.0 is currently in use by another process.

1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.


In [135]:
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    # bnb_4bit_use_double_quant=True,
    # bnb_4bit_quant_type="nf4",
    # bnb_4bit_compute_dtype=torch.bfloat16
)

config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map.

model = AutoModelForCausalLM.from_pretrained(model_id, config=config, quantization_config=bnb_config, device_map='auto', trust_remote_code=True, cache_dir=None) # for inference use 'auto', for training us device_map={"":0}

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [75]:
print(model.config)

LlamaConfig {
  "_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "init_device": "cuda:0",
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "quantization_config": {
    "bnb_4bit_compute_dtype": "float32",
    "bnb_4bit_quant_type": "fp4",
    "bnb_4bit_use_double_quant": false,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": false,
    "load_in_8bit": true,
    "quant_method": "bitsandbytes"
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "floa

In [76]:
cache_dir=None

### Setup the Tokenizer

In [77]:
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, use_fast=True) # will use the Rust fast tokenizer if available
tokenizer.pad_token = tokenizer.eos_token

In [78]:
print(tokenizer.eos_token)

</s>


### Load Dataset

In [13]:
from datasets import load_dataset

dataset = load_dataset('ai2_arc',  'ARC-Easy')
dataset["train"][100]

Downloading readme:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/331k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/346k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/86.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2251 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2376 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/570 [00:00<?, ? examples/s]

{'id': 'Mercury_7043505',
 'question': 'Which interaction within an ecosystem is characterized by gradual change from one community of organisms to another?',
 'choices': {'text': ['symbiosis',
   'food webs',
   'energy pyramids',
   'succession'],
  'label': ['A', 'B', 'C', 'D']},
 'answerKey': 'D'}

In [14]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 2251
    })
    test: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 2376
    })
    validation: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 570
    })
})


In [79]:
# Define the roles and markers
B_SEQ = "<s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>", "<<SYS>>\n\n"

SYSTEM_MESSAGE = "For a given question along with multiple choice answers A, B, C and D, choose the best choice among A, B, C, D without any explanation."
NEW_LINE = '\n'

def create_prompt(question, choices):
    return f'{B_SEQ}{B_INST} {B_SYS}\n{SYSTEM_MESSAGE}\n{E_SYS}{"Question: " + question + NEW_LINE}{"".join("Choice " + label.strip() + ": " + text + NEW_LINE for text, label in zip(choices["text"], choices["label"]))}{E_INST}'

def tokenize_function(samples):
  prompts = []
  for question, choices in zip(samples["question"], samples["choices"]):
    prompts.append(create_prompt(question, choices))

  return {"prompts": prompts}

In [80]:
transformed_datasets = dataset.map(tokenize_function, batched=True)

In [81]:
transformed_datasets["train"]["prompts"][0]

'<s>[INST] <<SYS>>\nFor a given question along with multiple choice answers A, B, C and D, choose the best choice among A, B, C, D without any explanation.\n<<SYS>>\n\nQuestion: Which factor will most likely cause a person to develop a fever?\nChoice A: a leg muscle relaxing after exercise\nChoice B: a bacterial population in the bloodstream\nChoice C: several viral particles on the skin\nChoice D: carbohydrates being digested in the stomach\n[/INST]'

In [145]:
input = tokenizer(transformed_datasets["train"]["prompts"][-4:], padding=True, truncation=True, return_tensors='pt').to(runtimeFlag)
print(input)

{'input_ids': tensor([[    1,     1,   518, 25580, 29962,  3532, 14816, 29903,  6778,    13,
          2831,   263,  2183,  1139,  3412,   411,  2999,  7348,  6089,   319,
         29892,   350, 29892,   315,   322,   360, 29892,  6755,   278,  1900,
          7348,  4249,   319, 29892,   350, 29892,   315, 29892,   360,  1728,
           738,  8252, 29889,    13,  9314, 14816, 29903,  6778,    13,    13,
         16492, 29901,  1932,  4094,  3415, 26191,  1078,   515, 11563, 29915,
         29879,  7101,   964,   278, 25005, 29892,   967, 13206, 21337,  3896,
           278, 10489,  2106, 29889,  1724,  1735,   297,   445, 10489,  9946,
         13206, 21337,   304,   378,  1145,   344,   964,   278,  2319,  4094,
          4441,   552,  1372,   393,   883, 27091, 29973,    13, 29620,   319,
         29901,   450,  4158,   338, 12212, 29889,    13, 29620,   350, 29901,
           450,  7977,   338, 11664, 29889,    13, 29620,   315, 29901,   450,
         10430,   338, 12212, 29889,  

In [146]:
# Despite returning the usual output, the streamer will also print the generated text to stdout.
with torch.no_grad():
  generated_token_ids = model.generate(**input)
generated_texts = tokenizer.batch_decode(generated_token_ids, skip_special_tokens=True)
print(generated_texts)

["[INST] <<SYS>>\nFor a given question along with multiple choice answers A, B, C and D, choose the best choice among A, B, C, D without any explanation.\n<<SYS>>\n\nQuestion: When water evaporates from Earth's surface into the atmosphere, its molecules enter the gas state. What change in this gas causes molecules to condense into the small water droplets that form clouds?\nChoice A: The mass is reduced.\nChoice B: The volume is increased.\nChoice C: The temperature is reduced.\nChoice D: The pressure is increased.\n[/INST]\n\nBest answer: C: The temperature is reduced.", '[INST] <<SYS>>\nFor a given question along with multiple choice answers A, B, C and D, choose the best choice among A, B, C, D without any explanation.\n<<SYS>>\n\nQuestion: Which process directly adds carbon into the atmosphere?\nChoice A: increasing plant populations\nChoice B: decreasing animal populations\nChoice C: burning fossil fuels\nChoice D: forming sedimentary rock\n[/INST]0. The best answer is (C) burning

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
prompt = "Tell me about Donald Trump"

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

output = model.generate(**model_inputs)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Tell me about Donald Trump’s tax plan. Einzeln
Donald Trump has proposed a comprehensive tax reform plan that he claims will help America’s middle class and small businesses. Here are some key elements of his plan:
1. Lowering the business tax rate: Trump wants to reduce the business tax rate from 35% to 15%. This would make the US more competitive with other countries and encourage businesses to invest and hire more workers.
2. Doubling the standard deduction: Trump wants to increase the standard deduction from $12,000 to $24,000 for individuals and from $24,000 to $48,000 for married couples. This would help reduce the amount of taxes that people pay on their income.
3. Eliminating the estate tax: Trump wants to repeal the estate tax, which is also known as the “death tax.” This tax is levied on the estates of people who have died, and Trump claims it unfairly targets small businesses and farms.
4. Capping the charitable donation deduction: Trump wants to limit the amount that people

In [None]:
output = model.generate(**model_inputs)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Tell me about Donald Trump's "Art of the Deal"
 Unterscheidung zwischen "Art of the Deal" und einem anderen Buch über den "Deal"

Donald Trump's "Art of the Deal" is a book written by the 45th President of the United States, Donald Trump, and published in 1987. The book is a self-help and business advice book that offers Trump's insights and strategies for success in business and life.
The book is divided into 12 chapters, each focusing on a different aspect of deal-making, such as negotiating, marketing, and closing deals. Trump shares his own experiences and anecdotes from his business career, offering practical advice and tips for readers looking to improve their negotiation and deal-making skills.
One of the key themes of the book is the importance of being a strong negotiator. Trump emphasizes the need to be confident, assertive, and willing to walk away from a bad deal. He also stresses the importance of having a clear vision and a well-defined strategy for achieving one's goals.

In [None]:
prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
candidate1 = "The law does not apply to croissants and brioche."
candidate2 = "The law applies to baguettes."

In [None]:
from transformers import AutoModelForMultipleChoice
model = AutoModelForMultipleChoice.from_pretrained(model_name, device_map="auto", load_in_4bit=True)

ValueError: ignored

In [None]:
import torch
model_inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt").to("cuda:0")

labels = torch.tensor(0).unsqueeze(0)

ValueError: ignored

In [None]:
result.logits

tensor([[[ -0.0486,  -1.7080,   0.0869,  ...,   1.3975,   1.5039,   0.4370],
         [ -8.7109,  -8.6172,  -0.1006,  ...,  -4.1211,  -5.8516,  -5.6562],
         [ -0.7783,   1.2002,   5.8320,  ...,   0.7549,  -1.4062,  -1.4502],
         [-10.8672,  -8.8438,  -3.9238,  ...,  -5.2109,  -7.3867,  -8.2344],
         [ -4.6289,  -6.1445,   4.0156,  ...,  -2.9121,  -3.1699,  -5.3320],
         [ -1.1934,  -1.6387,  11.0547,  ...,  -0.0321,  -0.2375,   1.5273]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

In [None]:
model_inputs

{'input_ids': tensor([[    1, 24948,   592,  1048, 20953]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')}