## LLM Inference - Llama 2 7B Chat Huggingface
In this notebook, we perform inference using [Llama 2 7B Chat Huggingface](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model famous evaluation benchmark datasets such as,
- [ARC: AI2 Reasoning Challenge Dataset](https://huggingface.co/datasets/ai2_arc)
- ?

This notebook uses bitsandbytes quantisation for inference, reducing GPU needs.

### Mount Google Drive (Optional, but recommended)

Model parameter weights are stored in Google Drive, it saves you time the next time you load the model.

If you don't use it, remove cache_dir from the model and tokeniser below.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os

# This is the path to the Google Drive folder.
drive_path = "/content/drive"

# This is the path where you want to store your cache.
cache_dir_path = os.path.join(drive_path, "MyDrive/huggingface_cache")

# Check if the Google Drive folder exists. If it does, use it as the cache_dir.
# If not, set cache_dir to None to use the default Hugging Face cache location.
if os.path.exists(drive_path):
    cache_dir = cache_dir_path
    os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists
else:
    cache_dir = None

# Setup and Install
- It's best to run Llama 2 7B model on a GPU, which you can do using a free Colab notebook.
- Check the Google Colab runtime to the top right corner.
- Or, go to the menu -> Runtime -> Change Runtime Type.
- Select GPU (T4).

### Install Libraries

In [None]:
!pip install -q bitsandbytes transformers datasets accelerate einops safetensors torch xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m945.3 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [None]:
# Set the runtime to cpu or gpu. Llama 7B (or 13B) requires too much RAM to work on cpu alone on a free or PRO Colab notebook - so use runtime = "gpu".
runtime = "gpu"  # OR "cpu"

if runtime == "cpu":
    runtimeFlag = "cpu"
elif runtime == "gpu":
    runtimeFlag = "cuda:0"
else:
    print("Invalid runtime. Please set it to either 'cpu' or 'gpu'.")
    runtimeFlag = None

print("Runtime flag is:", runtimeFlag)

In [None]:
# Huggingface space & mode information - Llama 2 7 Billion parameters Chat model
model_id = "meta-llama/Llama-2-7b-chat-hf"

### Import

In [None]:
import transformers
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

### Huggingface Login

In [None]:
from huggingface_hub import notebook_login
notebook_login()

### Load Quantized Model

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map.

model = AutoModelForCausalLM.from_pretrained(model_id, config=config, quantization_config=bnb_config, device_map='auto', trust_remote_code=True, cache_dir=cache_dir) # for inference use 'auto', for training us device_map={"":0}

### Setup the Tokenizer

In [None]:
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, use_fast=True) # will use the Rust fast tokenizer if available
tokenizer.pad_token = tokenizer.eos_token

In [None]:
print(tokenizer.eos_token)

### Load Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset('ai2_arc',  'ARC-Easy')
dataset["train"][100]

In [None]:
# Define the roles and markers
B_SEQ = "<s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>", "<<SYS>>\n\n"

SYSTEM_MESSAGE = "For a given question along with multiple choice answers A, B, C and D, choose the best answer without any explanation."
NEW_LINE = '\n'

def tokenize_function(sample):
   prompt = '''{B_SEQ}{B_INST} {B_SYS}\n{SYSTEM_MESSAGE}\n{E_SYS} \
   {"Question:" + sample["question"].strip() + NEW_LINE} \
   "".join(label + ": " + text + NEW_LINE for text, label in zip(sample["choices"]["text"], sample["choices"]["label"])) \
    {E_INST}'''.format(sample)

  print(prompt)
  # return tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").to(runtimeFlag)

In [None]:
tokenize_function(dataset["train"][100])

In [None]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [None]:
print(model.config)

In [None]:
tokenized_datasets

In [None]:
# Despite returning the usual output, the streamer will also print the generated text to stdout.
output = model.generate(**inputs, streamer=streamer, max_new_tokens=500)


In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
prompt = "Tell me about Donald Trump"

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

output = model.generate(**model_inputs)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Tell me about Donald Trump’s tax plan. Einzeln
Donald Trump has proposed a comprehensive tax reform plan that he claims will help America’s middle class and small businesses. Here are some key elements of his plan:
1. Lowering the business tax rate: Trump wants to reduce the business tax rate from 35% to 15%. This would make the US more competitive with other countries and encourage businesses to invest and hire more workers.
2. Doubling the standard deduction: Trump wants to increase the standard deduction from $12,000 to $24,000 for individuals and from $24,000 to $48,000 for married couples. This would help reduce the amount of taxes that people pay on their income.
3. Eliminating the estate tax: Trump wants to repeal the estate tax, which is also known as the “death tax.” This tax is levied on the estates of people who have died, and Trump claims it unfairly targets small businesses and farms.
4. Capping the charitable donation deduction: Trump wants to limit the amount that people

In [None]:
output = model.generate(**model_inputs)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Tell me about Donald Trump's "Art of the Deal"
 Unterscheidung zwischen "Art of the Deal" und einem anderen Buch über den "Deal"

Donald Trump's "Art of the Deal" is a book written by the 45th President of the United States, Donald Trump, and published in 1987. The book is a self-help and business advice book that offers Trump's insights and strategies for success in business and life.
The book is divided into 12 chapters, each focusing on a different aspect of deal-making, such as negotiating, marketing, and closing deals. Trump shares his own experiences and anecdotes from his business career, offering practical advice and tips for readers looking to improve their negotiation and deal-making skills.
One of the key themes of the book is the importance of being a strong negotiator. Trump emphasizes the need to be confident, assertive, and willing to walk away from a bad deal. He also stresses the importance of having a clear vision and a well-defined strategy for achieving one's goals.

In [None]:
prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
candidate1 = "The law does not apply to croissants and brioche."
candidate2 = "The law applies to baguettes."

In [None]:
from transformers import AutoModelForMultipleChoice
model = AutoModelForMultipleChoice.from_pretrained(model_name, device_map="auto", load_in_4bit=True)

ValueError: ignored

In [None]:
import torch
model_inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt").to("cuda:0")

labels = torch.tensor(0).unsqueeze(0)

ValueError: ignored

In [None]:
result.logits

tensor([[[ -0.0486,  -1.7080,   0.0869,  ...,   1.3975,   1.5039,   0.4370],
         [ -8.7109,  -8.6172,  -0.1006,  ...,  -4.1211,  -5.8516,  -5.6562],
         [ -0.7783,   1.2002,   5.8320,  ...,   0.7549,  -1.4062,  -1.4502],
         [-10.8672,  -8.8438,  -3.9238,  ...,  -5.2109,  -7.3867,  -8.2344],
         [ -4.6289,  -6.1445,   4.0156,  ...,  -2.9121,  -3.1699,  -5.3320],
         [ -1.1934,  -1.6387,  11.0547,  ...,  -0.0321,  -0.2375,   1.5273]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

In [None]:
model_inputs

{'input_ids': tensor([[    1, 24948,   592,  1048, 20953]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')}