##### This is a sample on how to load the LLaMA model by HuggingFace. 

##### For this sample to work, a registration on Meta LLama page is required, alike registration at HF.

##### Chat sample:

In [1]:
from transformers import AutoTokenizer
import transformers
import torch

  from .autonotebook import tqdm as notebook_tqdm


This will load content from HF repo directly to my local machine. This is somehow better than using llama locally from Meta's git repo.

In [2]:
# This only works passing the Meta user download link for LLama models

# Define llama 2 chat model, w/ params size of 7B 
model = "meta-llama/Llama-2-7b-chat-hf"

# Load the pre-trained model for HF
tokenizer = AutoTokenizer.from_pretrained(model)

# Builds the HF pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

Downloading shards: 100%|██████████| 2/2 [00:00<00:00,  3.21it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.12s/it]


Specify the tokenizer used, indicating if padding/truncation was used and which token for eos/unk to use:

In [3]:
tokenizer

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-chat-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

This is the first example, saying "hello" to LLaMa:

In [5]:
with open('prompts/prompt1.txt') as f:
    prompt = f.read()

prompt

'Say hello. Use only 20 words:\n'

In [6]:
answer1 = pipeline(
    prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1, # determine the amount of generated answers
    eos_token_id=tokenizer.eos_token_id, # determine the end-of-string token
    max_length=50, # determine the max amount of tokens to generate
)

In [7]:
print(answer1[0]['generated_text'])

Say hello. Use only 20 words:

"Hello, how are you?"


This is the 2nd example, which takes too long to generate:

In [9]:
with open('prompts/prompt2.txt') as f:
    prompt2 = f.read()

prompt2

'Tell me a joke about cars. Say the joke in bullet points\n'

In [10]:
answer2 = pipeline(
    prompt2,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
)

In [11]:
print(answer2[0]['generated_text'])

Tell me a joke about cars. Say the joke in bullet points

•	Why did the car go to the doctor?
•	Because it had a flat tire-iatrist!
•	What did the car say when it got a flat tire?
•	Tire-d!
•	Why did the car go to the gym?
•	To get a little lift!
•	What do you call a car with no
