# Huggingface로 Llama3 실행해보기
- transformer pipeline으로 추론
- pipeline은 추론하는 일련의 과정들을 추상화한 함수이다.

In [1]:
import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00,  1.27s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
print(pipeline.tokenizer.eos_token)
print(pipeline.tokenizer.eos_token_id)
print(pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"))
print(pipeline.tokenizer.convert_tokens_to_ids(pipeline.tokenizer.eos_token))

<|eot_id|>
128009
128009
128009


In [3]:


messages = [
   {"role": "user", "content": "Hello, how are you?"},
   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
   {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)
# print(prompt)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256, # prompt를 제외한 답변의 최대 토큰 개수
    eos_token_id=terminators,
    do_sample=True, # answer sampling. Defeault는 Greedy Search이며 같은 질문에 대해 같은 답변을 하게 됨.
    temperature=0.6, # 답변의 창의성 정도. 높을 값일 수록 창의적인 답변을 하지만 정확하지 않을 수 있음
    top_p=0.9, # float < 1로 설정하면 합산 확률이 top_p 이상인 가장 작은 확률의 토큰 세트만 생성에 유지.
)
print(outputs[0]["generated_text"][len(prompt):])


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


That sounds like a lot of fun! I'd be happy to play along. Go ahead and show me how chat templating works!


# Hugginface로 Llam3 실행해보기2
- 이번엔 pipeline 없이 직접 모델과 토크나이저를 불러와서 실행해보기
- huggingface에 등록되어 있는 model id만 넘겨주면 해당 모델과 모델에 가장 적합한 토크나이저를 반환해준다.

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.14s/it]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Arrrr, shiver me timbers! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas o' the internet! Me be here to swab the decks with ye, answerin' yer questions and tellin' tales o' adventure and booty! So hoist the colors, me hearty, and let's set sail fer a swashbucklin' good time!
