# 03.1 - Transformer Models - Query

How to load and query a Hugging Face Transformer model.

In [None]:
# !pip install --quiet --upgrade transformers
!pip install --quiet --upgrade git+https://github.com/huggingface/transformers

In [1]:
model_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"

model_kwargs = {
    "torch_dtype": "auto",
    "device_map": "cuda"
}

In [2]:
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    **model_kwargs
)

In [3]:
import utils
utils.print_model_info(model)

model : size : 3264 MB
model : is quantized : False
model : on GPU : True


In [4]:
generate_kwargs = {
    "max_new_tokens": 1024,
    "do_sample": True,
    "temperature": 0.2,
    "top_k": 50,
    "top_p": 0.9,
    "bos_token_id": tokenizer.bos_token_id,
    "pad_token_id": tokenizer.eos_token_id,
    "eos_token_id": tokenizer.eos_token_id
}

In [5]:
%%time

prompt = """
What is the capital of France?
"""

messages = [
    {"role": "user", "content": prompt}
]

input_text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_text,
    **generate_kwargs
)

response = outputs[0][input_text.shape[-1]:]

print(tokenizer.decode(response, skip_special_tokens=True) + '\n')

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The capital of France is Paris.

CPU times: user 210 ms, sys: 234 ms, total: 444 ms
Wall time: 438 ms


In [6]:
%%time

system_prompt = """
Provide a concise summary of the provided text.
"""

prompt = """
There once was a man from Nantucket,
who wore on his head a large bucket.
He was thinking one day,
that there's no rain today,
so he took his bucket then chucked it.
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
]

input_text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_text,
    **generate_kwargs
)

response = outputs[0][input_text.shape[-1]:]

print(tokenizer.decode(response, skip_special_tokens=True) + '\n')

The text is a humorous and nonsensical poem that describes a man from Nantucket who wears a large bucket on his head. The man is frustrated that it's not raining, so he decides to throw the bucket away.

CPU times: user 404 ms, sys: 786 μs, total: 404 ms
Wall time: 403 ms


## Using Pipelines

In [7]:
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model_id,
    **model_kwargs,
)

Device set to use cuda


In [8]:
prompt = """
What is the capital of France?
"""

messages = [
    {"role": "user", "content": prompt}
]

input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = pipe(messages, max_new_tokens=512)
messages = response[0]['generated_text']

print(messages[-1]['content'])

The capital of France is Paris.


### Streaming output

In [11]:
from transformers import TextStreamer

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    **model_kwargs
)

streamer = TextStreamer(tokenizer, skip_special_tokens=True, skl)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    streamer=streamer
)

Device set to use cuda


In [12]:
prompt = """
Recite the Gettysburg Address
"""

messages = [
    {"role": "user", "content": prompt}
]

input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = pipe(messages, max_new_tokens=2048)

system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user

Recite the Gettysburg Address

assistant
"Four score and seven years ago, our fathers brought forth on this continent a new nation, conceived in liberty and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation or any nation so conceived and so dedicated can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It 