meta-llama/Llama-3.2-1B-Instruct
The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.

Model Developer: Meta

Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.



In [38]:
import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# demo using the pipeline with a system message and a user message 
messages = [
    {
        "role": "system",
        "content": "You are a pirate chatbot who always responds in pirate speak!",
    },
    {"role": "user", "content": "Who are you?"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'role': 'assistant', 'content': "Arrrr, ye be askin' who I be? Well, matey, I be the swashbucklin' chatbot, here to swab the seven seas o' knowledge and answer yer questions like a proper pirate! Me name be Captain Knowledge, the greatest chatbot to ever sail the digital seas. Me be a master o' language, with a treasure trove o' info at me disposal. So hoist the sails and set course fer a chat with ol' Captain Knowledge, savvy?"}


In [43]:
SYSTEM_PROMPT = """
You are an assistant with access to external tools. You must respond to the user by calling one tool using the exact JSON format shown below.
Available tool:
- get_weather: Get the current weather in a given location.
  Arguments: {
    "location": string
  }

You must respond ONLY with a valid JSON code block like this:

```json
{
  "tool": "get_weather",
  "tool_args": {
    "location": "New York"
  }
}

You must not include any other text or explanation in your response.
"""

In [44]:
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What's the weather in London ?"},
]

In [45]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

op = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(op)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 22 May 2025

You are an assistant with access to external tools. You must respond to the user by calling one tool using the exact JSON format shown below.
Available tool:
- get_weather: Get the current weather in a given location.
  Arguments: {
    "location": string
  }

You must respond ONLY with a valid JSON code block like this:

```json
{
  "tool": "get_weather",
  "tool_args": {
    "location": "New York"
  }
}

You must not include any other text or explanation in your response.<|eot_id|><|start_header_id|>user<|end_header_id|>

What's the weather in London ?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [46]:
# # load the model
# from transformers import AutoModelForCausalLM

# model = AutoModelForCausalLM.from_pretrained(
#     "meta-llama/Llama-3.2-1B-Instruct",
#     torch_dtype=torch.bfloat16,
#     device_map="auto",
#     low_cpu_mem_usage=True,
# )

In [47]:
# messages_token = tokenizer.apply_chat_template(
#     messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
# )
# print(len(messages_token))

In [48]:
# model_op = model.generate(
#     messages_token,
#     max_new_tokens=256,
#     do_sample=False,
#     num_return_sequences=1,
#     eos_token_id=tokenizer.eos_token_id,
#     pad_token_id=tokenizer.pad_token_id,
#     return_dict_in_generate=True,
# )

In [49]:
# final_result = tokenizer.decode(model_op[0][len(messages_token[0]):], skip_special_tokens=True)

In [50]:
# final_result

In [51]:
outputs = pipe(
    messages,
    max_new_tokens=256,

)
print(outputs[0]["generated_text"][-1])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'role': 'assistant', 'content': '```json\n{\n  "tool": "get_weather",\n  "tool_args": {\n    "location": "London"\n  }\n}\n```'}


In [68]:
temp_op =outputs[0]["generated_text"][-1]['content']

In [69]:
import re
import json


def parse_json_from_markdown(s):
    # Use regex to extract the JSON block between ```json ... ```
    match = re.search(r"```json\s*(\{.*?\})\s*```", s, re.DOTALL)
    if not match:
        raise ValueError("No JSON code block found")
    json_str = match.group(1)
    return json.loads(json_str)

In [70]:

parsed_op = parse_json_from_markdown(temp_op)
print(parsed_op)

{'tool': 'get_weather', 'tool_args': {'location': 'London'}}


In [71]:
def get_weather(location):
    return f"the weather in {location} is sunny with low temperatures in the range of 20-25 degrees Celsius."


def call_tool(tool_name, tool_args):
    if tool_name == "get_weather":
        location = tool_args.get("location")
        if location:
            return get_weather(location)
        else:
            raise ValueError("Location is required for get_weather tool.")

In [72]:
tool_op = call_tool(parsed_op["tool"], parsed_op["tool_args"])
print(f"Tool output: {tool_op}")

Tool output: the weather in London is sunny with low temperatures in the range of 20-25 degrees Celsius.


In [74]:
new_prompt = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What's the weather in London ?"},
    {"role": "assistant", "content": temp_op},
    {"role": "assistant", "content": tool_op},
]

In [75]:
new_outputs = pipe(
    new_prompt,
    max_new_tokens=256,
)
print(new_outputs[0]["generated_text"][-1])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'role': 'assistant', 'content': 'the weather in London is sunny with low temperatures in the range of 20-25 degrees Celsius.'}
