##### What is this notebook about?
- This notebook shows how to run inference on any LLM using huggingface. 


In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "5"


In [None]:
#model_name = "Qwen/Qwen2.5-7B-Instruct-1M"
#model_name = "Qwen/Qwen2.5-1.5B-Instruct"
#model_name = "Qwen/Qwen-Audio"
model_name = "meta-llama/Llama-3.2-3B-Instruct"


  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.67s/it]


In [None]:
# Using pipeline
import transformers
import torch

pipeline = transformers.pipeline(
    "text-generation", model=model_name, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)
pipeline("Hey how are you doing today?")

In [None]:
# Using Auto classes

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                          trust_remote_code=True
                                          )


In [3]:
generation_config = GenerationConfig(
    #max_length=256,
    max_new_tokens=256,
    temperature=0.05,
    do_sample=True,
    #do_sample=False,
    use_cache=True,
    skip_special_tokens=True,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)

def tokenize_generate_response(tokenizer, model, generation_config, messages):

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        **model_inputs,
        generation_config=generation_config
    )

    full_response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print("Full response (including prompt):")
    print(full_response)
    
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print("Generated response:")
    print(response)


In [4]:
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
tokenize_generate_response(tokenizer, model, generation_config, messages)

Full response (including prompt):
system

Cutting Knowledge Date: December 2023
Today Date: 27 Mar 2025

You are a helpful assistant.user

Give me a short introduction to large language model.assistant

**Introduction to Large Language Models**

A large language model (LLM) is a type of artificial intelligence (AI) designed to process and understand human language. These models are trained on vast amounts of text data, enabling them to learn patterns, relationships, and structures of language. The goal of LLMs is to generate human-like text, answer questions, and complete tasks that require language understanding.

**Key Characteristics:**

1. **Training Data**: LLMs are trained on massive amounts of text data, often sourced from the internet, books, and other digital sources.
2. **Neural Network Architecture**: LLMs are built using neural networks, which are composed of multiple layers of interconnected nodes (neurons) that process and transform input data.
3. **Self-Supervised Learni