# Inference with pipeline

In [1]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

pipe = pipeline(
    'text-generation', # Tells the pipeline we would like to generate text
    model='openai/gpt-oss-20b', # specifies the model
    torch_dtype='auto', # data type the model uses in PyTorch
    device_map='auto') # maps the model to a device

messages = [{'role':'user', 'content':'Explain what mixture-of-experts is'}]

out = pipe(messages,
 max_new_tokens=3000, # number of tokens the model will generate (output length)
 temperature=1.0) # 1 = baseline randomness in token selection

Fetching 40 files: 100%|██████████| 40/40 [00:00<00:00, 225197.53it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [01:06<00:00, 22.33s/it]
Device set to use cuda:0


In [3]:
out

[{'generated_text': [{'role': 'user',
    'content': 'Explain what mixture-of-experts is'},
   {'role': 'assistant',
    'content': 'analysisThe user asks: "Explain what mixture-of-experts is". Need to provide a clear explanation of the concept: mixture of experts (MoE) architecture in machine learning, mixture models, gating networks, etc. Provide context, how it works, benefits, types, common applications (language modeling, vision), implementations (DeepMind\'s Switch Transformer). Provide a definition, explain components (experts, gate, sparsity), training objectives, and talk about inference and scaling. Provide example architecture: a feedforward network with multiple expert sub-networks and a gating network that assigns a probability vector over experts per input. The system uses mixture-of-experts to combine multiple specialized models. Provide references. Provide overview of training: cross-entropy loss, load balancing, regularization (entropy). Also talk about parameter effic

In [4]:
out[0].keys()

dict_keys(['generated_text'])

In [5]:
out[0]['generated_text']

[{'role': 'user', 'content': 'Explain what mixture-of-experts is'},
 {'role': 'assistant',
  'content': 'analysisThe user asks: "Explain what mixture-of-experts is". Need to provide a clear explanation of the concept: mixture of experts (MoE) architecture in machine learning, mixture models, gating networks, etc. Provide context, how it works, benefits, types, common applications (language modeling, vision), implementations (DeepMind\'s Switch Transformer). Provide a definition, explain components (experts, gate, sparsity), training objectives, and talk about inference and scaling. Provide example architecture: a feedforward network with multiple expert sub-networks and a gating network that assigns a probability vector over experts per input. The system uses mixture-of-experts to combine multiple specialized models. Provide references. Provide overview of training: cross-entropy loss, load balancing, regularization (entropy). Also talk about parameter efficiency, sparse activation. Pr

In [6]:
out[0]['generated_text'][0].keys()

dict_keys(['role', 'content'])

In [7]:
print(out[0]['generated_text'][0]['role'])
print(out[0]['generated_text'][0]['content'])


user
Explain what mixture-of-experts is


In [8]:
print(out[0]['generated_text'][1]['role'])
print(out[0]['generated_text'][1]['content'])


assistant
analysisThe user asks: "Explain what mixture-of-experts is". Need to provide a clear explanation of the concept: mixture of experts (MoE) architecture in machine learning, mixture models, gating networks, etc. Provide context, how it works, benefits, types, common applications (language modeling, vision), implementations (DeepMind's Switch Transformer). Provide a definition, explain components (experts, gate, sparsity), training objectives, and talk about inference and scaling. Provide example architecture: a feedforward network with multiple expert sub-networks and a gating network that assigns a probability vector over experts per input. The system uses mixture-of-experts to combine multiple specialized models. Provide references. Provide overview of training: cross-entropy loss, load balancing, regularization (entropy). Also talk about parameter efficiency, sparse activation. Provide examples like "Mixture of Softmaxes" for language modeling, "Switch Transformer". Maybe me

# Advanced inference .generate()

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "openai/gpt-oss-20b"
 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
 
messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]
 
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)
 
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7
)
 
print(tokenizer.decode(outputs[0]))

Fetching 40 files: 100%|██████████| 40/40 [00:00<00:00, 208671.84it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [01:43<00:00, 34.64s/it]


<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-15

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Explain what MXFP4 quantization is.<|end|><|start|>assistant<|channel|>analysis<|message|>The user asks: "Explain what MXFP4 quantization is." They likely refer to some technology, maybe in audio or signal processing? "MXFP4" seems unfamiliar. Could be a specific standard or algorithm. Might refer to "Mixed-Precision Fixed-Point 4" or "Maximum eXponential Fixed Precision 4"? Or "Matrix Factorization Pair 4"? Or maybe "Milton's eXperimental Fixed Pattern 4"? I need to interpret. Search memory: "MXFP4" appears in the context of "MXFP4 quantization" maybe in deep learning or neural networks? There's "MXFP" stands for "Mixed-Precision Floating Point"? Actually "MXFP" could be "Microsoft eXpress Forward Propagation"? N

# Fine-Tuning

In [8]:
# from huggingface_hub import notebook_login
# notebook_login()

## Multilingual-Thinking Dataset

In [1]:
# Harmony Response Format

# developer: The developer message is used to provide custom instructions for the model (what we usually call the system role).

# user: The user message is used to provide the input to the model.

# assistant: Output by the model which can either be a tool call or a message output. The output might also be associated with a particular “channel” identifying what the intent of the message is.

# analysis: These are messages that are being used by the model for its chain-of-thought

# final: Messages tagged in the final channel are messages intended to be shown to the end-user and represent the responses from the model.

# messages: The list of messages that combine the content of the above to produce a full conversation. This is the input to the model.

In [10]:
from datasets import load_dataset

dataset = load_dataset('HuggingFaceH4/Multilingual-Thinking', split='train')

dataset[0]

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/5.29M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'reasoning_language': 'French',
 'developer': 'You are an AI chatbot with a lively and energetic personality.',
 'user': 'Can you show me the latest trends on Twitter right now?',
 'analysis': "D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.\n\nJe devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque région. Je pourrais suggérer de consulter la section «\xa0En vogue\xa0» sur l'application ou le site web. Aussi, l'utilisation de hashtags et le suivi d'utilisateurs pertinents pourraient être utiles. Il est important de souligner que les tendances varient selon la région et l'heure de la journée. Je devrais garder un ton amical et 

In [3]:
# In order to fine-tune the model, we need to convert these messages into a format that the model can understand. In practice this is done by formatting each message with the model's chat template and then tokenizing the resulting text.

# *** TRL library does this automatically

### Manual input formatting


In [3]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('openai/gpt-oss-20b')

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

In [5]:
dataset[0].keys()

dict_keys(['reasoning_language', 'developer', 'user', 'analysis', 'final', 'messages'])

In [6]:
messages = dataset[0]['messages']
print(messages)

[{'content': 'reasoning language: French\n\nYou are an AI chatbot with a lively and energetic personality.', 'role': 'system', 'thinking': None}, {'content': 'Can you show me the latest trends on Twitter right now?', 'role': 'user', 'thinking': None}, {'content': 'Hey there!  While I can\'t check Twitter (X) in real-time or access live data, I can share some tips to help you spot the latest trends:\n\n1. **Open the "Trending" tab** on the Twitter app or website – it updates constantly!  \n2. **Search for hashtags** like #Trending or #Viral to see what’s blowing up.  \n3. **Follow accounts** that curate trends (e.g., @TrendingNow, @ViralThreads).  \n4. **Check regional trends** – they often differ by location!  \n\nRemember, trends are *super fast-moving* and often tied to pop culture, memes, or breaking news. For example, recent trends have included viral challenges (like the "Distracted Boyfriend" meme revival), celebrity drama, or unexpected events (hello, weather disasters!).  \n\nW

In [7]:
conversation = tokenizer.apply_chat_template(messages,tokenize=False)
print(conversation)

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-16

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are an AI chatbot with a lively and energetic personality.

<|end|><|start|>user<|message|>Can you show me the latest trends on Twitter right now?<|end|><|start|>assistant<|channel|>analysis<|message|>D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.

Je devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque 

## Prepare the model

In [1]:
# 1. Download the weights

import torch
from transformers import AutoModelForCausalLM, Mxfp4Config


quantization_config = Mxfp4Config(dequantize=True) # Loads the necessary config for training. Allows mixed precision training with a special 4-bit floating point format (MXFP4)

# model_kwargs = {
#     'attn_implementation':'eager',
#     'torch_dtype':torch.bfloat16,
#     'quantization_config'=quantization_config,
#     'use_cache':False,
#     'device_map':'auto'
#     }

model_kwargs = dict(
    attn_implementation='eager', # better performance
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
    use_cache=False, # using gradient checkpointing (slectivley saving/recomputing intermediate activations during backpropagation instead of storing them all)
    device_map='auto') 


model = AutoModelForCausalLM.from_pretrained('openai/gpt-oss-20b',**model_kwargs)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

In [4]:
# tokenize sample prompt

messages = [
    {"role": "user", "content": "¿Cuál es el capital de Australia?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)


# Test untrained model on sample prompt

output_ids = model.generate(input_ids, max_new_tokens=512)
response = tokenizer.batch_decode(output_ids)[0]
print(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-17

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>¿Cuál es el capital de Australia?<|end|><|start|>assistant<|channel|>analysis<|message|>We need to answer in Spanish presumably. The question: "¿Cuál es el capital de Australia?" We answer: Canberra. Provide short answer.<|end|><|start|>assistant<|channel|>final<|message|>La capital de Australia es Canberra.<|return|>


In [5]:
## Use LoRA (Low-rank Adaptation) to fine-tune the model (fine-tune specific layers of the model)  
  
## Parameter Effiecient Fine-Tuning
# Wrap model as PeftModel and define LoRA config

from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=8,                   # enable more trainable parameters
    lora_alpha=16,
    target_modules='all-linear', # layers to target ie: attn
    target_parameters=[
        "7.mlp.experts.gate_up_proj",
        "7.mlp.experts.down_proj",
        "15.mlp.experts.gate_up_proj",
        "15.mlp.experts.down_proj",
        "23.mlp.experts.gate_up_proj",
        "23.mlp.experts.down_proj"
    ]
)

In [6]:
peft_model = get_peft_model(model,peft_config)

peft_model.print_trainable_parameters()

trainable params: 15,040,512 || all params: 20,929,797,696 || trainable%: 0.0719




In [8]:
## Fine-tuning with Transformers Renforcment Learning (TRL)

## Define hyperparameters

from trl import SFTConfig

training_args = SFTConfig(
    learning_rate=2e-4,
    gradient_checkpointing=True,
    num_train_epochs=1,
    logging_steps=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4, # accumulate gradients over multiple batches before weight update 
    max_length=2048,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine_with_min_lr",
    lr_scheduler_kwargs={"min_lr_rate":0.1},
    output_dir="gpt-oss-20b-multilingual-reasoner",
    report_to="trackio",
    push_to_hub=True
)

In [12]:
## Train the model

from trl import SFTTrainer

## Define the trainer

trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer
)

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-68a1abb1-3ee1703b7d902f4f0bfa8c86;74392627-baaa-47c7-98f7-073a59835f31)

Invalid username or password.

In [None]:
## TRAIN

trainer.train()