# Inference with pipeline

In [1]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

pipe = pipeline(
    'text-generation', # Tells the pipeline we would like to generate text
    model='openai/gpt-oss-20b', # specifies the model
    torch_dtype='auto', # data type the model uses in PyTorch
    device_map='auto') # maps the model to a device

messages = [{'role':'user', 'content':'Explain what mixture-of-experts is'}]

out = pipe(messages,
 max_new_tokens=3000, # number of tokens the model will generate (output length)
 temperature=1.0) # 1 = baseline randomness in token selection

Fetching 40 files: 100%|██████████| 40/40 [00:00<00:00, 225197.53it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [01:06<00:00, 22.33s/it]
Device set to use cuda:0


In [3]:
out

[{'generated_text': [{'role': 'user',
    'content': 'Explain what mixture-of-experts is'},
   {'role': 'assistant',
    'content': 'analysisThe user asks: "Explain what mixture-of-experts is". Need to provide a clear explanation of the concept: mixture of experts (MoE) architecture in machine learning, mixture models, gating networks, etc. Provide context, how it works, benefits, types, common applications (language modeling, vision), implementations (DeepMind\'s Switch Transformer). Provide a definition, explain components (experts, gate, sparsity), training objectives, and talk about inference and scaling. Provide example architecture: a feedforward network with multiple expert sub-networks and a gating network that assigns a probability vector over experts per input. The system uses mixture-of-experts to combine multiple specialized models. Provide references. Provide overview of training: cross-entropy loss, load balancing, regularization (entropy). Also talk about parameter effic

In [4]:
out[0].keys()

dict_keys(['generated_text'])

In [5]:
out[0]['generated_text']

[{'role': 'user', 'content': 'Explain what mixture-of-experts is'},
 {'role': 'assistant',
  'content': 'analysisThe user asks: "Explain what mixture-of-experts is". Need to provide a clear explanation of the concept: mixture of experts (MoE) architecture in machine learning, mixture models, gating networks, etc. Provide context, how it works, benefits, types, common applications (language modeling, vision), implementations (DeepMind\'s Switch Transformer). Provide a definition, explain components (experts, gate, sparsity), training objectives, and talk about inference and scaling. Provide example architecture: a feedforward network with multiple expert sub-networks and a gating network that assigns a probability vector over experts per input. The system uses mixture-of-experts to combine multiple specialized models. Provide references. Provide overview of training: cross-entropy loss, load balancing, regularization (entropy). Also talk about parameter efficiency, sparse activation. Pr

In [6]:
out[0]['generated_text'][0].keys()

dict_keys(['role', 'content'])

In [7]:
print(out[0]['generated_text'][0]['role'])
print(out[0]['generated_text'][0]['content'])


user
Explain what mixture-of-experts is


In [8]:
print(out[0]['generated_text'][1]['role'])
print(out[0]['generated_text'][1]['content'])


assistant
analysisThe user asks: "Explain what mixture-of-experts is". Need to provide a clear explanation of the concept: mixture of experts (MoE) architecture in machine learning, mixture models, gating networks, etc. Provide context, how it works, benefits, types, common applications (language modeling, vision), implementations (DeepMind's Switch Transformer). Provide a definition, explain components (experts, gate, sparsity), training objectives, and talk about inference and scaling. Provide example architecture: a feedforward network with multiple expert sub-networks and a gating network that assigns a probability vector over experts per input. The system uses mixture-of-experts to combine multiple specialized models. Provide references. Provide overview of training: cross-entropy loss, load balancing, regularization (entropy). Also talk about parameter efficiency, sparse activation. Provide examples like "Mixture of Softmaxes" for language modeling, "Switch Transformer". Maybe me

# Advanced inference .generate()

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "openai/gpt-oss-20b"
 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
 
messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]
 
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)
 
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7
)
 
print(tokenizer.decode(outputs[0]))

Fetching 40 files: 100%|██████████| 40/40 [00:00<00:00, 208671.84it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [01:43<00:00, 34.64s/it]


<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-15

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Explain what MXFP4 quantization is.<|end|><|start|>assistant<|channel|>analysis<|message|>The user asks: "Explain what MXFP4 quantization is." They likely refer to some technology, maybe in audio or signal processing? "MXFP4" seems unfamiliar. Could be a specific standard or algorithm. Might refer to "Mixed-Precision Fixed-Point 4" or "Maximum eXponential Fixed Precision 4"? Or "Matrix Factorization Pair 4"? Or maybe "Milton's eXperimental Fixed Pattern 4"? I need to interpret. Search memory: "MXFP4" appears in the context of "MXFP4 quantization" maybe in deep learning or neural networks? There's "MXFP" stands for "Mixed-Precision Floating Point"? Actually "MXFP" could be "Microsoft eXpress Forward Propagation"? N

# Fine-Tuning

In [1]:
from huggingface_hub import notebook_login

In [3]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Multilingual-Thinking Dataset

In [8]:
from datasets import load_dataset

dataset = load_dataset('HuggingFaceH4/Multilingual-Thinking', split='train')

dataset[0]

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/5.29M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'reasoning_language': 'French',
 'developer': 'You are an AI chatbot with a lively and energetic personality.',
 'user': 'Can you show me the latest trends on Twitter right now?',
 'analysis': "D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.\n\nJe devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque région. Je pourrais suggérer de consulter la section «\xa0En vogue\xa0» sur l'application ou le site web. Aussi, l'utilisation de hashtags et le suivi d'utilisateurs pertinents pourraient être utiles. Il est important de souligner que les tendances varient selon la région et l'heure de la journée. Je devrais garder un ton amical et 