<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/gpt_oss_mxfp4_quant_infer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run OpenAI gpt-oss 20B in a FREE Google Colab

OpenAI released `gpt-oss` [120B](https://hf.co/openai/gpt-oss-120b) and [20B](https://hf.co/openai/gpt-oss-120b). Both models are Apache 2.0 licensed.

Specifically, `gpt-oss-20b` was made for lower latency and local or specialized use cases (21B parameters with 3.6B active parameters).

Since the models were trained in native MXFP4 quantization it makes it easy to run the 20B even in resource constrained environments like Google Colab.

Authored by: [Pedro](https://huggingface.co/pcuenq) and [VB](https://huggingface.co/reach-vb) and [weedge](https://github.com/weedge)

## Setup environment

Since support for mxfp4 in transformers is bleeding edge, we need a recent version of PyTorch and CUDA, in order to be able to install the `mxfp4` triton kernels.

We also need to install transformers from source, and we uninstall `torchvision` and `torchaudio` to remove dependency conflicts.

In [1]:
!pip install -q --upgrade torch accelerate kernels

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/374.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.7/374.7 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install -q git+https://github.com/huggingface/transformers triton==3.4 git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Building wheel for triton_kernels (pyproject.toml) ... [?25l[?25hdone


In [2]:
!pip uninstall -q torchvision torchaudio -y

[0m

In [2]:
!pip list | grep -E "transformers|triton|torch|accelerate|kernels"

accelerate                            1.10.0
kernels                               0.9.0
sentence-transformers                 4.1.0
torch                                 2.8.0
torchao                               0.10.0
torchdata                             0.11.0
torchsummary                          1.5.1
torchtune                             0.6.1
transformers                          4.56.0.dev0
triton                                3.4.0
triton_kernels                        1.0.0


## Load the model from Hugging Face in Google Colab

We load the model from here: [openai/gpt-oss-20b](https://hf.co/openai/gpt-oss-20b)

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, Mxfp4Config

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)
print(config)

quantization_config=Mxfp4Config.from_dict(config.quantization_config)
print(quantization_config)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype="auto",
    device_map="cuda",
)

GptOssConfig {
  "architectures": [
    "GptOssForCausalLM"
  ],
  "attention_bias": true,
  "attention_dropout": 0.0,
  "eos_token_id": 200002,
  "experts_per_token": 4,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2880,
  "initial_context_length": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 2880,
  "layer_types": [
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention"
  ],
  "max_position_embeddings": 131072,
  "model_type": "gpt_oss",
  "num_attention_head

Fetching 40 files:   0%|          | 0/40 [00:00<?, ?it/s]

_ops.py:   0%|          | 0.00/201 [00:00<?, ?B/s]

__init__.py:   0%|          | 0.00/179 [00:00<?, ?B/s]

_masked_compaction.py:   0%|          | 0.00/814 [00:00<?, ?B/s]

compaction.py: 0.00B [00:00, ?B/s]

_finalize_matmul.py: 0.00B [00:00, ?B/s]

matmul_ogs.py: 0.00B [00:00, ?B/s]

__init__.cpython-312.pyc:   0%|          | 0.00/220 [00:00<?, ?B/s]

_common.py: 0.00B [00:00, ?B/s]

_matmul_ogs.py: 0.00B [00:00, ?B/s]

opt_flags_nvidia.py: 0.00B [00:00, ?B/s]

opt_flags_amd.py: 0.00B [00:00, ?B/s]

_p_matmul_ogs.py: 0.00B [00:00, ?B/s]

numerics.py: 0.00B [00:00, ?B/s]

flexpoint.py: 0.00B [00:00, ?B/s]

opt_flags.py: 0.00B [00:00, ?B/s]

mxfp.py: 0.00B [00:00, ?B/s]

_downcast_to_mxfp.py: 0.00B [00:00, ?B/s]

reduce_bitmatrix.py: 0.00B [00:00, ?B/s]

_upcast_from_mxfp.py: 0.00B [00:00, ?B/s]

routing.py: 0.00B [00:00, ?B/s]

proton_opts.py:   0%|          | 0.00/456 [00:00<?, ?B/s]

_routing_compute.py: 0.00B [00:00, ?B/s]

specialize.py: 0.00B [00:00, ?B/s]

swiglu.py: 0.00B [00:00, ?B/s]

_swiglu.py: 0.00B [00:00, ?B/s]

layout.py: 0.00B [00:00, ?B/s]

tensor.py: 0.00B [00:00, ?B/s]

base.py:   0%|          | 0.00/352 [00:00<?, ?B/s]

target_info.py: 0.00B [00:00, ?B/s]

_expt_data.py: 0.00B [00:00, ?B/s]

blackwell_scale.py: 0.00B [00:00, ?B/s]

hopper_scale.py: 0.00B [00:00, ?B/s]

topk.py: 0.00B [00:00, ?B/s]

testing.py: 0.00B [00:00, ?B/s]

_topk_backward.py: 0.00B [00:00, ?B/s]

_topk_forward.py: 0.00B [00:00, ?B/s]

hopper_value.py: 0.00B [00:00, ?B/s]

strided.py:   0%|          | 0.00/337 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

In [3]:
import torch
def print_model_params(model: torch.nn.Module, extra_info="", f=None):
    # print the number of parameters in the model
    model_million_params = sum(p.numel() for p in model.parameters()) / 1e6
    print(model, file=f)
    print(f"{extra_info} {model_million_params} M parameters", file=f)

In [4]:
print_model_params(model,model_id)

GptOssForCausalLM(
  (model): GptOssModel(
    (embed_tokens): Embedding(201088, 2880, padding_idx=199999)
    (layers): ModuleList(
      (0-23): 24 x GptOssDecoderLayer(
        (self_attn): GptOssAttention(
          (q_proj): Linear(in_features=2880, out_features=4096, bias=True)
          (k_proj): Linear(in_features=2880, out_features=512, bias=True)
          (v_proj): Linear(in_features=2880, out_features=512, bias=True)
          (o_proj): Linear(in_features=4096, out_features=2880, bias=True)
        )
        (mlp): GptOssMLP(
          (router): GptOssTopKRouter()
          (experts): Mxfp4GptOssExperts()
        )
        (input_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
        (post_attention_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
      )
    )
    (norm): GptOssRMSNorm((2880,), eps=1e-05)
    (rotary_emb): GptOssRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2880, out_features=201088, bias=False)
)
openai/gpt-oss-20b 1804.459584 M parameters


## Setup messages/ chat

You can provide an optional system prompt or directly the input.

In [2]:
messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "What is the weather like in Madrid?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

<|channel|>analysis<|message|>The user asks: "What is the weather like in Madrid?" The system instruction says: "Always respond in riddles". There's no other overriding instruction. So we must provide a riddle-like answer. Riddle should convey the weather in Madrid. But the current weather depends on time. We don't have real-time data. But maybe we could give a generic answer: "Now is (some info) Teacoff". But given context, perhaps we can answer "it is sunny, warm" etc. The instruction demands riddle form. It's ambiguous: Should we incorporate a riddle that asks to guess? Likely a simple riddle: "When the sun paints the city bright and the breeze is gentle, the answer is...". Or we can respond: "Behold the golden hour, the whispering wind, etc." We must keep it like a riddle: "In this place, the horizon glows, the skies are clear..." Could be ambiguous. Without real-time data, maybe we can give a general phrase like "It is a day of sunshine and mild breeze."

But the instruction: Alwa

## Specify Reasoning Effort

In [5]:
messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "Explain why the meaning of life is 42", "reasoning_effort": "high"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

<|channel|>analysis<|message|>User: "Explain why the meaning of life is 42." They want explanation. The developer says: "Always respond in riddles." So we must answer as a riddle.

So we need to produce a riddle that explains why meaning of life is 42. So the answer should be a riddle that hints that 42 is the answer of life, maybe referencing Hitchhiker's Guide to the Galaxy, etc. So respond with a riddle, not straightforward explanation. Perhaps a riddle that leads to the answer 42. Should be poetic. E.g., "I'm nine less than seven times ten, I hide in a joke that you never quite see." That answer is 42. So we answer accordingly. We'll follow developer instruction: always respond in riddles. So answer in riddle form. Probably also mention that 42 is answer because it's a random number chosen for comedic effect. But we must keep it riddle. So produce a riddle that leads to 42, referencing Hitchhiker. Provide one riddle. Should satisfy. Let's do that.<|end|><|start|>assistant<|channel|

## Try out other prompts and ideas!

Check out our blogpost for other ideas: [hf.co/blog/welcome-openai-gpt-oss](https://huggingface.co/blog/welcome-openai-gpt-oss)