<a href="https://colab.research.google.com/github/samratkar/samratkar.github.io/blob/main/GPT_OSS_(1)_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## (1) API access through HF Inference Providers

In [None]:
import os
from openai import OpenAI

HF_TOKEN = ""

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=HF_TOKEN,          # ← use the variable here
)

completion = client.chat.completions.create(
    model="openai/gpt-oss-120b:cerebras",
    messages=[{"role": "user",
               "content": "Give me a short code to find the smallest prime number less than 100"}],
)

#print(completion.choices[0].message)

from IPython.display import Markdown, display

reply = completion.choices[0].message.content          # the text you really want
display(Markdown(reply))


Here’s a **tiny Python snippet** that scans the numbers 1‑99, picks the first one that’s prime, and prints it (which of course will be 2).

```python
def is_prime(n: int) -> bool:
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True

# Find the smallest prime < 100
for n in range(2, 100):
    if is_prime(n):
        print(n)        # → 2
        break
```

**How it works**

1. `is_prime` checks divisibility only up to √n – fast enough for this tiny range.
2. The `for` loop starts at 2 (the first possible prime) and stops as soon as it finds a prime, printing it and exiting with `break`.

Running the script prints:

```
2
```

*If you need the result as a value instead of printing, simply replace the `print(n)` line with `smallest = n` and use `smallest` later.*

In [None]:
import os
from openai import OpenAI

HF_TOKEN = ""

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=HF_TOKEN,
)

response = client.responses.create(
    model="openai/gpt-oss-20b:fireworks-ai",
    input="How many rs are in the word 'strawberry'?",
)

print(response)



Response(id='resp_5e540d347da0b6c41a09e71956a2372af766f274cf8a7761', created_at=1754454904.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='openai/gpt-oss-20b:fireworks-ai', object='response', output=[ResponseReasoningItem(id='rs_b31f75e7807150882ab77e6ee719c6de380c32b3efd1b35a', summary=[], type='reasoning', encrypted_content=None, status='completed', content=[{'type': 'reasoning_text', 'text': 'We need to interpret the question. "How many rs are in the word \'strawberry\'?" Counting letter \'r\' occurrences. Let\'s examine \'strawberry\'. letters: s t r a w b e r r y. Count letter \'r\': at position3 = r, at pos? Let\'s write: s(1) t(2) r(3) a(4) w(5) b(6) e(7) r(8) r(9) y(10). So there are r at positions 3, 8, 9: three r\'s. So answer: 3. Also sometimes they ask "How many r\'s" or "rs" meaning letter r. The answer: 3. Ensure output: just answer: 3.'}]), ResponseOutputMessage(id='msg_b66f03bfbe97b5b15e5c0f0c7cae1a54e419743eae0ff3cb', content=[ResponseO

## (2) Local Inference

### Part 1: Using HuggingFace Transformers

In [None]:
!pip install -q --upgrade accelerate transformers kernels


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))


MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

<|channel|>analysis<|message|>The user asks: "How many rs are in the word 'strawberry'?" They likely refer to the letter 'r' count in the word "strawberry". The word "strawberry" contains letters: s, t, r, a, w, b, e, r, r, y. Actually "strawberry": s t r a w b e r r y. Count r's: positions 3, 8, 9:


### Part 2: Using VLLM -> leads to error at the moment. Error being monitored here: https://github.com/oumi-ai/oumi/issues/1906

In [None]:
!pip install -q vllm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [None]:
from vllm import LLM

llm = LLM(
    "openai/gpt-oss-120b",
    tensor_parallel_size=1
)


INFO 08-06 04:36:40 [__init__.py:235] Automatically detected platform cuda.


Parse safetensors files:   0%|          | 0/15 [00:00<?, ?it/s]

INFO 08-06 04:36:48 [config.py:1604] Using max model len 131072


ValidationError: 1 validation error for ModelConfig
  Value error, Unknown quantization method: mxfp4. Must be one of ['aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fp8', 'ptpc_fp8', 'fbgemm_fp8', 'modelopt', 'modelopt_fp4', 'marlin', 'bitblas', 'gguf', 'gptq_marlin_24', 'gptq_marlin', 'gptq_bitblas', 'awq_marlin', 'gptq', 'compressed-tensors', 'bitsandbytes', 'qqq', 'hqq', 'experts_int8', 'neuron_quant', 'ipex', 'quark', 'moe_wna16', 'torchao', 'auto-round', 'rtn', 'inc']. [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs]
    For further information visit https://errors.pydantic.dev/2.11/v/value_error

## (3) Applications

### Part 1: Text Classification

In [None]:
!pip install -q datasets
!pip install -q scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [None]:
# sentiment_gpt_oss20b.py
# -------------------------------------------------------------
#  Sentiment classification for Rotten-Tomatoes reviews
#  using openai/gpt-oss-20b
#  – no spurious warnings
#  – batched inference for speed
# -------------------------------------------------------------

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
    set_seed,
)
from transformers.utils import logging as hf_logging
from datasets import load_dataset
from transformers.pipelines.pt_utils import KeyDataset
from sklearn.metrics import classification_report
from tqdm import tqdm

# ------------------------------------------------------------------
# 0.  Silence most Transformers log noise (including the flag notice)
# ------------------------------------------------------------------
hf_logging.set_verbosity_error()   # show only real errors

# ------------------------------------------------------------------
# 1.  Model & tokenizer
# ------------------------------------------------------------------
model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:          # many GPT-style tokenizers lack pad
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",          # shard across all visible GPUs/accelerators
    torch_dtype="auto",         # fp16 or bf16 if supported
)

set_seed(42)                    # deterministic (affects sampling only)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)

# ------------------------------------------------------------------
# 2.  Dataset & prompt construction
# ------------------------------------------------------------------
data = load_dataset("rotten_tomatoes")

PROMPT_TEMPLATE = (
    "You are a sentiment-analysis assistant.\n"
    "Question: Is the following movie review positive or negative?\n"
    "Review: {text}\n"
    "Answer (one word: 'positive' or 'negative'):"
)

def build_prompt(example):
    return {"prompt": PROMPT_TEMPLATE.format(text=example["text"])}

data = data.map(build_prompt, desc="Building prompts")

# ------------------------------------------------------------------
# 3.  Batched inference
# ------------------------------------------------------------------
def classify(split: str, batch_size: int = 8):
    """
    Returns a list of 0/1 predictions (0=negative, 1=positive).
    """
    y_pred = []

    gen_kwargs = {
        "max_new_tokens": 2,     # 1-2 tokens is enough for "positive"/"negative"
        "do_sample": False,      # greedy → deterministic, no 'temperature' needed
    }

    ds_iter = KeyDataset(data[split], "prompt")
    for output in tqdm(
        pipe(ds_iter, batch_size=batch_size, **gen_kwargs),
        total=len(data[split]),
        desc=f"Inference on '{split}' split (batch={batch_size})"
    ):
        answer = output[0]["generated_text"].strip().split()[-1].lower()
        y_pred.append(1 if answer.startswith("positive") else 0)

    return y_pred

print("Running inference … this is CPU/GPU-heavy for a 20B-parameter model.")
y_pred_test = classify("test", batch_size=8)   # adjust batch_size to fit VRAM

# ------------------------------------------------------------------
# 4.  Evaluation
# ------------------------------------------------------------------
def evaluate_performance(y_true, y_pred):
    print(
        classification_report(
            y_true,
            y_pred,
            target_names=["Negative Review", "Positive Review"],
            digits=4,
        )
    )

evaluate_performance(data["test"]["label"], y_pred_test)

# --------------------------------------------------------------
# 5.  Show some sample I/O
# --------------------------------------------------------------
import random
from textwrap import shorten

SAMPLES_TO_SHOW = 10
idxs = random.sample(range(len(data["test"])), SAMPLES_TO_SHOW)

label_map = {0: "negative", 1: "positive"}

print("\n─── SAMPLE PREDICTIONS ────────────────────────────────────────")
for idx in idxs:
    review = data["test"][idx]["text"]
    gold   = label_map[data["test"][idx]["label"]]
    pred   = label_map[y_pred_test[idx]]

    # tighten long reviews to ~120 chars for console
    view = shorten(review, width=120, placeholder=" …")

    print(f"\nREVIEW  : {view}")
    print(f"PREDICT : {pred}")
    print(f"GROUND  : {gold}")
print("───────────────────────────────────────────────────────────────")



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


Running inference … this is CPU/GPU-heavy for a 20B-parameter model.


Inference on 'test' split (batch=8): 100%|██████████| 1066/1066 [02:59<00:00,  5.94it/s]

                 precision    recall  f1-score   support

Negative Review     0.5086    0.9981    0.6738       533
Positive Review     0.9500    0.0356    0.0687       533

       accuracy                         0.5169      1066
      macro avg     0.7293    0.5169    0.3713      1066
   weighted avg     0.7293    0.5169    0.3713      1066


─── SAMPLE PREDICTIONS ────────────────────────────────────────

REVIEW  : noyce creates a film of near-hypnotic physical beauty even as he tells a story as horrifying as any in the heart- …
PREDICT : negative
GROUND  : positive

REVIEW  : [scherfig] has made a movie that will leave you wondering about the characters' lives after the clever credits roll .
PREDICT : negative
GROUND  : positive

REVIEW  : the dramatic scenes are frequently unintentionally funny , and the action sequences -- clearly the main event -- are …
PREDICT : negative
GROUND  : negative

REVIEW  : this is popcorn movie fun with equal doses of action , cheese , ham and cheek (


