# GPT-OSS Mercury Age Query

This notebook uses OpenAI's gpt-oss-20b model to answer "How old is the planet Mercury?" and returns a structured JSON output with the age in years, months, weeks, and centuries.

## Key Features:
- Uses MXFP4 quantization for efficient inference on H100 GPU
- Implements Harmony response format parsing
- Strips reasoning/analysis tokens to show only final answer
- Outputs structured JSON


## Step 1: Install Dependencies

Installing required packages with specific versions for MXFP4 compatibility on H100 GPU.


In [None]:
# Install required packages
# Note: triton==3.4 is required for MXFP4 kernel compatibility on H100
# The "kernels" package mentioned in some docs doesn't appear to be a real PyPI package
# MXFP4 support is likely built into triton==3.4 and transformers
%pip install -U transformers accelerate torch triton==3.4 python-dotenv


## Step 2: Load Hugging Face Authentication Token

Loading authentication token from .env.local file for accessing Hugging Face models.


In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env.local (if present)
load_dotenv('.env.local')

# Get Hugging Face token from either variable name
hf_token = os.getenv('HUGGING_FACE_HUB_TOKEN') or os.getenv('HF_TOKEN')

if hf_token:
    # Mirror into both common env var names to maximize compatibility
    os.environ['HUGGING_FACE_HUB_TOKEN'] = hf_token
    os.environ['HF_TOKEN'] = hf_token

    print("‚úì Hugging Face token loaded from .env.local")
    # Authenticate with Hugging Face (writes to local cache)
    try:
        from huggingface_hub import login
        login(token=hf_token, add_to_git_credential=False)
        print("‚úì Authenticated with Hugging Face")
    except Exception as e:
        print(f"WARNING: Could not perform huggingface_hub login: {e}")
        print("Proceeding with environment variable authentication only.")
else:
    print("‚ö†Ô∏è  WARNING: No Hugging Face token found in .env.local")
    print("   You may encounter rate limits or be unable to access gated models.")
    print("   Get your token from: https://huggingface.co/settings/tokens")
    print("   Then create a .env.local file with: HUGGING_FACE_HUB_TOKEN=your_token_here")


## Step 3: Verify GPU Availability

Checking that H100 GPU is available and properly configured.


In [None]:
import torch

# Verify GPU availability
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("WARNING: No GPU detected. Model will run on CPU (very slow).")


## Step 4: Load Model and Tokenizer

Loading gpt-oss-20b with MXFP4 quantization (automatic on H100). This will download ~16GB of model weights on first run.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai/gpt-oss-20b"

print(f"Loading tokenizer from {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Loading model from {model_name}...")
print("This may take several minutes on first run (downloading ~16GB)...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",  # Uses MXFP4 automatically on H100
    device_map="auto"    # Automatically places model on available GPU
)

print("‚úì Model loaded successfully!")
print(f"Model device: {model.device}")
if torch.cuda.is_available():
    print(f"Memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")


## Step 5: Construct Prompt with Structured Output Schema

Building the prompt using the Harmony response format with:
- System message: Defines reasoning level, channels, and model identity
- Developer message: Includes instructions and JSON schema for structured output
- User message: The question about Mercury's age


In [None]:
import json

# Define the JSON schema for structured output
json_schema = {
    "type": "object",
    "properties": {
        "years": {
            "type": "number",
            "description": "Age of Mercury in years"
        },
        "months": {
            "type": "number",
            "description": "Age of Mercury in months"
        },
        "weeks": {
            "type": "number",
            "description": "Age of Mercury in weeks"
        },
        "centuries": {
            "type": "number",
            "description": "Age of Mercury in centuries"
        }
    },
    "required": ["years", "months", "weeks", "centuries"]
}

# Construct the system message
system_message = """You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-08
Reasoning: medium
# Valid channels: analysis, commentary, final. Channel must be included for every message."""

# Construct the developer message with structured output format
developer_message = f"""# Instructions
Provide accurate scientific information. Output the response as valid JSON only.

# Response Formats
## mercury_age
{json.dumps(json_schema)}"""

# User question
user_message = "How old is the planet Mercury? Provide the age in years, months, weeks, and centuries as a JSON object."

# Build the messages list
messages = [
    {"role": "system", "content": system_message},
    {"role": "developer", "content": developer_message},
    {"role": "user", "content": user_message}
]

print("Prompt constructed successfully!")
print(f"\nSystem message length: {len(system_message)} chars")
print(f"Developer message length: {len(developer_message)} chars")
print(f"User message: {user_message}")


## Step 6: Generate Response

Using the model's `.generate()` method with proper stop tokens to control output.


In [None]:
# Apply chat template and prepare inputs
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

print("Generating response...")
print("This may take 30-60 seconds depending on GPU...")

# Generate with stop tokens
# Stop tokens: <|return|> (200002) and <|call|> (200012)
outputs = model.generate(
    **inputs,
    max_new_tokens=500,  # Limit response length
    temperature=0.7,
    eos_token_id=[200002, 200012],  # Stop at <|return|> or <|call|>
    pad_token_id=tokenizer.eos_token_id
)

print("‚úì Generation complete!")
print(f"Generated {len(outputs[0]) - len(inputs['input_ids'][0])} tokens")


## Step 7: Parse Harmony Format

Extracting only the 'final' channel content while stripping reasoning/analysis tokens.


In [None]:
import re

# Decode the full output to see the harmony structure
full_output = tokenizer.decode(outputs[0], skip_special_tokens=False)

print("="*80)
print("FULL OUTPUT (with special tokens):")
print("="*80)
print(full_output)
print("="*80)

# Extract only the generated portion (after the input)
generated_only = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=False)

print("\nGENERATED PORTION ONLY:")
print("="*80)
print(generated_only)
print("="*80)


## Step 8: Parse and Display JSON Output

Converting the final content to JSON and displaying it in a structured format.


In [None]:
def parse_json_from_text(text):
    """
    Extract and parse JSON from text content.
    Handles cases where JSON might be embedded in markdown or surrounded by text.
    """
    if not text:
        return None
    
    # Try to parse the entire text as JSON first
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass
    
    # Look for JSON block in markdown code blocks
    json_block_pattern = r'```(?:json)?\s*(\{.*?\})\s*```'
    matches = re.findall(json_block_pattern, text, re.DOTALL)
    if matches:
        try:
            return json.loads(matches[0])
        except json.JSONDecodeError:
            pass
    
    # Look for JSON object pattern
    json_pattern = r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}'
    matches = re.findall(json_pattern, text, re.DOTALL)
    for match in matches:
        try:
            parsed = json.loads(match)
            # Verify it has our expected keys
            if all(key in parsed for key in ['years', 'months', 'weeks', 'centuries']):
                return parsed
        except json.JSONDecodeError:
            continue
    
    return None

# Parse the JSON from final content
if final_content:
    result_json = parse_json_from_text(final_content)
    
    if result_json:
        print("\n" + "="*80)
        print("FINAL JSON OUTPUT - Mercury's Age:")
        print("="*80)
        print(json.dumps(result_json, indent=2))
        print("="*80)
        
        # Display in a more readable format
        print("\nüìä Mercury's Age Breakdown:")
        print(f"   ‚Ä¢ Years:     {result_json.get('years', 'N/A'):,.0f}")
        print(f"   ‚Ä¢ Months:    {result_json.get('months', 'N/A'):,.0f}")
        print(f"   ‚Ä¢ Weeks:     {result_json.get('weeks', 'N/A'):,.0f}")
        print(f"   ‚Ä¢ Centuries: {result_json.get('centuries', 'N/A'):,.0f}")
    else:
        print("\nWARNING: Could not parse JSON from final content")
        print("Final content was:")
        print(final_content)
else:
    print("\nERROR: No final content to parse")


## Summary

This notebook demonstrated:
1. ‚úÖ Loading OpenAI's gpt-oss-20b model with MXFP4 quantization on H100 GPU
2. ‚úÖ Constructing prompts using the Harmony response format with structured output schema
3. ‚úÖ Generating responses with proper stop token handling
4. ‚úÖ Parsing the multi-channel harmony format to extract only the final (user-facing) content
5. ‚úÖ Stripping reasoning/analysis tokens for clean JSON output
6. ‚úÖ Displaying Mercury's age in multiple time units

### Key Takeaways:
- **MXFP4 quantization** keeps memory usage at ~16GB on H100
- **Harmony format** separates reasoning (analysis channel) from final output (final channel)
- **Structured output** requires defining JSON schema in the developer message
- **Stop tokens** (<|return|>, <|call|>) control when generation should end
