# Subtext Codec Demo

This notebook demonstrates **subtext-codec**, a steganographic codec that hides arbitrary binary data
inside LLM-generated text by steering token selection via logit rank.

## How It Works

1. **Encoding**: Binary data is converted to a sequence of "digits" in a variable base. Each digit
   determines which token rank (from the model's top-k predictions) to select at each generation step.
   This produces natural-looking text that secretly encodes your data.

2. **Decoding**: Given the encoded text and a key, we re-run the model to determine what the top-k
   tokens were at each step. By observing which token was actually chosen, we recover the original
   digit sequence and convert it back to bytes.

The process is **fully reversible** and **deterministic** when using the same model and parameters.

## 1. Environment Setup

First, we configure CUDA for deterministic behavior. This environment variable ensures
reproducible results when using GPU acceleration.

In [None]:
# Set CUDA workspace configuration for deterministic cuBLAS operations.
# This is required BEFORE importing torch to ensure reproducible GPU computations.
%env CUBLAS_WORKSPACE_CONFIG=:4096:8

In [None]:
# Import the subtext_codec library
import subtext_codec

# Also import json for loading/viewing the key file
import json
from pathlib import Path

print(f"subtext-codec version: {subtext_codec.__version__}")

## 2. Explore the Sample Data

The `samples/` directory contains pre-generated example files:

- **`secret.txt`**: The original secret payload (Lorem ipsum text)
- **`key.json`**: The codec key containing parameters needed for decoding
- **`message.txt`**: The encoded text (looks like a steganography tutorial!)
- **`decoded.txt`**: The decoded payload (should match secret.txt exactly)

In [None]:
# Define paths to sample files
SAMPLES_DIR = Path("samples")

SECRET_FILE = SAMPLES_DIR / "secret.txt"      # Original secret data to encode
KEY_FILE = SAMPLES_DIR / "key.json"           # Codec key for encoding/decoding
MESSAGE_FILE = SAMPLES_DIR / "message.txt"    # Encoded steganographic text
DECODED_FILE = SAMPLES_DIR / "decoded.txt"    # Decoded output (should match secret)

# Verify all sample files exist
for f in [SECRET_FILE, KEY_FILE, MESSAGE_FILE, DECODED_FILE]:
    assert f.exists(), f"Missing sample file: {f}"
print("All sample files found!")

In [None]:
# Load and display the secret payload
# This is the data we want to hide inside LLM-generated text
secret_data = SECRET_FILE.read_bytes()

print(f"Secret payload size: {len(secret_data)} bytes")
print(f"\n--- Secret Content ---\n")
print(secret_data.decode('utf-8'))

In [None]:
# Load and examine the codec key
# The key contains all parameters needed to decode the message
with open(KEY_FILE) as f:
    key_data = json.load(f)

print("--- Codec Key Contents ---\n")
for k, v in key_data.items():
    print(f"  {k}: {v}")

print("\n--- Key Field Explanations ---")
print("""
  version:           Codec version (v2 = dynamic mixed-radix encoding)
  top_k:             Maximum number of token candidates at each step
  top_p:             Nucleus sampling threshold (cumulative probability cutoff)
  prompt_prefix:     The text prompt that precedes the encoded content
  model_name_or_path: HuggingFace model identifier used for encoding
  device:            Computation device (cuda/cpu)
  torch_dtype:       Model precision (bf16 for bfloat16)
  payload_length:    Exact byte count of original payload (for reconstruction)
""")

In [None]:
# Load and display the encoded message
# Notice how it reads like natural text about steganography!
# The secret data is hidden in the specific word choices.
encoded_message = MESSAGE_FILE.read_text()

print(f"Encoded message size: {len(encoded_message)} characters")
print(f"\n--- Encoded Message (first 1500 chars) ---\n")
print(encoded_message[:1500])
print("\n[... message continues ...]")

## 3. Load the Language Model

Both encoding and decoding require the same language model. The model generates probability
distributions over tokens, which the codec uses to hide/recover data.

**Note**: This step downloads and loads a large model (~16GB for Llama-3.1-8B).
Make sure you have sufficient GPU memory and have accepted the model's license on HuggingFace.

In [None]:
# Load the model and tokenizer
# Parameters:
#   - model_name_or_path: HuggingFace model identifier (must match key)
#   - device: "cuda" for GPU, "cpu" for CPU (much slower)
#   - torch_dtype: "bf16" for bfloat16 precision (saves memory, matches key)

MODEL_NAME = key_data["model_name_or_path"]  # Use the model specified in the key
DEVICE = "cuda"  # Change to "cpu" if no GPU available
DTYPE = key_data.get("torch_dtype", "bf16")

print(f"Loading model: {MODEL_NAME}")
print(f"Device: {DEVICE}, Dtype: {DTYPE}")
print("This may take a minute...\n")

tokenizer, model = subtext_codec.load_model_and_tokenizer(
    MODEL_NAME,
    DEVICE,
    DTYPE
)

print(f"\nModel loaded successfully!")
print(f"Vocabulary size: {len(tokenizer):,} tokens")

## 4. Encode New Data

Now let's encode our own secret message. We'll create a new configuration and
encode a custom payload into fresh steganographic text.

In [12]:
# Create a codec configuration for encoding
# These parameters control how data is hidden in the generated text

encode_config = subtext_codec.CodecConfig(
    model_name_or_path=MODEL_NAME,
    device=DEVICE,
    
    # The prompt that starts the generated text
    # Choose something that gives the model context for natural generation
    prompt_prefix="Here's an interesting fact about neural networks: ",
    
    # top_k: Maximum number of candidate tokens at each step
    # Higher = more capacity per token, but potentially less natural text
    top_k=16,
    
    # top_p: Nucleus sampling threshold (0.0 to 1.0)
    # Only tokens within this cumulative probability mass are considered
    # Lower = more focused/natural text, less encoding capacity
    top_p=0.9,
    
    # Store model info in the key for easier decoding later
    store_model_in_key=True,
    
    torch_dtype=DTYPE,
)

print("Encoding Configuration:")
print(f"  Model: {encode_config.model_name_or_path}")
print(f"  Prompt: '{encode_config.prompt_prefix}'")
print(f"  top_k: {encode_config.top_k}")
print(f"  top_p: {encode_config.top_p}")

Encoding Configuration:
  Model: meta-llama/Llama-3.1-8B-Instruct
  Prompt: 'Here's an interesting fact about neural networks: '
  top_k: 16
  top_p: 0.9


In [22]:
# Define a custom secret payload to encode
my_secret = b"This is my secret message! It will be hidden inside AI-generated text."

print(f"Payload to encode: {len(my_secret)} bytes")
print(f"Content: {my_secret.decode('utf-8')}")
print("\nEncoding... (this generates text token by token)\n")

# Encode the data into steganographic text
# Returns:
#   - encoded_text: The generated text with hidden data
#   - new_key: CodecKey needed to decode the message later
encoded_text, new_key = subtext_codec.encode_data_to_text(
    data=my_secret,
    cfg=encode_config,
    model=model,
    tokenizer=tokenizer,
)

print(f"Generated {len(encoded_text)} characters of steganographic text")
print(f"\n--- Begin Encoded Text ---\n")
print(encoded_text)
print(f"\n--- End Encoded Text ---\n")

Payload to encode: 70 bytes
Content: This is my secret message! It will be hidden inside AI-generated text.

Encoding... (this generates text token by token)

Generated 993 characters of steganographic text

--- Begin Encoded Text ---

Here's an interesting fact about neural networks: 80% to a neural network can learn in the first 8-10 training cycles and then, in essence, they are stuck at about the 80% level, with little to know increase in learning, despite increasing training cycles.
The above is from a 1982 report from Stanford on the subject, where David A. Bayer was experimenting on this particular phenomenon. The full quote goes: 
``We've seen in all experiments to date that at around 80% or 90%, there will come a plateau where, for no good apparent reason, learning will suddenly cease." 

It appears the 80-20 principle is very much applicable to this field and other fields like finance and even software, where about 20% effort generates 80-80-80 of the results.

I would also l

In [15]:
# Examine the generated key
# This key is REQUIRED to decode the message - keep it safe!

print("--- Generated Codec Key ---\n")
print(f"  version: {new_key.version}")
print(f"  top_k: {new_key.top_k}")
print(f"  top_p: {new_key.top_p}")
print(f"  payload_length: {new_key.payload_length}")
print(f"  prompt_prefix: '{new_key.prompt_prefix}'")
print(f"  model_name_or_path: {new_key.model_name_or_path}")

print("\nIMPORTANT: Without this key, the message cannot be decoded!")

--- Generated Codec Key ---

  version: v2
  top_k: 16
  top_p: 0.9
  payload_length: 70
  prompt_prefix: 'Here's an interesting fact about neural networks: '
  model_name_or_path: meta-llama/Llama-3.1-8B-Instruct

IMPORTANT: Without this key, the message cannot be decoded!


In [17]:
# Verify round-trip: decode the text we just encoded
# This confirms encoding -> decoding recovers the exact original data

print("Performing round-trip verification...\n")

roundtrip_decoded = subtext_codec.decode_text_to_data(
    encoded_text=encoded_text,
    key=new_key,
    prompt_prefix=encode_config.prompt_prefix,
    model=model,
    tokenizer=tokenizer,
    device=DEVICE,
)

# Verify exact match
assert roundtrip_decoded == my_secret, "Round-trip verification failed!"

print("Round-trip verification PASSED!")
print(f"  Original:  {my_secret}")
print(f"  Decoded:   {roundtrip_decoded}")
print(f"  Match:     {roundtrip_decoded == my_secret}")

Performing round-trip verification...

Round-trip verification PASSED!
  Original:  b'This is my secret message! It will be hidden inside AI-generated text.'
  Decoded:   b'This is my secret message! It will be hidden inside AI-generated text.'
  Match:     True


## 5. Robustness: Handling Leading/Trailing Text

The decoder is robust to leading/trailing text after the encoded message. This is useful because:
- The encoded text might be copied with extra content
- Social media or messaging apps might append signatures or metadata

The decoder uses a **sentinel token** to know where the encoded data ends.

In [21]:
# Add some leading and trailing text that wasn't part of the original encoding
noisy_text = "\n\n[This leading text was added later and is not part of the encoded message.]" + encoded_text + "\n\n[This trailing text was added later and is not part of the encoded message.]"

print(f"Original encoded length: {len(encoded_text)} chars")
print(f"With leading/trailing noise:     {len(noisy_text)} chars")
print(f"\nDecoding noisy text...\n")

# Decode should still work, ignoring the trailing content
decoded_from_noisy = subtext_codec.decode_text_to_data(
    encoded_text=noisy_text,
    key=new_key,
    prompt_prefix=encode_config.prompt_prefix,
    model=model,
    tokenizer=tokenizer,
    device=DEVICE,
)

assert decoded_from_noisy == my_secret, "Decoding with trailing text failed!"

print("SUCCESS: Decoded correctly despite trailing noise!")
print(f"Recovered: {decoded_from_noisy.decode('utf-8')}")

Original encoded length: 993 chars
With leading/trailing noise:     1148 chars

Decoding noisy text...

SUCCESS: Decoded correctly despite trailing noise!
Recovered: This is my secret message! It will be hidden inside AI-generated text.


## 6. Saving and Loading Keys

In practice, you'll want to save the codec key to share with the intended recipient.
The key is essential for decoding - treat it like a password!

In [None]:
# Save the key to a file
output_key_path = Path("my_key.json")

subtext_codec.save_codec_key(new_key, output_key_path)

print(f"Key saved to: {output_key_path}")
print(f"\nKey file contents:")
print(output_key_path.read_text())

In [None]:
# Load the key back and verify it works
loaded_key = subtext_codec.load_codec_key(output_key_path)

# Decode using the loaded key
decoded_with_loaded_key = subtext_codec.decode_text_to_data(
    text=encoded_text,
    key=loaded_key,
    prompt_prefix=encode_config.prompt_prefix,
    model=model,
    tokenizer=tokenizer,
    device=DEVICE,
)

assert decoded_with_loaded_key == my_secret
print("Successfully decoded using loaded key!")

# Clean up the temporary key file
output_key_path.unlink()
print("(Temporary key file cleaned up)")

## Summary

This demo covered the core functionality of subtext-codec:

1. **Environment Setup**: Configure CUDA determinism and set random seeds
2. **Sample Data**: Explored the provided samples (secret, key, message)
3. **Model Loading**: Load the language model required for encoding/decoding
4. **Decoding**: Recover hidden data from steganographic text using a key
5. **Encoding**: Hide arbitrary binary data inside natural-looking LLM text
6. **Robustness**: The decoder handles trailing text gracefully
7. **Key Management**: Save and load codec keys for later decoding

### Key Points

- **Determinism is critical**: Both encoding and decoding must use the same model,
  parameters, and random seed to work correctly.
  
- **The key is essential**: Without the codec key, the message cannot be decoded.
  The key contains the prompt prefix, top_k/top_p settings, and payload length.
  
- **Capacity vs. naturalness**: Higher top_k allows more data per token but may
  produce less natural text. Adjust based on your needs.
  
- **Model consistency**: The exact same model (including version and precision)
  must be used for both encoding and decoding.