# LLM Encrypted Token Generation

This notebook shows how to configure a GPT2 model to generate text based on an encrypted prompt. The GPT2 model shown
here runs some layers on the client-side machine and some on the server-side using FHE:
- on the client-side: non-linear layers, such as attention, normalization and activation functions
- on the server-side: all linear layers that have trained weights

To generate one token, the model shown here requires around 11 seconds on a desktop GPU.

In [9]:
# Import necessary libraries
import os
import random

import numpy as np
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Conv1D

from concrete.ml.torch.hybrid_model import HybridFHEModel

# Set random seed for reproducibility
SEED = 0
torch.manual_seed(SEED)

<torch._C.Generator at 0x73526e76a630>

In [10]:
def generate_and_print(prompt, model, tokenizer, seed=None, max_new_tokens=30):
    """
    Generates text based on the provided prompt and prints both the prompt and the generated text.

    Args:
        prompt (str): The input prompt to generate text from.
        model: The pre-trained language model.
        tokenizer: The tokenizer associated with the model.
        seed (int, optional): Seed for random number generators to ensure reproducibility.
        max_new_tokens (int, optional): Maximum number of tokens to generate. Defaults to 30.
    Returns:
        str: The generated text (response only, without the prompt).
    """
    # Set the environment variable for CuBLAS deterministic behavior
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

    # Set the random seed for reproducibility
    if seed is not None:
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)

    # Encode the input prompt
    inputs = tokenizer.encode_plus(prompt, return_tensors="pt")

    # Generate text
    with torch.no_grad():
        output = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_new_tokens,
            top_p=0.9,
            temperature=0.6,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    # Get only the newly generated tokens
    input_length = inputs["input_ids"].shape[1]
    generated_ids = output[0, input_length:]
    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    # Print the prompt and generated text
    print(f"Prompt: {prompt}")
    print(f"Response: {generated_text}\n")

    return generated_text

## Load the model for inference

In [11]:
# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Ensure tokenizer has a pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# Freeze model weights
for param in model.parameters():
    param.requires_grad = False

## Generate some tokens on a clear-text prompt

In [12]:
_ = generate_and_print(prompt="Programming is", model=model, tokenizer=tokenizer, seed=SEED)

Prompt: Programming is
Response: a skill you need to learn to master.

Learn to code

There are a lot of different ways to learn programming.

The



## Configure the layers that execute remotely

All linear layers in the GPT2 model can be detected by checking their Python type. Some layers
are expressed as 1d-Convolutional layers, but are executed as matrix-multiplications. The Concrete ML
Extensions backend provides functionality to execute such layers as encrypted-clear matrix multiplications.

In [13]:
remote_names = []
for name, module in model.named_modules():
    if isinstance(module, (torch.nn.Linear, Conv1D)):
        remote_names.append(name)

In [14]:
# Create the HybridFHEModel with the specified remote modules
hybrid_model = HybridFHEModel(model, module_names=remote_names)

## Compile the model

This step determines the quantization parameters needed for inference. Dynamic quantization 
is used and ensures quantization parameters are customized for each token of the token sequence.

In [15]:
BLOCK_SIZE = 32
# Prepare input data for calibration
input_tensor = torch.randint(0, tokenizer.vocab_size, (256, BLOCK_SIZE), dtype=torch.long)

# Calibrate and compile the model
hybrid_model.compile_model(input_tensor, n_bits=8, use_dynamic_quantization=True)

Compiling FHE layers:   0%|          | 0/49 [00:00<?, ?it/s]

## Generate a few encrypted tokens based on an encrypted prompt

In [16]:
# Set FHE mode to disable for text generation
hybrid_model.set_fhe_mode("execute")

_ = generate_and_print(
    prompt="Programming is", model=hybrid_model.model, tokenizer=tokenizer, seed=SEED
)

Prompt: Programming is
Response: one of those things that, the first time around, the development team, the full-time team, would, in theory, acknowledge a



## Conclusion

This notebook shows that with a few lines of code a GPT2 model can be converted 
to execute on encrypted data. 