# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [2]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

  from .autonotebook import tqdm as notebook_tqdm
2025-01-26 19:57:24,783	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.73it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.87it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.94it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.31it/s]

100%|██████████| 23/23 [00:06<00:00,  3.67it/s]


### Non-streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dr. Jeannette Gaudry Haynie, and I am a psychologist who has dedicated my professional life to helping individuals achieve greater self-awareness and personal growth.
As a licensed psychologist with over 20 years of experience, I have worked with individuals from all walks of life, including those struggling with anxiety, depression, trauma, relationship issues, and career dissatisfaction.
My approach is grounded in evidence-based practices, including cognitive-behavioral therapy (CBT), psychodynamic therapy, and mindfulness-based interventions. I also incorporate elements of existential and humanistic psychology, recognizing that each individual's unique experiences, values, and perspectives shape their
Prompt: The president of the United States is
Generated text:  not just the leader of the country, but also the symbolic head of the military. The president is responsible for the defense of the nation and is the commander-in-chief of the arme

### Streaming Synchronous Generation

In [12]:
def remove_overlap(existing_text, new_chunk):
    """
    Finds the largest suffix of 'existing_text' that is a prefix of 'new_chunk'
    and removes that overlap from the start of 'new_chunk'.
    """
    max_overlap = 0
    max_possible = min(len(existing_text), len(new_chunk))

    for i in range(max_possible, 0, -1):
        if existing_text.endswith(new_chunk[:i]):
            max_overlap = i
            break

    return new_chunk[max_overlap:]

def generate_text_no_repeats(llm, prompt, sampling_params):
    """
    Example function that:
    1) Streams the text,
    2) Removes chunk overlaps,
    3) Returns the merged text.
    """
    final_text = ""
    for chunk in llm.generate(prompt, sampling_params, stream=True):
        chunk_text = chunk["text"]

        cleaned_chunk = remove_overlap(final_text, chunk_text)

        final_text += cleaned_chunk

    return final_text


prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = generate_text_no_repeats(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I'm a 17-year-old high school student. I live in a small town in the Pacific Northwest with my family. I enjoy hiking and reading in my free time. I'm a bit of a introvert, but I'm working on being more outgoing. That's me in a nutshell. What do you think? Is it a good self-introduction?
This is a good self-introduction because it:
* Introduces the character's name and age
* Provides some background information about the character's life
* Reveals some of the character's interests and personality traits
* Is concise and easy to read

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the northern part of the country, near the Seine River. It is the largest city in France and is known for its rich history, art,

### Non-streaming Asynchronous Generation

In [None]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [None]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

In [None]:
llm.shutdown()