# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.40it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.14it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.09it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aurélien, I’m 29 years old and I live in Paris. I’m currently in a relationship with a woman named Anna. We have a lovely apartment in the Marais district, near the Centre Pompidou. We both work as designers, Anna in fashion and I in interior design. We love exploring the city and trying new restaurants and cafes.
I have a passion for photography and I try to take pictures of everything I see. Anna has a great eye for art and we often visit museums and exhibitions. We both love to travel, especially in Europe, and we try to take a trip together at least once a year
Prompt: The president of the United States is
Generated text:  the head of the federal government and has significant powers and duties. These include making decisions on policy, commanding the military, and acting as the commander-in-chief of the armed forces. The president also serves as the head of state and represents the United States internationally.
Some of the key responsibi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys playing the guitar and listening to music. I'm a bit of a bookworm and like to read in my free time. I'm a pretty laid-back person who tries to stay out of trouble. That's me in a nutshell. What do you think? Is it a good introduction?
Your self-introduction is clear and concise, and it gives a good sense of who you are. However, it might be a bit too neutral. A self-introduction is a chance to show your personality and make a good impression, so you might want to add a bit

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, finance, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." The city has a rich history dating back to the 3

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI aims to provide transparent and interpretable AI models, enabling humans to understand the reasoning behind AI-driven decisions.
2. Rise of Edge AI: With the proliferation of IoT devices, Edge AI is expected to become more prominent. Edge AI involves processing data closer to the source, reducing latency and improving real-time decision-making.
3. Growing importance



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Cassie Welles. I'm a 25-year-old artist living in Brooklyn. I enjoy sketching cityscapes and experimenting with watercolors.
Write a short, neutral self-introduction for a fictional character. Hello, my name is Max Elliot. I'm a 30-year-old engineer working in San Francisco. I like hiking on weekends and trying out new craft beers.
Write a short, neutral self-introduction for a fictional character. Hello, my name is Sofia Patel. I'm a 28-year-old writer based in Los Angeles. I'm currently working on a novel and enjoying exploring the city's bookstores and cafes.


Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is located in the northern part of the country. Paris is often referred to as the City of Light (La Ville Lumière) and is known for its iconic landmarks, art museums, and romantic atmospher

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lena

,

 and

 I

'm

 a

 

17

-year

-old

 high

 school

 student

.

 I

'm

 a

 bit

 of

 a

 book

worm

 and

 enjoy

 spending

 time

 alone

,

 reading

,

 and

 writing

.

 I

've

 been

 attending

 school

 in

 the

 city

 for

 the

 past

 three

 years

 and

 have

 been

 trying

 to

 figure

 out

 what

 I

 want

 to

 do

 with

 my

 life

.

 I

'm

 currently

 considering

 different

 college

 majors

 and

 career

 paths

,

 but

 nothing

 has

 really

 stuck

 yet

.


In

 this

 self

-int

roduction

,

 Lena

 presents

 herself

 as

 a

 high

 school

 student

 who

 is

 focused

 on

 academics

 and

 personal

 interests

.

 She

 mentions

 her

 age

,

 her

 passion

 for

 reading

 and

 writing

,

 and

 her

 current

 uncertainty

 about

 her

 future

 plans

.

 The

 tone

 is

 neutral



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 on

 the

 River

 Se

ine

.

 It

 is

 the

 largest

 city

 in

 France

 and

 is

 known

 for

 its

 fashion

,

 art

,

 and

 history

.


The

 post

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 appeared

 first

 on

 Homework

 A

iders

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely to

 be

 shaped

 by

 advancements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 These

 technologies

 are

 expected

 to

 become

 increasingly

 sophisticated

 and

 integrated

,

 leading

 to

 more

 efficient

 and

 effective

 AI

 systems

.

 Future

 trends

 in

 AI

 may

 include

:


Adv

ancements

 in

 AI

 will also

 lead

 to

 increased

 use

 of

 AI

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 and

 education

.

 AI

 will

 also

 be

 used

 to

 improve

 decision

-making

,

 automate

 processes

,

 and

 enhance

 customer

 experiences

.

 However

,

 there

 are

 also

 concerns

 about

 the

 potential

 risks

 and

 challenges

 associated

 with

 AI

,

 such

 as

 job

 displacement

,

 bias

,

 and

 cybersecurity

 threats

.


AI

 will

 become

 increasingly

 integrated

 into

 various

 aspects

 of

 life

,

 such

 as

 homes




In [6]:
llm.shutdown()