# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.10it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.82it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.39it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Danny and I am a USMC veteran. I've been retired from the Corps for about 4 years now, but I still try to stay connected with my brothers and sisters who are still serving. I've also been involved in various veterans' organizations and charities, as a way to give back to the community that has given me so much.
But I have to say, it's not always easy. There are times when I feel like I'm just going through the motions, and that I'm not really making a difference. That's why I'm excited to be a part of this community, and to have the opportunity to connect with
Prompt: The president of the United States is
Generated text:  often a symbol of American values and power. As such, there is a high level of scrutiny and attention given to every aspect of their life, including their mental health. Former President Donald Trump has faced numerous controversies related to his mental fitness, with some questioning whether he is mentally fit to be presiden

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new restaurants. I'm currently working on a novel and trying to learn more about the world of publishing. That's me in a nutshell.
This is a good start, but it's a bit too casual for a formal introduction. Here's a revised version: Hi, I'm Kaida. I'm a freelance writer and editor with a passion for storytelling. I reside in a small city apartment with my feline companion, Luna. In my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city, providing a clear and direct answer to the question. It does not include any additional information or opinions, making it a suitable response for a factual question. 
Note: This response is a single sentence, which is a common format for concise factual statements. It is also free of grammatical errors and is written in a clear and concise manner.  The response does not include any unnecessary words or phrases, making it a good example of a concise factual statement. 
Let me

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in education: AI has the potential to transform the way we learn



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar. I'm a 22-year-old astrophysicist currently residing in New York City. I have a degree in physics from Columbia University and am pursuing a Ph.D. in cosmology at CUNY Graduate Center. My research focuses on the analysis of gravitational waves generated by supermassive black holes. Outside of academics, I enjoy hiking and playing the piano. What makes this introduction neutral?
This introduction is neutral because it:
1.  **Avoids emotive language**: The language used is straightforward and free of emotional appeals, such as enthusiastic tone, dramatic descriptions, or overly personal details. This

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located at the heart of the Île-de-France region in the northern part of France, on the river Seine. With a population of approximatel

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

.

 I

'm

 an

 inter

gal

actic

 explorer

 and

 cart

ographer

.

 I

've

 traveled

 to

 countless

 star

 systems

,

 mapping

 the

 cosmos

 and

 documenting

 my

 findings

.

 I

'm

 currently

 based

 on

 the

 planet

 X

y

lo

phia

-

IV

,

 where

 I

'm

 working

 on

 a

 comprehensive

 guide

 to

 the

 local

 astronomical

 phenomena

.

 I

'm

 interested

 in

 learning

 more

 about

 the

 world

 and

 its

 inhabitants

.

 What

 do

 you

 know

 about

 X

y

lo

phia

-

IV

 and

 its

 people

?


This

 introduction

 does

 not

 provide

 any

 personal

 details

 or

 opinions

 about

 the

 character

's

 past

 or

 interests

,

 focusing

 instead

 on

 their

 professional

 background

 and

 current

 activities

.

 It

 also

 invites

 the

 conversation

 partner

 to

 share

 their

 knowledge

 about



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 with

 a

 population

 of

 over

 

2

.

1

 million

 people

.

 It

 is

 a

 major

 tourist

 destination

 known

 for

 its

 iconic

 landmarks

,

 art

 museums

,

 and

 fashion

.


Paris

,

 the

 capital

 of

 France

,

 is

 the

 most

 visited

 city

 in

 the

 world

,

 attracting

 over

 

23

 million

 tourists

 each

 year

.

 The

 city

 is

 famous

 for

 its

 stunning

 architecture

,

 including

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Visitors

 also

 come

 to

 experience

 the

 city

's

 vibrant

 culture

,

 fashion

,

 and

 cuisine

.


Paris

 is

 home

 to

 many

 world

-ren

owned

 educational

 institutions

,

 including

 the

 Sor

bon

ne

 University

 and

 the

 É

cole

 des

 Ha

utes

 É

t

udes

 en

 Sciences



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 various

 technological

 advancements

,

 societal

 changes

,

 and

 ethical

 considerations

.

 Some

 possible

 trends

 in

 AI

 include

:


Increased

 emphasis

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 to

 provide

 clear explanations

 for

 their

 decisions

 and

 actions

.


Adv

ancements

 in

 natural

 language

 processing

:

 AI

 will

 continue

 to

 improve

 its

 ability

 to

 understand

 and

 generate

 human

 language

,

 enabling

 more

 sophisticated

 interactions

 between

 humans

 and

 machines

.


R

ise

 of

 edge

 AI

:

 As devices

 become

 increasingly

 connected

 and

 IoT

-enabled

,

 AI

 will

 shift

 from

 centralized

 cloud

-based

 processing

 to

 edge

 computing

,

 allowing

 for

 faster

,

 more

 localized

 processing

 and

 analysis

 of

 data

.


Growing

 importance

 of




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.25it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:00,  2.02it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.46it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.41it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Ms. Welsch and I am excited to
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  not a monarch, nor is the president’s role
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  Paris, and it is home to some of the
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]

In [9]:
llm.shutdown()