# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.01it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.65it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.31it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Richard and I've been a dedicated Oracle DBA for many years. I've been tasked with implementing and configuring Oracle 12c for an upcoming project. While I've worked with various versions of Oracle, including 11g, I must admit that I'm not too familiar with the latest features and best practices of Oracle 12c.

I've been studying the Oracle documentation and various online resources, but I'm struggling to get started with configuring the database. Specifically, I'm having trouble understanding how to configure the Automatic Storage Management (ASM) component. I've heard it's a significant improvement over previous versions of Oracle, but I'm
Prompt: The president of the United States is
Generated text:  not only the chief executive of the government, but he is also the commander-in-chief of the armed forces. In this role, he is responsible for making decisions that affect the lives of millions of people, both in the military and civilians.
The

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my free time. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and a few art projects, and I'm excited to see where my creative endeavors take me. I'm looking forward to meeting new people and making connections in my community. That's me in a nutshell! How would you describe Kaida? What are some potential personality traits and characteristics that you can infer from this

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about the country of France. France is a country located in Western Europe.
Provide a concise factual statement about the country of France. France is the third most populous country in the European Union.
Provide a concise factual statement about the country of France. France is the most visited country in the world.
Provide a concise factual statement about the country of France. France is a founding member of the United Nations.
Provide a concise factual statement about the country of France. France is a member of the European Union.
Provide a concise factual statement about the country of France. France is a member of the G7 and G

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and explainability in AI decision-making. Explainable AI (XAI) aims to provide insights into how AI models make decisions, which can



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Amaranth Vex. I'm a 25-year-old astrobiologist from New Eden, a colony planet on the outer rim of the galaxy. I'm currently on a research mission to survey the exoplanet, Xylophia-IV, for signs of life. I've always been fascinated by the mysteries of the cosmos and the possibility of discovering life beyond our own world. That's me in a nutshell. [Note: the tone should be neutral, without any emotional tone or personal biases.]
Amaranth Vex is a 25-year-old astrobiologist from New Eden, a colony planet on the outer

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
France, officially the French Republic, is a country located in Western Europe. The country is bordered by several countries including Belgium, Luxembourg, Germany, Switzerland, Italy, Spain, and Andorra.
The climate in France is temperate with

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

 K

ats

ur

agi

.

 I

'm

 a

 

17

-year

-old

 student

 at

 the

 prestigious

 K

uro

ba

 Academy

,

 where

 I

'm

 studying

 to

 become

 a

 skilled

 air

ship

 pilot

.

 When

 I

'm

 not

 in

 the

 skies

,

 I

 enjoy

 tink

ering

 with

 machines

 and

 exploring

 the

 city

's

 hidden

 corners

.


K

aida

 is

 a

 student

 at

 K

uro

ba

 Academy

,

 where

 she

's

 studying

 to

 become

 an

 air

ship

 pilot

.

 She

's

 

17

 years

 old

 and

 enjoys

 tink

ering

 with

 machines

 and

 exploring

 the

 city

's

 hidden

 corners

.

 This

 introduction

 provides

 a

 neutral

 view

 of

 K

aida

,

 introducing

 her

 background

 and

 interests

 without

 giving

 away

 too

 much

 about

 her

 personality

 or

 motivations

.

 It

 sets

 the



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 known

 for

 its

 historical

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 a

 center

 of

 fashion

 and

 culture

,

 and

 it

 is

 a

 popular

 tourist

 destination

.


The

 city

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

,

 but

 the

 metropolitan

 area

 has

 a

 population

 of

 over

 

12

.

2

 million

 people

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 France

,

 along

 the

 Se

ine

 River

.


Some

 of

 the

 city

’s

 famous

 neighborhoods

 include

 the

 Latin

 Quarter

,

 Mont

mart

re

,

 and

 Le

 Mar

ais

.

 The

 city

 has

 a

 rich

 history

,

 dating

 back

 to

 the

 

3

rd



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 difficult

 to

 predict

 with

 certainty

,

 but

 many

 experts

 believe

 that

 AI

 will

 have

 a

 significant

 impact

 on

 various

 aspects

 of

 society

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 adoption

 of

 AI

 in

 various

 industries

:

 AI

 is

 already

 being

 used

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 education

.

 In

 the

 future

,

 AI

 is

 likely

 to

 become

 more

 prevalent

 in

 these

 industries

,

 and

 it

 may

 also

 be

 adopted

 in

 new

 industries

 such

 as

 agriculture

,

 manufacturing

,

 and

 cybersecurity

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

:

 Natural

 language

 processing

 (

N

LP

)

 is

 a

 sub

field

 of

 AI

 that

 deals

 with

 the

 interaction

 between

 computers

 and




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.10it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.76it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.38it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Mayra Torres, I'm a writer and a
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the head of state and head of government of the
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  home to the world’s most visited museum, the
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), to

In [9]:
llm.shutdown()