# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.18it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.90it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.43it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Amelia, and I am a MSc by Research student in the School of Earth and Environmental Sciences. My research focuses on the application of statistical and machine learning techniques to the analysis of archaeological data.
The title of my research is "A computational approach to understanding the spatiotemporal patterns of past human migration and cultural exchange." This involves exploring how machine learning algorithms can be used to identify patterns in archaeological data, such as radiocarbon dates, that can provide insights into the movement of people and the exchange of ideas and material culture between ancient societies.
My research interests are in the intersection of archaeology, computer science, and statistics,
Prompt: The president of the United States is
Generated text:  planning to meet with Vladimir Putin to discuss several topics, including cybersecurity, space exploration and nuclear arms control. On Wednesday, President Donald

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, trying new foods, and practicing yoga. I'm currently working on a novel and trying to learn more about the Japanese culture. I'm a bit of a introvert, but I'm always up for a good conversation. I'm looking forward to meeting new people and making connections.
This is a good example of a neutral self-introduction because it doesn't reveal too much about the character's personality, background, or motivations. It simply provides a brief overview of who they are and what they're interested in. This can be helpful for a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The Eiffel Tower is a famous landmark in Paris. It was built for the 1889 World’s Fair and was originally intended to be a temporary structure. The Eiffel Tower is 324 meters tall and is made of iron. It is one of the most recognizable landmarks in the world and is visited by millions of people each year.
The Louvre Museum is another famous landmark in Paris. It was originally a royal palace, but it was converted into a museum in the 18th century. The Louvre is home to some of the world’s most famous artworks, including the Mona Lisa. The museum is visited

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and accountability. Explainable AI (XAI) aims to provide insights into how AI systems make decisions, enabling users to understand and trust AI-driven



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Sofia Rodriguez. I'm a 25-year-old freelance writer and editor, living in a small apartment in a bustling city. I've been writing and editing for years, and I'm always looking for new projects and opportunities to collaborate with others.
This is a neutral self-introduction, which means it doesn't reveal too much about Sofia's personality, interests, or background. It provides some basic information about her, such as her name, age, profession, and location, but doesn't go into details. This is suitable for a formal or professional setting, or when you want to make a good impression without revealing too much about yourself.


Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city is a major cultural and financial hub in Europe and is known for the Eiffel Tower, the Louvre Museum, and many other famou

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Julian.

 I

'm

 a

 

22

-year

-old

 student

 major

ing

 in

 environmental

 science

.

 I

 enjoy

 hiking

,

 reading

,

 and

 learning

 about

 local

 ecosystems

.

 My

 goal

 is

 to

 contribute

 to

 environmental

 conservation

 efforts

 through

 research

 and

 policy

-making

.

 What

 are

 your

 goals

 and

 interests

?

 


1

:

 What

 is

 the

 subject

 of

 the

 self

-int

roduction

?


Answer

:

 The

 subject

 of

 the

 self

-int

roduction

 is

 Julian

,

 the

 fictional

 character

.


2

:

 What

 does

 Julian

 study

?


Answer

:

 Julian

 is

 a

 student

 major

ing

 in

 environmental

 science

.


3

:

 What

 are

 some

 of

 Julian

's

 hobbies

?


Answer

:

 Julian

 enjoys

 hiking

,

 reading

,

 and

 learning

 about

 local

 ecosystems

.


4

:

 What

 is

 Julian

's



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 capital

 and

 largest

 city

 of

 France

,

 located

 in

 the

 Î

le

-de

-F

rance

 region

 in

 the

 north

-central

 part

 of

 the

 country

.

 It

 is

 situated

 on

 the

 Se

ine

 River

,

 approximately

 

300

 kilometers

 (

186

 miles

)

 from

 the

 Atlantic

 coast

.

 The

 city

 is

 a

 major

 hub

 for

 art

,

 fashion

,

 cuisine

,

 and

 culture

,

 and

 is

 home

 to

 some

 of the

 world

’s

 most

 famous

 landmarks

,

 including

 the

 E

iff

el Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 With

 a

 population

 of

 over

 

2

.

1

 million

 people

,

 Paris

 is

 a

 global

 center

 for

 business

,

 finance

,

 and

 tourism

.

 The

 city

 is

 served



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 often

 the

 subject

 of

 debate

 and

 speculation

.

 Based

 on

 current

 developments

 and

 trends

,

 we

 can

 identify

 several

 possible

 future

 trends

 in

 artificial

 intelligence

.

 These

 trends

 may

 shape

 the

 course

 of

 AI

 research

 and

 its

 applications

 in

 various

 industries

.


Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Use

 of

 Edge

 AI

:

 As

 AI

 becomes

 more

 pervasive

,

 edge

 computing

 will

 play

 a

 crucial

 role

 in

 enabling

 real

-time

 processing

 and

 decision

-making

 at

 the

 edge

 of

 the

 network

.

 This

 will

 lead

 to

 more

 efficient

 and

 secure

 AI

 applications

,

 especially

 in

 IoT

 and

 autonomous

 systems

.


2

.

 Rise

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 integrated

 into

 critical

 decision

-making

 processes

,

 there




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.16it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.84it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.47it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Sarah. I'm a 24-year-old graphic
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the head of state and head of government of the
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  a city that never fails to amaze. From
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Si

In [9]:
llm.shutdown()