# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.07it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.75it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.43it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elia and I'm a language student. I am trying to learn Japanese. I am really interested in this language and I want to practice with you. I have already learned the basics (Hiragana and Katakana, some basic vocabulary, grammar, etc.) and I want to improve my listening and speaking skills.
I'm happy to practice with you, Elia! I'm a native Japanese speaker, so I can help you with listening and speaking. How would you like to practice? Do you have a specific topic in mind, or would you like me to suggest something? For example, we could practice a conversation about
Prompt: The president of the United States is
Generated text:  forced to make an executive decision that could have far-reaching consequences for the country. The nation holds its breath as the president weighs the pros and cons of the decision and ultimately makes a choice that will be remembered for generations to come.
This is a common theme in presidential dramas, and it's one tha

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm looking forward to meeting new people and learning more about their experiences.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states the character's name, age, occupation, and interests. It also mentions a current project and a desire to learn more about others, which shows that the character is open-minded and interested in connecting with others.
Here are a few things to consider when

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, art, fashion, and cuisine. It is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. Paris is a popular tourist destination and is considered one of the most romantic cities in the world. It is also a major hub for business, finance, and culture. The city has a population of over 2.1 million people and is a global center for fashion, cuisine, and art. Paris is a city that is steep

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, including the development of AI-powered robots that can assist with surgeries and other medical procedures.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the education sector by providing personalized learning experiences, automating grading and assessment, and enabling real



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Axel and I'm a 25-year-old freelance writer and occasional musician. I enjoy spending my free time reading, playing guitar, and trying out new recipes in the kitchen. What can I share with you? (This is a template for a social media bio or a profile introduction) ... (more)

## Step 1: Determine the key elements to include in the self-introduction
The key elements to include in the self-introduction are the person's name, age, occupation, and any relevant hobbies or interests.

## Step 2: Identify the tone and style of the introduction
The tone and style of the introduction should be

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city with a rich history, cultural landmarks, and a blend of modern and traditional architecture.
Here is a short and concise factual statement about France's capital city:

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ash

er

 Black

wood

.

 I

'm

 a

 

28

-year

-old

 software

 developer

 living

 in

 Seattle

.

 I

 enjoy

 hiking

 and

 playing

 guitar

 in

 my

 free

 time

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 but

 I

'm

 working

 on

 being

 more

 outgoing

.

 I

'm

 currently

 looking

 for

 a

 new

 opportunity

 that

 will

 challenge

 me

 and

 allow

 me

 to

 grow

 professionally

.

 I

'm

 looking

 for

 a

 company

 culture

 that

 values

 teamwork

,

 innovation

,

 and

 creativity

.

 That

's

 me

 in

 a

 nutshell

.

 Feel

 free

 to

 ask

 me

 any

 questions

.

 


This

 text

 is

 an

 example

 of

 a

:


Answer

:

 A

 professional

 or

 networking

 introduction

,

 such

 as

 a

 LinkedIn

 profile

 or

 a

 job

 interview

.

 


The

 text

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 on

 the

 Se

ine

 River

.

 It

 is

 known

 for

 its

 cultural

 and

 historical

 significance

.

 It

 has

 many

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.


Describe

 the

 E

iff

el

 Tower

.

 The

 E

iff

el

 Tower

 is

 an

 iron

 lattice

 tower

 located

 in

 Paris

,

 France

.

 It

 was

 built

 in

 

188

9

 for

 the

 World

's

 Fair

 and

 stands

 at

 a

 height

 of

 

324

 meters

 (

1

,

063

 feet

).

 It

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

 and

 is

 visited

 by

 millions

 of

 people

 each

 year

.

 The

 tower

 has

 five

 distinct

 levels

,

 with

 the

 top

 level



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 advancements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 As

 AI

 continues

 to

 improve

,

 it

 is

 likely

 to

 have

 a

 significant

 impact

 on

 various

 industries

 and

 aspects

 of

 our

 lives

.

 Some

 possible

 future

 trends

 in

 AI

 include

:


Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 expected

 to

 play

 a

 more

 significant

 role

 in

 healthcare

,

 including

 diagnosis

,

 treatment

,

 and

 patient

 care

.

 For

 example

,

 AI

-powered

 systems

 can

 analyze

 medical

 images

,

 identify

 potential

 health

 risks

,

 and

 suggest

 personalized

 treatment

 plans

.


R

ise

 of

 autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 expected

 to

 become

 more

 prevalent

,

 with

 AI

 playing

 a

 key

 role

 in

 their

 development

.

 AI

 can

 enable

 vehicles




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.85it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.39it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Tyler and I am a leader with the Boy Scouts
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  a very powerful person. But being a president is
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  getting a new mayor, and it's a woman
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]

In [9]:
llm.shutdown()