# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0825 20:53:34.905000 3395758 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0825 20:53:34.905000 3395758 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0825 20:53:43.196000 3396113 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0825 20:53:43.196000 3396113 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.86it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.85it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.26it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.26it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.26it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  3.36it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mia, and I'm a 13 year old who has been diagnosed with autism. What can I do to start to deal with it? I've noticed that my emotions have become more intense and my thoughts become very focused on what I'm doing right now and not on other things, such as the future. I'm not sure if I have any options at this point. Can you help me?

Mia

I'm so sorry to hear that you're going through this, Mia. Starting autism support groups can be very helpful for people with autism, as it can provide a sense of community and a safe space to talk about your
Prompt: The president of the United States is
Generated text:  a person, but the vice president is a position. (Judge true or false)

To determine whether the statement "The president of the United States is a person, but the vice president is a position" is true or false, let's analyze each part of the statement separately.

1. **The president of the United States is a person:**
   - The president of the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Parliament building. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also home to the French Riviera, a popular tourist destination for its beaches and Mediterranean cuisine. The city is known for its fashion industry, with many famous designers and boutiques located in the city. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the most likely trends in AI that are currently being explored and are likely to continue in the coming years:

1. Increased automation and robotics: As AI technology continues to improve, we are likely to see an increase in the automation and robotics of various industries. This could lead to the creation of more efficient and cost-effective solutions, as well as the creation of new jobs that are not currently being filled by humans.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [career]. [Career] [Name], a passionate and [job role], is known for [job description]. I am currently [current position] and [name] is proud to say that I am [self-introduction]! As a [career], [Name] is dedicated to [job role], excelling in [job role] and delivering exceptional results. I am a [career] with a passion for [job role] and am always ready to [job role] with my dedication, courage, and [job role] attitude. Joining [Name] at [name] is one

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city renowned for its grand architecture, rich culture, and vibrant arts scene. It's the country's largest city and one of the most popular tourist destinations, drawing millions of visitors each year. Paris is known for its iconic landmarks like the Eiffel Tower, Louvre Museum, and Notr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

].

 I

've

 always

 been

 fascinated

 by

 [

what

 is

 your

 profession

 or

 interest

?

].

 I

 have

 a

 genuine

 passion

 for

 [

the

 specific

 field

 or

 interest

 you

'd

 like

 to

 share

 about

].

 I

'm

 confident

 in

 my

 abilities

 and

 eager

 to

 learn

 and

 grow

 in

 this

 area

.

 I

'm

 always

 up

 for

 new

 challenges

 and

 enjoy

 using

 my

 creativity

 to

 solve

 problems

.

 I

'm

 a

 team

 player

,

 and

 I

 thrive

 in

 fast

-paced

 environments

.

 I

'm

 always

 looking

 for

 opportunities

 to

 learn

 and

 grow

,

 and

 I

'm

 always

 eager

 to

 share

 my

 knowledge

 with

 others

.

 I

'm

 excited

 to

 dive

 into

 the

 world

 of

 [

describe

 your

 field

 of

 interest

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



###

 Brief

ly

 Sum

mar

ize

 the

 Location

 of

 Paris





**

Paris

,

 France

**

 is

 located

 in

 the

 region

 of

 **

Paris

**,

 in

 the

 central

-s

ou

thern

 part

 of

 France

.

 It

 is

 the

 capital

 city

 of

 France

 and

 is

 the

 largest

 city

 in

 Europe

.

 Located

 in

 the

 center

 of

 the

 **

Fr

app

ie

**,

 Paris

 is

 the

 most

 populated

 city

 in

 France

,

 and

 the

 capital

 of

 its

 

8

 regions

.



###

 Brief

ly

 Sum

mar

ize

 the

 Population

 and

 Economy

 of

 Paris





**

Paris

**

 is

 a

 major

 city

 in

 France

,

 with

 a

 population

 of

 over

 **

7

 million

**

 people

.

 The

 city

 is

 an

 international

 economic

 center

 that

 has

 become

 a

 center



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 a

 number

 of

 possible

 trends

.

 Here

 are

 some

 of

 the

 most

 likely

 developments

:



 

 

1

.

 Increased

 depth

 and

 complexity

 of

 AI

 systems

:

 As

 AI

 systems

 become

 more

 complex

 and

 nuanced

,

 it

 is

 possible

 that

 they

 will

 be

 able

 to

 perform

 tasks

 that

 were

 previously

 thought

 to

 be

 impossible

,

 such

 as

 understanding

 natural

 language

,

 generating

 creative

 art

,

 or

 making

 decisions

 based

 on

 social

 and

 cultural

 norms

.


 

 

2

.

 Integration

 of

 AI

 into

 human

 society

:

 As

 AI

 becomes

 more

 integrated

 into

 everyday

 life

,

 it

 is

 possible

 that

 it

 will

 be

 able

 to

 perform

 tasks

 that

 were

 previously

 thought

 to

 be

 impossible

,

 and

 that

 it

 will

 become

 an

 integral

 part

 of




In [6]:
llm.shutdown()