# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-23 06:34:02] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-23 06:34:02] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-23 06:34:02] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-23 06:34:06] INFO server_args.py:2410: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.61it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:16,  1.17it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:16,  1.17it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:16,  1.17it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:01<00:04,  3.58it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:01<00:04,  3.58it/s]

Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:01<00:04,  3.58it/s] 

Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.81it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.81it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.81it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:01<00:01,  6.70it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:01<00:01,  6.70it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:01<00:01,  6.70it/s]

Capturing batches (bs=56 avail_mem=76.77 GB):  35%|███▌      | 7/20 [00:01<00:01,  6.70it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:01<00:00, 10.09it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:01<00:00, 10.09it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:01<00:00, 10.09it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:01<00:00, 10.09it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:01<00:00, 13.34it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:01<00:00, 13.34it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:01<00:00, 13.34it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.36it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.36it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.36it/s] Capturing batches (bs=8 avail_mem=76.74 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.75it/s]Capturing batches (bs=4 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.75it/s]

Capturing batches (bs=2 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.75it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:02<00:00, 14.75it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:02<00:00, 17.83it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:02<00:00,  9.80it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Cynthia. I'm currently 17 and I was born in the United States. I'm a strong-willed girl who loves to go out and have fun. I have a lot of hobbies and interests, such as playing the guitar, swimming, and reading. I also like to travel a lot and take a lot of photos. I'm really excited to be learning to code and I'm really looking forward to coding my first game. I'm learning and improving all the time and I'm always looking for new things to learn.
Can you tell me about your favorite hobby or activity that you enjoy? Cynthia is a strong-willed girl
Prompt: The president of the United States is
Generated text:  traveling in Europe. At airports, he must go through customs, and customs in the United States and the European Union are different. In the United States, people are allowed to bring their own personal items into the country. However, in the European Union, people are not allowed to bring their own personal items, so customs officers need

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill or Hobby] enthusiast. I love [What I Enjoy Doing]. I'm a [What I Do for a Living]. I'm a [What I Do for a Living]. I'm a [What I Do for a Living]. I'm a [What I Do for a Living]. I'm a [What I Do for a Living]. I'm a [What I Do for a Living]. I'm a [What I Do for a Living]. I'm a [What I Do for a Living]. I'm a [What I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French Academy of Sciences. Paris is a cultural and economic center, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. It is a popular tourist destination, attracting millions of visitors each year. The city is also known for its cuisine, including French cuisine, and its fashion industry, with many famous designers and boutiques. Paris is a city of contrasts, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, including manufacturing, transportation, and healthcare. This will lead to increased efficiency, productivity, and cost savings for businesses and individuals.

2. Enhanced cognitive abilities: AI will continue to improve its ability to process and analyze information, leading to more sophisticated and nuanced decision-making. This will enable AI to better understand and respond to human emotions,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [job title] at [Company]. I enjoy [how I make time to relax and relax about life], and my favorite thing to do in my free time is [something specific]. How do you spend your free time? As an AI language model, my job is to provide information and answer questions to the best of my ability. While I can't engage in physical activities, I can assist with answering questions and providing helpful information. How can I help you today? Be sure to ask me any questions you have! I look forward to hearing from you! [Name] [Company] [Contact Information

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital city of France. It is located in the south of the country and is the largest city in the European Union. The city is known for its rich cultural heritage, beautiful architect

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

],

 and

 I

 am

 a

 [

insert

 profession

 or

 role

].

 I

 come

 from

 [

insert

 home

 or

 place

 of

 origin

]

 and

 have

 been

 in

 [

insert

 occupation

]

 for

 [

insert

 number

 of

 years

]

 years

.

 I

 enjoy

 [

insert

 one

 or

 two

 hobbies

],

 [

insert

 interests

],

 and

 [

insert

 any

 skills

 or

 talents

].

 If

 you

 could

 give

 me

 a

 brief

 description

 of

 yourself

,

 that

 would

 be

 great

!

 [

insert

 a

 brief

 description

,

 ideally

 incorporating

 at

 least

 one

 hobby

,

 one

 or

 two

 interests

,

 or

 any

 other

 attributes

 that

 would

 make

 you

 stand

 out

].


Hello

,

 my

 name

 is

 [

insert

 name

],

 and

 I

 am

 a

 [

insert

 profession

 or

 role

].

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

 and

 medieval

 cath

ed

r

als

.

 



This

 statement

 encaps

ulates

 the

 key

 points

 about

 Paris

,

 including

 its

 location

,

 iconic

 features

,

 and

 historical

 significance

,

 while

 being

 concise

 and

 to

 the

 point

.

 It

 also

 provides

 context

 by

 mentioning

 the

 E

iff

el

 Tower

 and

 the

 medieval

 cath

ed

r

als

.

 The

 statement

 is

 factual

 and

 appropriate

 for

 a

 general

 reader

's

 understanding

 of

 the

 city

.

 



For

 a

 more

 detailed

 statement

,

 consider

 incorporating

 the

 following

 additional

 information

:


-

 The

 population

 of

 Paris

 is

 around

 

2

.

 

1

 million

.


-

 Paris

 is

 home

 to

 the

 French

 Parliament

,

 where

 the

 president

 of

 France

 resides

.


-

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 speculative

 and

 relies

 on

 many

 factors

 such

 as

 technological

 advancements

,

 economic

 factors

,

 ethical

 considerations

,

 and

 societal

 shifts

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Enhanced

 AI

:

 AI

 will

 continue

 to

 get

 better

 at

 performing

 tasks

 that

 require

 reasoning

,

 learning

,

 and

 problem

-solving

.

 This

 could

 lead

 to

 the

 creation

 of

 more

 intelligent

 machines

 that

 can

 handle

 complex

 tasks

 in

 industries

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.



2

.

 Autonomous

 machines

:

 As

 AI

 technology

 improves

,

 autonomous

 machines

 that

 can

 perform

 tasks

 without

 human

 intervention

 will

 become

 more

 common

.

 This

 could

 lead

 to

 a

 more

 efficient

 and

 safer

 society

,

 but

 it

 also

 raises

 concerns

 about

 the

 impact

 on

 jobs

 and

 privacy

.






In [6]:
llm.shutdown()