# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0830 10:41:12.535000 338265 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 10:41:12.535000 338265 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0830 10:41:21.128000 338641 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 10:41:21.128000 338641 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.08it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.64it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.64it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.64it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.84it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kerstin Thiele, I'm a PhD student in my third year, I'm interested in computational and statistical biology and I'm working on the computational analysis of genomes. At the moment, I'm working on developing methods to deal with the "black-box nature" of genome analysis, and developing efficient methods to estimate the parameters of models that are only known for a limited number of parameters. For example, I'm interested in developing methods to deal with the "black-box nature" of genome analysis, and developing efficient methods to estimate the parameters of models that are only known for a limited number of parameters. My research interests have been in computational
Prompt: The president of the United States is
Generated text:  trying to decide how many military personnel to allocate to different branches of the military. The president can allocate anywhere from 0 to 1000 military personnel to each branch of the military. He will consider a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short description of your personality or skills]. And what's your favorite hobby or activity? I'm always looking for new experiences and challenges, so I'm always up for trying new things. What's your favorite book or movie? I love reading and watching movies, and I'm always looking for new adventures to explore. What's your favorite place to go? I love to travel, so I'm always looking for new destinations to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also known for its cuisine, fashion, and art scene. The city is home to many international organizations and institutions, including the European Parliament and the United Nations. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that is both

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, transparency, and accountability.

2. More integration with other technologies: AI will continue to be integrated with other technologies, such as machine learning, natural language processing, and computer vision. This will lead to new applications and opportunities for AI to be used in new ways.

3. Greater use of AI in healthcare: AI is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Job Title] at [Company Name]. Currently, I work at [Location], and I am passionate about [interest or hobby]. I am always up-to-date on the latest trends and technologies in my field, and I am driven by a desire to keep learning and improving. I have a strong work ethic and a positive attitude, and I am always ready to learn from new experiences and challenges. What's your profession and what do you do for a living? Feel free to add any additional information that you think would be helpful to know about yourself. I am excited to start our conversation and get to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, an iconic city located on the Seine River, known for its rich history, stunning architecture, and vibrant cultural scene. The city is home to landmarks such as the Louvre Museu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 I

'm

 a

 [

Role

]

!

 I

'm

 a

 [

Skill

 or

 Expert

ise

]

 expert

 in

 [

Field

 or

 Area

 of

 Focus

].

 I

've

 been

 around

 for

 [

Number

]

 years

,

 and

 I

 always

 try

 to

 [

Express

 a

 positive

 attitude

 or

 enthusiasm

].

 I

'm

 here

 to

 [

Provide

 a

 brief

 overview

 of

 your

 area

 of

 expertise

 or

 expertise

,

 such

 as

 your

 experience

,

 knowledge

,

 or

 skills

].

 I

'm

 always

 ready

 to

 help

 others

,

 and

 I

'm

 here

 to

 support

 you

.

 What

's

 your

 name

,

 and

 what

's

 your

 profession

?

 If

 you

 have

 any

 questions

 or

 need

 more

 information

,

 please

 don

't

 hesitate

 to

 ask

.

 [

Contact

 Information

]

 I

 look



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 and

 is

 the

 seat

 of

 the

 French

 government

,

 official

 residence

 of

 the

 President

 of

 France

,

 and

 the

 country

’s

 cultural

,

 financial

,

 and

 political

 capital

.

 The

 city

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history

,

 and

 diverse

 culture

.

 Paris

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 the

 Notre

-D

ame

 Cathedral

,

 and

 many

 other

 iconic

 landmarks

 that

 showcase

 the

 city

's

 artistic

 and

 architectural

 heritage

.

 The

 city

 also

 has

 a

 vibrant

 nightlife

,

 cosm

opolitan

 atmosphere

,

 and

 a

 long

-standing

 tradition

 of

 classical

 music

,

 cinema

,

 and

 theater

.

 In

 addition

,

 Paris

 is

 an

 important

 center

 of

 the

 world

's



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 continue

 to

 evolve

 and

 develop

 rapidly

.

 Some

 potential

 future

 trends

 include

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 AI

 will

 likely

 continue

 to

 integrate

 with

 other

 technologies

 like

 machine

 learning

,

 computer

 vision

,

 natural

 language

 processing

,

 and

 robotics

,

 further

 improving

 its

 capabilities

 in

 processing

 data

 and

 making

 decisions

.



2

.

 Greater

 emphasis

 on

 ethical

 considerations

:

 As

 AI

 becomes

 more

 advanced

,

 it

 will

 be

 important

 to

 address

 potential

 ethical

 concerns

,

 such

 as

 bias

 and

 privacy

,

 in

 its

 applications

.



3

.

 Increased

 focus

 on

 integration

 with

 human

 work

:

 AI

 will

 likely

 continue

 to

 be

 integrated

 more

 deeply

 into

 the

 work

 of

 humans

,

 with

 the

 goal

 of

 improving

 efficiency

 and

 productivity

.



4

.

 More




In [6]:
llm.shutdown()