# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0819 00:01:58.779000 68098 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0819 00:01:58.779000 68098 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0819 00:02:08.420000 68741 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0819 00:02:08.420000 68741 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.49it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=62.39 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=62.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.88it/s]Capturing batches (bs=2 avail_mem=62.32 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.88it/s]Capturing batches (bs=1 avail_mem=62.32 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.88it/s]Capturing batches (bs=1 avail_mem=62.32 GB): 100%|██████████| 3/3 [00:00<00:00,  9.40it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tanya. I am a teacher at the Chinese language school. I am very friendly to students. Every day, I like to take them to the nearby mountains. They like to ride bikes in the park. They often have fun in the park. I think it is a very nice place. I often teach them how to make snowmen. The students like to have fun with my snowmen. My class is very popular. I am very proud of my students. As I like to travel, I want to visit a big city in the future. Do you have any ideas about traveling? Come to visit me in China. I will
Prompt: The president of the United States is
Generated text:  a holder of what type of political office?
A. Executive
B. Legislative
C. Executive or Legislative
D. None of the above
Answer:

A

Which of the following is NOT a component of a computer network?
A. Central Processing Unit
B. Server
C. Terminal
D. Resource Pool
Answer:

D

Which of the following options is related to the project execution process?
A. Project Planni

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a cultural and economic hub, with a rich history dating back to the Roman Empire and being the birthplace of the French Revolution. Paris is a popular tourist destination and a major center for business and finance. The city is known for its cuisine, fashion, and art, and is home to many famous museums, theaters, and landmarks. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. Its status as the capital of France is a testament to its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Increased use of AI in healthcare: AI is likely to play a more significant role in healthcare, with more advanced algorithms being used to diagnose and treat diseases, predict patient outcomes, and improve patient care.

4. Greater focus on AI in manufacturing



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  John and I am a passionate student of science fiction. I am an experienced writer with a strong passion for exploring the endless possibilities of the universe. I enjoy writing science fiction and fantasy stories, and I love reading and discussing the themes of love, loss, and power. I am excited to meet you! How can I make a good first impression on you? Hello, I am John, and I am a science fiction writer with a passion for exploring the endless possibilities of the universe. I enjoy writing science fiction and fantasy stories, and I love reading and discussing the themes of love, loss, and power. I am excited to meet you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 
Answer: Paris, the largest city in France, is the capital city of the country. Its history spans over 1,000 years and is home to nume

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

 of

 character

],

 and

 I

'm

 a

 [

insert

 profession

 or

 profession

 role

]

 with

 a

 passion

 for

 [

insert

 something

 related

 to

 your

 field

 of

 interest

].

 I

'm

 always

 up

 for

 new

 experiences

 and

 always

 ready

 to

 learn

 and

 improve

.

 I

 value

 empathy

 and

 kindness

,

 and

 I

 strive

 to

 create

 positive

 impact

 in

 my

 community

.

 I

'm

 a

 curious

 person

 who

 loves

 to

 research

 new

 topics

 and

 learn

 new

 things

.

 I

 enjoy

 sharing

 my

 knowledge

 and

 expertise

 with

 others

 and

 helping

 them

 grow

.

 I

'm

 a

 team

 player

 and

 always

 willing

 to

 lend

 a

 helping

 hand

.

 I

 have

 a

 natural

 curiosity

 and

 love

 to

 learn

,

 so

 I

'm

 always

 looking

 for

 new

 ways

 to

 expand

 my



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 seat

 of

 the

 French

 government

.

 It

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 history

 and

 cultural

 heritage

.

 The

 city

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 a

 major

 center

 for

 business

,

 culture

,

 and

 entertainment

,

 attracting

 millions

 of

 tourists

 annually

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 one

 of

 rapid

 and

 continuous

 innovation

,

 with

 many

 interesting

 developments

 in

 the

 coming

 years

.

 Some

 potential

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

 include

:



1

.

 Enhanced

 Data

 Privacy

 and

 Security

:

 As

 more

 data

 is

 generated

 and

 analyzed

,

 we

 are

 likely

 to

 see

 increased

 focus

 on

 data

 privacy

 and

 security

.

 AI

-powered

 tools

 and

 systems

 will

 be

 required

 to

 ensure

 that

 sensitive

 information

 is

 handled

 securely

 and

 eth

ically

.



2

.

 AI

 for

 Healthcare

:

 AI

 will

 play

 a

 more

 significant

 role

 in

 healthcare

 in

 the

 future

,

 particularly

 in

 personalized

 medicine

.

 AI

 will

 be

 used

 to

 analyze

 medical

 data

 and

 provide

 insights

 into

 patient

 outcomes

,

 allowing

 healthcare

 providers

 to

 make

 more

 informed




In [6]:
llm.shutdown()