# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0808 18:50:42.093000 215996 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0808 18:50:42.093000 215996 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0808 18:50:50.364000 216415 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0808 18:50:50.364000 216415 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.42it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tom, and I am a C++ programmer. I have always loved programming and building applications. I am particularly interested in machine learning, and I have taken courses on deep learning, and have learned a lot from them.
I am looking for a job that would allow me to utilize my programming skills and machine learning skills to make a difference in the world. What kind of job would be ideal for me, and how can I find one in my area of interest?
There are many jobs that would allow for programming and machine learning skills, depending on your specific skills and interests. Some possibilities include:
1. Data science: This involves collecting,
Prompt: The president of the United States is
Generated text:  attempting to foster social harmony through a series of speeches. He starts with a president's speech that is 40 minutes long. In his next speech, he doubles the duration of the previous speech. How long is the next speech in minutes?

To determine

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I've been with the company for [number of years] years. I'm a [job title] at [company name], and I've been with the company for [number of years] years. I'm a [job title] at [company name], and I've been with the company for [number of years] years. I'm a [job title] at [company name], and I've been with the company for [number of years] years. I'm a [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower, beautiful museums, and rich history. It is also a popular tourist destination and a major financial center. Paris is home to many world-renowned artists, writers, and musicians, and is known for its fashion industry. The city is also known for its cuisine, with its famous Parisian dishes like croissants, escargot, and escargot. Paris is a city of contrasts, with its modern architecture and historical landmarks blending together to create a unique and fascinating place. The city is also home to the Louvre Museum, which is one of the largest and most

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and context-aware AI systems that can better understand and respond to the needs of humans.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more robust and transparent AI systems that are designed to minimize



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Age] year old. I'm from [Country] and I have been working as a [Job Title] in [Industry] for [Number] years. I'm always looking for new challenges and opportunities to grow and learn. What's your experience with technology, and what projects do you work on? Hi, my name is [Name] and I'm a [Age] year old. I'm from [Country] and I have been working as a [Job Title] in [Industry] for [Number] years. I'm always looking for new challenges and opportunities to grow and learn.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

That's correct! Paris, officially known as the City of Paris, is the largest city in France by population and a major cultural and political center. It's a UNESCO World Heritage site and home to iconic landmarks such as Notre-Dame Cathedral and the Eiffel Tower. Paris is

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

],

 and

 I

 am

 a

 [

insert

 occupation

 or

 profession

].

 I

 have

 always

 been

 fascinated

 by

 [

insert

 something

 you

 enjoy

 doing

],

 and

 I

 have

 decided

 to

 pursue

 it

 further

.

 I

 am

 currently

 studying

 [

insert

 a

 course

 of

 study

],

 and

 I

 am

 currently

 enrolled

 at

 [

insert

 school

 or

 institution

].

 I

 have

 a

 deep

 respect

 for

 [

insert

 a

 historical

 figure

 or

 cultural

 person

].

 I

 believe

 that

 my

 passion

 for

 [

insert

 something

 you

 are

 passionate

 about

]

 will

 drive

 me

 to

 become

 a

 [

insert

 a

 profession

 or

 person

]

 who

 will

 continue

 to

 contribute

 to

 society

.

 I

 am

 [

insert

 a

 positive

,

 enthusiastic

,

 or

 confident

]

 attitude

,

 and

 I

 believe

 that

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 vibrant

 culture

,

 historic

 architecture

,

 and

 annual

 E

iff

el

 Tower

 celebrations

.

 Is

 there

 a

 specific

 landmark

 or

 historical

 event

 in

 Paris

 that

 you

 would

 love

 to

 see

 again

 in

 person

?

 As

 an

 AI

 language

 model

,

 I

 don

't

 have

 personal

 preferences

 or

 experiences

,

 but

 I

 can

 provide

 you

 with

 some

 factual

 information

 about

 Paris

.



Paris

 is

 a

 city

 in

 the

 north

 of

 France

,

 known

 for

 its

 rich

 history

,

 art

,

 fashion

,

 and

 food

.

 It

 is

 home

 to

 several

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 has

 a

 long

 history

,

 dating

 back

 to

 the

 Roman

 Empire

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 diverse

,

 with

 many

 potential

 trends

 shaping

 its

 direction

 and

 impact

.

 Here

 are

 some

 potential

 future

 trends

:



1

.

 Increased

 integration

 with

 human

 consciousness

:

 AI

 could

 become

 more

 integrated

 with

 the

 human

 brain

,

 allowing

 machines

 to

 mimic

 human

 decision

-making

 and

 behavior

.



2

.

 Better

 understanding

 of

 human

 emotions

:

 AI

 could

 become

 more

 accurate

 at

 interpreting

 human

 emotions

 and

 responding

 appropriately

 to

 them

.



3

.

 Enhanced

 sense

 of

 empathy

 and

 compassion

:

 AI

 could

 become

 more

 capable

 of

 understanding

 and

 empath

izing

 with

 other

 human

 emotions

 and

 experiences

.



4

.

 Greater

 autonomy

 for

 humans

:

 AI

 could

 become

 more

 capable

 of

 making

 autonomous

 decisions

 and

 taking

 action

 on

 its

 own

 without

 human

 intervention

.



5

.

 Development

 of




In [6]:
llm.shutdown()