# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0830 03:31:22.661000 288388 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 03:31:22.661000 288388 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0830 03:31:32.501000 289135 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 03:31:32.501000 289135 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.23it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.10 GB):   0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.10 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.12it/s]Capturing batches (bs=2 avail_mem=74.04 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.12it/s]

Capturing batches (bs=1 avail_mem=74.04 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.12it/s]Capturing batches (bs=1 avail_mem=74.04 GB): 100%|██████████| 3/3 [00:00<00:00, 11.79it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Diana and I am 25 years old. I am a student at the age of 21 and I have been in a relationship since I was 15. I have been in a relationship for about 2 years now and I have not had any conception of any pregnancy. I did not have any intercourse with a male until now. Is it safe for me to stop using birth control pills?
I am 19 years old and I'm on my first pregnancy. It is very difficult for me. I have not conceived yet and I have been on the pill for 2 years. Can I stop using my pill?

Prompt: The president of the United States is
Generated text:  a title that has become well known. They are referred to as the "Chief of State" or "Chief of the State" in various contexts. However, the titles of "Ambassador" and "Foreign Service Officer" are also well known. What are the differences between these positions?
Ambassadors and Foreign Service Officers are two different positions with unique responsibilities. An ambassador is responsible for repres

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short description of your personality or skills]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [insert a short description of your favorite hobby or activity]. I'm always looking for ways to expand my knowledge and skills. What's your favorite book or movie? I love [insert a short description

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also famous for its cuisine, fashion, and music, and is home to many world-renowned museums, theaters, and art galleries. Paris is a cultural and political center of France and a major tourist destination. It is also known for its annual Eiffel Tower Festival, which attracts millions of visitors each year. The city is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced interactions between humans and machines.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, but there is a growing trend towards using AI to assist in diagnosis, treatment, and patient care.

4. Greater use of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [Age] year old [Gender] [Name]. I am a/an [Occupation]. My [Favorite] hobby is [List [Favorite Hobby]]. I have always loved [My Hobby], and it has become a major part of my [My Hobby's] life. I am a self-described [Describe Your Personality or Character Traits]. I love [Your Personality or Character Traits] to the [How It Makes You Feel]. I am a/an [Your Name]. I enjoy [Your Occupation]. I always strive to [Your Characteristic or Value] in my life. I believe that [Your Values or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a beautiful city known for its rich history and vibrant culture. Paris is home to iconic landmarks such as Notre Dame Cathedral, the Louvre Museum, and the Eiffel Tower, as well as its famous neighborhoods such as Montmartre and the Parisian Riviera. The city is als

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

occupation

 or

 profession

].

 I

 am

 a

 [

age

]

 year

 old

,

 [

gender

]

 [

race

],

 and

 I

 have

 been

 in

 this

 field

 for

 [

number

 of

 years

]

 years

.

 I

 started

 out

 as

 [

current

 job

]

 but

 I

 am

 now

 [

current

 occupation

 or

 profession

].

 I

 love

 [

what

 you

 like

 to

 do

 or

 what

 you

 enjoy

 doing

].

 I

 am

 passionate

 about

 [

reason

 why

 I

 love

 what

 I

 do

].

 I

 am

 very

 [

your

 preferred

 personality

 trait

].

 I

 am

 also

 [

your

 preferred

 hobby

 or

 interest

].

 I

 am

 also

 [

your

 favorite

 quote

].

 Lastly

,

 I

 am

 [

your

 most

 memorable

 achievement

 or

 accomplishment

].

 I

 look



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Ré

pub

lique

"

 or

 "

La

 Ro

che

 Trev

ées

"

 and

 is

 the

 third

 largest

 city

 in

 France

 and

 the

 largest

 in

 the

 European

 Union

.

 



Paris

,

 nicknamed

 "

La

 Ro

che

 Trev

ées

"

 or

 "

La

 Ro

che

 de

 la

 V

igne

",

 is

 the

 seat

 of

 the

 French

 government

 and

 is

 located

 in

 the

 north

 of

 the

 country

 on

 the

 right

 bank

 of

 the

 Se

ine

 River

.

 The

 city

 is

 known

 for

 its

 historical

 buildings

,

 art

,

 music

,

 and

 cuisine

,

 and

 is

 famous

 for

 its

 cath

ed

r

als

,

 Notre

-D

ame

,

 Saint

e

-Ch

ap

elle

,

 and

 Lou

vre

 Museum

.

 It

 also

 has

 a

 vibrant



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 significant

 advances

 in

 several

 key

 areas

,

 including

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 to

 diagnose

 and

 treat

 diseases

,

 and

 the

 trend

 is

 expected

 to

 continue

 in

 the

 coming

 years

.

 AI

-powered

 diagnostic

 tools

,

 such

 as

 machine

 learning

 algorithms

 that

 analyze

 medical

 images

,

 may

 become

 more

 accurate

 and

 reliable

,

 leading

 to

 better

 patient

 outcomes

 and

 improved

 healthcare

 outcomes

.



2

.

 Improved

 privacy

 and

 security

:

 As

 AI

 systems

 become

 more

 complex

 and

 sophisticated

,

 there

 is

 a

 risk

 that

 they

 may

 be

 used

 to

 track

 and

 collect

 personal

 data

.

 To

 address

 this

,

 there

 is

 a

 growing

 focus

 on

 improving

 privacy

 and

 security

 in

 AI

 systems

,




In [6]:
llm.shutdown()