# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0814 00:03:19.195000 333880 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 00:03:19.195000 333880 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0814 00:03:31.028000 334473 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 00:03:31.028000 334473 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.31it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.86it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.86it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.86it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.25it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lisa. I'm a 9-year-old girl from the United States. I'm a student at Greenfield High School. I like to visit friends and watch TV. I have some good friends like Tony and Alan. They are of the same age. We go to school together. We talk to each other every day. We learn from each other. I am not allowed to watch TV after 9:30. My parents are both teachers. My parents are strict with me. They are always busy. But I'm a good student. I want to be a doctor when I grow up. I am not very tall. But
Prompt: The president of the United States is
Generated text:  a branch of the federal government. Which of the following statements about the president is correct?
A. The president is the head of government of the United States.
B. The president can nominate judges to the federal court.
C. The president is the highest judge of the United States.
D. The president is a member of the Congress of the United States.
Answer:
A

The reason why high-altitude work

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or profession]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I enjoy [insert a short description of your favorite hobby or activity]. I'm always looking for ways to stay active and healthy, so I'm always on the lookout for new fitness programs or classes. What's your favorite book or movie? I'm a big fan of [insert

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its vibrant nightlife and is a popular tourist destination. The city is known for its art, music, and cuisine, and is a major economic and cultural center in Europe. It is the largest city in France and has a population of over 2.5 million people. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the urban landscape. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there is a growing emphasis on developing AI that is designed to be ethical and responsible. This could mean developing AI that is designed to minimize harm to individuals, or that is designed to be transparent and accountable.

2. Greater integration with human decision-making: AI is likely to become more integrated with human decision-making in the future. This could mean that AI systems are designed to make decisions based on human input



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [age] year old [gender] with [hobbies, interests, strengths], [a profession]. I love [something you enjoy doing]. And I also like [a hobby, interest, or skill you're passionate about]. I'm [a type of personality] who [what you do]. That's all I have to say, and I look forward to meeting you. [Tell a little bit more about your personality] I am [Type of Personality], and I enjoy doing [What I Do]. I am [Hobbies, Interests, Strengths] and have a passion for [My

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France, located on the left bank of the Seine river and is known for its historical architecture, fashion, and culinary traditions. The city is home to the Louvre Museum, Notre-Dame Cathedral, and many other iconic landmarks. It is also a major center fo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

Occup

ation

]

 who

 has

 dedicated

 myself

 to

 [

Your

 Main

 Activity

].

 In

 my

 free

 time

,

 I

 enjoy

 [

Additional

 Activity

].

 Whether

 I

'm

 writing

 articles

,

 solving

 puzzles

,

 or

 playing

 games

,

 I

 find

 joy

 in

 [

Joy

ful

 Activity

].

 I

 also

 enjoy

 [

Another

 Joy

ful

 Activity

],

 and

 I

 believe

 in

 [

My

 Personal

 Values

 and

 beliefs

].

 My

 goal

 is

 to

 [

My

 Goal

],

 and

 I

 strive

 to

 [

My

 Goal

].

 I

 am

 always

 ready

 to

 learn

 and

 grow

,

 and

 I

 am

 eager

 to

 [

My

 Goal

].

 What

's

 your

 name

,

 and

 what

 brings

 you

 here

 today

?

 As

 I

 stand

 before

 you

,

 I

 have



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 third

-largest

 city

 in

 the

 European

 Union

 after

 Rome

 and

 London

.



**

Paris

**:

 the

 largest

 city

 in

 France

 and

 the

 third

-largest

 city

 in

 the

 European

 Union

 after

 Rome

 and

 London

.

 With

 its

 impressive

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

,

 Paris

 is

 a

 city

 of

 art

,

 science

,

 fashion

,

 and gastr

onomy

.

 The

 city

 is

 also

 home

 to

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

,

 among

 many other

 iconic

 landmarks

.

 In

 addition

 to

 being

 a

 major

 economic

 and

 political

 center

,

 Paris

 is

 also

 one

 of

 the

 world

's

 most

 popular

 tourist

 destinations

.

 It

 plays

 a

 central

 role

 in

 French

 culture

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 shaped

 by

 many

 factors

,

 but

 here

 are

 some

 possible

 trends

 that

 are

 currently

 being

 considered

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

-powered

 tools

 could

 help

 improve

 the

 accuracy

 and

 speed

 of

 medical

 diagnoses

,

 personalize

 treatment

 plans

,

 and

 improve

 patient

 outcomes

.



2

.

 Adv

ancements

 in

 natural

 language

 processing

:

 As

 AI

 continues

 to

 become

 more

 sophisticated

,

 it

 will

 become

 easier

 for

 machines

 to

 understand

 human

 language

,

 leading

 to

 more

 natural

 and

 intuitive

 interactions

 between

 humans

 and

 machines

.



3

.

 Integration

 of

 AI

 with

 IoT

 devices

:

 IoT

 devices

 can

 collect

 vast

 amounts

 of

 data

,

 and

 AI

 can

 be

 used

 to

 analyze

 and

 interpret

 this

 data

 to

 provide

 insights

 into

 the

 behavior

 and




In [6]:
llm.shutdown()