# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0827 21:07:47.038000 1357753 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0827 21:07:47.038000 1357753 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0827 21:07:56.834000 1358602 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0827 21:07:56.834000 1358602 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.12it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.85 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=55.85 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.70it/s]Capturing batches (bs=2 avail_mem=55.78 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.70it/s]Capturing batches (bs=1 avail_mem=55.77 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.70it/s]Capturing batches (bs=1 avail_mem=55.77 GB): 100%|██████████| 3/3 [00:00<00:00, 11.02it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lisa. I'm a student. I live in California. I like to read books and watch TV. I have a pet dog called Max. I love my pet dog. I don't like animals with backbones. Now I have a new friend. Her name is Lily. She is an English teacher. She is from America. She likes to swim and read books. She is friendly and kind. I don't like animals with backbones. But I like animals who are strong. I often play with Lily. Sometimes I play with my pet dog. Lily and I have many hobbies. It's nice to learn from each other.
Prompt: The president of the United States is
Generated text:  expected to be a US senator. How can the US senator be a US president?
How can a US senator be a US president?
Multi-choice problem: Are these two questions inquiring about the same information?
OPTIONS: +no; +yes;

+yes;
The two questions are inquiring about the same information. Both questions are asking about the relationship between a US senator and a US president. They both sp

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French Parliament House. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is known for its fashion, art, and cuisine, and is a popular destination for tourists and locals alike. It is also home to the French Parliament, the French National Library, and the French Parliament House. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and efficient solutions to complex problems.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations and responsible use of the technology. This could lead to more stringent regulations and guidelines for the development and deployment of AI systems.

3. Increased use



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a creative and dynamic thinker who thrives on exploring new ideas and pushing boundaries. From a young age, I've been fascinated by the human experience, and I'm always seeking to see things in new light and make a difference in the world. I'm a voracious reader and a passionate student, always seeking to learn and grow. I'm also a skilled communicator, and I'm confident in my ability to connect with others on a personal level. I believe that creativity and self-improvement are the keys to success, and I'm committed to honing my skills and continuing my education to stay ahead of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital of France and is the largest city in the European Union. It is known for its beautiful architecture, world-class museums and monuments, and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Age

]

 year

 old

,

 [

Gender

]

 who

 lives

 in

 [

City

/C

ountry

].

 I

'm

 currently

 studying

 [

Major

/

Field

 of

 Study

],

 [

School

 Name

]

 in

 [

City

].

 I

'm

 passionate

 about

 [

H

obby

/

Interest

/

En

thus

iasm

].

 Outside

 of

 school

,

 I

 enjoy

 [

Activity

/

Inter

ests

/

Am

our

].

 I

'm

 currently

 working

 on

 [

Project

],

 [

Project

's

 Goal

],

 and

 [

Project

's

 Outcome

].

 If

 you

 have

 any

 questions

 or

 need

 help

 with

 anything

 related

 to

 [

Subject

],

 feel

 free

 to

 ask

.

 I

 look

 forward

 to

 having

 the

 opportunity

 to

 learn

 from

 you

 and

 share

 my

 experiences

 and

 thoughts

 with



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 


(

1

 point

)

 


A

)

 True

 


B

)

 False




A

)

 True





Paris

 is

 the

 capital

 city

 of

 France

.

 It

 is

 also

 known

 as

 the

 "

City

 of

 Love

"

 and

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 global

 city

 with

 a

 rich

 history

,

 art

,

 and

 culture

,

 and

 it

 continues

 to

 be

 an

 important

 center

 for

 politics

,

 business

,

 and

 entertainment

 in

 France

.

 The

 city

 is

 also

 known

 for

 its

 fashion

 industry

,

 and

 Paris

 is

 considered

 the

 "

Paris

ian

"

 fashion

 capital

.

 Paris

 has

 a

 population

 of

 around

 

2



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 increasing

 sophistication

,

 autonomous

 decision

-making

,

 and

 application

 to

 many

 more

 industries

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Improved

 predictive

 analytics

:

 AI

 will

 continue

 to

 improve

 predictive

 analytics

,

 allowing

 for

 more

 accurate

 and

 timely

 predictions

 of

 future

 trends

 and

 events

.

 This

 will

 enable

 businesses

 to

 make

 data

-driven

 decisions

 and

 improve

 their

 operations

.



2

.

 Increased

 AI

 ethics

:

 The

 ethical

 implications

 of

 AI

 are

 becoming

 increasingly

 important

,

 with

 concerns

 about

 bias

,

 transparency

,

 and

 accountability

.

 As

 AI

 becomes

 more

 widespread

,

 there

 will

 be

 a

 greater

 need

 for

 clear

 guidelines

 and

 standards

 to

 ensure

 its

 ethical

 use

.



3

.

 AI

 for

 healthcare

:

 AI

 will

 be

 used




In [6]:
llm.shutdown()