# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0814 20:52:44.980000 162183 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 20:52:44.980000 162183 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0814 20:52:53.871000 162479 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 20:52:53.871000 162479 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.06it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.47it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.47it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.47it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.54it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Matt. I'm 17 years old and I live in Boston, Massachusetts. I have a crush on a girl and my crush is really good. 

I'm a big fan of the Marvel Universe. I like Spider-Man, Captain America, Iron Man, and Captain Marvel. I also like the Avengers series and the Star Wars series. I love Star Wars and Spider-Man. I like to read and watch movies and I like music, especially rock and pop music. 

I'm a beginner at Minecraft. I have no idea how to play. I play Minecraft with my friends. I love how different the settings are. I don
Prompt: The president of the United States is
Generated text:  a powerful leader who has the power to make important decisions and policies. In many countries, the president is elected by the people, either directly or through a system of proportional representation. In the United States, the president is elected for a four-year term by the citizens of the United States, who vote in primary elections and general elections. 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm currently [Current Location] and I enjoy [Favorite Activity]. I'm always looking for new experiences and challenges to try. What's your favorite hobby or activity to do? [Name] is a [Favorite Hobby/Activity]. I'm always looking for new adventures and experiences to try. What's your favorite hobby or activity to do? [Name] is a [Favorite Hobby/Activity]. I'm always looking for new adventures and experiences to try. What's your favorite hobby or activity to do? [Name] is a [Favorite Hobby/Activity

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French Academy of Sciences, and the French Riviera. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. It is also known for its cuisine, including its famous croissants and its famous wine. The city is also home to many museums, including the Louvre and the Musée d'Orsay. Paris is a city that is a must-visit for anyone interested in French culture and history. It is also a popular

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, with more sophisticated algorithms and machine learning techniques being developed to improve diagnosis, treatment, and patient care.

2. AI in finance: AI is already being used in finance to improve risk management, fraud detection, and trading algorithms. As AI technology continues to improve, we can expect



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name]. I'm an [insert character's age] year old fictional character, and I'm currently [insert current location, status, or occupation]. I've always been [insert a trait or quality that sets me apart from others, such as being kind, honest, intelligent, or adventurous]. I love to [insert a hobby or activity that I enjoy, such as reading, playing music, or cooking]. What makes you unique and interesting to readers? As an AI language model, I don't have a physical body or desires, but I'm designed to understand and respond to the language and questions that people have.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its landmarks such as Notre-Dame Cathedral, the Eiffel Tower, and the Louvre Museum. It is also famous for its rich history, including the French Revol

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

 and

 I

'm

 __

________

,

 a

 __

________

_

.



I

'm

 __

________

,

 a

 __

________

,

 and

 I

've

 always

 been

 fascinated

 by

 __

________

.

 I

'm

 always

 up

 for

 __

________

,

 and

 I

'm

 eager

 to

 __

________

.

 My

 expertise

 lies

 in

 __

________

,

 and

 I

 look

 forward

 to

 __

________

.

 As

 a

 __

________

,

 I

'm

 always

 ready

 to

 __

________

.

 I

 enjoy

 __

________

,

 and

 I

'm

 a

 __

________

.

 I

'm

 always

 looking

 forward

 to

 __

________

.

 I

'm

 __

________

.

 I

 enjoy

 __

________

,

 and

 I

'm

 a

 __

________

.

 I

'm

 always

 up

 for

 __

________

.

 I

'm

 __

________

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 as

 the

 “

City

 of

 Light

”

 and

 is

 a

 cultural

 and

 artistic

 center

.

 The

 city

 is

 also

 home

 to

 many

 notable

 museums

 and

 art

 galleries

.

 Paris

 is

 home

 to

 many

 important

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 the

 location

 of

 the

 French

 Parliament

 and

 the

 French

 Parliament

 building

.

 Paris

 is

 a

 thriving

 met

ropolis

 with

 a

 diverse

 population

 and

 thriving

 economy

.

 It

 is

 a

 popular

 tourist

 destination

 for

 many

 visitors

.

 The

 city

 is

 also

 known

 for

 its

 cuisine

 and

 has

 a

 rich

 history

 dating

 back

 to

 ancient

 times

.

 It

 is

 a

 vibrant

 and

 dynamic

 city

 with

 a

 rich

 cultural

 heritage



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 technological

 advancements

 and

 the

 development

 of

 new

 capabilities

.

 Here

 are

 some

 possible

 trends

 that

 may

 emerge

 in

 the

 future

:



1

.

 Increased

 integration

 with

 human

 intelligence

:

 AI

 may

 become

 more

 integrated

 with

 human

 intelligence

 to

 enhance

 the

 accuracy

 and

 efficiency

 of

 decision

-making

 processes

.



2

.

 Enhanced

 natural

 language

 processing

:

 AI

 may

 be

 able

 to

 understand

 and

 interpret

 human

 language

 more

 effectively

,

 leading

 to

 more

 sophisticated

 language

 processing

 and

 natural

 language

 generation

.



3

.

 Improved

 predictive

 analytics

:

 AI

 may

 become

 more

 sophisticated

 at

 predicting

 future

 events

 and

 trends

,

 leading

 to

 better

-in

formed

 decision

-making

 and

 strategic

 planning

.



4

.

 Greater

 localization

 of

 AI

:

 AI

 may

 become

 more

 localized

 to

 different




In [6]:
llm.shutdown()