# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0828 07:25:44.168000 1823161 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0828 07:25:44.168000 1823161 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0828 07:25:52.659000 1824187 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0828 07:25:52.659000 1824187 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.82it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=59.84 GB):   0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=59.84 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.02it/s]Capturing batches (bs=2 avail_mem=59.28 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.02it/s]

Capturing batches (bs=1 avail_mem=59.28 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.02it/s]Capturing batches (bs=1 avail_mem=59.28 GB): 100%|██████████| 3/3 [00:00<00:00, 11.48it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Michael. I like traveling, music, and cooking. I also like to read a lot. I'm a fan of traveling by train because I think it's interesting and relaxing. When I travel by train, I'm never tired. It's like I'm getting a big hug from the people on the train. I love to travel by train because I like the experience of getting on and off the train. I like to try new foods while traveling and it's a great way to discover new places. I love to cook and have a good time cooking. What are some things that Michael enjoys doing in his free time? - reading -
Prompt: The president of the United States is
Generated text:  a high-ranking government official who is elected by the people to serve a four-year term. The president is the most powerful person in the world and serves as the head of the executive branch of the federal government. This political officeholder has a lot of power and influence and can make decisions that affect the lives of millions of A

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a unique trait or characteristic that makes you stand out]. And what's your favorite hobby or activity? I love [insert a hobby or activity that you enjoy]. And what's your favorite book or movie? I love [insert a favorite book or movie that you've read or watched]. And what's your favorite color? I love [insert a favorite color that you enjoy]. And what's your favorite food? I love [insert a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the third largest in the world. It is also the seat of the French government and the country's cultural and political capital. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major tourist destination, with millions of visitors each year. The city is home to many famous museums, including the Musée d'Orsay and the Musée d'Orsay. Paris is a vibrant and diverse city with a rich history and culture that has influenced French art, literature, and cuisine. It is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and adaptive AI systems that can better understand and respond to human needs and preferences.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations and guidelines for its development and use



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name], I'm a friendly and kind-hearted character. I'm always ready to lend a helping hand and be a good listener. I'm not afraid to ask questions and offer advice to those in need. My goal is to make people happy and create a positive impact in the world. I'm excited to meet you and learn more about your life. What can I say? I'm [name], your friendly and helpful friend. How about you? How can I assist you? Let's have a chat and find out more. [name] [name] [name] [name] [name] How can I help you?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its beautiful architecture, rich history, and vibrant cultural scene. It is the world's second-largest city and the third-largest metropolitan area in terms of population, with over 2 million residents. Paris is also home to UNESCO World Heritage site

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

/an

 [

Job

 Title

]

 at

 [

Company

 Name

].

 I

 have

 a

 passion

 for

 [

What

 you

 do

 for

 a

 living

]

 and

 I

'm

 always

 striving

 to

 [

What

 you

 want

 to

 achieve

 in

 your

 career

].

 I

 believe

 in

 [

A

 belief

 you

 hold

 dear

].

 I

'm

 a

 [

Any

 notable

 qualities

 you

 possess

,

 e

.g

.

 hard

working

,

 organized

,

 hard

 to

 get

 over

].

 I

'm

 always

 ready

 to

 [

Any

 future

 goal

 or

 dream

 you

 have

].

 What

's

 your

 name

,

 and

 what

 do

 you

 do

?

 [

Your

 name

]

 [

Your

 job

 title

]

 [

Company

 Name

]

 (

Optional

:

 Contact

 information

,

 email

,

 etc

.).

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 for

 its

 historical

 significance

,

 vibrant

 culture

,

 and

 world

-ren

owned

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 considered

 the

 cultural

 center

 of

 Europe

 and

 a

 major

 tourist

 destination

.

 It

 is

 known

 for

 its

 culinary

,

 fashion

,

 and

 entertainment

 scenes

.

 The

 city

 is

 home

 to

 many

 cultural

 institutions

 and

 is

 considered

 one

 of

 the

 world

's

 most

 cosm

opolitan

 cities

.

 Its

 status

 as

 a

 major

 center

 of

 science

 and

 innovation

 is

 also

 well

 known

.

 According

 to

 the

 Paris

 

2

0

2

4

 Olympic

 and

 Par

aly

mp

ic

 Games

,

 the

 city

 is

 expected

 to

 host

 the

 

2

0

2



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 continued

 innovation

 and

 advancement

 in

 areas

 such

 as

 deep

 learning

,

 natural

 language

 processing

,

 computer

 vision

,

 robotics

,

 and

 autonomous

 systems

.

 Here

 are

 some

 potential

 trends

 to

 watch

 out

 for

 in

 the

 coming

 years

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 can

 be

 used

 to

 improve

 the

 accuracy

 of

 diagnoses

,

 improve

 treatment

 outcomes

,

 and

 personalize

 healthcare

 experiences

 for

 patients

.



2

.

 Better

 understanding

 of

 human

 emotions

:

 AI

 algorithms

 can

 help

 to

 better

 understand

 human

 emotions

 and

 behavior

,

 which

 can

 be

 used

 to

 develop

 more

 effective

 treatments

 for

 mental

 health

 issues

.



3

.

 More

 personal

ization

 of

 AI

 experiences

:

 AI

 is

 becoming

 more

 personal

,

 with

 more

 people

 interacting

 with




In [6]:
llm.shutdown()