# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0822 18:20:13.743000 1698896 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0822 18:20:13.743000 1698896 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0822 18:20:22.421000 1700522 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0822 18:20:22.421000 1700522 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.04it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=51.73 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=51.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.80it/s]Capturing batches (bs=2 avail_mem=51.67 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.80it/s]Capturing batches (bs=1 avail_mem=51.67 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.80it/s]Capturing batches (bs=1 avail_mem=51.67 GB): 100%|██████████| 3/3 [00:00<00:00, 11.23it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Soyoung and I have a problem with my computer. Sometimes I get an error message like this:

```
Process finished with exit code 0
```

I have tried searching online for solutions, but I haven't been able to find a comprehensive solution. Any help would be appreciated. Thanks in advance!

---

```
Process finished with exit code 0
```

This error message is common when your computer is trying to shut down or restart but is unable to do so due to a hardware or software problem. The message itself doesn't tell you what the problem is, but rather that the process has finished normally and exited with a
Prompt: The president of the United States is
Generated text:  288 inches tall. If his wife is 14 inches shorter than him, what is their combined height in feet?

To find the combined height of the president and his wife in feet, we need to follow these steps:

1. **Determine the height of the wife:**
   - The president's height is 288 inches.
   - 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Person] who enjoys [Favorite Activity/Interest]. I'm [Favorite Color] and I love [Favorite Food/Drink]. I'm [Favorite Book/TV Show/Video Game] and I'm always [Favorite Quote/Adjective]. I'm [Favorite Animal/Plant/Insect/Animal] and I love [Favorite Hobby/Activity]. I'm [Favorite Music/Art/Science/Technology] and I'm always [Favorite Thing to Do]. I'm [Favorite Place/Location/Activity/Person]. I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture. The city is known for its beautiful architecture, world-renowned museums, and annual festivals such as the Eiffel Tower and the Louvre. Paris is also home to many famous landmarks such as Notre-Dame Cathedral, the Louvre Museum, and the Champs-Élysées. The city is a major economic and cultural hub in Europe and is a popular tourist destination. Paris is a city that is constantly evolving and changing, with new developments and attractions being added to the city's list of attractions. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more robust AI systems that are designed to be transparent, accountable, and responsible



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name], and I am a software engineer who has been working in the field for [insert years] years. My current job involves [insert a relevant role] and I have a passion for [insert a personal passion or hobby]. I am always up-to-date with the latest technologies and I enjoy learning new things. I am a quick learner and a good communicator. I have a good work ethic and am always looking for ways to improve my skills. What can you tell me about yourself? [insert your self-introduction] [insert your self-introduction] [insert your self-introduction] [insert your self-introduction]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the cultural, political, and economic center of France and home to many of the country's most famous landmarks. The city is also the world's third-largest by populat

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 young

,

 aspiring

 writer

 with

 a

 passion

 for

 exploring

 the

 depths

 of

 storytelling

.

 I

've

 always

 been

 drawn

 to

 writing

 because

 I

'm

 fascinated

 by

 how

 the

 human

 mind

 can

 hold

 complex

 ideas

 and

 emotions

,

 and

 how

 those

 ideas

 can

 be

 conveyed

 through

 words

 on

 paper

 or

 screen

.

 I

 have

 a

 natural

 ability

 to

 craft

 compelling

 narratives

 that

 grab

 people

's

 attention

 and

 make

 them

 feel

 emotionally

 connected

 to

 the

 characters

 and

 the

 story

.

 My

 hope

 is

 to

 become

 a

 published

 author

,

 sharing

 my

 writing

 with

 readers

 around

 the

 world

 and

 sharing

 the

 joy

 of

 storytelling

 with

 others

.

 Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

.

 Let

's

 connect

 and

 see

 what

 our



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 and

 most

 populous

 city

 in

 the

 country

.

 



To

 elaborate

:


-

 Paris

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.


-

 It

 is

 the

 seat

 of

 the

 French

 government

 and

 numerous

 other

 governmental

 agencies

.


-

 The

 city

 is

 home

 to

 many

 museums

,

 including

 the

 Lou

vre

,

 the

 Centre

 Pom

pid

ou

,

 and

 the

 Mus

ée

 d

'

Or

say

.


-

 The

 French

 language

 is

 the

 official

 language

,

 and

 Paris

 is

 home

 to

 many

 universities

 and

 institutions

 of

 higher

 learning

.


-

 The

 city

 is

 a

 major

 center

 for

 fashion

,

 gastr

onomy

,

 and

 culture

,

 attracting

 visitors

 from



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 promising

,

 with

 many

 potential

 trends

 shaping

 its

 direction

.

 Here

 are

 some

 of

 the

 potential

 future

 trends

:



1

.

 Autonomous

 vehicles

:

 AI

 is

 already

 transforming

 the

 transportation

 industry

,

 with

 autonomous

 vehicles

 becoming

 more

 common

 and

 affordable

.

 The

 trend

 will

 continue

 with

 self

-driving

 cars

 becoming

 more

 widespread

 and

 accessible

.



2

.

 Speech

 and

 language

 processing

:

 AI

 will

 continue

 to

 improve

 and expand

 its capabilities

,

 with

 speech

 and

 language

 processing

 becoming

 more

 accurate

 and

 efficient

.

 This

 will

 enable

 smarter

 and

 more

 intuitive

 interactions

 with

 computers

 and

 other

 devices

.



3

.

 Autonomous

 medical

 devices

:

 AI

 will

 be

 used

 to

 develop

 medical

 devices

 that

 can

 diagnose

 and

 treat

 diseases

 more

 accurately

 and

 efficiently

 than

 human

 doctors

.

 This




In [6]:
llm.shutdown()