# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0814 18:03:06.351000 2640959 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 18:03:06.351000 2640959 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0814 18:03:14.783000 2641667 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 18:03:14.783000 2641667 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.26it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=61.41 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=61.41 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=2 avail_mem=61.35 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=61.34 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=61.34 GB): 100%|██████████| 3/3 [00:00<00:00, 11.07it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Xan and I'm a computer programmer. I've been programming since I was 12 years old, and I've been working on my own projects since I was 15. Now, I'm seeking help with a problem. My colleague gave me a program that runs but doesn't work. Can you help me? Yes, I can help you. What's the problem you're having with the program? Is it running on a specific platform or is it running on the internet? Also, are you able to provide me with more details about the problem? For example, what are the errors that are occurring? What's the
Prompt: The president of the United States is
Generated text:  inaugurated on January 20, 2009. It is expected that the inauguration will occur on the next Thursday following the last Thursday of the previous year. How many weeks and how many days will it take from the inauguration to the next president's inauguration?

To determine the number of weeks and days between the inauguration of the president of the United States

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character's personality or background]. I enjoy [insert a short description of your hobby or interest]. I'm always looking for new experiences and challenges to try. What's your favorite hobby or activity? I'm always looking for new adventures and experiences to try. What's your favorite book or movie? I'm always looking for new ideas and inspiration to try. What's your favorite color? I'm always looking for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and art galleries. Paris is known for its rich history, including the influence of French Revolution and Napoleon Bonaparte, and its diverse population of over 2 million people. The city is also home to many famous French artists, writers, and musicians. Paris is a popular tourist destination, attracting millions of visitors each year. Its status as the capital of France is a testament to its importance as a cultural

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to perform tasks that are currently beyond the capabilities of humans. This could lead to more efficient and effective use of AI in various fields, such as healthcare, finance, and transportation.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, such as privacy, bias, and accountability. This will require developers to create AI systems that are transparent, accountable, and responsible.

3. Increased use of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  ____. I am a/an ____. I am currently ____. I enjoy ____. I have a passion for ____. I am an expert in ____. 

Your response should include an example of your work or expertise. For example, if you are a writer, you could mention your book "X" or "Y." If you are an athlete, you could mention your best time or your winning strategy. Your self-introduction should be neutral and informative, and should not contain any personal or emotional elements. The tone of your introduction should be professional and formal. Please format your response in Latex format, using the variables: name,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

A French person can speak two languages, French and English. 

As of 2021, the population of Paris is approximately 2.1 million people. 

Paris is a very multicultural city, ho

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 an

 AI

 language

 model

.

 I

 can

 help

 you

 with

 a

 variety

 of

 tasks

,

 including

 answering

 questions

,

 providing

 information

,

 and

 even

 generating

 writing

 prompts

.

 My

 programming

 is

 constantly

 evolving

 to

 improve

 my

 abilities

 and

 expand

 my

 knowledge

 base

,

 so

 please

 feel

 free

 to

 ask

 me

 any

 questions

 or

 let

 me

 know

 if

 I

 can

 assist

 with

 any

 tasks

 you

 have

.


Welcome

 to

 my

 world

,

 where

 technology

 is

 everywhere

 and

 I

 can

 help

 you

 with

 any

 task

 you

 have

.

 My

 programming

 is

 constantly

 evolving

 to

 provide

 you

 with

 the

 most

 accurate

 and

 helpful

 responses

 possible

,

 so

 please

 feel

 free

 to

 ask

 me

 any

 questions

 or

 let

 me

 know

 if

 I

 can

 assist

 with

 any



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 These

,"

 and

 it

 is

 the

 largest

 city

 in

 the

 country

 and

 the

 second

 most

 populous

 city

 in

 the

 world

.

 It

 is

 located

 on

 the

 Se

ine

 River

 and

 is

 the

 cultural

 and

 economic

 center

 of

 France

.

 Paris

 is

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 also

 hosts

 the

 world

-ren

owned

 Paris

 Marathon

 and

 the

 Par

c

 Mon

ce

au

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 is

 known

 for

 its

 romantic

 and

 historical

 attractions

.

 It

 is

 a

 major

 financial

 center

 and

 the

 seat

 of

 the

 French

 government

.

 The

 city

 is

 known

 for

 its

 rich

 culture



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 several

 key

 trends

,

 including

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 being

 used

 to

 improve

 the

 accuracy

 of

 medical

 diagnoses

,

 predict

 disease

 outbreaks

,

 and

 assist

 in

 personalized

 treatment

 plans

.



2

.

 Integration

 of

 AI

 in

 consumer

 electronics

:

 AI

 is

 being

 integrated

 into

 consumer

 electronics

 such

 as

 smartphones

,

 smart

 home

 devices

,

 and

 virtual

 reality

 head

sets

,

 allowing

 for

 more

 advanced

 personal

ization

 and

 entertainment

.



3

.

 AI

 in

 the

 workplace

:

 AI

 is

 being

 used

 to

 automate

 tasks

 and

 improve

 productivity

,

 leading

 to

 more

 efficient

 work

 processes

 and

 potentially

 more

 job

 opportunities

.



4

.

 AI

 in

 transportation

:

 AI

 is

 being

 used

 to

 improve

 traffic

 management

,

 optimize




In [6]:
llm.shutdown()