# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0811 17:20:55.285000 2543369 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 17:20:55.285000 2543369 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0811 17:21:04.414000 2543724 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 17:21:04.414000 2543724 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.04it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.51it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.51it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.51it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  6.57it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jakob. I am a German young computer programmer. I was born on 20th August 2001 in Lübeck, Germany and currently I'm living in the UK. I studied Computer Science at the University of California, Berkeley, and have held a part-time internship at Google. My programming language skills include C++, Java, C#, and JavaScript, and I have worked on various projects including open source projects like Node.js, Linux kernel, and Arduino. I am passionate about technology and have been building software for a living since the age of 14. I'm excited about the new possibilities that the technology
Prompt: The president of the United States is
Generated text:  24 years older than his youngest daughter. If the president is 41 years old, and the president decides to give his daughter a scholarship that is 25% of her age, how old will the daughter be when the scholarship is distributed?

To determine the age of the youngest daughter, we start by noting that the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for being in the industry], and I'm always looking for ways to [something]. I'm [age] years old, and I'm [gender] [race]. I'm [occupation] and I'm [address]. I'm [address] and I'm [address]. I'm [address]. I'm [address]. I'm [address]. I'm [address]. I'm [address]. I'm [address].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the seat of the French government and the country's cultural and political center. Paris is a bustling metropolis with a rich history and a diverse population of over 2 million people. The city is known for its fashion, art, and cuisine, and is a popular tourist destination. It is also home to many famous landmarks, including the Arc de Triomphe and the Notre-Dame Cathedral. Paris is a city that has been a center of culture and politics for over 20

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased integration with other technologies: AI is already being integrated into a wide range of other technologies, such as smart homes, self-driving cars, and virtual assistants. As these technologies continue to evolve, we can expect to see even more integration between AI and other technologies.

2. Greater emphasis on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, transparency, and accountability.

3. Increased focus



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [S or D] (Student / Doctor) at [School Name]. I recently graduated with a [degree] in [Field of Study]. I currently work [job title] at [company name]. I enjoy [insert what I enjoy doing]. My favorite hobby is [insert what I like to do]. I'm always looking for ways to [insert new skill or activity]. I'm [insert how you might describe yourself]. I'm [insert how old], [insert how many years of experience]. I'm [insert how excited you are to meet you], [Name]. I'm excited to meet

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital city of France and is known for its rich history, beautiful architecture, and famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also a hub for art, culture, and commerce, and is home to over 1

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Jane

.

 I

'm

 an

 English

 teacher

.

 How

 can

 I

 assist

 you

 today

?

 I

'm

 a

 warm

-hearted

 and

 humorous

 person

.

 I

 like

 to

 think

 I

'm

 funny

,

 but

 I

 have

 a

 sense

 of

 humor

 that

 can

 sometimes

 make

 people

 laugh

.

 I

 like

 to

 chat

 with

 people

 and

 help

 them

 with

 their

 questions

 and

 concerns

.

 I

 have

 a

 wide

 range

 of

 interests

,

 from

 film

 and

 music

 to

 travel

 and

 hobbies

.

 What

's

 your

 name

?

 I

'm

 Jane

.

 I

'm

 an

 English

 teacher

.

 Can

 you

 tell

 me

 a

 little

 bit

 more

 about

 yourself

?

 I

'm

 a

 warm

-hearted

 and

 humorous

 person

.

 I

 like

 to

 think

 I

'm

 funny

,

 but

 I

 have

 a

 sense

 of

 humor



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 bustling

 met

ropolis

 known

 for

 its

 historical

 landmarks

,

 vibrant

 arts

 scene

,

 and

 cosm

opolitan

 lifestyle

.

 Paris

 is

 situated

 on

 the

 River

 Se

ine

 and

 is

 the

 seat

 of

 government

,

 administration

,

 and

 culture

 in

 France

.

 It

 is

 also

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

,

 hosting

 millions

 of

 tourists

 annually

.

 The

 city

 is

 a

 melting

 pot

 of

 different

 cultures

 and

 is

 celebrated

 for

 its

 art

,

 architecture

,

 and

 cuisine

.

 Paris

 is

 a

 city

 of

 contrasts

,

 with

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

,

 as

 well

 as

 trendy

 areas

 like

 the

 Mar

ais

 district

 and

 the

 Left

 Bank

.

 It

's



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 and

 it

 is

 impossible

 to

 predict

 with

 absolute

 certainty

 which

 trends

 will

 emerge

 or

 which

 trends

 will

 fade

.

 However

,

 we

 can

 make

 educated

 guesses

 about

 the

 likely

 trajectory

 of

 AI

 in

 the

 coming

 years

 based

 on

 current

 trends

 and

 emerging

 technologies

.



One

 potential

 future

 trend

 in

 AI

 is

 the

 increasing

 use

 of

 AI

 in

 natural

 language

 processing

 and

 machine

 learning

.

 As

 AI

 technology

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 even

 more

 complex

 and

 sophisticated

 models

 that

 can

 interpret

 and

 understand

 human

 language

 in

 new

 and

 innovative

 ways

.

 This

 could

 lead

 to

 a

 greater

 ability

 to

 interact

 with

 AI

 systems

 through

 natural

 language

,

 as

 well

 as

 to

 more

 sophisticated

 forms

 of

 machine

 learning that

 can

 understand

 and




In [6]:
llm.shutdown()