# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.55it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I'm a social studies teacher at Orinda High School. I teach high school social studies and social science classes. I have taught every school subject. I am also certified to teach English Language Arts and have been certified to teach elementary school.
I have been teaching since 2013. I am also the director of the School Psychologist Program. I am currently registered to teach Social Studies at the Secondary School level. I have a Bachelor of Science Degree in Education from the University of California, Santa Cruz. I currently work as a college board grade-level English teacher. I have also taught in the Gwinnett County Schools
Prompt: The president of the United States is
Generated text:  a person. A person is a person. Therefore, the president of the United States is a person.

Please examine the validity of this syllogism. The argument has:

a) valid
b) invalid
c) no conclusion
d) it is impossible to determine

To determine the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill] who has been [Number of Years] years in the industry. I'm passionate about [What I Love to Do], and I'm always looking for new opportunities to [What I Want to Learn/Do]. I'm [What I Do Best/What I'm Good At]. I'm excited to meet you and learn more about you. [Your Name] [Your Age] [Your Gender] [Your Occupation] [Your Skill] [Your Number of Years] [Your Passion] [Your Goal] [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and restaurants. Paris is a popular tourist destination, known for its rich history, art, and cuisine. It is home to the French Parliament, the French Academy of Sciences, and the French National Library. The city is also home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is a vibrant and diverse city with a rich cultural heritage, and it is a popular destination

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn from and adapt to human behavior and decision-making processes.

2. Enhanced privacy and security: As AI becomes more advanced, there will be an increased focus on privacy and security, with efforts to ensure that AI systems are designed and implemented in a way that respects human rights and protects personal data.

3. Greater emphasis on ethical considerations: As AI becomes more prevalent in various industries, there will be a greater emphasis on ethical considerations, with efforts to ensure that



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [insert profession or major], majoring in [insert degree program] at [insert university or college]. I've always been interested in the creative arts, which is why I'm passionate about [insert something related to your profession or major]. I also enjoy [insert hobbies or interests that are relevant to your profession or major], such as [insert hobby or interest]. As a [insert role], [insert position] at [insert employer or company name], I'm always looking for new ways to [insert something related to your profession or major], and I'm always eager to learn and grow as a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the country and the second-largest urban area in the world by population, after the metropolis of Tokyo in Japan. Located in the center of

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

...

 (

insert

 name

)

 and

 I

'm

 a

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an

/an



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 Î

le

 de

 la

 C

ité

 region

,

 which

 includes

 the

 city

 of

 Paris

 itself

.

 Paris

 is

 known

 for

 its

 historic

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 E

iff

el

 Tower

,

 and

 is

 also

 home

 to

 the

 French

 National

 Library

,

 the

 Lou

vre

 Museum

,

 and

 the

 Palace

 of

 Vers

ailles

.

 The

 city

 is

 also

 known

 for

 its

 vibrant

 culture

,

 including

 art

 and

 music

,

 and

 its

 cuisine

,

 which

 is

 well

-reg

arded

 for

 its

 flavors

 and

 techniques

.

 Paris

 is

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

,

 known

 for

 its

 elegant

 architecture

,

 rich

 history

,

 and

 cultural

 attractions

.

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 third

-largest



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 promising

 and

 likely

 to

 continue

 to

 grow

,

 evolve

,

 and

 explore

 new

 dimensions

 of

 human

 creativity

,

 control

,

 and

 influence

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Personal

ization

:

 AI

 will

 become

 more

 personalized

,

 and

 people

 will

 be

 able

 to

 expect

 and

 demand

 more

 customized

 and

 targeted

 experiences

.

 This

 could

 involve

 more

 human

-like

 interactions

,

 as

 well

 as

 more

 personalized

 recommendations

 and

 interactions

.



2

.

 Autonomous

 Agents

:

 Autonomous

 agents

,

 which

 are

 programs

 that

 can

 perform

 tasks

 without

 human

 intervention

,

 will

 become

 more

 widespread

.

 This

 could

 involve

 robots

 and

 drones

 that

 can

 perform

 tasks

 without

 human

 intervention

,

 such

 as

 picking

 up

 trash

 or

 delivering

 groceries

.



3

.

 Machine

 Learning

 and




In [6]:
llm.shutdown()