# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aniket. I'm a journalist and a music producer, passionate about the music industry. I've been covering music for several years now and have been a regular contributor to various music publications. As a creative force, I've been passionate about my craft and am always on the lookout for innovative ways to bring my audience closer to the music I love. My goal is to continue to produce high-quality music for my fans while also staying true to the values of the music industry. How can I best approach the music industry as a journalist and music producer? As an advocate for music, I aim to make music accessible and enjoyable for everyone.
Prompt: The president of the United States is
Generated text:  a person. Which of the following statements about the president is true? (　　)  
A: The president is the highest leader of the state, with the highest position in the state.  
B: The president is the person who exercises the highest administrative powe

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a cultural and economic hub, and is home to many world-renowned museums, theaters, and restaurants. It is a popular tourist destination, and is a major center for business and finance in Europe. Paris is also known for its cuisine

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn from and adapt to human behavior and decision-making processes.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and improve access to care. As AI becomes more advanced, it is likely to be used in even



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  John, and I am a software engineer with a passion for creating innovative solutions and technology-driven solutions. I have a strong technical background and love to work with teams of developers to bring my ideas to life. I enjoy learning new technologies and constantly improving my skills, and I am always looking for ways to stay up-to-date with the latest trends and developments in the field. I believe that technology is the key to unlocking new opportunities and making the world a better place, and I am excited to join a team where I can share my knowledge and passion with others. Thank you for considering me for a potential career match. Happy to meet you!

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its iconic Eiffel Tower, vibrant culture, and rich history. Paris is also fa

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

background

 information

 about

 yourself

].

 What

 can

 you

 tell

 us

 about

 yourself

?

 [

Provide

 additional

 information

 about

 yourself

,

 such

 as

 your

 hobbies

,

 interests

,

 or

 any

 relevant

 experiences

 or

 accomplishments

].

 How

 do

 you

 like

 to

 spend

 your

 free

 time

?

 [

Describe

 your

 preferred

 activities

 or

 hobbies

,

 such

 as

 reading

,

 playing

 music

,

 or

 hanging

 out

 with

 friends

].

 What

 is

 your

 favorite

 color

?

 [

If

 you

 have

 a

 favorite

 color

,

 please

 share

 it

].

 Lastly

,

 what

 is

 your

 greatest

 passion

 in

 life

?

 [

Specify

 your

 passion

 in

 a

 clear

 and

 concise

 manner

,

 such

 as

 writing

,

 photography

,

 or

 music

].


[

Name

],

 my

 name

 is

 Sarah



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

,

 officially

 known

 as

 the

 Î

le

-de

-F

rance

 region

 in

 the

 

2

0

1

0

 census

,

 is

 the

 capital

 of

 France

 and

 the

 largest

 metropolitan

 area

 in

 the

 world

 by

 population

.

 It

 is

 located

 in

 the

 north

 central

 region

 of

 France

 and

 includes

 the

 Î

le

 de

 la

 C

ité

,

 the

 suburb

 of

 the

 same

 name

,

 and

 the

 

1

2

th

 arr

ond

issement

,

 the

 city

 center

.

 Paris

 is

 the

 sixth

-largest

 city

 in

 the

 European

 Union

 and

 the

 

1

5

th

-largest

 city

 in

 the

 world

 by

 population

.

 It

 is

 the

 seat

 of

 the

 French

 government

,

 the

 head

 of

 state

,

 and

 the

 chief

 administrative

 and

 financial

 centre

 of

 France

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

 and

 has

 the

 potential

 to

 transform

 virtually

 every

 aspect

 of

 our

 lives

,

 from

 healthcare

 to

 education

 to

 transportation

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

:



1

.

 Increased

 integration

 of

 AI

 into

 various

 industries

:

 With

 the

 increasing

 adoption

 of

 AI

 in

 various

 industries

,

 we

 can

 expect

 to

 see

 more

 companies

 and

 organizations

 integrating

 AI

 into

 their

 operations

.

 For

 example

,

 in

 the

 healthcare

 industry

,

 AI

 can

 be

 used

 to

 improve

 diagnosis

 and

 treatment

 outcomes

,

 while

 in

 the

 financial

 industry

,

 AI

 can

 be

 used

 to

 improve

 fraud

 detection

 and

 risk

 management

.



2

.

 Improved

 accuracy

 and

 reliability

 of

 AI

 systems

:

 As

 AI

 systems

 become

 more

 complex




In [6]:
llm.shutdown()