# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.23it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Yuriko and I am a 9th grade student at St. Peter’s School in Ontario, Canada. I am the president of the board of directors at Stony Brook University, and I have been working on a project to improve the sustainability and accessibility of public transportation in the city. I am also a passionate environmentalist and advocate for climate action.

What is your main focus in the project?

As the president of the board of directors at Stony Brook University, my main focus is on the sustainable development and accessibility of public transportation. I am working on developing a transportation plan that would prioritize the use of electric and sustainable vehicles and
Prompt: The president of the United States is
Generated text:  trying to decide how many armed guards should be deployed in each of the 200 towns in order to best protect the country. 1.5 times the number of guards in San Francisco is 200. 2.5 times the number of guards in San Francisco

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French National Opera. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is the largest city in France and the second-largest city in the European Union by population. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced interactions between humans and machines.

2. Greater use of machine learning: Machine learning will continue to become more advanced, with more sophisticated algorithms and models that can learn from data and adapt to new situations.

3. Increased focus on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater focus on ethical considerations, including issues such as bias, transparency, and accountability.

4. Greater use of AI for creative and artistic purposes: AI is likely to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [role] at [Your company]. I've been an avid reader and podcast listener since childhood, and my love for storytelling and storytelling techniques has always fueled my passion for writing. I'm always looking for ways to make the world a better place through my writing and in my role as a public speaker. I'm enthusiastic about sharing my knowledge and experiences, and I'm eager to meet new people and share my passions with them. I'm excited to meet you! [Your Name] [Your Role] [Your Company] [Your Passion] [Your Background] [Your Interests] [Your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, often referred to as the "City of Light" due to its vibrant culture, architecture, and artistic heritage. It serves as the political, economic, and cultural center of France, hosting major e

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

 and

 I

 am

 a

/an

 __

________

_

.


As

 an

 AI

 language

 model

,

 I

 don

't

 have

 a

 physical

 form

 and

 I

'm

 not

 limited

 to

 a

 fixed

 personality

.

 I

'm

 here

 to

 assist

 you

 with

 your

 questions

,

 answer

 your

 queries

,

 and

 provide

 helpful

 responses

 to

 any

 problems

 you

 might

 have

.

 Let

 me

 know

 if

 you

 need

 anything

 else

.

 How

 can

 I

 assist

 you

 today

?

 


As

 a

 language

 model

,

 I

 don

't

 have

 a

 physical

 form

 and

 I

'm

 not

 limited

 to

 a

 fixed

 personality

.

 I

'm

 here

 to

 assist

 you

 with

 your

 questions

,

 answer

 your

 queries

,

 and

 provide

 helpful

 responses

 to

 any

 problems

 you

 might

 have

.

 Let

 me

 know

 if

 you



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 Lou

v

ain

,

 and

 is

 the

 largest

 city

 and

 the

 capital

 of

 the

 country

,

 located

 on

 the

 banks

 of

 the

 River

 Se

ine

.



A

 

1

9

6

7

 film

 noir

 directed

 by

 Robert

 Alt

man

,

 "

C

razy

 Town

"

 is

 a

 

1

9

7

0

 Western

 film

 noir

 that

 features

 a

 group

 of

 young

 men

 who

 are

 attracted

 to

 a

 beautiful

 woman

 who

 lives

 on

 the

 outskirts

 of

 the

 city

.

 Although

 they

 initially

 have

 no

 intention

 of

 pursuing

 her

 romant

ically

,

 the

 men

 start

 to

 develop

 a

 strong

 romantic

 attraction

 to

 her

 after

 they

 begin

 to

 see

 her

 as

 a

 true

 love

 interest

.



Given

 a

 list

 of

 categories

:

 company

,

 educational

 institution



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 trends

:



 

 

1

.

 Increased

 automation

 and

 automation

 of

 tasks

:

 As

 AI

 technologies

 continue

 to

 advance

,

 we

 can

 expect

 to

 see

 more

 automation

 of

 tasks

 in

 many

 industries

.

 This

 could

 involve

 the

 automation

 of

 repetitive

 and

 mundane

 tasks

,

 as

 well

 as

 the

 automation

 of

 tasks

 that

 were

 once

 done

 by

 humans

.


 

 

2

.

 Improved

 AI

 ethics

 and

 accountability

:

 As

 AI

 technology

 continues

 to

 evolve

,

 we

 can

 expect

 to

 see

 more

 attention

 paid

 to

 the

 ethical

 and

 moral

 implications

 of

 AI

 systems

.

 This

 could

 involve

 the

 development

 of

 guidelines

 and

 frameworks

 for

 the

 responsible

 use

 of

 AI

,

 as

 well

 as

 the

 development

 of

 AI

 systems

 that

 are

 designed

 to

 be




In [6]:
llm.shutdown()