# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.75it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I'm from the United States. I'm really good at English because I'm from the US. My hometown is Chicago, Illinois. There I learned English from my parents and my grandparents. There is a big difference between Chicago and my hometown. In Chicago, people are friendly and have a lot of fun. There are lots of places to see and do. There is a lot of traffic and traffic lights. In my hometown, there is no traffic and it's much safer. In Chicago, there are lots of restaurants. In my hometown, there are only a few. There is a big difference between Chicago and my hometown.
Prompt: The president of the United States is
Generated text:  two-thirds of the population of a city. The mayor of the city owns three-quarters of the city's public transportation. If the city's population is 100,000 and the city's public transportation is divided equally between two services, how many people are on the mayor's bus?
To determine the number of people on th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a cultural and economic center, hosting numerous museums, theaters, and festivals throughout the year. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, art, and cuisine, and is a major transportation hub for Europe. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also home to many international organizations and institutions, including UNESCO and the European Union. Paris is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and artificial intelligence: As AI becomes more advanced, it is likely to automate many of the tasks that are currently performed by humans. This could lead to increased efficiency and productivity, but it could also lead to job displacement for some workers.

2. Improved privacy and security: As AI becomes more sophisticated, it is likely to require more data to function effectively. This could lead to increased privacy concerns, as AI systems may be able to learn about individuals' personal information



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [job title] with a passion for [specialization or interest]. I'm a dedicated [job title] who loves [related activities or hobbies]. I'm always looking for new experiences and things to learn, so I enjoy [advice or advice for others]. I'm excited to meet you and what you can offer.
Your response should be between 4-6 sentences, and should be written in an informative yet conversational tone. You may also include any personal anecdotes or experiences to add depth to your self-introduction. Avoid using profanity or negative language. Make sure to give a brief overview

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the second-largest city in the country and one of the most visited cities in the world. It is also home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cat

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 first

 name

],

 and

 I

'm

 here

 to

 meet

 you

 and

 help

 you

 get

 where

 you

 need

 to

 be

.

 How

 can

 I

 assist

 you

 today

?



Assistant

:

 Hello

,

 my

 name

 is

 [

insert

 first

 name

],

 and

 I

'm

 here

 to

 meet

 you

 and

 help

 you

 get

 where

 you

 need

 to

 be

.

 How

 can

 I

 assist

 you

 today

?

 [

Insert

 brief

 description

 of

 character

's

 abilities

,

 personality

,

 or

 background

]

 I

'm

 here

 to

 help

 anyone

 who

 needs

 it

.

 How

 can

 I

 assist

 you

 today

?

 [

Insert

 brief

 description

 of

 character

's

 abilities

,

 personality

,

 or

 background

]

 I

'm

 here

 to

 help

 anyone

 who

 needs

 it

.

 How

 can

 I

 assist

 you

 today

?

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 and

 most

 populous

 city

 in

 France

,

 with

 a

 population

 of

 approximately

 

2

.

 

4

 million

.

 The

 city

 is

 located

 on

 the

 banks

 of

 the

 Se

ine

 River

 and

 is

 known

 for

 its

 art

,

 music

,

 cuisine

,

 and

 fashion

.

 Paris

 is

 also

 home

 to

 many

 historical

 and

 cultural

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 has

 a

 rich

 and

 diverse

 cultural

 and

 artistic

 heritage

,

 and

 has

 been

 the

 birth

place

 of

 many

 renowned

 figures

 in

 French

 and

 global

 history

.

 The

 city

 is

 also

 an

 important

 economic

 and

 cultural

 center

,

 with

 a

 thriving

 economy

 and

 a

 vibrant

 nightlife

.

 Overall

,

 Paris

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 will

 likely

 continue

 to

 evolve

 over

 the

 coming

 decades

,

 driven

 by

 a

 combination

 of

 new

 technologies

,

 advances

 in

 data

 and

 computing

 power

,

 and

 shifts

 in

 societal

 values

 and

 norms

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Increasing

 automation

 and

 robotics

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 we

 may

 see

 a

 rise

 in

 automation

 and

 robotics

,

 as

 machines

 take

 on

 tasks

 that

 were

 previously

 performed

 by

 humans

.

 This

 could

 lead

 to

 significant

 job

 losses

 in

 certain

 industries

,

 but

 also

 create

 new

 opportunities

 for

 new

 roles

 and

 skills

.



2

.

 Enhanced

 cognitive

 abilities

:

 AI

 is

 likely

 to

 continue

 to

 improve

 our

 ability

 to

 learn

 and




In [6]:
llm.shutdown()