# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.27it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  George Smith and I am from Manchester, England, and I am a Mechanical Engineer by Education. I have a bachelor's degree in Mechanical Engineering from the University of Manchester. I have held positions with John Deere and Waters Technologies in the UK and I am currently working in the Department of Mechanical Engineering at the University of Manchester. My research interests are in the areas of manufacturing systems, machine design and robotics.

Research Interests

• Manufacturing Systems
• Intelligent Manufacturing
• Manufacturing Scheduling

Education

• Bachelor of Mechanical Engineering, University of Manchester, Manchester, England, 2009

Professional Experience

• Mechanical Engineer, John
Prompt: The president of the United States is
Generated text:  assassinated. This is a significant event, making it extremely likely that the assassinated person was a well-known political figure. Therefore, it is necessary to distinguish the types o

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm [age] years old, [gender] and [country]. I have a [job title] at [company name], and I enjoy [job title] work. I'm [job title] at [company name], and I enjoy [job title] work. I'm [job title] at [company name], and I enjoy [job title] work. I'm [job title] at [company name], and I enjoy [job

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic and culturally rich city with a rich history dating back to the Roman Empire. It is the largest city in France and the second-largest city in the European Union, with a population of over 2. 5 million people. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also a major center for the arts, music, and fashion, and is home to many world-renowned museums, theaters, and art galleries. Paris is a popular tourist destination and a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several trends that are expected to shape the development of this technology in the coming years. Here are some of the potential trends that are likely to shape the future of AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and privacy. AI developers will need to be more mindful of the potential impact of their technology on society and work to ensure that it is used in a responsible and ethical manner.

2. Greater integration with other technologies: AI is already being integrated into



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a/an [Age] year-old [Occupation] [Job Title]. I'm a professional [Role] with a career spanning [Number of Years in Industry] and [Number of Years in Industry]. I work as a/an [Name] and [Name] at [Name]. I'm a/an [Type of Character] who [Describe What Character is]. I'm passionate about [Your Passion], and I enjoy [Your Passion in Detail]. I'm a/an [Type of Character] who [Describe What Character is]. I have a [Type of Character] character trait [Describe What Trait You Have

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the second largest city in the European Union and the second most populous city in the world after New York City. The city is known for its rich history, beautiful architecture, and diverse cultural scene. It has been the seat of government and the capital of Fran

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 an

 AI

 language

 model

 designed

 to

 assist

 with

 a

 wide

 range

 of

 tasks

,

 including

 but

 not

 limited

 to

 answering

 questions

,

 generating

 text

,

 and

 helping

 with

 other

 tasks

.

 I

'm

 a

 trustworthy

 and

 helpful

 assistant

,

 always

 ready

 to

 assist

 you

 with

 any

 questions

 or

 tasks

 you

 may

 have

.

 How

 can

 I

 assist

 you

 today

?

 Let

 me

 know

!

 #

AI

Assistant

 #

Tech

G

uru

 #

Tech

Support

 #

G

ent

le

Int

roduction

 #

Friendly

Style

 #

Positive

Att

itude

 #

Knowledge

able

 #

Friendly

 #

Help

ful

 #

Tech

Aware

 #

Tech

S

av

vy

 #

Tech

G

uru

 #

Tech

G

uru

 #

Tech

G

uru

 #

Tech

G

uru

 #

Tech

G

uru



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

 and

 the

 City

 of

 Light

,

 the

 City

 of

 Music

 and

 the

 City

 of

 Science

.

 It

 is

 located

 in

 the

 central

 region

 of

 France

,

 on

 the

 banks

 of

 the

 Se

ine

 River

,

 in

 the

 Prov

ence

-Al

pes

-C

ôte

 d

'

Az

ur

 region

.

 It

 is

 France

's

 largest

 city

,

 with

 a

 population

 of

 over

 

2

.

7

 million

 people

 and

 one

 of

 the

 world

's

 most

 cosm

opolitan

 cities

,

 home

 to

 the

 world

's

 largest

 museums

,

 theaters

,

 opera

 houses

,

 and

 other

 cultural

 institutions

.

 Paris

 is

 also

 known

 for

 its

 stunning

 architecture

,

 including

 its

 iconic

 E

iff

el

 Tower

,

 and

 for

 its

 annual

 annual

 summer



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 technological

 and

 societal

 changes

 that

 will

 shape

 the

 way

 we

 live

 and

 interact

 with

 technology

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 automation

 and

 AI

-driven

 automation

:

 As

 AI

 continues

 to

 improve

,

 we

 can

 expect

 to

 see

 a

 growing

 number

 of

 jobs

 automated

,

 leading

 to

 a

 rise

 in

 automation

 and

 AI

-driven

 automation

.

 This

 could

 result

 in

 a

 greater

 emphasis

 on

 human

 skills

 and

 abilities

 and

 could

 lead

 to

 a

 more

 efficient

 use

 of

 resources

.



2

.

 AI

-powered

 healthcare

:

 With

 the

 increasing

 availability

 of

 AI

-powered

 medical

 devices

 and

 algorithms

,

 we

 can

 expect

 to

 see

 more

 personalized

 and

 efficient

 healthcare

.

 AI

-powered

 systems

 could

 be

 used

 to




In [6]:
llm.shutdown()