# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.80it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.80it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kasturi Shetty. I am a fifth-year student at the University of Hawaii at Manoa. My major is in Environmental Studies, and my thesis is on the impact of green roofs and their functionality on the growing of grass.
I have some experience in environmental science. For instance, I also did a project on the impact of green roofs on the growing of grass in the US. I am also very enthusiastic about environmental issues and not just environmental science, but also policy and the whole of sustainability.
I'm very passionate about the issues around the environment, not just the research side of it, but also the wider application of the issue
Prompt: The president of the United States is
Generated text:  trying to choose between two jobs for the next administration. The first job is the head of the Department of Defense, which he believes will lead to significant changes in the military, especially in the development of new technology and defense strateg

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. It is the capital of France and the largest city in the European Union. The city is known for its cuisine, fashion, and art, and is home to many famous landmarks and attractions. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely future trends in AI:

1. Increased automation and robotics: As AI continues to advance, we are likely to see an increase in automation and robotics in various industries. This could lead to the creation of more efficient and productive machines that can perform tasks that were previously done by humans.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we are likely to see even more widespread adoption of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name], and I specialize in creating engaging and thought-provoking content for digital and print media. I'm always looking for fresh ideas and innovative approaches to storytelling, and I'm passionate about using storytelling to make a positive impact in the world. I'm currently freelancing for [insert company name] and have worked on a variety of projects, including creative writing, graphic design, and content creation. I enjoy helping others to develop their creative skills and spreading the word about the power of storytelling. I'm excited about the opportunity to share my knowledge and passion for storytelling with others. Thank you for considering me for your character

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Here is the factual statement:

Paris is the capital of Fran

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

 [

short

,

 enthusiastic

,

 or

 reserved

]

 man

 who

 is

 passionate

 about

 [

your

 profession

 or

 hobby

],

 and

 who

 is

 always

 ready

 to

 [

whatever

 it

 is

 that

 you

 are

]

 when

 it

 comes

 to

 [

what

 you

 are

].

 I

 am

 [

your

 profession

 or

 hobby

],

 and

 I

 have

 always

 been

 fascinated

 by

 [

something

 that

 might

 interest

 you

,

 such

 as

 the

 outdoors

,

 music

,

 science

,

 or

 technology

].

 I

 am

 always

 looking

 for

 new

 ways

 to

 [

what

 you

 are

]

 and

 I

 am

 always

 eager

 to

 learn

 about

 [

what

 you

 are

]

 and

 try

 new

 things

.

 I

 am

 very

 [

your

 personality

 type

 or

 trait

],

 and

 I

 always



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 that

 sits

 maj

est

ically

 on

 the

 Se

ine

 River

 and

 is

 the

 largest

 city

 in

 Europe

.

 



Here

's

 the

 relevant

 paragraph

 from

 the

 Wikipedia

 article

 on

 France

:



The

 capital

 city

 of

 France

 is

 Paris

,

 the

 largest

 city

 in

 Europe

,

 the

 seat

 of

 the

 government

,

 and

 the

 cultural

,

 economic

,

 and

 educational

 center

 of

 the

 country

.

 The

 population

 of

 Paris

 is

 

2

,

1

4

2

,

7

8

1

,

 of

 which

 

2

,

0

9

9

,

7

3

1

 reside

 in

 the

 city

 proper

 and

 

4

3

,

0

4

8

 outside

 of

 Paris

.

 The

 city

 is

 the

 fifth

-largest

 city

 in

 the

 world

 by

 population

,

 and

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 several

 key

 trends

,

 including

:



1

.

 Increased

 integration

 with

 human

-like

 qualities

:

 AI

 systems

 are

 likely

 to

 become

 more

 natural

 and

 human

-like

,

 able

 to

 mimic

 human

 behavior

 and

 think

 like

 humans

.

 This

 could

 lead

 to

 a

 more

 intuitive

 and

 efficient

 use

 of

 AI

 in

 our

 daily

 lives

.



2

.

 Personal

ization

:

 AI

 systems

 will

 become

 increasingly

 capable

 of

 learning

 from

 data

 and

 personal

ization

 will

 be

 a

 key

 trend

.

 This

 will

 enable

 AI

 systems

 to

 provide

 tailored

 and

 personalized

 experiences

 to

 users

.



3

.

 Enhanced

 privacy

:

 AI

 systems

 will

 continue

 to

 pose

 privacy

 concerns

.

 As

 AI

 systems

 become

 more

 integrated

 with

 our

 everyday

 lives

,

 there

 will

 be

 a

 need

 to

 address

 concerns




In [6]:
llm.shutdown()