# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.85it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alice. I'm an 11-year-old girl. I like playing games and I love to collect stamps. I have a collection of 50 stamps in my room. I often show my stamps to my parents. I think playing games is the most important. I like my parents and they always give me good things. But, I don't like going to school because I think it's boring. I often listen to music and watch TV. I get up early in the morning and go to bed at night. I do my homework every night. I also play table tennis with my friends, and we have a great time.
Prompt: The president of the United States is
Generated text:  in the White House and the United States is in the same position as in the 18th century. How is this possible? Explain in detail.

The president of the United States is in the White House and the United States is in the same position as in the 18th century.

This is possible because the United States has undergone a significant transformation in its political, economic, an

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] with [number of years] years of experience in [field]. I am a [type of person] who is always [positive trait]. I am [ability to do something]. I am [personality type]. I am [role in the organization]. I am [role in the company]. I am [role in the community]. I am [role in the community]. I am [role in the community]. I am [role in the community]. I am [role in the community]. I am [role in the community]. I am [role in the community]. I am [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. The city is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also known for its fashion industry, with many famous designers and boutiques. The city is a major hub for business and trade, with many international companies and organizations headquartered there. Paris is a popular tourist destination, with millions of visitors each year. It is a cultural and artistic center, with many museums, theaters, and galleries.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior. This could lead to more personalized and efficient applications of AI, as well as more effective ways to interact with machines and humans.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and deployment,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [job title] with [number of years of experience]. My strongest skills are [list skills]. What's your name and what's your job? [Name]: Hello, my name is [Name]. I'm a [job title] with [number of years of experience]. My strongest skills are [list skills]. What's your name and what's your job? (Repeat this sentence as many times as necessary to get the point across) Oh wow, I really like hearing that name! Do you happen to have a short biography or a short story that you could share with me? I'm always looking

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest and most populous city in the country, and is known as the "City of Light" due to its picturesque architecture, vibrant culture, and important historical significance. Paris is also a major financial center and a major

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

've

 been

 with

 the

 company

 for

 [

number

 of

 years

]

 years

,

 and

 I

've

 been

 working

 hard

 to

 grow

 my

 career

 in

 [

industry

/

field

]

 and

 [

job

 title

].

 I

 believe

 in

 [

reason

 for

 success

]

 and

 I

'm

 always

 looking

 for

 ways

 to

 improve

 myself

.

 What

 makes

 you

 a

 good

 fit

 for

 the

 job

?

 Additionally

,

 let

 me

 know

 if

 you

'd

 like

 me

 to

 give

 you

 a

 bit

 more

 information

 about

 [

company

 name

].

 I

'm

 happy

 to

 share

 anything

 I

 can

 about

 the

 company

,

 its

 history

,

 mission

,

 and

 any

 other

 relevant

 information

.

 Thank

 you

 for



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Let

 me

 know

 if

 you

 have

 any

 other

 questions

!

 Paris

,

 officially

 called

 the

 City

 of

 Paris

,

 is

 the

 capital

 and

 largest

 city

 of

 France

.

 It

 is

 located

 in

 the

 western

 suburbs

 of

 the

 Paris

 region

,

 bounded

 by

 the

 Se

ine

 River

 to

 the

 west

,

 the

 Arc

 de

 Tri

omp

he

 to

 the

 east

,

 the

 River

 Mar

ne

 to

 the

 north

,

 and

 the

 Se

ine

 to

 the

 south

.

 The

 city

 spans

 across

 

1

2

 sq

.

 miles

 (

3

1

 sq

 km

)

 and

 is

 the

 oldest

 continuously

 inhabited

 city

 in

 the

 world

,

 with

 its

 ancient

 Roman

 city

 walls

 dating

 back

 to

 the

 

5

th

 century

 BC

.

 Paris

 is

 known

 for

 its

 beautiful



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 driven

 by

 the

 development

 of

 new

 technologies

 and

 applications

,

 as

 well

 as

 ongoing

 advancements

 in

 machine

 learning

,

 deep

 learning

,

 and

 natural

 language

 processing

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 AI

 ethics

:

 As

 more

 AI

 systems

 become

 integrated

 into

 daily

 life

,

 it

 is

 likely

 that

 ethical concerns

 will

 arise

.

 This

 could

 lead

 to

 the

 development

 of

 new

 ethical

 guidelines

 and

 frameworks

 for

 AI

 development

 and

 deployment

.



2

.

 The

 integration

 of

 AI

 into

 human

 life

:

 As

 AI

 becomes

 more

 integrated

 into

 daily

 life

,

 it

 is

 likely

 that

 it

 will

 become

 more

 natural

 and

 ubiquitous

.

 For

 example

,

 AI

-powered

 voice

 assistants

 like

 Siri

 and

 Google

 Assistant

 will

 become

 more

 prevalent




In [6]:
llm.shutdown()