# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Katy. I was born in Chicago, Illinois. I've been living in New York since 2011. I'm a 21 year old college student majoring in Biology. I like to study computers, math, and singing.
When I first went to college, I was working at an accounting firm. It was fun but I didn't enjoy working there. I had to go to different parts of the country to work for different companies. The work was challenging but I didn't enjoy it. I had to work hard to keep up with the demands of the work and the money was not enough to afford a car or
Prompt: The president of the United States is
Generated text:  a unique position because they are both the head of government and the head of the executive branch. Some people may confuse their roles, but to them, it is the same person. They are both the head of the executive branch of the government. Does it follow that does the president of the United States have to be a man? Options are:
+ yes
+ it is not possible to tell
+

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and enjoy [reason for passion]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

A. True
B. False
A. True

Paris is the capital of France and is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center in Europe. Paris is a popular tourist destination and is home to many world-renowned museums, art galleries, and restaurants. The city is known for its rich history, diverse culture, and beautiful architecture. Paris is a city that has been a center of politics, culture, and industry for centuries, and it continues to be a major hub for global affairs and commerce

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and adaptive AI systems that can better understand and respond to human emotions and preferences.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations and responsible use of AI. This could lead to more stringent regulations and standards for AI development and deployment, as well as greater transparency and accountability in AI systems.

3. Increased focus on



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [career level] [gender] [occupation]. I have a strong [personal trait or quality] and I am always [preference/attraction] with [people]. My [career goal] is [reason for goal]. As a [personal characteristic], I am [appearance/age/height]. What kind of person are you? [Personal trait or quality]: [insert trait] [Appearance/age/height]: [insert appearance/age/height] [Career goal]: [insert goal] [Personal characteristic]: [insert personal characteristic] [Appearance/age/height]: [insert appearance/

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. 

This statement is factually accurate. Paris is the largest city in France and the third-largest city in the world, with a population of approximately 2.3 million. It is known for its historical architecture, vibr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Career

]

 [

Role

].

 I

'm

 also

 an

 [

Ad

jective

]

 person

.

 I

 have

 a

 [

Number

]

 degree

 in

 [

Field

]

 and

 [

Number

]

 years

 of

 experience

 in

 [

Field

].

 I

 enjoy

 [

Ad

jective

]

 work

 and

 am

 always

 looking

 for

 ways

 to

 [

Ad

jective

]

 things

.

 What

 are

 some

 of

 the

 challenges

 that

 you

're

 likely

 to

 face

 in

 your

 role

,

 and

 how

 are

 you

 addressing

 them

?

 Let

 me

 know

!

 #

 [

Name

]

 -

 [

Career

]

 -

 [

Role

]


Hello

,

 my

 name

 is

 [

Name

],

 and

 I

'm

 a

 [

Career

]

 [

Role

].

 I

'm

 also

 an

 [

Ad



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 famous

 museums

 and

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Lou

vre

 Museum

.

 



(

Answer

 with

 

3

-

5

 sentences

 about

 Paris

's

 location

,

 culture

,

 and

 attractions

,

 while

 also

 mentioning

 any

 notable

 aspects

 or

 events

 that

 have

 occurred

 in

 the

 city

)

 



Please

 also

 explain

 the

 significance

 of

 the

 E

iff

el

 Tower

 in

 Paris

 and

 how

 it

 became

 a

 symbol

 of

 France

 during

 World

 War

 I

.

 Finally

,

 describe

 any

 current

 events

 or

 cultural

 initiatives

 in

 Paris

 that

 reflect

 its

 importance

 to

 the

 French

 people

.

 



(

Ensure

 your

 answer

 includes

 specific

 details

 about

 the

 E

iff

el

 Tower

 and

 its

 significance

 to

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 continuous

 innovation

 and

 development

,

 driven

 by

 advancements

 in

 computing

 power

,

 machine

 learning

 algorithms

,

 and

 neural

 networks

.

 Here

 are

 some

 potential

 trends

 that

 may

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 we

 become

 more

 aware

 of

 the

 potential

 risks

 and

 unintended

 consequences

 of

 AI

 systems

,

 there

 will

 be

 a

 growing

 emphasis

 on

 ethical

 and

 responsible

 development

 of

 AI

.

 This

 may

 lead

 to

 increased

 regulation

 of

 AI

 systems

,

 as

 well

 as

 improvements

 in

 the

 safety

 and

 reliability

 of

 AI

 systems

.


2

.

 Advances

 in

 natural

 language

 processing

:

 With

 the

 help

 of

 machine

 learning

,

 AI

 systems

 are

 becoming

 increasingly

 capable

 of

 understanding

 and

 interpreting

 human

 language

.

 This




In [6]:
llm.shutdown()