# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mary, and I'm in Grade 6. I'm really shy, I don't like talking to people. One day, I lost my school bag and I cried. Then my mom said, "We don't need to worry about your school bag. It's not important to you. You can go out with friends and have fun." My mom's words made me feel so much. I felt happy and excited. But after a while, I realized that my mom was trying to make me feel better. I was scared to be seen by people because I didn't want to make friends, and I was upset that I had lost
Prompt: The president of the United States is
Generated text:  trying to decide whether to send his family on a trip to Hawaii. Hawaii is 3,000 miles away and it takes 5 hours to travel there. Once there, it takes 2 hours to explore the islands and eat at a restaurant. After dinner, the president wants to take a nap. He then wants to spend 20 minutes playing games. Finally, he wants to spend another 4 hours on the couch. How many hours would he have spent 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I am a [job title] at [company name]. I am passionate about [job title] and have been working in this field for [number of years] years. I am always looking for new challenges and opportunities to grow and learn. I am a [job title] with a passion for [job title], and I am always eager to learn and improve. I am a [job title] who is always looking for ways to make a positive impact in the world. I am a [job title] who is always looking for ways to make a difference in the lives of others. I am a [job

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also known for its fashion industry, with Paris Fashion Week being one of the world's largest. The city is also home to the French Parliament, the French National Museum of Modern Art, and the Eiffel Tower. Paris is a vibrant and dynamic city that is a must-visit for anyone interested in French culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized healthcare and education. AI will also continue to play an increasingly important role in solving complex problems, from climate change and energy production to healthcare and security. As AI becomes more integrated into our daily lives, it is likely to have a significant impact on the way we work, live, and interact with each other. However, it is also important to consider the potential risks and ethical concerns associated with AI, and to work



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First Name] and I'm a/an [Last Name]. I am a/an [x] [noun] with a background in [x] [noun]. I've been [x] for [x] years, and I'm passionate about [x]. I'm a/an [x] person who is [x] with [x]. I strive to [x] every day, and I'm always looking for ways to improve myself. I'm [x] because I want to [x]. I am [x] and I am [x]. I have [x] and I am [x]. I am [x

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located on the River Seine in the north of the country and is the political, cultural, and economic center of the country. It is also known as the City of Light and the City of Love. Paris is the second most populous city in the European Union after Brussels and is home to numerous historical landmarks, such as the Eiffel Tower and Notre-Dame Cathedral. It has a rich culinary heritage and a diverse

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

/an

 [

Job

 Title

]

 in

 [

Company

 Name

].

 I

 have

 been

 working

 in

 [

Field

]

 for

 [

X

 years

]

 and

 I

 have

 always

 been

 passionate

 about

 [

Your

 Job

 Title

]

 because

 I

 believe

 in

 making

 a

 positive

 impact

 on

 the

 world

 around

 us

.

 I

 love

 spending

 my

 time

 in

 [

Your

 Location

]

 and

 I

 am

 always

 looking

 for

 new

 challenges

 to

 try

.

 I

 am

 a

 team

 player

 and

 I

 am

 eager

 to

 learn

 and

 grow

 as

 a

 person

.

 Thank

 you

.

 Welcome

,

 [

Interview

er

 Name

].

 I

 am

 excited

 to

 meet

 you

 and

 discuss

 the

 opportunities

 and

 challenges

 ahead

 as

 a

 [

Job

 Title

]

 in

 [

Company

 Name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 “

La

 Ville

-Mar

ie

,”

 the

 birth

place

 of

 the

 French

 Republic

 and

 the

 nation

’s

 largest

 city

,

 the

 second

 most

 populous

 city

,

 and

 the

 largest

 metropolitan

 area

 in

 the

 world

,

 with

 an

 estimated

 population

 of

 over

 one

 million

.

 It

 is

 located

 on

 the

 Î

le

 de

 France

,

 situated

 on

 the

 right

 bank

 of

 the

 Se

ine

 River

,

 and

 is

 the

 oldest

 city

 in

 the

 world

.

 The

 city

 has

 a

 rich

 and

 diverse

 cultural

 scene

,

 including

 the

 iconic

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 known

 for

 its

 art

,

 music

,

 and

 food

 scenes

,

 as

 well

 as

 its

 medieval



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 a

 highly

 dynamic

 and

 rapidly

 evolving

 field

 with

 a

 wide

 range

 of

 potential

 outcomes

.

 Here

 are

 some

 possible

 trends

 that

 are

 expected

 to

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increased

 specialization

 and

 specialization

ization

:

 As

 AI

 becomes

 more

 advanced

,

 it

 will

 become

 possible

 to

 create

 AI

 systems

 that

 specialize

 in

 a

 particular

 task

 or

 application

.

 This

 will

 enable

 AI

 systems

 to

 become

 highly

 proficient

 and

 adaptable

,

 as

 they

 can

 learn

 and

 improve

 on

 their

 own

.



2

.

 Personal

ized

 AI

:

 As

 AI

 technology

 continues

 to

 improve

,

 it

 is

 likely

 that

 we

 will

 see

 a

 trend

 towards

 personalized

 AI

,

 where

 AI

 systems

 are

 designed

 to

 provide

 specific

 and

 tailored

 solutions

 to

 individual




In [6]:
llm.shutdown()