# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.78it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Fyodor Laskin. I'm an assistant professor at the University of New South Wales, Australia, and my research focuses on moduli spaces and their applications in physics and number theory. My work includes the construction of moduli spaces of Calabi-Yau manifolds, the study of their Betti numbers, and their application to problems in string theory, topology, and quantum field theory.

I am particularly interested in the moduli space of Abelian varieties over a given ring $R$, and its connection to the geometry of Calabi-Yau manifolds. I am also interested in the study of singularities, such
Prompt: The president of the United States is
Generated text:  250 cm tall. His office is 20% taller than he is. The mayor of the town is four times taller than the president. How tall is the mayor of the town in meters? To determine the height of the mayor of the town, we need to follow these steps:

1. Calculate the height of the president's office.
2. Determ

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

This statement is factually correct and provides a clear and concise overview of the capital city's location and significance in French culture and politics. However, it could be expanded to include additional information about Paris's historical and cultural importance, such as its role in the French Revolution and its status as a UNESCO World Heritage site. For example:

"Paris, the capital of France, is a bustling metropolis with a rich history and cultural heritage. The city is home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral, among other iconic landmarks. Paris has been a center of French politics, culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, privacy, and transparency. AI developers will need to be more mindful of the potential impact of their technology on society.

2. Integration with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing, and computer vision. This will allow for more complex and sophisticated AI systems



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [type] [skill] who has a lot to offer. If you have any questions or would like to discuss [something], feel free to ask. I'm excited to help you. [Name] (optional) Feel free to introduce yourself to anyone you meet while you're here. [Name] (optional) Goodbye! [Name] (optional) Have a great day! What's your name, [Name]? Let me know if there's anything specific you'd like to talk about. How can I assist you today? [Name] (optional) Keep it short, this is your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its rich history, stunning architecture, and vibrant culture. It is located in the western region of France and is the country's largest city and the second most populous city. The city has a diverse population and is home to some of the world's most famous landma

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

profession

]

 who

 has

 been

 working

 in

 the

 [

industry

]

 industry

 for

 [

number

]

 years

.

 My

 expertise

 lies

 in

 [

specific

 area

 of

 expertise

].

 I

'm

 currently

 [

current

 role

],

 and

 I

'm

 always

 eager

 to

 learn

 and

 improve

 my

 skills

.


I

'm

 always

 looking

 to

 expand

 my

 knowledge

 and

 contribute

 to

 the

 success

 of

 [

company

]

 in

 [

field

].

 What

 is

 the

 most

 interesting

 project

 you

 have

 in

 mind

 for

 me

?

 Let

's

 discuss

 it

 soon

!

 I

 look

 forward

 to

 our

 next

 meeting

.

 How

 can

 I

 help

 you

 today

?

 Let

's

 get

 to

 know

 each

 other

 better

 and

 see

 where

 our

 careers

 take

 us

.

 What

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 vibrant

 culture

.

 It

 was

 founded

 in

 the

 

6

th

 century

 and

 is

 home

 to

 many

 famous

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

,

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 city

 that

 is

 constantly

 changing

 and

 developing

,

 with

 many

 new

 neighborhoods

 and

 tourist

 attractions

 popping

 up

 every

 year

.

 It

 is

 also

 a

 hub

 for

 the

 French

 language

 and

 culture

,

 with

 French

 being

 the

 official

 language

 and

 being

 an

 integral

 part

 of

 the

 French

 identity

.

 The

 city

 is

 famous

 for

 its

 festivals

,

 such

 as

 the

 Spring

 Festival

,

 and

 its

 annual

 Notre

-D

ame

 Festival

,

 which

 celebrates

 the

 rich



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

 and

 significant

 changes

 in

 the

 way

 we

 live

 and

 work

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 automation

:

 AI

 is

 expected

 to

 automate

 many

 of

 the

 jobs

 that

 are

 currently

 performed

 by

 humans

,

 such

 as

 data

 entry

,

 customer

 service

,

 and

 maintenance

.

 This

 could

 lead

 to

 increased

 efficiency

 and

 productivity

,

 but

 it

 may

 also

 cause

 job

 displacement

.



2

.

 Improved

 cognitive

 abilities

:

 AI

 is

 expected

 to

 continue

 improving

 its

 ability

 to

 process

 and

 analyze

 information

,

 which

 could

 lead

 to

 more

 accurate

 predictions

 and

 better

 decision

-making

.



3

.

 Personal

ization

:

 AI

 is

 expected

 to

 become

 even

 more

 personalized

,

 allowing

 humans

 to

 receive

 tailored




In [6]:
llm.shutdown()