# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.65it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kat. I'm the new student at a new school. There are about 200 students in the school and I don't know what I'm doing. The classroom is not big and I don't know where to sit. It's not far from our school bus stop, but I'm still not sure where it is. What should I do? The answer to this question is: I need to find a teacher. First of all, it's not a good idea to sit next to someone. If I sit next to a teacher, I might not know which class he or she teaches. We should try to sit close to
Prompt: The president of the United States is
Generated text:  34 years younger than the president of Brazil. The president of Brazil is 3/4 times younger than the president of China. If the president of China is 40 years old, how old is the president of the United States?
Let's start by identifying the ages given in the problem.

1. The president of China is 40 years old.
2. The president of Brazil is \(\frac{3}{4}\) times younger than the president of China. Si

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" and "La Ville de la Rose". It is the largest city in France and the third largest in the world, with a population of over 2 million people. Paris is famous for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also known for its rich history, including the Roman Empire, French Revolution, and French Revolution. Paris is a cultural and artistic center, with many museums, theaters, and galleries. It is also a major transportation hub, with many major airports and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration of AI into everyday life: As AI becomes more integrated into our daily lives, we can expect to see more widespread adoption of AI technologies. This could include things like voice assistants, self-driving cars, and even more advanced forms of AI that can perform tasks that were previously thought to be beyond the capabilities of human intelligence.

2. Greater emphasis on ethical and responsible AI: As AI becomes more integrated into our daily lives, there



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a software engineer with a focus on developing highly effective and efficient software solutions. I'm constantly seeking out new ways to improve software design and development, and I'm passionate about using my skills to create systems that are both reliable and user-friendly. [Name] enjoys using my tools and services to design and develop high-quality software solutions. Thank you for considering my career and expertise. [Name] This is [Your Name]. How can I assist you today?
Hello, my name is [Name]. I'm a software engineer with a focus on developing highly effective and efficient software solutions. I'm constantly seeking out new ways

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Explain the significance of Paris in French history and culture. 

Discuss the architectural, political,

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]

 and

 I

'm

 an

 [

occupation

],

 [

name

]

 [

job

 title

].

 I

'm

 [

job

 title

]

 and

 I

'm

 here

 to

 share

 my

 personal

 experiences

 and

 stories

 with

 anyone

 who

 cares

 to

 listen

.

 What

 can

 you

 tell

 me

 about

 yourself

?


As

 an

 [

occupation

],

 [

name

]

 [

job

 title

],

 I

 have

 a

 passion

 for

 [

occupation

],

 [

name

]

 [

job

 title

]

 and

 I

'm

 constantly

 learning

 and

 growing

 in

 my

 field

.

 I

'm

 always

 seeking

 out

 new

 ideas

,

 experiences

,

 and

 opportunities

 to

 improve

 myself

 and

 the

 world

 around

 me

.

 How

 do

 you

 feel

 about

 the

 idea

 of

 growing

 and

 learning

 continuously

?


[

Name

]

 [

Occup

ation

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 capital

 city

 of

 France

 and

 is

 located

 in

 the

 northern

 part

 of

 the

 country

.

 It

 is

 the

 largest

 city

 in

 France

,

 as

 well

 as

 one

 of

 the

 largest

 cities

 in

 Europe

.

 It

 is

 also

 the

 world

's

 most

 populous

 urban

 area

,

 with

 over

 

2

0

 million

 inhabitants

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 culture

,

 art

,

 and

 cuisine

,

 and

 is

 a

 major

 tourist

 destination

 worldwide

.

 The

 city

 is

 also

 home

 to

 numerous

 museums

,

 theaters

,

 and

 other

 attractions

,

 and

 has

 been

 a

 cultural

 hub

 for

 centuries

.

 Its

 significance

 as

 the

 capital

 of

 France

 has

 been

 recognized

 by

 various

 levels

 of

 government

 and

 its

 political

 institutions

,

 including

 the

 European



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

,

 innovation

,

 and

 integration

 into

 various

 industries

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Improved

 privacy

 and

 security

:

 As

 more

 AI

 systems

 are

 deployed

,

 they

 may

 become

 more

 vulnerable

 to

 hacking

 and

 data

 breaches

.

 Therefore

,

 it

's

 important

 for

 AI

 systems

 to

 be

 designed

 with

 security

 in

 mind

,

 and

 to

 incorporate

 robust

 privacy

 protections

.



2

.

 Increased

 automation

:

 As

 AI

 systems

 become

 more

 complex

 and

 sophisticated

,

 they

 may

 become

 increasingly

 autonomous

 and

 able

 to

 perform

 tasks

 on

 their

 own

.

 This

 could

 have

 a

 significant

 impact

 on

 jobs

,

 especially

 in

 areas

 that

 are

 currently

 being

 automated

.



3

.

 Greater

 integration

 with

 human

 decision

-making

:

 AI




In [6]:
llm.shutdown()