# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.60it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lily and I'm from the United States. I study a very interesting subject called Geography. I'm in a middle school right now. But I'm not in this school right now. I'm in a school called the University of California, San Diego. This school is really good, but I'm not going there yet. I think it's a great place to study. Geography is a great subject for kids like me. It's not very hard and it helps us learn about places on the Earth. I'm in Geography class right now. I'm in a middle school. I'm a middle school student. We are learning about the
Prompt: The president of the United States is
Generated text:  a very important person in the government of the United States. He or she is the leader of the country. The president is also known as the chief executive officer, chief executive and chief executive officer. The president is elected by the citizens. The president usually serves for four-year terms. The president’s role is to make decisions for

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Gender] [Occupation]. I am a [Skill or特长] who has always been [Positive or Negative] about [Reason for being positive or negative]. I am [Positive or Negative] about [Reason for being positive or negative]. I am [Positive or Negative] about [Reason for being positive or negative]. I am [Positive or Negative] about [Reason for being positive or negative]. I am [Positive or Negative] about [Reason for being positive or negative]. I am [Positive or Negative] about [Reason for being positive or negative]. I am [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in Europe by population. It is located in the south of France and is the seat of government, administration, and culture for the country. Paris is known for its rich history, art, and cuisine, and is a major tourist destination. The city is also home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a vibrant and dynamic city with a rich cultural heritage that continues to inspire and influence the city and its people. The city is also home to many important organizations and institutions, including

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn and adapt in ways that are difficult for humans to do. This could lead to more efficient and effective decision-making, as well as more personalized and context-aware interactions.

2. Greater emphasis on ethical and social considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical and social considerations. This could lead to more responsible and accountable development of AI, as well as more effective regulation and oversight.

3. Increased focus on



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  John. I am a software engineer with a deep passion for AI and machine learning. I have a natural curiosity and a never-ending thirst for knowledge in the field. I am always learning and seeking to improve myself in order to stay ahead of the curve. I believe that AI is the future and I am excited to help shape it with my expertise. What is your name, and what brings you here? Your description could add depth and personality to my character. What would you like to be known as? As an AI assistant, I am a powerful tool with endless knowledge to assist you. What would you like to assist you with? As

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city is known for its rich history, stunning architecture, and lively culture.

That's correct! Paris is the capital city of France, and it's famous for its i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 [

Age

],

 and

 I

 have

 always

 been

 fascinated

 by

 [

Subject

],

 which

 has

 been

 a

 driving

 force

 in

 my

 life

.

 I

'm

 a

 [

career

]

 and

 I

 always

 aim

 to

 [

What

 I

 hope

 to

 achieve

].

 I

 love

 [

occupation

],

 and

 I

'm

 always

 looking

 for

 [

new

 experiences

],

 [

what

 else

 you

 like

 to

 do

],

 and

 [

what

 else

 you

 enjoy

 doing

].

 What

's

 your

 name

?

 What

's

 your

 age

?

 What

 subject

 do

 you

 like

 to

 learn

?

 What

 career

 do

 you

 want

 to

 pursue

?

 What

 do

 you

 enjoy

 doing

 for

 fun

?

 What

 are

 your

 hobbies

?

 How

 do

 you

 stay

 motivated

 to

 achieve

 your

 goals

?

 What

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 capital

 of

 France

 is

 Paris

.

 It

 is

 the

 largest

 city

 in

 the

 European

 Union

 and

 is

 the

 seat

 of

 government

,

 culture

,

 and

 diplomacy

 for

 France

.

 It

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

.

 The

 city

 is

 also

 famous

 for

 its

 cuisine

 and

 fashion

.

 Paris

 is

 a

 cultural

 melting

 pot

 of

 diverse

 influences

,

 including

 French

,

 German

,

 Italian

,

 and

 British

,

 and

 is

 home

 to

 many

 of

 the

 world

's

 major

 museums

,

 galleries

,

 and

 theaters

.

 The

 city

 is

 also

 the

 birth

place

 of

 many

 famous

 figures

 such

 as

 Napoleon

 Bon

ap

arte

 and

 Marie

 Ant

oin

ette

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 diverse

 and

 evolving

.

 Some

 possible

 trends

 in

 the

 field

 include

:



1

.

 Increased

 automation

:

 AI

 will

 become

 even

 more

 automated

 and

 efficient

,

 autom

ating

 many

 tasks

 that

 were

 previously

 done

 by

 humans

.

 This

 could

 lead

 to

 job

 displacement

 but

 also

 create

 new

 job

 opportunities

.



2

.

 Improved

 privacy

 and

 security

:

 AI

 will

 continue

 to

 rely

 on

 data

 and

 algorithms

,

 which

 may

 lead

 to

 more

 privacy

 and

 security

 issues

.

 However

,

 advancements

 in

 AI

 will

 also

 lead

 to

 new

 tools

 for

 ensuring

 the

 security

 and

 privacy

 of

 data

.



3

.

 Personal

ization

:

 AI

 will

 become

 even

 more

 personalized

,

 allowing

 for

 more

 accurate

 predictions

 and

 recommendations

 for

 users

.

 This

 could

 lead

 to

 improved

 user

 experience

 and

 satisfaction




In [6]:
llm.shutdown()