# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.11it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex, a software developer and I'm a big fan of Django, a great framework for building web applications. I enjoy helping people learn about web development and technology in general, and also enjoy taking on challenges and helping others with projects. I'm currently a student in the Bachelor of Science in Computer Science program at the University of the Sciences in Philadelphia.

I'm always open to sharing my knowledge and knowledge of Django with others, and would love to learn more about how I can improve my Django development skills. I'm also interested in exploring open source projects, and have a particular interest in using Django with PostgreSQL.

I'm excited to hear from
Prompt: The president of the United States is
Generated text:  a what type of government? A. democratic
B. democratic
C. republic
D. autocratic
E. constitutional government

To determine the correct answer, let's analyze each option step by step:

A. Democratic: This 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about your interests and what you're looking for in a job. Let's chat! [Name] [Job Title] at [Company Name] [Company Name] is a [Company Description]. I'm always looking for opportunities to grow and learn, and I'm eager to contribute to the success of our team. What can I do for you? [Name] [Job Title] at [Company Name] [Company Name] is a [Company Description]. I'm always looking for opportunities to grow and learn,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population of over 10 million people. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum, as well as its vibrant arts scene and food culture. The city is also home to many world-renowned museums, including the Louvre and the Musée d'Orsay, and is a major center for fashion, art, and music. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. Its status

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the most likely trends that could shape the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This could lead to the creation of more efficient and productive machines that can perform tasks that were previously done by humans.

2. Improved privacy and security: As AI technology becomes more advanced, there will be an increased need for privacy and security measures to protect the data that is collected and used



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a/an [X] of [Y] years old, I have [X] years of experience in [X] industry. I have a passion for [X] and [X] and [X] in my personal life. I enjoy [X] and [X] and [X] in my daily life. I am a [X] personality type. I like to [X] and [X] and [X]. I am [X] and [X] and [X]. I am looking forward to [X] with [X] and [X]! I look forward to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the center of the country and is the largest city and the most populous city in Europe, with a population of over 10 million people.

To create a new paragraph about the capital city of France, you would need to provide some additional context. For example, you might mention its name, location, and significance in French culture. Here's an example of how to structure a new paragraph about the capital ci

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

age

]

 year

 old

 [

occupation

]

 [

job

 title

].

 I

 have

 been

 in

 the

 field

 of

 [

field

 of

 study

]

 for

 [

number

]

 years

,

 and

 I

 have

 always

 been

 a

 person

 who

 [

mention

 an

 attribute

 or

 trait

 of

 yourself

].

 I

 have

 always

 been

 a

 [

any

 adjective

 or

 term

]

 and

 [

any

 additional

 traits

].

 I

 am

 a

 [

any

 additional

 abilities

 or

 qualities

].

 I

 am

 passionate

 about

 [

any

 hobby

 or

 interest

],

 and

 I

 try

 to

 make

 the

 world

 a

 better

 place

.

 I

 have

 always

 been

 inspired

 by

 [

any

 famous

 person

 or

 event

].

 I

 am

 always

 eager

 to

 learn

 and

 always

 striving

 to

 improve

 [

any



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



The

 following

 historical

 facts

 about

 France

's

 capital

 city

,

 Paris

:



1

.

 Paris

 was

 founded

 in

 

7

8

7

 by

 Char

lem

agne

,

 who

 gave

 it

 the

 name

 "

La

 Gr

âce

"

 (

The

 Grace

)

 in

 memory

 of

 his

 son

 Louis

,

 then

 Holy

 Roman

 Emperor

.


2

.

 The

 city

 was

 an

 important

 center

 of

 learning

 and

 culture

,

 and

 was

 home

 to

 many

 famous

 scholars

 and

 writers

,

 including

 Shakespeare

.


3

.

 Paris

 is

 home

 to

 the

 E

iff

el

 Tower

,

 which

 was

 designed

 by

 Gust

ave

 E

iff

el

 in

 

1

8

8

9

 and

 is

 considered

 one

 of

 the

 world

's

 most

 famous

 structures

.

 



Paris

 is

 known

 for

 its

 iconic



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

,

 with

 many

 potential

 areas

 of

 development

 to

 keep

 us

 entertained

,

 informed

,

 and

 connected

.

 Here

 are

 a

 few

 trends

 to

 watch

:



1

.

 More

 advanced

 natural

 language

 processing

:

 As

 AI

 becomes

 more

 capable

 of

 processing

 language

 and

 recognizing

 patterns

,

 we

 can

 expect

 more

 sophisticated

 and

 accurate

 language

 understanding

 and

 generation

.



2

.

 Increased

 AI

 in

 healthcare

:

 AI

 can

 be

 used

 to

 improve

 the

 accuracy

 and

 efficiency

 of

 diagnoses

,

 assist

 in

 treatment

 planning

,

 and

 improve patient

 care

.

3

.

 Integration of

 AI

 into

 all

 aspects

 of

 life

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 we

 can

 expect

 to

 see

 more

 widespread

 adoption

 of

 AI

 across

 various

 industries

,

 such

 as

 education

,

 finance




In [6]:
llm.shutdown()