# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.97it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sara and I'm 15 years old. My name is very important to me because I want to be respected and loved by all my friends and family. I think it's very important to show that I have value and worth, and my name is a way to do that. I'm a good reader, and I enjoy reading books about animals. My first book is called "Dogs Are Your Friends. " I'm really excited to share it with you because I'm a little nervous about this and want to be sure everything will be okay. Please let me know if there's anything I can do to make you feel comfortable.
Prompt: The president of the United States is
Generated text:  30 years older than the president of Central America. The president of Central America is half the age of the president of Asia. The president of Asia is twice the age of the president of Europe. If the president of Europe is 100 years old, how old is the president of Central America?
To determine the age of the president of Central America, we need t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, positive description of your personality or skills]. And what's your favorite hobby or activity? I'm always looking for new experiences and adventures, so I enjoy [insert a short, positive description of your hobby or activity]. And what's your favorite book or movie? I love reading and watching movies, and I'm always looking for new recommendations. And what's your favorite color? I love [insert a short, positive description

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, including the influence of the French Revolution and the influence of the French Revolution on modern French culture. Paris is also home to many famous French artists, writers, and musicians. The city is a major center for the arts, with many museums, theaters, and other cultural institutions.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn from and adapt to human behavior and decision-making processes.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, it is likely to be used in more complex and personalized ways, with



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [Occupation], and I have been in the [Field of Interest] for [Number of Years]. I have a passion for [Reason for Passion], and I am always seeking to [Opinion or Action]. I believe that [Reason for Passion] is [Why it Matters], and that passion has driven me to [How Passion Has Impacted Me]. I am a [Skill or Personality] who is always [Why I Am Important]. I am [Hobbies or Interests] and I enjoy [Why I Enjoy My Interests]. I am a [Personality] who has a [Why I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city and the heart of the nation. The city is known for its historical landmarks, vibrant cultural scene, and rich French culture. It is a major transportation hub and an important economic center. Paris is home to the Eiffel Tower and the Louvre Museum, among other 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

Your

 Profession

/

Interest

].

 As

 a

 [

Your

 Profession

/

Interest

],

 I

 enjoy

 [

Your

 Passion

 or

 Hobby

].

 I

'm

 [

Your

 Age

],

 and

 I

'm

 currently

 living

 and

 working

 in

 [

Your

 Current

 Location

].

 I

'm

 an

 [

Your

 Education

 Level

]

 graduate

 with

 a

 [

Your

 Degree

]

 in

 [

Your

 Major

],

 and

 I

 have

 [

Your

 Relevant

 Experience

]

 to

 my name

.

 I

'm

 dedicated

 to

 [

Your

 Mission

],

 and I

 believe

 in

 [

Your Values

].

 I

 strive to

 [

Your

 Virt

ue

],

 and

 I

'm

 always

 [

Your

 Characteristic

].

 I

 enjoy

 [

Your

 Inter

ests

/

Activities

]

 and

 I

 love

 to

 [

Your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known as

 the

 City

 of

 Light

.


The

 statement is

:

 Paris

,

 also

 known

 as the

 City

 of

 Light

,

 is

 the capital

 of

 France

.

 


This

 statement

 encaps

ulates

 the

 key

 facts

 about

 Paris

,

 including

 its

 location

 as

 the

 capital

,

 its

 nickname

 "

City

 of

 Light

,"

 and

 its

 cultural

 significance

 in

 French

 history

 and

 culture

.

 The

 use

 of

 the

 term

 "

City

 of

 Light

"

 directly

 refers

 to

 its

 status

 as

 the

 center

 of

 French

 culture

 and

 urban

 life

.

 While

 the

 official

 name

 for

 the

 city

 is

 Paris

,

 this

 informal

 nickname

 is

 often

 used

 as

 a

 more

 personal

 or

 informal

 way

 to

 refer

 to

 the

 city

.

 Paris

 is

 also

 notable

 for

 its

 significant

 contributions

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 a

 combination

 of

 significant

 progress

,

 rapid

 adoption

,

 and

 a

 focus

 on

 ethical

 and

 social

 implications

.

 Here

 are

 some

 key

 trends

 in

 AI

 that

 could

 shape

 the

 future

:



1

.

 Enhanced

 Understanding

 of

 Human

 Cognitive

 Processes

:

 AI

 researchers

 are

 developing

 new

 algorithms

 and

 machine

 learning

 techniques

 that

 can

 better

 understand

 and

 simulate

 the

 human

 brain

's

 cognitive

 processes

.

 This

 could

 lead

 to

 more

 accurate

 and

 personalized

 recommendations

 and

 diagnoses

,

 and

 could

 lead

 to

 breakthrough

s

 in

 areas

 like

 emotional

 intelligence

 and

 social

 cognition

.



2

.

 Improved

 AI

 for

 Health

 Care

:

 AI

-powered

 tools

 are

 already

 being

 used

 in

 healthcare

 to

 improve

 diagnosis

 and

 treatment

,

 but

 there

 is

 still

 a

 lot

 of

 room

 for

 improvement




In [6]:
llm.shutdown()