# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0807 22:57:15.669000 2819121 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0807 22:57:15.669000 2819121 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0807 22:57:23.575000 2820151 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0807 22:57:23.575000 2820151 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.86it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily and I am currently 22 years old. I am from Canada and I am a [Grade 12 Biology] student. I have always been an avid gardener and love to learn new things about plants and the environment. I have a strong interest in environmental science and I am always looking for new ways to make my garden more eco-friendly. I also enjoy spending time in nature and taking photos of the beauty of the outdoors. My favorite hobby is taking photos of the flora and fauna in my garden. I am very happy with my current life and am excited to see what the future holds. Is there anything I can help
Prompt: The president of the United States is
Generated text:  22 years older than the president of Brazil. The president of Brazil is 20 years younger than the president of the United States. How old are the three of them, in order, if the president of Brazil is currently 40 years old?
Let's denote the age of the president of the United States as \( U \), the preside

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is also a major financial center and a major tourist destination. Paris is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also known for its cuisine, including French cuisine, and its wine production. Paris is a city that has a rich history and a unique culture that is enjoyed by people from all over the world. It is a city that is constantly evolving and changing, but it

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there is a growing emphasis on developing AI that is more ethical and responsible. This could include developing AI that is designed to minimize harm to individuals and society as a whole.

2. Greater integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior. This could lead to more sophisticated forms of AI that are able to understand and respond



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Job Title] at [Company Name]. I have a knack for [adjective describing my abilities or skills]. I thrive on [reason why I enjoy this job or career]. What brings you to this position? I'm excited to dive into [reason for the job or career].
Hello, my name is [Name] and I'm a [Job Title] at [Company Name]. I have a knack for [adjective describing my abilities or skills]. I thrive on [reason why I enjoy this job or career]. What brings you to this position? I'm excited to dive into [reason for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Explanation for reasoning: Paris is the capital of France, the country's largest city, and the seat of government, culture, and religion. Its extensive streets, well-preserved historic areas, and unique architecture are all typical of the French cap

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 a

 [

Your

 Age

]

 year

 old

 [

Your

 Occupation

].

 I

 have

 lived

 in

 the

 [

Your

 City

/T

own

]

 for

 [

Your

 Duration

].

 I

 come

 from

 [

Your

 Country

],

 where

 I

 grew

 up

 and

 have

 always

 been

 fascinated

 by

 [

Your

 Hobby

/

Interest

].

 How

 would

 you

 like

 to

 introduce

 yourself

?

 Sure

,

 here

's

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

:


My

 name

 is

 [

Your

 Name

],

 and

 I

 am

 a

 [

Your

 Age

]

 year

 old

 [

Your

 Occupation

].

 I

 have

 lived

 in

 the

 [

Your

 City

/T

own

]

 for

 [

Your

 Duration

],

 and

 I

 come

 from

 [

Your

 Country

],

 where

 I

 grew



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 and

 it

 is

 also

 the

 largest

 city

 in the

 country

.


Paris

 is

 also

 the

 largest

 city

 in

 the

 world

 by

 population

 at

 

2

,

1

7

4

,

8

2

6

 as

 of

 

2

0

2

3

.

 The

 city

 is

 located

 in

 the

 Se

ine

-Mar

itime

 region

 in

 the

 south

 of

 France

.

 It

 is

 known

 for

 its

 historic

 sites

,

 iconic

 landmarks

,

 and

 world

-ren

owned

 museums

.

 Paris

 is

 often

 referred

 to

 as

 the

 "

City

 of

 Light

"

 and

 is

 a

 major

 tourist

 attraction

.

 The

 city

 is

 home

 to

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 hub

 for

 culture

,

 art

,

 fashion



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 development

,

 innovation

,

 and

 integration

 of

 AI

 technologies

 into

 various

 industries

 and

 domains

.

 Here

 are

 some

 potential

 trends

 that

 are

 expected

 to

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increased

 automation

 and

 robotics

:

 AI

 is

 already

 becoming

 increasingly

 prevalent

 in

 many

 industries

,

 from

 manufacturing

 and

 healthcare

 to

 customer

 service

 and

 transportation

.

 However

,

 as

 AI

 technology

 continues

 to

 improve

,

 we

 are

 likely

 to

 see

 more

 automation

 and

 robotics

 becoming

 more

 integrated

 into

 our

 daily

 lives

,

 leading

 to

 increased

 efficiency

,

 cost

 savings

,

 and

 personal

 freedom

.



2

.

 Improved

 cognitive

 functions

:

 AI

 technologies

 are

 likely

 to

 continue

 to

 advance

 in

 areas

 such

 as

 speech

 recognition

,

 machine

 learning

,




In [6]:
llm.shutdown()