# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0817 06:49:03.941000 907596 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0817 06:49:03.941000 907596 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0817 06:49:13.245000 908056 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0817 06:49:13.245000 908056 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.51it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=54.65 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=54.65 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.99it/s]Capturing batches (bs=2 avail_mem=54.64 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.99it/s]Capturing batches (bs=1 avail_mem=54.64 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.99it/s]Capturing batches (bs=1 avail_mem=54.64 GB): 100%|██████████| 3/3 [00:00<00:00,  5.20it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Karl and I am 28 years old. I have been living in the United States since I was 14 years old. I have been very active and healthy. I like to stay active because it keeps me healthy. I like to eat healthy and eat lots of fruits and vegetables. I like to go to the gym on weekends. I like to take long walks in the park in the morning and I like to take long walks at night. I like to read books and listen to music. I think I am very healthy because I have a good diet and exercise. 

What is the most likely relationship between Karl and his family
Prompt: The president of the United States is
Generated text:  now trying to make a statement on his administration's immigration policy. He speaks to the nation about the progress being made and the obstacles and challenges that still need to be overcome. He speaks of the efforts to address the issue of illegal immigration. He speaks about the effort to address the issue of refugees. He speaks about the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its vibrant nightlife and is a popular tourist destination. The city is home to many international organizations and is a major economic and cultural center in Europe. It is also known for its cuisine, including its famous croissants and its traditional French cuisine. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that is constantly evolving

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and adaptive AI systems that can learn from feedback and improve their performance over time.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and evaluation of AI systems, as well as greater transparency and accountability in their development and deployment.

3. Increased use of AI in healthcare: AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Age]. I am a self-made entrepreneur, a successful business owner, and a well-respected figure in my field. I am also an avid reader and I love to share my experiences and insights with others. My goal is to help others achieve their goals and make their lives better. I am always looking for new opportunities to learn and grow. Thank you for asking! Congratulations on your new book! I'm excited to meet you. Let's discuss the book together. [Name] [Age] [Name]: Hi, I'm excited to meet you! What can you tell me about your book?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral.

Paris, the city of love, is the capital of France, renowned for its cultural, historical, and archit

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

Your

 Profession

/

Type

]

 with

 a

 deep

 understanding

 of

 [

Your

 Area

 of

 Expert

ise

 or

 Interest

].

 



I

 enjoy

 [

Your

 Passion

/F

un

 Fact

]

 and

 have

 always

 been

 interested

 in

 [

Your

 Area

 of

 Interest

 or

 Hobby

].

 I

 am

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 learn

 and

 grow

.

 



I

 believe

 that

 my

 personality

 is

 [

Your

 Personality

 Type

]

 and

 I

 enjoy

 [

Your

 Inter

ests

 or

 Lifestyle

].

 I

 am

 always

 open

 to

 feedback

 and

 willing

 to

 learn

 from

 others

.

 



I

 am

 passionate

 about

 [

Your

 Area

 of

 Interest

 or

 Hobby

]

 and

 I

 am

 always

 eager

 to

 share

 my

 knowledge

 with

 others

.

 I

 am



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 historic

 center

 that

 is

 located

 in

 the

 south

 of

 the

 country

 and

 is

 the

 largest

 city

 by

 population

 in

 France

,

 with

 over

 

2

 million

 inhabitants

.

 The

 city

 is

 known

 for

 its

 rich

 cultural

 heritage

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

 Dame

 Cathedral

.

 Paris

 is

 a

 major

 financial

 hub

 and

 is

 home

 to

 many

 of

 the

 country

's

 major

 banks

,

 insurance

 companies

,

 and

 financial

 institutions

.

 The

 city

 is

 also

 home

 to

 a

 vibrant

 arts

 scene

 and

 a

 number

 of

 museums

,

 including

 the

 Mus

ée

 d

'

Or

say

 and

 the

 Lou

vre

.

 Paris

 is

 a

 city

 of

 contrasts

,

 with

 a

 sense

 of

 humor

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 key

 trends

 that

 are

 currently

 being

 explored

 and

 tested

.

 Here

 are

 some

 potential

 areas

 of

 focus

 for

 AI

 in

 the

 coming

 years

:



1

.

 Increased

 integration

 with

 human

 AI

:

 As

 more

 AI

 systems

 are

 developed

,

 they

 are

 likely

 to

 be

 integrated

 with

 human

 AI

 in

 order

 to

 enhance

 their

 capabilities

 and

 make

 them

 more

 effective

.

 This

 could

 involve

 using

 machine

 learning

 algorithms

 to

 analyze

 human

 behavior

 and

 decision

-making

 processes

,

 or

 using

 natural

 language

 processing

 to

 automate

 and

 optimize

 administrative

 tasks

.



2

.

 Em

phasis

 on

 ethical

 and

 societal

 considerations

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 there

 is

 a

 growing

 need

 for

 ethical

 and

 societal

 considerations

 to

 be

 taken

 into




In [6]:
llm.shutdown()