# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0830 06:44:15.695000 3006600 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 06:44:15.695000 3006600 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0830 06:44:24.312000 3007265 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 06:44:24.312000 3007265 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.21it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.15it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Zhang Xiang. I work in a factory for 4 years. I get paid $20 per hour. I have a job in the department of production. I had to work overtime once a week to get my pay raised by 50%. My manager is worried that I may lose my job if I don't raise my pay. I have a family. It takes about 10 hours to go to the hospital and I also have to look after my husband and daughter. What should I do to improve my situation?
Based on your situation, it sounds like you are facing a challenging situation. Here are some steps you can
Prompt: The president of the United States is
Generated text:  a different person every year and is usually from the left or right of the political spectrum. Do you think that the president of the United States is usually right wing?

  1. Yes
  2. No
  3. Sometimes
  4. No idea
  5. Yes
To determine whether the president of the United States is usually right-wing, we need to consider the characteristics and policies of the current pr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill/Ability] who have been [Number of Years] years in the [Field/Industry] industry. I'm passionate about [What I Love About My Profession]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [Personality Trait] who is [What I Like to Do]. I'm always ready to learn and improve. I'm a [Personality Trait] who is [What I Like to Do]. I'm always ready to learn and improve. I'm a [Personality Trait]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the third-largest city in the world by population. It is known for its rich history, beautiful architecture, and vibrant culture. Paris is a popular tourist destination and a major economic center. The city is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also known for its cuisine, fashion, and music. Paris is a city that has a unique blend of old and new, and it continues to be a major cultural and economic center in Europe. The city is home to many museums, theaters, and other cultural institutions

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems will become more integrated with human intelligence, allowing them to learn and adapt to new situations. This will enable AI to perform tasks that are currently beyond the capabilities of humans, such as playing chess or playing musical instruments.

2. Enhanced decision-making: AI systems will become more capable of making decisions based on complex data and information, rather than simply following pre-programmed rules. This will enable AI to make more informed and ethical decisions, which will have a significant impact on society.

3. Improved privacy and security: As AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm an [short, descriptive phrase like "web developer", "software engineer", "data scientist", etc.] who's passionate about [mention an interesting interest or hobby], [provide details about your skills and experience, like "am proficient in programming languages like Python, Java, and SQL", "able to work collaboratively with other developers", "excellent at project management", "inspired by the idea of creating meaningful, impactful content for individuals and organizations"].
It's my pleasure to meet you, and I look forward to discussing our potential collaboration. What can you tell me about yourself and what drives you?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. 

Paris is the cultural, financial, and political center of France, and is home to many

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Jane

 and

 I

 am

 a

 computer

 programmer

.

 I

 have

 been

 programming

 for

 over

 

1

0

 years

 and

 have

 worked

 on

 a

 wide

 range

 of

 projects

,

 from

 small

 websites

 to

 complex

 software

 systems

.

 I

 have

 a

 passion

 for

 solving

 problems

 and

 using

 technology

 to

 improve

 the

 way

 we

 work

 and

 communicate

.

 I

 am

 also

 a

 firm

 believer

 in

 continuous

 learning

 and

 am

 always

 looking

 for

 ways

 to

 enhance

 my

 skills

 and

 knowledge

.

 I

 am

 excited

 to

 work

 with

 you

 and

 contribute

 to

 your

 project

.

 Jane

.

 



I

 hope

 this

 short

 self

-int

roduction

 does

 not

 come

 across

 as

 too

 formal

 or

 preach

y

,

 but

 rather

 as

 a

 friendly

 and

 approach

able

 greeting

.

 Let

 me

 know

 if

 you

 would

 like



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Grande

 Ã

©

p

�nce

"

 and

 "

La

 Joy

e

use

 France

",

 a

 historic

 and

 culturally

 significant

 city

 located

 in

 the

 Î

le

-de

-F

rance

 region

 of

 France

.


What

 is

 the

 capital

 city

 of

 France

?

 Paris

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 dominated

 by

 a

 few

 key

 trends

 that

 are

 likely

 to

 shape

 the

 technology

's

 direction

 in

 the

 coming

 years

.

 Here

 are

 some

 potential

 trends

 that

 are

 expected

 to

 impact

 the

 field

:



1

.

 Increased

 integration

 and

 collaboration

 between

 AI

 and

 human

 experts

:

 As

 AI

 becomes

 more

 integrated

 into

 human

 decision

-making

 processes

,

 it

 is

 likely

 to

 become

 even

 more

 powerful

 and

 capable

.

 This

 could

 lead

 to

 increased

 collaboration

 between

 AI

 and

 human

 experts

 in

 various

 fields

,

 such

 as

 healthcare

,

 finance

,

 and

 education

.



2

.

 Improved

 AI

 ethics

 and

 privacy

:

 As

 AI

 systems

 become

 more

 complex

 and

 complex

,

 there

 will

 likely

 be

 increased

 pressure

 to

 develop

 ethical

 and

 privacy

-pres

erving

 algorithms

 that

 minimize

 bias




In [6]:
llm.shutdown()