# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0827 22:51:41.473000 3146234 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0827 22:51:41.473000 3146234 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0827 22:51:51.230000 3146587 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0827 22:51:51.230000 3146587 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.94it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.94it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.94it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.45it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  James Brown, a former fictional character. I am known for my distinctive voice, smooth bass lines, and rich, soulful music. I started as a tough kid on a tough crew but quickly learned to play the blues and evolve into a talented musician. I've been a musical legend, collaborating with legendary artists and creating timeless classics. My music has influenced countless artists and continues to inspire people around the world with its soulful melodies and empowering lyrics. Can you tell me about your life and how you became famous in the music industry?
James Brown's journey to fame began in the 1950s when he was just 1
Prompt: The president of the United States is
Generated text:  represented by a 10-member Senate committee. This Senate committee includes two Republicans, three Democrats, and seven Independent Senators. In how many ways can we choose a President, a Vice-President, and a Vice-Chancellor if each of these positions is distinct and

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest in the industry]. I'm always looking for new opportunities to [what you enjoy doing]. I'm [how you like to be treated] at [company name]. I'm [how you like to be treated] at [company name]. I'm [how you like to be treated] at [company name]. I'm [how you like to be treated] at [company name]. I'm [how you like

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that serves as the political, cultural, and economic center of the country. It is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also famous for its rich history, including the French Revolution and the French Revolution Museum. The city is home to many famous French artists, writers, and musicians, and is a major tourist destination for visitors from around the world. Paris is a vibrant and diverse city with a rich cultural heritage that continues to inspire and influence the country's art, literature, and music. The city is also known for its food

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we can expect to see even more widespread use of AI in healthcare, including in areas such as diagnosis, treatment planning, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection, risk management, and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [Type of Char] who enjoys [Your passion or hobby]. I'm [Your age] years old, [Your occupation] and [Your strength in the character's field] are [Your skill or expertise]. I've always loved [Your hobby or interest] because [Your motivation or why you enjoy it]. Whether it's [Your interests], [Your hobbies], [Your strengths], or [Your passions], [Your name] is a [type of character] who seeks [Your goal or motivation]. Please let me know what you would like to know about [Your character]. [Your Name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its iconic landmarks, such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also renowned for its rich history, including the famous Louvre Museum and its display of art and artifacts from around the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 highly

 skilled

 and

 experienced

 [

occupation

 or

 profession

].

 I

 am

 dedicated

 to

 [

job

 title

],

 and

 I

 have

 always

 been

 passionate

 about

 [

something

 that

 interests

 me

 or

 challenges

 me

].

 I

 am

 eager

 to

 learn

 and

 grow

,

 and

 I

 am

 always

 looking

 for

 ways

 to

 improve

 my

 skills

 and

 knowledge

.

 I

 am

 a

 team

 player

,

 and

 I

 enjoy

 collaborating

 with

 others

 to

 achieve

 our

 goals

.

 I

 am

 open

 to

 new

 experiences

 and

 opportunities

,

 and

 I

 am

 always

 willing

 to

 take

 on

 new

 challenges

.

 I

 am

 confident

 in

 my

 abilities

 and

 have

 no

 regrets

 about

 pursuing

 my

 career

 in

 [

job

 title

].

 I

 am

 dedicated

 to

 [

job

 title

]

 and

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 statement

 is

 factual

.

 Paris

 is

 the

 capital

 city

 of

 France

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 several

 key

 trends

,

 including

:



1

.

 More

 autonomous

 robots

:

 Robots

 will

 become

 more

 intelligent

 and

 able

 to

 learn

 on

 their

 own

,

 freeing

 up

 more

 human

 workers

 to

 focus

 on

 more

 complex

 tasks

.

 Autonomous

 robots

 could

 be

 used

 in

 manufacturing

,

 transportation

,

 and

 healthcare

.



2

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 can

 be

 used

 to

 analyze

 medical

 images

 and

 provide

 accurate

 diagnoses

,

 and

 it

 can

 also

 help

 in

 predicting

 and

 preventing

 diseases

.

 This

 could

 lead

 to

 more

 accurate

 and

 effective

 treatments

,

 and

 improved

 patient

 outcomes

.



3

.

 AI

 in

 education

:

 AI

 can

 be

 used

 to

 personalize

 learning

 experiences

,

 provide

 feedback

 and

 assessments

,

 and

 even

 to

 teach

 students




In [6]:
llm.shutdown()