# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0919 04:31:20.358000 1801785 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 04:31:20.358000 1801785 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0919 04:31:28.857000 1802488 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 04:31:28.857000 1802488 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0919 04:31:29.233000 1802487 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 04:31:29.233000 1802487 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-19 04:31:29] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.99it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.98it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=75.45 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=75.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.51it/s]Capturing batches (bs=2 avail_mem=74.99 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.51it/s]Capturing batches (bs=1 avail_mem=74.99 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.51it/s]

Capturing batches (bs=1 avail_mem=74.99 GB): 100%|██████████| 3/3 [00:00<00:00,  4.40it/s]Capturing batches (bs=1 avail_mem=74.99 GB): 100%|██████████| 3/3 [00:00<00:00,  4.09it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  João and I am an amateur artist. I have learned through experience that the most accurate information on the Earth is obtained when using a compass and my map. If I were to travel to a location that is 220 miles north of my current location, would I need to first take the compass to a north point, or take the compass to the opposite of the north point? To be more specific, if my current location is point A, how would I determine the distance to point B, which is 220 miles north of A? Would I use a compass to a point that is 220 miles north of
Prompt: The president of the United States is
Generated text:  visiting a small country and decides to take a new library as a gift. The library will be located in a country where the population is exactly 100,000 people. The president wants to know how many times the librarian must walk around the circumference of the library to read all the books in it. Each book weighs 10 pounds and the librarian walks

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What brings you to [company name] and what makes you a good fit for the position? I'm a [insert a short description of your personality or character trait]. I'm always looking for new challenges and opportunities to grow and learn. How do you stay motivated and focused on your work? I'm a [insert a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville Blanche" or "The White City." It is the largest city in France and the second-largest city in the European Union, with a population of over 10 million people. Paris is known for its rich history, art, and culture, and is a major tourist destination. It is also home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. The city is also known for its cuisine, with many famous French dishes such as croissants, escargot, and boudin. Paris is a vibrant and dynamic

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence. This could lead to more sophisticated forms of AI that can learn from and adapt to human behavior and preferences.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve diagnosis, treatment, and patient care. As AI becomes more advanced, it is likely to be used in even more sophisticated ways, such as predicting patient outcomes,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [job title] with [number of years] years of experience in [industry or field]. I'm passionate about [what I enjoy doing], and I thrive on [why I love what I do]. I'm always looking for ways to improve my skills and stay ahead of the curve in [industry or field]. My strongest suit is [the thing that makes you stand out in the room]. What makes you unique and why do you like being in this field? [Personality traits or qualities that set you apart].
As an AI language model, I do not have a physical appearance or personal experiences,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

(A) The population of Paris is 2.4 million.
(B) Paris is the largest city in France.
(C) Paris is located in the center of France.
(D) Paris is the second largest city in Europe.
(E) Paris is the capital of Fr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 [

occupation

]

 in

 [

field

/

industry

].

 I

'm

 passionate

 about

 [

reason

 for

 interest

]

 and

 [

name

 of

 the

 project

 or

 task

 I

'm

 most

 proud

 of

].

 I

 love

 [

reason

 for

 persistence

]

 and

 [

reason

 for

 commitment

].

 What

 exc

ites

 me

 the

 most

 is

 [

exc

iting

 moment

 from

 my

 experience

].

 My

 work

 ethic

 is

 also

 very

 strong

 and

 I

'm

 always

 striving

 to

 improve

 [

strength

 or

 quality

 improvement

].

 I

'm

 always

 open

 to

 feedback

 and

 willing

 to

 learn

 from

 others

.

 What

's

 the

 most

 important

 quality

 that

 drives

 me

?

 



Tell

 me

 more

 about

 your

 background

 and

 how

 you

 became

 interested

 in

 the

 field

 you

 are

 in

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



[

Mark

 down

]


In

 [

URL

]

 style

,

 provide

 a

 brief

,

 accurate

 statement

 about

 the

 capital

 city

 of

 France

.

 It

 should

 include

 at

 least

 one

 relevant

 fact

 or

 reference

 to

 where

 Paris

 is

 located

.

 For

 example

,

 "

Paris

,

 the

 cultural

 capital

 of

 France

,

 is

 located

 in

 the

 Lo

ire

 Valley

,

 a

 region

 renowned

 for

 its

 vine

yards

 and

 architecture

."

 [

Mark

 down

]

 In

 [

URL

]

 style

,

 provide

 a

 brief

,

 accurate

 statement

 about

 the

 capital

 city

 of

 France

.

 It

 should

 include

 at

 least

 one

 relevant

 fact

 or

 reference

 to

 where

 Paris

 is

 located

.

 For

 example

,

 "

Paris

,

 the

 cultural

 capital

 of

 France

,

 is

 located

 in

 the

 Lo



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 wide

 range

 of

 different

 technologies

 and

 applications

.

 Here

 are

 some

 possible

 trends

 that

 we

 can

 expect

 to

 see

 in

 the

 field

 of

 artificial

 intelligence

:



1

.

 Increased

 Use

 of

 AI

 in

 Healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 diagnosis

,

 treatment

,

 and

 patient

 care

.

 For

 example

,

 AI

 can

 analyze

 medical

 images

 to

 detect

 cancer

,

 track

 patients

'

 medical

 history

,

 and

 even

 help

 doctors

 develop

 personalized

 treatment

 plans

.



2

.

 AI

 in

 Finance

:

 AI

 is

 also

 becoming

 more

 and

 more

 prevalent

 in

 the

 finance

 industry

,

 helping

 to

 automate

 trading

,

 fraud

 detection

,

 and

 customer

 service

.

 In

 fact

,

 some

 financial

 institutions

 are

 already

 using

 AI

 to

 identify

 fraudulent

 transactions




In [6]:
llm.shutdown()