# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0813 17:02:14.843000 952491 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0813 17:02:14.843000 952491 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0813 17:02:26.873000 953070 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0813 17:02:26.873000 953070 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.92it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.94it/s]Capturing batches (bs=2 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.94it/s]Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.94it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00, 11.42it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kim. I'm 13 years old. I'm tall and I have long black hair. I like playing computer games. I don't like playing sports. I like to listen to music and watch TV. I like cartoons and comic books. Sometimes, I like to draw pictures. I like to listen to music and watch TV. How do you like playing computer games? Are you interested in sports? What do you like to do for fun? Please tell me. (Write an essay of no less than 60 words about your hobbies. The essay should include the following information. 1. What is your hobby? 
Prompt: The president of the United States is
Generated text:  trying to decide whether to deliver a speech or a written speech. The president has a small prison population to be apprehended, and he wants to know how likely it is that the prison population will be large enough to justify delivering a speech rather than a written speech. The president has gathered some data on the prison populations of different states.

The data 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest in the industry]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [reason for interest in the industry] and I'm always eager to learn and improve. I'm a [reason for interest in the industry] and I'm always eager to learn and improve. I'm a [reason for interest in the industry] and I'm always eager to learn and improve. I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic city with a rich history dating back to the Middle Ages. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for art, culture, and fashion. Paris is a popular tourist destination and a major economic center in France. It is home to many famous museums, theaters, and restaurants. The city is also known for its cuisine, including its famous Parisian dishes such

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more natural and intuitive interactions between humans and machines.

2. Enhanced machine learning capabilities: AI is likely to become more powerful and capable of learning from large amounts of data, allowing machines to perform tasks that were previously impossible or difficult to achieve. This could lead to more efficient and effective applications of AI.

3. Increased focus on ethical considerations: As AI becomes more integrated with human intelligence, there will be increased focus on ethical



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am an [age] year old [profession]. I started my journey in the world of digital marketing [year]. I believe that my experience in this field has taught me the importance of [specific skill or trait]. What is your background? I have a passion for [interest or hobby]. What brings you to this profession? Thank you for asking. I am a [career aspiration], and I am eager to explore this field. I believe that with my [skills or qualities], I can make a meaningful impact on the world. I hope to learn from you and expand my knowledge in the industry. What can you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its rich history, culture, art, cuisine, and nightlife. It is the largest city in France and has a population of around 2.1 million people. Paris is famous for its famous 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

Your

 Profession

].

 I

 love

 to

 [

Your

 Hobby

/

Interest

/

Att

itude

].

 What

's

 your

 favorite

 hobby

 or

 interest

?

 And

 what

's

 your

 favorite

 color

?

 As

 a

 language

 model

 AI

,

 I

 don

't

 have

 feelings

 or

 emotions

,

 but

 I

 can

 talk

 about

 my

 profession

,

 hobbies

,

 and

 interests

.

 What

 would

 you

 like

 to

 know

 about

 me

?

 



[

Your

 Name

]

 is

 a

 [

Your

 Profession

]

 who

 enjoys

 [

Your

 Hobby

/

Interest

/

Att

itude

].

 They

 are

 passionate

 about

 [

Their

 Hobby

/

Interest

/

Att

itude

]

 and

 have

 always

 been

 interested

 in

 [

Other

 Hobby

/

Interest

/

Att

itude

].

 They

 are

 always



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 also

 known

 as

 "

The

 City

 of

 Light

"

 and

 "

The

 Eternal

 City

."

 It

 is

 a

 sprawling

 and

 vibrant

 met

ropolis

 with

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 Empire

.

 The

 city

 is

 home

 to

 many

 historical

 and

 cultural

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 known

 for

 its

 cuisine

,

 particularly

 its

 famous

 French

 cuisine

,

 which

 includes

 dishes

 like

 cro

iss

ants

,

 steak

,

 and

 rich

 sauces

.

 In

 addition

,

 Paris

 is

 a

 popular

 tourist

 destination

,

 drawing

 millions

 of

 visitors

 each

 year

.

 It

 has

 a

 diverse

 population

,

 with

 French

 and

 international

 residents

 living

 together

 in

 a

 city

 that



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 quite

 promising

 and

 has

 the

 potential

 to

 revolution

ize

 nearly

 every

 aspect

 of

 our

 lives

,

 from

 healthcare

 and

 education

 to

 transportation

 and

 entertainment

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Integration

 with

 Physical

 Objects

:

 As

 AI

 becomes more

 advanced,

 it

 is

 likely

 to

 find

 new

 uses

 in

 physical

 objects

,

 such

 as

 prost

hetic

 limbs

 and

 prost

hetic

 speech

 machines

.

 This

 could

 lead

 to

 new

 ways

 of

 enhancing

 human

 capabilities

 and

 improving

 the

 quality

 of

 life

.



2

.

 Improved

 Understanding

 of

 Human

 Communication

:

 AI

 is

 already

 becoming

 better

 at

 understanding

 human

 language

 and

 emotions

.

 As

 it

 continues

 to

 improve

,

 it

 is

 likely

 to

 become

 even

 more

 sophisticated

 and

 able

 to

 interpret

 human

 behavior

 and




In [6]:
llm.shutdown()