# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0808 02:08:55.126000 1470233 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0808 02:08:55.126000 1470233 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0808 02:09:02.917000 1471016 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0808 02:09:02.917000 1471016 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.49it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Evie and I'm a teacher from Sacramento, California. I teach seventh grade math. I'm just 27 years old and when I'm not teaching, I'm usually running. I'm also a Master's in Musicology and a certified music instructor. I've taught music theory and musical instruments at the middle and high school level for more than 20 years. I have a Bachelor's in Psychology from Sacramento State University, a Master's in Musicology from the University of Southern California, and a music education certificate from Sacramento State University. I have a passion for all things music. I'm a huge fan of jazz and
Prompt: The president of the United States is
Generated text:  52 years old today. The president is expected to retire in 10 years and will have 10 more years until he is 80 years old. What is the current age of the president? To determine the current age of the president, we need to follow a step-by-step approach:

1. **Identify the current age of the pres

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a unique trait or skill] that I'm passionate about and enjoy sharing with others. What's your favorite hobby or activity? I love [insert a hobby or activity that you enjoy]. What's your favorite book or movie? I love [insert a favorite book or movie that you've read or watched]. What's your favorite color? I love [insert a favorite color]. What's your favorite food? I love [insert a favorite

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population of over 10 million people. The city is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also known for its cuisine, fashion, and art scene, making it a popular tourist destination. The city is home to many cultural institutions and events throughout the year, including the annual Eiffel Tower Festival and the annual Louvre Museum Festival. Paris is a vibrant and dynamic city that is a must-visit for anyone interested in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn from and adapt to human behavior and decision-making processes.

2. Enhanced ethical considerations: As AI becomes more advanced, there will be increased scrutiny of its ethical implications, including issues such as bias, transparency, and accountability.

3. Greater reliance on AI for decision-making: AI is likely to become more integrated into decision-making processes, with more reliance on AI for tasks such as fraud detection, customer service, and healthcare.

4. Increased use of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Field or Career] expert. I specialize in [Briefly Describe Your Expertise]. If you have any questions or need assistance with anything, don't hesitate to reach out. Looking forward to the opportunity to help you. [Contact Information] [Position] [Phone] [Email]

Hey there, fellow [Job Title] expert! I’m [Your Name], a [Your Field or Career] whiz! Let’s connect and help you out. [Contact Information]

---

**Your Name:** [Your Full Name]
**Your Field or Career:** [Your Field or Career]
**Contact Information

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its beautiful architecture, historical significance, and vibrant cultural scene. It is also a major transportation hub and home to numerous museums, theaters, and monuments. The city is home to numerous notable figures, incl

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Career

 or

 Occupation

]

 with

 [

Number

]

 years

 of

 experience

.

 I

'm

 [

Age

]

 years

 old

.

 I

've

 [

Number

]

 years

 of

 service

 in

 the

 field

 of

 [

Field

 of

 Interest

].

 I

'm

 the

 [

Role

/

Position

]

 of

 [

Your

 Role

 or

 Job

 Title

].

 What

 would

 you

 like

 to

 know

 about

 me

?


I

'm

 curious

 to

 know

 about

 your

 career

,

 your

 work

,

 and

 your

 skills

.

 What

 do

 you

 do

 for

 a

 living

?

 What

 do

 you

 love

 to

 do

?

 What

 are

 your

 hobbies

?

 What

 are

 your

 interests

 in

 life

?

 What

's

 the

 best

 way

 to

 describe

 you

?

 How

 do

 you

 relax

?

 How

 do



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 statement

 is

 correct

.

 Paris

 is

 the

 capital

 city

 of

 France

,

 located

 on

 the

 Î

le

 de

 la

 C

ité

 in

 the

 Se

ine

 River

,

 and

 is

 known

 for

 its

 magnificent

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 It

 is

 the

 seat

 of

 government

,

 government

,

 and

 administrative

 authority

,

 and

 serves

 as

 the

 largest

 and

 most

 populous

 city

 in

 the

 European

 Union

.

 The

 city

 has

 been

 the

 seat

 of

 government

 since

 the

 

1

4

th

 century

 and

 continues

 to

 be

 a

 major

 international

 financial

 center

 and

 cultural

 hub

.

 It

 is

 also

 the

 world

's

 most

-

visited

 city

,

 with

 millions

 of

 tourists

 visiting

 annually

.

 Paris

 is

 known

 for

 its

 art

,

 fashion

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 varied

,

 with

 many

 possibilities

 and

 potential

 applications

.

 Here

 are

 some

 potential

 trends

 that

 may

 emerge

 in

 the

 coming

 years

:



1

.

 Increased

 integration

 with

 human

 cognition

:

 As

 AI

 becomes

 more

 advanced

,

 it

 is

 likely

 to

 integrate

 with

 human

 cognition

 to

 improve

 its

 accuracy

 and

 efficiency

.

 This

 could

 involve

 algorithms

 that

 can

 learn

 from

 human

 behavior

 and

 adapt

 to

 new

 situations

,

 or

 that

 can

 communicate

 with

 humans

 in

 a

 more

 human

-like

 way

.



2

.

 Enhanced

 language

 processing

:

 As

 AI

 continues

 to

 advance

,

 it

 may

 be

 able

 to

 process

 more

 complex

 languages

 and

 speech

 more

 accurately

,

 leading

 to

 more

 natural

 and

 intuitive

 interactions

 with

 humans

.



3

.

 Improved

 privacy

 and

 security

:

 As

 AI

 becomes




In [6]:
llm.shutdown()