# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0828 08:18:48.879000 1926887 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0828 08:18:48.879000 1926887 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0828 08:18:58.180000 1927658 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0828 08:18:58.180000 1927658 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.44it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.44it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.44it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.54it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jane. I'm an English girl. I am from Sydney, Australia. I like the weather in summer because it is very hot. I like summer to go shopping, watch TV or listen to music. I like to fly kites in summer. I usually go shopping on Saturdays. I don't go to the beach because it is too hot. I like to have my lunch at home. I usually have lunch with my family. I also like to have ice cream when I eat lunch. I usually eat my dinner with my family, too. I like to watch TV on Sundays. When I have my dinner, I like to play
Prompt: The president of the United States is
Generated text:  a position that should be held by ______.
A. a famous person
B. a person who is in high position
C. an individual with a lot of talent
D. someone who has a good education
Answer:
B

Which of the following statements is true?
A. The large window on the second floor of the hotel is used for ventilation.
B. The first-floor bathroom is on the second floor of the hotel.
C. The hotel

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I'm always up for a good challenge and love to explore new experiences. What's your favorite book or movie? I love to read and watch movies, and I'm always looking for new adventures to explore. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Louvre Museum. It is also home to many famous museums, including the Musée d'Orsay, the Musée Rodin, and the Musée d'Orsay. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a major transportation hub and a major economic center in Europe. Paris is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some potential trends that are likely to shape the future of AI:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence. This means that AI systems will be able to learn from and adapt to human behavior and preferences, and will be able to make decisions based on human values and goals.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This means that AI systems



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [Your Profession] [Your Role]. I enjoy solving complex problems and working towards making things better for the people I serve. I strive to be a positive influence in my community and strive to be someone people can count on when they need help. I am excited to learn more about you and see how you can help me achieve my goals. Happy to chat about it! How about you? What brings you to this role? What do you hope to achieve with this position? What kind of work experience do you have that would be helpful to me as I work towards my goal? What's your next

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks, rich history, and diverse culture, including iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is also a 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

],

 and

 I

'm

 a

 [

insert

 occupation

 or

 profession

]

 who

 has

 been

 following

 this

 character

 for

 [

insert

 number

 of

 years

 or

 years

]

 years

.

 I

've

 learned

 about

 your

 world

 and

 characters

 through

 various

 sources

,

 and

 I

'm

 here

 to

 share

 my

 insights

 with

 you

.

 I

'm

 here

 to

 ask

 and

 answer

 questions

,

 and

 to

 provide

 you

 with

 interesting

 and

 relevant

 information

.

 I

'm

 here

 to

 assist

 you

 in

 your

 journey

,

 whether

 it

's

 to

 learn

 about

 the

 world

,

 to

 find

 a

 job

,

 or

 simply

 to

 have

 a

 good

 conversation

.

 I

'm

 here

 to

 provide

 you

 with

 the

 best

 possible

 service

,

 and

 to

 help

 you

 in

 your

 quest

 for

 knowledge

 and

 understanding



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 where

 the

 French

 Revolution

 took

 place

.

 It

 is

 a

 UNESCO

 World

 Heritage

 site

.

 French

 cuisine

,

 including

 its

 famous

 past

ries

,

 is

 a

 key

 aspect

 of

 Paris

ian

 culture

 and

 cuisine

.

 Other

 notable

 landmarks

 include

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Palace

 of

 Vers

ailles

.

 Paris

 is

 a

 hub

 for

 fashion

,

 art

,

 and

 music

,

 and

 it

 has

 been

 a

 major

 center

 of

 international

 trade

 for

 centuries

.

 It

 is

 home

 to

 the

 University

 of

 Paris

 and

 has

 a

 long

 history

 of

 medieval

,

 Renaissance

,

 and

 Bar

oque

 architecture

.

 The

 city

 is

 known

 for

 its

 rich

 history

 and

 lively

 culture

,

 attracting

 visitors

 from



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 factors

,

 including

 advances

 in

 machine

 learning

,

 developments

 in

 computer

 hardware

 and

 software

,

 and

 the

 development

 of

 new

 technologies

 such

 as

 blockchain

 and

 quantum

 computing

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 precision

 and

 accuracy

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 will

 be

 able

 to

 process

 and

 analyze

 vast

 amounts

 of

 data

 more

 accurately

 and

 quickly

 than

 ever

 before

.

 This

 will

 allow

 AI

 systems

 to

 provide

 more

 precise

 and

 accurate

 predictions

 and

 recommendations

.



2

.

 Enhanced

 emotional

 intelligence

:

 AI

 systems

 will

 be

 able

 to

 learn

 and

 adapt

 to

 human

 emotions

 and

 behaviors

,

 enabling

 them

 to

 provide

 more

 personalized

 and

 empath

etic

 responses

.



3

.

 Greater




In [6]:
llm.shutdown()