# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0810 07:36:05.631000 3661088 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0810 07:36:05.631000 3661088 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0810 07:36:13.835000 3661446 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0810 07:36:13.835000 3661446 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.05it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Xiao Ming. I’m a middle school student. I'm good at sports. I like playing basketball and football. I like playing football better. My favorite color is red. I also like watermelons. I like hamburgers and milk. My favorite color is red. I like watermelons. My favorite color is red. I like hamburgers and milk. My favorite color is red. What sports do you like? Do you like red?  I'm a middle school student. I'm good at sports. I like playing basketball and football. I like playing football better. My favorite color is red. I also like water
Prompt: The president of the United States is
Generated text:  running for a second term. He will be the 43rd president. If he wants to be the longest-serving president, how many years will he have to serve before he becomes the next president? To determine how many years the president of the United States will have to serve before he becomes the next president, we need to know the current terms of the presid

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the annual Eiffel Tower Festival. It is the largest city in France and the third largest in the world, with a population of over 2.5 million people. Paris is also home to the Louvre Museum, the most famous art museum in the world, and the Notre-Dame Cathedral, a UNESCO World Heritage site. The city is also known for its rich history, including the Roman Empire, French Revolution, and the French Revolution. Paris is a popular tourist destination and a major economic center in France. It is also home to many famous French artists and writers

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased automation: As AI becomes more advanced, it is likely to become more capable of performing tasks that were previously done by humans. This could lead to the automation of many jobs, which could have a significant impact on employment and the economy.

2. AI ethics and privacy: As AI becomes more integrated into our daily lives, there will be a growing concern about the ethical implications of AI. This could lead to new regulations and standards being



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [character type] who loves [character's interest or hobby]. I'm passionate about [character's passion or hobby] and spend a lot of time [character's activity or activity that brings joy]. I believe in [character's core value or belief] and strive to [character's goal or aim]. I have a deep respect for [character's mentor or teacher] and I believe in [character's way of life or lifestyle]. I'm always ready to learn and grow, and I believe in [character's belief in living life to the fullest]. I love [character's work or profession] and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in Europe and the third-largest city in the world by population. Paris is famous for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and Montmartr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

]

 and

 I

 am

 a

 dedicated

 [

insert

 job

 title

 or

 field

]

 with

 over

 [

insert

 number

 of

 years

 of

 experience

].

 I

 have

 a

 passion

 for

 [

insert

 something

 that

 interests

 you

,

 like

 music

,

 cooking

,

 or

 [

insert

 hobby

/

interest

]

 -

 the

 more

 detailed

,

 the

 better

].

 I

 am

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 learn

 and

 grow

 as

 a

 professional

.

 I

 am

 always

 looking

 to

 improve

 my

 skills

 and

 stay

 up

-to

-date

 with

 the

 latest

 trends

 in

 my

 field

.

 I

 am

 a

 team

 player

 and

 I

 enjoy

 working

 well

 with

 people

 from

 different

 backgrounds

 and

 cultures

.

 I

 am

 excited

 to

 be

 part

 of

 this

 team

 and

 I

 look

 forward

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 historic

 sites

,

 landmarks

,

 and

 rich

 cultural

 heritage

.

 It

 is

 also

 the

 birth

place

 of

 many

 famous

 French

 figures

,

 including

 Napoleon

 Bon

ap

arte

 and

 the

 E

iff

el

 Tower

.

 Paris

 is

 a

 cosm

opolitan

 city

 with

 a

 diverse

 range

 of

 cultures

 and

 cuis

ines

,

 and

 is

 a

 popular

 tourist

 destination

.

 Its

 vibrant

 art

 scene

,

 including

 the

 Lou

vre

 and

 the

 Notre

-D

ame

 Cathedral

,

 is

 another

 major

 draw

 for

 visitors

.

 Overall

,

 Paris

 is

 considered

 one

 of

 the

 most

 important

 and

 distinctive

 cities

 in

 the

 world

.

 It

's

 a

 charming

 city

 with

 a

 warm

 and

 welcoming

 atmosphere

.

 Is

 there

 anything

 else

 you

'd

 like

 to

 know

 about

 Paris

 or

 any

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 different

 trends

,

 including

 the

 increasing

 integration

 of

 AI

 into

 all

 aspects

 of

 society

,

 the

 continued

 advancement

 of

 machine

 learning

 and

 computer

 vision

,

 the

 growing

 importance

 of

 ethical

 considerations

,

 and

 the

 increasing

 reliance

 on

 AI

 in

 areas

 such

 as

 healthcare

,

 transportation

,

 and

 security

.

 These

 trends

 could

 potentially

 lead

 to

 new

 and

 innovative

 applications

 of

 AI

,

 as

 well

 as

 new

 challenges

 and

 ethical

 dile

mmas

 that

 must

 be

 addressed

 in

 order

 to

 ensure

 the

 responsible

 and

 ethical

 use

 of

 AI

 in

 society

.

 The

 potential

 impact

 of

 these

 trends

 could

 be

 far

-reaching

,

 from

 improved

 efficiencies

 and

 productivity

 in

 various

 industries

,

 to

 a

 greater

 understanding

 and

 control

 of

 the

 world

 around

 us




In [6]:
llm.shutdown()