# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0809 04:50:45.791000 272851 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0809 04:50:45.791000 272851 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0809 04:50:54.929000 273234 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0809 04:50:54.929000 273234 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.70it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Marc and I am a medical student. I want to start my internship at a medical facility in the fall. However, I am not sure how to begin my research for my internship. Can you help me with that?
Sure! Here are a few steps that can help you get started with your research for your internship:

1. Research the Medical Facility: Start by researching the medical facility you are interested in. Look at the facilities website, check out their social media pages, and talk to any medical professionals or staff members who may be willing to share information. You can also look up online reviews and ratings to see what other people have had
Prompt: The president of the United States is
Generated text:  3 feet tall. His granddaughter is 2/3 the height of the president. Additionally, there is a boy in the same family, who is 4 feet tall. How tall are all three combined?

To determine the total height of the three individuals, we need to calculate the height o

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the Louvre Museum. It is also the seat of the French government and the largest city in the country. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. The city is known for its vibrant nightlife, art, and cuisine. It is also home to many famous landmarks such as Notre-Dame Cathedral, the Palace of Versailles, and the Arc de Triomphe. Paris is a popular tourist destination and a major economic hub in France. The city is known for its diverse population, including French, African,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration of AI into everyday life: As AI becomes more integrated into our daily lives, we are likely to see more widespread adoption of AI technologies. This could include things like voice assistants, self-driving cars, and smart home devices.

2. Greater emphasis on ethical and responsible AI: As AI becomes more advanced, there will be a greater emphasis on ensuring that it is used ethically and responsibly. This could involve developing guidelines and standards



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Occupation]. I come from [Country] and have lived here [Number of Years]. I have always been passionate about [I Love to Do]. I am a [Job Title] who is always looking for new challenges and opportunities to grow. I believe that [Reason Why I Love What I Do]. I am a [Name] and I am [Number of Years] old. I speak English and I have lived in [Country] for [Number of Years]. I have always been passionate about [I Love to Do] and have always wanted to be a [Job Title]. I am a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the Loire Valley region, known for its historic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. 

The city is also home to numerous museums, theaters, and a vibrant cultural scene, including the Opéra and the Musée d'Orsay. Paris 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

/an

 [

Occup

ation

]

 who

 loves

 to

 [

Reason

 for

 love

/

interest

].

 My

 journey

 is

 a

 mix

 of

 [

Em

otional

 journey

],

 learning

 about

 [

Subject

 or

 Realm

].

 I

 believe

 that

 my

 [

Experience

 or

 Skill

]

 will

 help

 me

 achieve

 [

My

 Goal

 or

 Vision

].

 Looking

 to

 grow

,

 the

 best

 way

 I

 can

 do

 that

 is

 [

How

 you

 plan

 to

 grow

].

 I

 am

 a

/an

 [

Age

],

 [

Height

],

 [

Weight

],

 [

Hair

 Color

],

 [

Eye

 Color

],

 and

 [

Physical

 Appearance

].

 My

 hobby

 is

 [

Past

 Hobby

],

 and

 I

 enjoy

 [

How

 I

 like

 to

 spend

 time

].

 I

 believe

 that

 I

 am

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 south

-central

 region

 of

 the

 country

.

 It

 is

 the

 largest

 city

 in

 France

,

 with

 an

 estimated

 population

 of

 over

 

7

 million

 people

.

 Paris

 is

 known

 for

 its

 rich

 history

 and

 culture

,

 as

 well

 as

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.

 The

 city

 is

 also

 known

 for

 its

 annual

 festivals

,

 including

 the

 Spring

 Festival

,

 Carnival

,

 and

 the

 E

iff

el

 Tower

 Festival

.

 Paris

 has

 been

 described

 as

 a

 "

city

 of

 love

"

 and

 is

 considered

 one

 of

 the

 most

 beautiful

 cities

 in

 the

 world

.

 It

 is

 home

 to

 over

 

3

0

0

,

0

0

0

 foreign

 tourists



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 key

 trends

,

 including

:



1

.

 Increased

 complexity

 and

 sophistication

 of

 AI

:

 AI

 is

 expected

 to

 become

 more

 complex

 and

 sophisticated

,

 with

 the

 ability

 to

 learn

 from

 data

,

 understand

 context

,

 and

 reason

 about

 the

 world

 in

 ways

 that

 humans

 cannot

.

 This

 includes

 the

 development

 of

 more

 advanced

 neural

 networks

,

 symbolic

 AI

,

 and

 more

 intelligent

 agents

.



2

.

 Greater

 emphasis

 on

 ethical

 AI

:

 There

 is

 a

 growing

 concern

 about

 the

 ethical

 implications

 of

 AI

,

 including

 issues

 such

 as

 privacy

,

 bias

,

 and

 transparency

.

 The

 development

 of

 more

 ethical

 AI

 systems

 is

 likely

 to

 become

 more

 important

,

 with

 efforts

 to

 ensure

 that

 AI

 systems

 are

 fair

,

 transparent

,

 and




In [6]:
llm.shutdown()