# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0815 07:48:43.997000 4059379 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0815 07:48:43.997000 4059379 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0815 07:48:53.207000 4059849 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0815 07:48:53.207000 4059849 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.18it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.66 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=55.66 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.95it/s]Capturing batches (bs=2 avail_mem=55.60 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.95it/s]Capturing batches (bs=1 avail_mem=55.60 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.95it/s]Capturing batches (bs=1 avail_mem=55.60 GB): 100%|██████████| 3/3 [00:00<00:00,  8.69it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sonya. I'm a scientist and a trainee teacher at MIT. I have been studying physics for a long time. I have worked on problems like finding the force of gravity between two objects or measuring the mass of a neutron. I also have a personal interest in the science of mind and memory, as well as the science of pain. I am currently a member of the MIT Media Lab's Brain and Learning Project. I have been working with the Intelligent Systems Lab to develop the capabilities for human-like intelligences. Recently, I have been working on the Gaze Project at the MIT Media Lab. The Gaze Project is about improving
Prompt: The president of the United States is
Generated text:  `x' years older than the president of Brazil, `x' years younger than the president of France, and `x' years older than the president of Germany. If the president of Brazil is currently 22 years old, what is the president of France's age?
To find the president of France's age, we need t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Middle Ages and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major cultural and economic center, with a diverse population and a thriving arts scene. The city is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. Paris is a popular tourist destination and a major hub for international business and diplomacy. It is also known for its cuisine, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation: AI is likely to become more prevalent in various industries, including manufacturing, transportation, and healthcare. Automation will likely lead to increased efficiency and productivity, but it will also lead to the loss of jobs for humans.

2. AI ethics and privacy: As AI becomes more integrated into our daily lives, there will be increasing concerns about its ethical implications and potential privacy violations. There will be a need for regulations and guidelines to ensure that AI is used in a responsible



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I am a [Your Profession] with a passion for [Your Interest/Opinion/Story]. I am excited to share my experiences and insights with anyone who wants to learn more about my journey. I'm here to inspire and challenge you to grow and evolve as a person. How can I get to know you better? You can reach me at [Your Phone Number/Email Address]. Welcome, [Your Name], I look forward to meeting you! Can you tell me more about your work or interests? As a [Your Profession], I have a particular interest in [Your Area of Expertise/Interest].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a historic and cultural center of France, known for its landmarks, festivals, and cuisine. Paris is often referred to as the "city of love" and is home to some of the world's most famous museums, such as the Lo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

'm

 a

/an

 [

character

's

 role

/

occupation

]

!



As

 a

/an

 [

character

's

 role

/

occupation

],

 I

 love

 [

short

ly

 describe

 something

 you

 like

 about

 your

 role

].

 My

 favorite

 [

character

's

 role

/

occupation

]

 is

 the

 [

insert

 your

 character

's

 profession

]

 because

 [

insert

 why

 you

 like

 that

 role

].

 I

 enjoy

 [

insert

 something

 you

 like

 about

 your

 role

]

 because

 [

insert

 why

 you

 like

 that

 role

].

 As

 a

/an

 [

character

's

 role

/

occupation

],

 I

 believe

 in

 [

insert

 something

 that

 reflects

 your

 core

 values

 and

 beliefs

 about

 the

 role

].

 I

 believe

 in

 [

insert

 something

 that

 reflects

 your

 core

 values

 and

 beliefs

 about



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 historical

,

 cultural

,

 and

 economic

 center

 that

 has

 played

 a

 significant

 role

 in

 shaping

 French

 identity

 and

 shaping

 the

 country’s

 worldliness

. Its

 most famous

 landmark is

 the E

iff

el

 Tower

,

 a 

32

4-meter

-t

all

 iron

 lattice

 tower

 that

 stands

 at

 the

 entrance

 to

 Paris

.

 Paris

 is

 also

 the

 birth

place

 of

 the

 French

 Revolution

,

 the

 home

 of

 the

 French

 language

,

 and

 the

 birth

place

 of

 many

 famous

 French

 writers

 and

 artists

.

 The

 city

 is

 also

 known

 for

 its

 rich

 culinary

 traditions

,

 fashion

 industry

,

 and

 nightlife

 scene

.

 As

 of

 

2

0

2

1

,

 Paris

 has

 an

 estimated

 population

 of

 

1

8

 million

 people

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 variable

.

 Some

 possible

 trends

 that

 are

 being

 predicted

 include

:



1

.

 Increased

 automation

 of

 mundane

 tasks

:

 As

 automation

 becomes

 more

 advanced

,

 more

 tasks

 that

 were

 previously

 done

 by

 humans

 will

 be

 automated

.

 This

 could

 lead

 to

 increased

 efficiency

 and

 reduced

 human

 labor

.



2

.

 AI

 will

 become

 more

 intelligent

:

 As

 AI

 technology

 improves

,

 it

 will

 become

 more

 intelligent

 and

 able

 to

 understand

 and

 respond

 to

 complex

 problems

.

 This

 could

 lead

 to

 more

 effective

 and

 efficient

 solutions

 to

 complex

 problems

.



3

.

 AI

 will

 become

 more

 ethical

:

 As

 AI

 technology

 is

 developed

,

 there

 will

 be

 a

 greater

 emphasis

 on

 ethical

 considerations

.

 This

 could

 lead

 to

 the

 development

 of

 AI

 that

 is

 more

 transparent




In [6]:
llm.shutdown()