# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0813 21:20:02.305000 1919644 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0813 21:20:02.305000 1919644 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0813 21:20:14.642000 1920280 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0813 21:20:14.642000 1920280 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.40it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.99it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.99it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.99it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  9.07it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Robin. I was born in Rio de Janeiro, Brazil and raised in the city. My family is from diverse backgrounds and my parents are both nurses. I was raised with the love and respect for my ancestors and my grandparents who became nurses, their love for the nursing profession and their respect for each other.
In 1995, I was diagnosed with multiple sclerosis (MS) and from that moment on I began a journey to find a cure. This journey is never over and I am not done yet. I have had 10 years of MS in the form of relapse. But my determination and determination will always be a
Prompt: The president of the United States is
Generated text:  very popular. He is also very hungry. As a result, he has a problem: he is always hungry. The president is in charge of the whole country, and he must have food. He can't just buy it from the store because he is not rich enough. He has to spend his own money and give it to the people who feed the people in charge of the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your profession or role]. I enjoy [insert a brief description of your hobbies or interests]. I'm always looking for new experiences and challenges, and I'm always eager to learn and grow. What's your favorite hobby or activity? I love [insert a hobby or activity you enjoy]. I'm always looking for new adventures and experiences, and I'm always eager to try new things. What's your favorite book or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for internatio

Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and adaptive AI systems.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and deployment.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more integrated with human intelligence, there is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a 25-year-old software engineer with a degree in Computer Science from [University Name]. I'm a creative problem-solver and have always been drawn to technology and creating. I'm always looking for new ways to solve problems and innovate within the field of software development. I'm also a natural communicator, able to explain complex technical concepts in a clear and concise manner. I enjoy building and maintaining strong, reliable software applications. I'm a hardworking individual who thrives in a fast-paced, tech-driven environment. I'm always looking for ways to improve my skills and stay ahead of the curve in the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

What is the capital of France? Paris.

Prompt: Explain possible future trends in artificial intelligence. The future of A

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 

2

5

-year

-old

 software

 developer

 with

 a

 passion

 for

 creativity

 and

 problem

-solving

.

 I

 enjoy

 working

 on

 new

 projects

 and

 constantly

 learning

 new

 technologies

 and

 approaches

.

 I

've

 been

 coding

 for

 over

 

8

 years

 and

 have

 hon

ed

 my

 skills

 in

 languages

 like

 C

++

 and

 Python

.

 I

 enjoy

 exploring

 new

 programming

 parad

ig

ms

 and

 using

 them

 to

 create

 engaging

 and

 efficient

 applications

.

 I

 have

 a

 knack

 for

 coding

 in

 Ruby

 and

 love

 to

 learn

 new

 gems

 and

 tools

.

 I

'm

 passionate

 about

 using

 code

 to

 solve

 real

-world

 problems

 and

 I

'm

 always

 looking

 for

 ways

 to

 improve

 my

 skills

 and

 stay

 up

-to

-date

 with

 the

 latest

 programming

 trends

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 world

-ren

owned

 French

 city

 with

 a

 rich

 history

 and

 diverse

 culture

.



That

's

 not

 entirely

 true

.

 While

 Paris

 is

 a

 bustling

 met

ropolis

,

 it

's

 not

 the

 world

-ren

owned

 French

 capital

.

 The

 current

 capital

 is

 Paris

,

 and

 it

's

 located

 in

 the

 Î

le

-de

-F

rance

 region

 of

 France

.

 It

's

 the

 largest

 city

 in

 France

 by

 population

,

 with

 over

 

2

 million

 inhabitants

.

 However

,

 the

 city

 is

 known

 for

 its

 distinctive

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

,

 making

 it

 a

 fascinating

 place

 to

 visit

.

 In

 comparison

 to

 other

 cities

 in

 France

,

 Paris

 is

 often

 considered

 the

 most

 important

 and

 important

 city

 in

 the

 country

.

 



Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 diverse

 and

 complex

,

 and

 there

 is

 no

 single

 trend

 that

 can

 be

 confidently

 predicted

.

 However

,

 here

 are

 some

 possible

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 AI

 is

 becoming

 increasingly

 integrated

 with

 other

 technologies

 such

 as

 blockchain

,

 IoT

,

 and

 quantum

 computing

.

 This

 integration

 may

 lead

 to

 new

 applications

 and

 opportunities

,

 such

 as

 smart

 city

 management

 or

 autonomous

 vehicles

.



2

.

 More

 ethical

 AI

:

 As

 AI

 becomes

 more

 integrated

 with

 other

 technologies

,

 there

 will

 be

 an

 increasing

 focus

 on

 ethical

 considerations

.

 This

 may

 include

 issues

 such

 as

 bias

,

 fairness

,

 and

 transparency

.

 There

 will

 also

 be

 a

 push

 towards

 developing

 AI

 that

 is

 more




In [6]:
llm.shutdown()