# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0816 05:17:14.534000 2386979 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0816 05:17:14.534000 2386979 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0816 05:17:23.110000 2387210 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0816 05:17:23.110000 2387210 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.57it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.51 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=76.51 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.31it/s]Capturing batches (bs=2 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.31it/s]Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.31it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00,  9.43it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Baoxiang and I am a 4th grade student. I have good grades, but my parents don’t like me. I feel sad and lonely. What should I do?

I have learned that the parents of my friends often bully me, but I don’t care about them. I want to tell my parents, but they don’t understand my feelings, and they don’t want me to tell them. Can you give me some advice?

Based on the information you've provided, it seems that you are facing a challenging situation where you are dealing with feelings of loneliness and lack of communication with your parents. Here are some suggestions
Prompt: The president of the United States is
Generated text:  proposing a new policy that will directly impact the lives of 100,000 people. To determine the number of votes needed in a national election, the president is using a formula that involves a large, complex number. The formula is given by:
\[ \text{Total Votes Needed} = \frac{500,000,000}{n} \]
where \(n\) is the number of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the Louvre Museum. It is also a major cultural and economic center, hosting the French Parliament and the World Cup. Paris is a popular tourist destination, known for its rich history, art, and cuisine. The city is also home to the French Academy of Sciences and the French National Library. It is a major transportation hub, with the Eiffel Tower serving as a symbol of the city. Paris is a vibrant and diverse city with a rich history and culture, making it a popular destination for tourists and locals alike.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the potential future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, including manufacturing, transportation, and healthcare. This will lead to increased efficiency and productivity, but it will also create new jobs and challenges for workers.

2. Enhanced privacy and security: As AI systems become more sophisticated, we will need to ensure that they are used responsibly and ethically. This will require ongoing efforts to protect



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I am a [insert your profession or occupation here, if you haven't already, e.g., artist, writer, etc.]. I'm passionate about my work and love sharing my art with the world. I've always been fascinated by stories and how they can be told in various ways. I'm also an avid reader and love exploring new ideas and perspectives. I enjoy creating art that can provoke emotions and inspire others. I hope to be a helpful resource to those who are interested in art and creative expression. How can I help you today? Let me know if you'd like to chat about any specific pieces or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its stunning architecture, beautiful parks, and cultural attractions. It is a bustling metropolis with a rich history and a lively nightlife. The city is home t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

].

 I

'm

 an

 [

insert

 age

]

 year

 old

,

 [

insert

 profession

].

 I

 have

 [

insert

 one

 or

 two

 bullet

 points

 describing

 my

 hobbies

 and

 interests

].

 I

 enjoy

 [

insert

 two

 or

 three

 things

 you

 do

 that

 bring

 you

 joy

].

 My

 favorite

 place

 to

 stay

 is

 [

insert

 one

 or

 two

 details

 describing

 it

].

 My

 favorite

 hobby

 is

 [

insert

 one

 or

 two

 things

 you

 enjoy

 doing

].

 I

 have

 a

 lot

 of

 [

insert

 one

 or

 two

 positive

 words

 to

 describe

 how

 you

 can

 describe

 yourself

].

 I

 love

 [

insert

 one

 or

 two

 positive

 words

 that

 describe

 the

 experience

 you

 want

 to

 share

 with

 others

].

 I

'm

 [

insert

 one

 or

 two

 positive

 words



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 known

 for

 its

 historical

 landmarks

,

 vibrant

 arts

 scene

,

 and

 rich

 cultural

 heritage

.

 It

 has

 been

 the

 capital

 of

 France

 since

 

1

8

7

1

 and

 is

 home

 to

 many

 of

 the

 country

's

 most

 famous

 landmarks

 and

 museums

.

 Paris

 is

 also

 a

 major

 center

 for

 fashion

,

 food

,

 and

 entertainment

,

 and

 is

 a

 popular

 tourist

 destination

 for

 millions

 of

 visitors

 annually

.

 Despite

 its

 fast

-paced

 pace

,

 Paris

 has

 a

 rich

 and

 diverse

 culture

 that

 is

 hard

 to

 ignore

.

 Its

 skyline

 is

 often

 a

 source

 of

 inspiration

 for

 artists

 and

 writers

,

 and

 its

 numerous

 museums

 and

 galleries

 are

 a

 testament

 to

 its

 artistic

 legacy

.

 With

 its

 large

 population

 and

 diverse

 population

,

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 depends

 on

 a

 number

 of

 factors

,

 including

 technological

 progress

,

 regulatory

 changes

,

 and

 public

 opinion

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Adv

ancements

 in

 machine

 learning

 and

 deep

 learning

:

 As

 AI

 technology

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 more

 complex

 algorithms

 and

 techniques

 that

 can

 learn

 from

 vast

 amounts

 of

 data

 and

 make

 more

 accurate

 predictions

 and

 decisions

.



2

.

 Increased

 use

 of

 AI

 for

 healthcare

:

 AI

 is

 already

 being

 used

 in

 many

 healthcare

 applications

,

 from

 diagnostic

 tools

 to

 personalized

 treatment

 plans

.

 We

 can

 expect

 to

 see

 even

 more

 significant

 advances

 in

 this

 area

 in

 the

 future

.



3

.

 Em

phasis

 on

 ethics

 and

 safety

:

 As

 AI




In [6]:
llm.shutdown()