# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0815 00:35:46.188000 416178 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0815 00:35:46.188000 416178 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0815 00:35:55.585000 416760 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0815 00:35:55.585000 416760 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.06it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.88 GB):   0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.88 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.88it/s]Capturing batches (bs=2 avail_mem=74.82 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.88it/s]

Capturing batches (bs=1 avail_mem=74.82 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.88it/s]Capturing batches (bs=1 avail_mem=74.82 GB): 100%|██████████| 3/3 [00:00<00:00, 11.26it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I want to learn about the different types of metaphors and symbols in literature. Can you provide some examples of metaphors and symbols that you have learned? The more detailed and specific you are, the better. Additionally, please provide some tips on how to apply these concepts in real-life situations. I would love to gain a better understanding of the subject. Sure, I can definitely help you with that! Here are some examples of metaphors and symbols that you might come across in literature:

Metaphors:

1. "The rose is a rose is a rose" - this metaphor compares a rose to a flower, which is
Prompt: The president of the United States is
Generated text:  a position of great importance and the highest office in the government of the United States. It is currently held by Joe Biden, and it will be replaced by Donald Trump's successor, Joseph Biden, a day after his inauguration on January 20, 2021.

Based on that paragraph can we concl

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, positive, enthusiastic statement about your personality or skills]. I'm always looking for new challenges and opportunities to grow and learn. Thank you for taking the time to meet me. [Name] [Company name] [Job title] [Company website] [LinkedIn profile] [Twitter handle] [Facebook page] [Email address] [Phone number] [Website URL] [LinkedIn URL] [Twitter URL] [Facebook URL

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and is known for its rich history, art, and cuisine. The city is also home to many international organizations and institutions, including the French Academy of Sciences and the European Parliament. Paris is a vibrant and dynamic city with a rich cultural heritage that continues to attract visitors from around the world. The city is also known for its diverse population, with a mix of French, European,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI technology continues to improve, we can expect to see even more widespread adoption in healthcare, with more personalized and accurate diagnoses, treatment plans, and patient care.

2. AI in finance: AI is already being used in finance to improve risk management, fraud detection, and trading algorithms. As AI technology



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [occupation], and I enjoy [occupation-related activities]. I have [number] years of experience in [occupation-related field]. I am passionate about [occupation-related topic or cause]. I am always up-to-date with the latest [occupation-related developments]. I am always looking for ways to [occupation-related improvement or growth]. I am always looking for new ways to [occupation-related challenges or obstacles]. I am always eager to learn and grow, and I am always determined to make a positive impact in the world. If you have a question or need assistance with a related topic or cause, please let me know,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the capital city of France and is the largest and most populous city in the country. It is also the world's most populous city 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

Occup

ation

]

!


I

'm

 currently

 [

Age

]

 years

 old

 and

 I

 currently

 live

 in

 [

City

 or

 Country

].

 I

 have

 always

 been

 [

Preferred

 Hobby

/

Interest

]

 and

 I

 enjoy

 [

Reason

 for

 Interest

].


I

'm

 [

Physical

 Appearance

]

 and

 I

 have

 a

 [

Physical

 Trait

]

 personality

.


I

 value

 integrity

,

 honesty

,

 and

 fairness

 in

 everything

 I

 do

.

 I

 believe

 that

 when

 people

 treat

 each

 other

 with

 respect

 and

 kindness

,

 the

 world

 will

 be

 a

 better

 place

.


I

'm

 looking

 forward

 to

 [

Reason

 for

 Coming

 to

 the

 Interview

]

 and

 I

'm

 excited

 to

 see

 how

 my

 skills

 and

 personality

 will

 be

 put

 to

 use

 in



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Does

 the

 fact

 that

 Paris

 is

 the

 capital

 of

 France

 mean

 that

 it

 is

 not

 a

 city

?

 No

,

 Paris

 is

 the

 capital

 of

 France

,

 which

 means

 it

 is

 a

 city

 in

 France

.

 While

 the

 capital

 city

 of

 France

 is

 Paris

,

 it

 is

 not

 a

 city

 in

 itself

.

 A

 city

 is

 a

 physical

 place

 that

 includes

 all

 of

 its

 buildings

,

 streets

,

 and

 infrastructure

.

 Paris

 is

 a

 city

,

 but

 it

 is

 not

 itself

 a

 city

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 trends

 and

 changes

 that

 are

 expected

 to

 impact

 the

 development

 and

 deployment

 of

 AI

 technology

.

 Some

 of

 the

 key

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

 include

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

 and

 medicine

:

 As

 AI

 is

 increasingly

 used

 in

 medical

 diagnosis

 and

 treatment

,

 it

 is

 likely

 that

 more

 and

 more

 medical

 professionals

 will

 be

 trained

 to

 use

 AI

 to

 improve

 patient

 outcomes

.

 Additionally

,

 there

 is

 increasing

 demand

 for

 AI

-based

 technologies

 in

 diagnostics

,

 drug

 discovery

,

 and

 genetic

 research

,

 which

 may

 lead

 to

 a

 significant

 expansion

 of

 AI

 in

 healthcare

.



2

.

 AI

-driven

 automation

:

 AI

 is

 already

 being

 used

 in

 a

 wide




In [6]:
llm.shutdown()