# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0812 19:34:26.865000 2783887 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0812 19:34:26.865000 2783887 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0812 19:34:39.970000 2784662 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0812 19:34:39.970000 2784662 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.00it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.51 GB):   0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.51 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.15it/s]Capturing batches (bs=2 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.15it/s]

Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.15it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00, 11.80it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lisa and I'm a 16 year old girl. I want to apply for an internship. But I'm not sure how to approach it. So, I'm looking for advice on how to make a successful application. 

I want to be a good intern, but I don't want to apply just for the sake of it. I want to apply because I want to learn a lot. I want to have a good understanding of the industry. I want to be able to apply my knowledge on the job.

What is the best way to approach an internship? What should I do first? How do I go about studying the industry
Prompt: The president of the United States is
Generated text:  trying to decide how many military personnel he should keep on the home front, while still maintaining some standard of living. The president is considering two options: Option 1: Keep 50% of the military personnel, but have only half of them serve in combat duty. Option 2: Keep 75% of the military personnel, but have only 1/3 of them serve in combat duty. Calculate the di

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French Parliament building. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a popular tourist destination and a major economic hub in Europe. The city is known for its fashion industry, art scene, and cuisine. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city of people, with a diverse population of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to perform tasks that are currently only possible with human expertise.

2. Greater use of AI in healthcare: AI is already being used in healthcare to diagnose and treat diseases, and it has the potential to revolutionize the field. As AI becomes more advanced, it is likely to be used in even more complex and personalized ways.

3. Increased use of AI in finance: AI is already being used in finance to analyze large amounts of data and make predictions about market



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name], and I'm [insert character's age and occupation]. I'm a [insert character's profession and role]. In my [insert number of years of service] years of experience in [insert relevant field or industry], I have [insert number of successful projects or achievements] that I've completed. I'm a [insert character's personality trait or characteristic] who is always ready to learn and make a positive impact on the world. I enjoy [insert character's hobbies or interests]. I'm [insert character's positive attitude or personality trait] who values teamwork and collaboration. I'm a [insert character's professional

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the heart of the French heartland and is the country’s largest city. It is known for its iconic Eiffel Tower, as 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 an

 AI

 assistant

 designed

 to

 help

 people

 with

 a

 wide

 range

 of

 tasks

.

 As

 a

 language

 model

,

 my

 primary

 goal

 is

 to

 assist

 users

 in

 finding

 information

,

 answering

 questions

,

 and

 providing

 support

 to

 them

.

 I

 am

 here

 to

 help

 people

 with

 a

 variety

 of

 needs

,

 from

 general

 knowledge

 to

 specialized

 topics

.

 Whether

 you

 need

 help

 with

 a

 problem

,

 want

 to

 learn

 a

 new

 skill

,

 or

 just

 want

 to

 chat

,

 I

 am

 here

 to

 help

 you

.

 My

 ultimate

 goal

 is

 to

 provide

 the

 best

 possible

 assistance

 to

 all

 users

,

 regardless

 of

 their

 needs

 or

 abilities

.

 Please

 feel

 free

 to

 ask

 me

 anything

 and

 I

 will

 do

 my

 best

 to

 assist



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 as

 the

 "

City

 of

 Love

"

 and

 is

 a

 city

 of

 fine

 architecture

,

 cultural

 life

,

 and

 rich

 history

.

 The

 city

 is

 located

 on

 the

 Se

ine

 River

 and

 is

 a

 bustling

 hub

 of

 commerce

,

 finance

,

 and

 entertainment

.

 It

 is

 also

 known

 as

 the

 "

City

 of

 Light

"

 due

 to

 its

 iconic

 skyline

 and

 the

 many

 public

 squares

,

 cafes

,

 and

 theaters

.

 Paris

 is

 a

 beloved

 city

 that

 has

 capt

ivated

 countless

 travelers

 and

 artists

 throughout

 history

.

 It

 is

 home

 to

 many

 world

-ren

owned

 landmarks

 and

 museums

,

 including

 Notre

 Dame

 Cathedral

,

 the

 Lou

vre

 Museum

,

 and

 the

 E

iff

el

 Tower

.

 Paris

 is

 a

 city

 of

 contrasts

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 possibilities

 and

 opportunities

,

 and

 it

 is

 likely

 to

 continue

 to

 evolve

 and

 change

 rapidly

.

 Here

 are

 some

 potential

 trends

 that

 AI

 is

 likely

 to

 experience

 in

 the

 coming

 years

:



1

.

 Increased

 Efficiency

:

 AI

 has

 the

 potential

 to

 increase

 efficiency

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 and

 manufacturing

.

 The

 development

 of

 machine

 learning

 algorithms

 can

 automate

 repetitive

 and

 expensive

 tasks

,

 freeing

 up

 human

 resources

 for

 more

 critical

 work

.



2

.

 Improved

 Personal

ization

:

 AI

 can

 be

 used

 to

 personalize

 marketing

 campaigns

 and

 products

 based

 on

 individual

 preferences

 and

 behavior

.

 This

 can

 lead

 to

 increased

 customer

 satisfaction

 and

 loyalty

,

 as

 well

 as

 increased

 revenue

 for

 businesses

.



3

.

 Enhanced

 Safety

:

 AI

 can




In [6]:
llm.shutdown()