# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-24 00:08:43] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=75.05 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=75.05 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.51it/s]Capturing batches (bs=2 avail_mem=74.90 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.51it/s]Capturing batches (bs=2 avail_mem=74.90 GB):  67%|██████▋   | 2/3 [00:00<00:00,  4.79it/s]Capturing batches (bs=1 avail_mem=74.89 GB):  67%|██████▋   | 2/3 [00:00<00:00,  4.79it/s]Capturing batches (bs=1 avail_mem=74.89 GB): 100%|██████████| 3/3 [00:00<00:00,  6.26it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mike. I am an American boy. I have a pen pal from England. His name is Nick. He is from a small city in England. I don't know Nick very well yet. Nick lives with his parents and his wife in a small flat in a village. He likes to play games and watch TV all the time. He has a big family. He has a sister and a brother. They all love music very much. They like to listen to music together. They often have a big party at their house. Nick's family is very different from mine. When I visit my home, I help my parents do some house
Prompt: The president of the United States is
Generated text:  a powerful man and has many ways to affect the development of the country. He is the leader of the country and he takes care of all the citizens. The president of the United States is a very important person in the country. In America, the president is the leader of the country and he is the commander-in-chief of the armed forces. He is the head of government an

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major economic center and a major tourist destination, with a diverse population of over 10 million people. The city is home to many famous museums, including the Musée d'Orsay, the Musée Rodin, and the Musée d'Orsay. Paris is also known for its cuisine, including French cuisine, and its fashion industry. The city is a major center for science

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be an increased risk of privacy and security breaches. Governments and organizations will need to develop new technologies and practices to protect the privacy and security of AI systems.

3. Greater reliance on AI for decision-making: AI is likely to become more integrated into decision-making processes



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name]. I'm [age] years old and currently residing in [city]. I enjoy [occupation or hobby] and have always had a strong sense of [value or importance]. I strive to [make a positive impact] and believe in the power of [how one can change the world]. I love [a unique trait or characteristic] and strive to make the world a better place by [how one can do so]. I am always looking to expand my knowledge and to learn from those around me. I am [a quote or proverb] to my heart, and it inspires me to always strive to be my best self. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the region of Nord-Pas-de-Calais, about halfway between the German-speaking regions of Nord and Picardy and the French-speaking regions of Loire and Brittany. The city is home to over 3.5 million people, and is a major 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

type

 of

 character

,

 such

 as

 a

 detective

,

 actor

,

 athlete

,

 etc

.

].

 I

 am

 passionate

 about

 [

what

 you

 like

 to

 do

,

 such

 as

 sports

,

 music

,

 writing

,

 etc

.]

 and

 enjoy

 being

 in

 the

 spotlight

.

 I

 have

 a

 knack

 for

 [

something

 specific

,

 such

 as

 storytelling

,

 problem

-solving

,

 etc

.]

 and

 I

 love

 to

 show

 off

 my

 skills

 through

 my

 work

.

 I

'm

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 grow

 and

 improve

,

 and

 I

 thrive

 on

 the

 adrenaline

 rush

 of

 being

 on

 the

 stage

 or

 in

 front

 of

 a

 camera

.

 So

,

 if

 you

 need

 any

 help

 or

 advice

,

 don

't

 hesitate



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 city

 known

 for

 its

 magnificent

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 It

 is

 a

 world

-ren

owned

 city

 with

 a

 rich

 tape

stry

 of

 cultures

 and

 a

 vibrant

 energy

.

 Paris

 is

 also

 home

 to

 numerous

 museums

,

 theaters

,

 and

 landmarks

,

 making

 it

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 culture

 and

 history

.

 The

 city

 is

 also

 known

 for

 its

 lively

 nightlife

 and

 delicious

 cuisine

.

 Paris

 has

 a

 strong

 sense

 of

 civic

 pride

 and

 a

 commitment

 to

 education

 and

 arts

,

 making

 it

 a

 popular

 destination

 for

 travelers

 and

 residents

 alike

.

 The

 city

's

 unique

 blend

 of

 historical

 and

 modern

 elements

 has

 made

 it

 a

 center

 of

 power

 and

 influence

 in

 France

 and

 beyond

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 very

 promising

,

 and

 there

 are

 many

 possible

 trends

 and

 technologies

 that

 are

 expected

 to

 shape

 the

 field

 in

 the

 coming

 years

.

 Some

 of

 the

 key

 trends

 in

 AI

 include

:



1

.

 Increased

 focus

 on

 ethical

 and

 social

 implications

:

 As

 AI

 becomes

 more

 prevalent

 in

 our

 lives

,

 there

 will

 be

 an

 increased

 focus

 on

 ethical

 and

 social

 implications

 of

 AI

.

 This

 includes

 issues

 such

 as

 bias

,

 privacy

,

 and

 transparency

.



2

.

 More

 automation

 and

 AI

 in

 manufacturing

 and

 other

 industries

:

 As

 AI

 becomes

 more

 prevalent

 in

 manufacturing

 and

 other

 industries

,

 we

 will

 see

 more

 automation

 and

 AI

-driven

 solutions

.

 This

 will

 likely

 result

 in

 increased

 efficiency

,

 reduced

 human

 errors

,

 and

 increased

 productivity

.



3

.




In [6]:
llm.shutdown()