# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.71it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jean Bardin and I’m a fully-fledged cook in the kitchen of a 5 star restaurant in Westport, CT. I have spent my whole life wanting to be a chef and it’s been a lifelong dream to be a chef. I’ve always loved cooking, and as a child I would go to the kitchen and make things from scratch. The love of cooking has continued to grow with my experiences working in the kitchen of restaurants, and now with my own restaurant, and my husband and I have taken our unique approach to the kitchen.
The kitchen of my restaurant is my most important space. It’s a space where I can
Prompt: The president of the United States is
Generated text:  a position that is held by an official who is appointed by the head of state. The president serves a term of two years, and can be re-elected. While some people may view a president as an elected official, others believe that the position is held by someone who is appointed by the head of state.
There are several ways to e

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What do you do

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, diverse culture, and vibrant nightlife. It is a major transportation hub, with many major highways and rail lines connecting the city to other parts of France and the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the urban landscape. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to be integrated with human intelligence in new and innovative ways. This could include the use of AI to enhance human decision-making, improve the accuracy of medical diagnoses, and even enhance human creativity.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could include the development of AI that is designed to be transparent, accountable, and responsible, and that is used in a way that is consistent with human values and principles.

3. Increased focus on



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [Level] in the field of [Your Profession]. I bring a unique blend of [Your Expertise] and [Your Strengths], and I'm always seeking to improve myself. I'm excited to be here and explore the world of [Your Profession]!

I'm [Your Age] years old, and I'm passionate about [Your Profession]. I'm a [Your Expertise] and I'm always learning new things and pushing myself to be the best. I'm committed to my goals and strive to achieve them, even when it's difficult or challenging.

I'm always looking for new

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. It is a bustling city with a rich history and a vibrant culture, including the world-renowned Eiffel Tower. Paris is famous for its iconic landmarks such as Notre Dame Cathedral, Louvre Museum, and the Pa

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

]

 and

 I

 am

 a

/an

 [

insert

 occupation

 or

 professional

 title

]

 with

 a

/an

 [

insert

 professional

 title

 or

 role

]

 in

 [

insert

 industry

 or

 specialty

]

 and

 I

 am

 [

insert

 nationality

].

 I

 have

 always

 had

 a

 love

 for

 [

insert

 hobby

,

 interest

,

 or

 passion

]

 and

 I

 am

 constantly

 striving

 to

 learn

 more

 about

 my

 field

 and

 improve

 my

 skills

.

 I

 am

 [

insert

 age

]

 years

 old

,

 and

 I

 have

 always

 been

 passionate

 about

 [

insert

 personal

 passion

 or

 hobby

].

 I

 have

 always

 been

 respectful

 of

 others

 and

 always

 strive

 to

 be

 a

 good

 listener

.

 I

 am

 excited

 to

 learn

 more

 about

 [

insert

 profession

 or

 role

] and

 to

 share

 my



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

,

 known

 as

 the

 City

 of

 Light

,

 is

 home

 to

 one

 of

 the

 world

's

 most

 famous

 museums

,

 the

 Lou

vre

,

 and

 the

 E

iff

el

 Tower

.

 It

 is

 also

 famous

 for

 its

 culinary

 traditions

,

 particularly

 its

 b

oul

anger

ies

,

 and

 for

 its

 vibrant

 nightlife

.

 Paris

 is

 a

 cosm

opolitan

 city

 with

 a

 rich

 history

 and

 a

 diverse

 array

 of

 cultures

.

 Its

 iconic

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

,

 are

 celebrated

 worldwide

.

 The

 city

 is

 also

 home

 to

 several

 of

 the

 world

's

 top

 universities

 and

 is

 considered

 the

 birth

place

 of

 the

 French

 language

.

 Paris

 is

 a

 major

 transportation

 hub

 with

 several

 airports

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 poised

 to

 be

 a

 fascinating

 and

 rapidly

 evolving

 field

.

 Here

 are

 some

 potential

 trends

 that

 could

 shape

 the

 direction

 of

 AI

 in

 the

 years

 to

 come

:



1

.

 Increased

 Integration

 with

 Human

 Intelligence

:

 One

 of

 the

 most

 significant

 trends

 in

 AI

 is

 the

 increased

 integration

 of

 AI

 with

 human

 intelligence

.

 This

 could

 result

 in

 more

 accurate

,

 nuanced

,

 and

 human

-like

 AI

 that

 can

 adapt

 to

 complex

 human

 environments

 and

 emotions

.



2

.

 Autonomous

 Systems

:

 AI

 is

 already

 being

 used

 in

 autonomous

 vehicles

,

 which

 are

 designed

 to

 navigate

 the

 roads

 and

 make

 decisions

 for

 the

 driver

.

 We

 may

 see

 more

 widespread

 implementation

 of

 autonomous

 systems

 in

 other

 sectors

,

 such

 as

 healthcare

,

 transportation

,

 and

 manufacturing

.



3




In [6]:
llm.shutdown()