# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.19it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John and I am interested in taking a class. I am a student at the University of Alberta and am in the Health Sciences section of the University. What is your name? Mr. Salter. How can I assist you today? Mr. Salter, could you please tell me more about your medical training? Mr. Salter, could you please describe your medical training? Mr. Salter, could you please describe your medical training?
It seems like you are trying to describe your medical training to me, but I do not have any information about your medical training. Could you please provide more context or ask me for clarification? Once
Prompt: The president of the United States is
Generated text:  36 years younger than the president of Mexico. The president of Mexico is younger than the president of the United States by 13 years. If the president of the United States is currently 50 years old, how old will the president of the United States be in 10 years?
To determine the age of the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Academy of Sciences. Paris is a cultural and economic center that plays a significant role in French politics and society. The city is known for its diverse population, including French, African, and immigrant communities. It is also a popular tourist destination, with millions of visitors annually. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is known for its rich history, art, and culture, making it a popular destination for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems will become more integrated with human intelligence, allowing them to learn from and adapt to the behavior and preferences of humans. This will enable more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Enhanced machine learning capabilities: AI systems will become more capable of learning from large amounts of data, which will enable them to make more accurate predictions and decisions. This will lead to more efficient and effective use of resources, as well as better decision-making in various industries.

3. Increased reliance on AI for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name], and I'm a [insert profession or career] with a passion for [insert your field or area of interest]. I'm always looking for opportunities to [insert a short reason why you are passionate about your field or area of interest]. I enjoy [insert one or two hobbies or interests that relate to your field or area of interest]. I am always up for [insert anything from challenging projects, team-building activities, or learning new skills to achieve personal and professional growth]. I believe that my unique abilities and experiences make me a valuable asset to any organization or team. I'm excited to be a part of your team and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

What is the answer? (Select from the options below.)

A. New Orleans
B. London
C. Washington D.C.
D. Paris

D. Paris

Paris

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

].

 I

'm

 here

 to

 help

 anyone

 who

 needs

 a

 little

 extra

 support

 and

 guidance

 in

 life

,

 including

 anyone

 who

 has

 a

 difficult

 time

 with

 anxiety

 or

 stress

.

 I

'm

 someone

 who

 can

 offer

 practical

 advice

 and

 provide

 a

 listening

 ear

,

 and

 I

 always

 strive

 to

 make

 sure

 that

 my

 advice

 is

 helpful

 and

 effective

.

 So

,

 if

 you

 need

 any

 help

 or

 guidance

 in

 life

,

 please

 don

't

 hesitate

 to

 reach

 out

 to

 me

.

 #

self

int

roduction

 #

An

xiety

Aware

ness

 #

Life

Advice

 #

Personal

Development

 #

Help

ful

Person





Hello

,

 my

 name

 is

 [

insert

 character

's

 name

].

 I

'm

 here

 to

 help

 anyone

 who

 needs

 a

 little

 extra



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 in

 France

 and

 is

 the

 capital

 of

 France

.

 It

 is

 located

 in

 the

 Î

le

-de

-F

rance

 region

 of

 the

 country

 and

 is

 the

 second

 most

 populous

 city

 in

 the

 world

 after

 New

 York

 City

.

 Paris

 is

 known

 for

 its

 art

,

 culture

,

 and

 historical

 landmarks

,

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 famous

 for

 its

 fashion

 and

 cuisine

,

 and

 is

 a

 major

 economic

 and

 cultural

 center

 in

 Europe

.

 Paris

 is

 a

 significant

 city

 in

 the

 world

 of

 luxury

 and

 celebrity

 culture

,

 with

 many

 top

 fashion

 and

 entertainment

 figures

 residing

 there

.

 According

 to

 the

 

2

0

2

0

 census

,

 Paris

 has



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 dominated

 by

 the

 development

 of

 more

 complex

 and

 sophisticated

 artificial

 intelligence

 that

 can

 handle

 increasingly

 complex

 and

 varied

 tasks

.

 This

 could

 include

 the

 development

 of

 more

 advanced

 machine

 learning

 algorithms

 that

 are

 capable

 of

 learning

 from

 large

 amounts

 of

 data

,

 recognizing

 patterns

 and

 making

 predictions

 in

 new

 and

 unexpected

 ways

.



Another

 trend

 is

 the

 increasing

 use

 of

 AI

 in

 industries

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 manufacturing

,

 where

 it

 can

 be

 used

 to

 improve

 efficiency

,

 reduce

 costs

,

 and

 enhance

 safety

.

 AI

-powered

 systems

 could

 also

 be

 used

 to

 automate

 repetitive

 and

 time

-consuming

 tasks

,

 freeing

 up

 workers

 to

 focus

 on

 more

 critical

 and

 creative

 work

.



As

 AI

 technology

 continues

 to

 advance

,

 we




In [6]:
llm.shutdown()