# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.00it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mary, and I live in a house on the 10th floor. I work as a night guard at a local hotel. I have a friend who is also a night guard at a hotel. The friend's name is John. 

We are both very good friends. We have lots of fun activities to share. We both enjoy going out to have a good time and fun with our friends. We usually talk to each other on the phone or have a coffee. 

I am looking for a day trip to explore a famous mountain. I would like to take our friend to explore this place. We are interested in exploring the famous
Prompt: The president of the United States is
Generated text:  a high-ranking government official who is appointed by the president of which country and must serve at least eight years?

The president of the United States is a high-ranking government official who is appointed by the president of the United States and must serve at least eight years. The current president of the United States is Joe Biden, who served two t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? As a [job title], I'm always looking for ways to improve my skills and knowledge. I'm always eager to learn new things and try new things. I'm also a great communicator and enjoy working with people from all walks of life. What's your favorite hobby or activity? As a [job title], I enjoy spending time with my family and friends. I also love to read and travel. What's your favorite book or movie? As a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the seat of the French government and the largest city in Europe by population. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the Middle Ages. It is also home to many famous museums, including the Musée d'Orsay and the Musée Rodin. The city is known for its cuisine, including its famous croissants and its traditional French cuisine. Paris is a vibrant and dynamic city with a diverse population and a rich cultural heritage

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and artificial intelligence: As AI technology continues to advance, we can expect to see more automation and artificial intelligence in our daily lives. This could include the development of more advanced robots and machines that can perform tasks that were previously done by humans.

2. Improved privacy and security: As AI technology becomes more advanced, we can expect to see more privacy and security concerns. This could include the development of more secure and transparent AI systems that can protect user



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [age] year old [occupation] [address]. I enjoy [my passion or hobby] and I live [location] with my [family or friend]. I am always [up-to-date] and always [able to adapt to new situations]. I am also [able to think on my feet] and [good at [skill]].

I am a [职业] who is [age] years old. I am [occupation] and I live in [location]. I am always [up-to-date] and always [able to adapt to new situations]. I am also [able to think on

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

[Mark the correct answer and explain why it is correct]

To determine the correct answer, I will analyze the following parts of the statement:

1. "The capital of France"
   - This is a clear reference to Paris, the capital city of France. The capital city is the main administrative center of a country and is know

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Age

]

 year

 old

 [

Occup

ation

],

 currently

 based

 in

 [

City

].

 I

'm

 passionate

 about

 [

Favorite

 Activity

/S

port

/

Interest

],

 and

 I

 am

 a

 [

Ext

ro

vert

/

Intro

vert

/

Ag

gressive

/

Res

il

ient

].

 I

 have

 a

 keen

 sense

 of

 [

Skill

/

Ability

],

 and

 I

'm

 [

Professional

/

Personal

].

 I

 have

 a

 diverse

 background

,

 spanning

 [

Number

 of

 Maj

ors

/D

egree

 Programs

/G

ram

mar

],

 and

 I

'm

 always

 eager

 to

 learn

 new

 things

,

 whether

 it

's

 by

 teaching

 myself

 or

 just

 by

 asking

 questions

.

 I

 am

 always

 looking

 for

 new

 experiences

,

 whether

 it

's

 in

 the

 workplace

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 and

 the

 largest

 city

 in

 the

 European

 Union

 by

 population

.

 Paris

 is

 also

 a

 cultural

 and

 artistic

 capital

 and

 home

 to

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 known

 for

 its

 cuisine

,

 fashion

,

 and

 music

 scene

,

 as

 well

 as

 its

 historic

 and

 romantic

 charm

.

 As

 the

 seat

 of

 government

,

 Paris

 has

 undergone

 significant

 development

 and

 urban

ization

 over

 the

 years

,

 with

 significant

 changes

 in

 land

 use

 and

 infrastructure

.

 Its

 status

 as

 a

 European

 capital

 is

 recognized

 by

 the

 EU

 and

 its

 status

 as

 the

 

1

3

th

 most

-pop

ulous

 city

 in

 the

 world



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 a

 highly

 dynamic

 and

 rapidly

 evolving

 field

.

 Here

 are

 some

 potential

 trends

 that

 could

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 will

 continue

 to

 play

 a

 vital

 role

 in

 healthcare

 by

 improving

 diagnostic

 accuracy

,

 predicting

 patient

 outcomes

,

 and

 developing

 personalized

 treatment

 plans

.

 AI

 algorithms

 will

 help

 doctors

 make

 more

 informed

 decisions

 and

 reduce

 errors

 in

 patient

 care

.



2

.

 Greater

 integration

 of

 AI

 into

 everyday

 life

:

 AI

 will

 become

 more

 integrated

 into

 our

 daily

 lives

,

 from

 voice

 assistants

 that

 understand

 spoken

 language

 to

 smart

 home

 devices

 that

 control

 our

 homes

.

 AI

 will

 also

 enable

 more

 efficient

 and

 cost

-effective

 transportation

 systems

.



3

.

 Increased




In [6]:
llm.shutdown()