# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.59it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Robin. I am 15 years old. I want to be a doctor. I want to become a doctor because doctors help sick people and they can fix things for everyone. I want to be a doctor because doctors are always ready to help people. I want to be a doctor because doctors can make people happier because they can understand people's feelings and help them. Doctors can make people better because they can help people learn and get well faster. I want to be a doctor because doctors can be friendly because they are always ready to help people and they always have a smile on their faces. ( ) 1. What is Robin's
Prompt: The president of the United States is
Generated text:  a noble man who has achieved a great deal in his life, and the people of the United States have never stopped celebrating his achievements. What does this sentence mean?
A. The president of the United States is a common person.
B. The president of the United States is a noble person.
C. The presiden

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Gender] [Gender Identity] who was born in [Birthplace] and grew up in [City]. I am a [Occupation] who has always been passionate about [What interests you in life]. I am a [Personality Type] who is [What makes you unique]. I am [What you like to do]. I am [What you are looking forward to doing]. I am [What you are most proud of]. I am [What you are most afraid of]. I am [What you are most excited about]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and art scene. Paris is a vibrant and dynamic city with a diverse population and a rich cultural heritage. It is a popular tourist destination and a major economic center in Europe. The city is also home to many international organizations and institutions. Paris is a city of contrasts, with its modern architecture and historical landmarks

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased automation: AI is expected to become more and more integrated into various industries, from manufacturing to healthcare. Automation will likely become more prevalent as AI systems become more sophisticated and can perform tasks that were previously done by humans.

2. AI ethics and privacy: As AI systems become more advanced, there will be increasing concerns about their impact on society. There will likely be a push for more ethical and transparent AI systems, as well as greater privacy protections for individuals.

3. AI for education:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Type of Character] who has been working in the tech industry for [X years] and [X years] before. I started as a [X amount] under a [X type of role] at [X company], and have recently taken on [X role] for [X amount of time]. I am a [Type of Character], and I’m passionate about [Type of Character’s Passion]. I believe in [Type of Character’s Core Values], and I strive to be a [Type of Character’s Character] with [Type of Character’s Character Traits]. Thank you. Let me know

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, often referred to as the "City of Light" and is the largest and most populous city in the European Union. It is located on the Seine River and serves as the cultural, political, and economic center of the country. Paris boasts a rich and diverse history dating back ove

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

]

 and

 I

'm

 a

 [

insert

 occupation

 or

 profession

]

 who

 is

 passionate

 about

 [

insert

 a

 personal

 passion

 or

 interest

].

 I

 enjoy

 [

insert

 something

 about

 my

 hobbies

,

 interests

,

 or

 skills

],

 and

 I

'm

 always

 looking

 for

 new

 opportunities

 to

 grow

 and

 learn

.

 I

'm

 also

 known

 for

 my

 love

 of

 [

insert

 a

 favorite

 hobby

,

 like

 painting

,

 cooking

,

 or

 playing

 music

],

 and

 I

 enjoy

 spending

 time

 with

 friends

 and

 family

.

 If

 you

're

 open

 to

 meeting

 me

 in

 person

,

 I

'd

 love

 to

 make

 some

 time

 to

 catch

 up

.

 What

's

 your

 name

,

 and

 what

's

 your

 occupation

 or

 profession

?

 Let

's

 catch

 up

 and

 get

 to

 know



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

.

 It

 is

 located

 on

 the

 Se

ine

 River

 and

 is

 the

 largest

 city

 in

 Europe

 by

 population

.

 The

 city

 is

 known

 for

 its

 medieval

 architecture

,

 art

,

 music

,

 and

 cuisine

.

 It

 is

 a

 popular

 tourist

 destination

 and

 a

 center

 of

 culture

,

 science

,

 and

 technology

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

 and

 attractions

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 a

 major

 financial

 center

 and

 a

 center

 of

 research

 and

 innovation

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 fascinating

 area

 of

 research

 and

 development

.

 Here

 are

 some

 potential

 trends

 we

 can

 expect

 to

 see

 in

 the

 next

 few

 decades

:



1

.

 More

 diverse

 and

 complex

 AI

:

 With

 the

 increasing

 availability

 of

 data

 and

 computational

 power

,

 AI

 will

 become

 even

 more

 diverse

 and

 complex

.

 This

 could

 mean

 that

 AI

 will

 be

 able

 to

 solve

 complex

 problems

 with

 more

 flexibility

 and

 creativity

 than

 ever

 before

.



2

.

 AI

 will

 become

 more

 human

-like

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 will

 start

 to

 exhibit

 more

 human

-like

 behaviors

 and

 characteristics

.

 This

 could

 include

 more

 empathy

,

 understanding

 emotions

,

 and

 even

 creativity

.



3

.

 AI

 will

 become

 more

 ethical

:

 As

 AI

 continues

 to

 advance

,

 we

 will




In [6]:
llm.shutdown()