# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.73it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kara, and I am a young adult with a high school diploma and am currently in high school. I am passionate about the arts and have a keen interest in the history of the arts. I am also a fan of reading and writing. What are some of your favorite books or authors? I can't have you mislead, my primary focus is on the arts, but I do enjoy reading and writing. Please share some of your favorite books or authors! To answer your question, I will proceed with these instructions: 
1. Write a paragraph about my favorite book or author. 
2. Write a paragraph about a book or author
Prompt: The president of the United States is
Generated text:  25 years older than the president of Central America. The president of Central America is half the age of the president of Asia. If the president of Asia is 35 years old, how old is the president of the United States? Let's start by determining the age of the president of Central America. We know that the president o

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? As a [job title], I have a passion for [job title-related interest or accomplishment]. I'm always looking for ways to [job title-related skill or hobby]. What's your background and what do you bring to the table? I have a background in [relevant field or area of expertise]. I bring a unique perspective and a passion for [job title-related interest or accomplishment]. What's your career goal? My career goal is to [career goal

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling city with a rich history and culture, and it is a popular tourist destination. The city is known for its fashion industry, art scene, and food culture, and it is a major center for business and finance in Europe. Paris is also a major transportation hub, with many major highways and rail lines connecting the city to other parts of France and the world. The city is home to many international

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be an increasing focus on ethical considerations. This will include issues such as bias, transparency, accountability, and privacy. AI developers will need to be more mindful of the potential consequences of their creations and work to ensure that they are developed in a way that is fair and responsible.

2. Greater integration with human intelligence: AI is likely to become more integrated with human intelligence in the future. This could involve



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [position] specialist at [Company Name]. I am [Brief Biography] with over [Number of Years] years of experience in the field. In my current role, I focus on [specific tasks or areas of expertise]. I am highly [Positive Traits or Qualities], and I strive to [Objective or goal]. What is your name, and what brings you here? [Your Name] [Brief Biography] [Position] Specialist at [Company Name] [Positive Traits or Qualities] [Objective or Goal] Hello, my name is [Your Name] and I am a [position]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a vibrant and historical city with a rich cultural heritage. It is located on the western edge of the Île-de-France region, near the mouth of the River Seine and surrounded by the rolling hills of the Loire Valley. Paris is renowned for its beau

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 Name

]

 and

 I

'm

 a

 [

Last

 Name

]

 and

 I

 am

 [

name

]

 as

 a

 character

.

 I

'm

 a

 [

character

's

 profession

 or

 role

].

 I

'm

 a

 [

character

's

 age

]

 and

 I

'm

 [

character

's

 height

]

 inches

 tall

.

 I

 have

 [

character

's

 hobbies

 or

 interests

],

 and

 I

 have

 [

character

's

 family

 background

],

 and

 I

 have

 [

character

's

 personal

 traits

].

 I

'm

 [

character

's

 personality

 type

],

 and

 I

 am

 a

 [

character

's

 style

 of

 transportation

],

 and

 I

 love

 [

character

's

 favorite

 hobby

 or

 activity

].

 I

'm

 excited

 to

 meet

 you

 and

 help

 you

 navigate

 this

 world

.

 What

 is

 your

 name

?

 What

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 vibrant

 met

ropolis

 with

 a

 rich

 history

,

 renowned

 for

 its

 architecture

,

 art

,

 and

 cultural

 heritage

.

 Paris

 is

 the

 second

-largest

 city

 in

 Europe

 and

 has

 a

 population

 of

 over

 

2

.

7

 million

 people

.

 The

 city

 is

 known

 for

 its

 unique

 urban

 environment

,

 including

 its

 narrow

 streets

,

 historical

 buildings

,

 and

 enchant

ing

 parks

.

 Paris

 is

 also

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Palace

 of

 Vers

ailles

.

 The

 city

 has

 a

 diverse

 population

 of

 people

 from

 various

 national

ities

 and

 cultures

,

 contributing

 to

 its

 multicultural

 and

 cosm

opolitan

 atmosphere

.

 Paris

 is

 a

 global



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 a

 dynamic

 and

 rapidly

 evolving

 field

,

 driven

 by

 advances

 in

 machine

 learning

,

 computer

 architecture

,

 and

 quantum

 computing

.

 Here

 are

 some

 possible

 trends

 to

 watch

 out

 for

:



1

.

 Increased

 focus

 on

 ethical

 and

 safety

 concerns

:

 With

 AI

 systems

 becoming

 more

 integrated

 into

 our

 daily

 lives

,

 it

's

 likely

 that

 we

 will

 see

 increased

 focus

 on

 ethical

 and

 safety

 concerns

,

 particularly

 as

 AI

 systems

 become

 more

 complex

 and

 sensitive

.



2

.

 AI

 that

 learns

 at

 a

 faster

 rate

 than

 humans

:

 One

 of

 the

 key

 trends

 in

 AI

 is

 the

 increase

 in

 the

 ability

 of

 AI

 systems

 to

 learn

 at

 a

 faster

 rate

 than

 humans

.

 This

 could

 lead

 to

 more

 efficient

 and

 effective

 decision

-making

 in




In [6]:
llm.shutdown()