# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John. I'm from New York City. I love to run on Sundays and Saturdays. I like to run as a way of exercise. I also love to go on walks in the park. I walk for 10 minutes on Saturdays and Sundays and then go for 20 minutes for a walk in the park. Sometimes, I also take my dog to the park on weekends. I've been running for over 10 years now. I've lost 10 pounds and have become very fit. I know it sounds like a lot of work, but it is a great way to keep fit and have fun. I spend
Prompt: The president of the United States is
Generated text:  a person. Which of the following is NOT a characteristic of the president of the United States?
A. President of the United States is a person
B. The president of the United States is the head of state of the country
C. The president of the United States is elected by the people
D. The president of the United States is the highest government official in the country
Answer: A

Among the following options, which on

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Type of Vehicle] with [Number] wheels. I'm [Favorite Hobby] and I enjoy [Favorite Activity]. I'm [Favorite Color] and I have [Number] friends. I'm [Favorite Book] and I love [Favorite Food]. I'm [Favorite Movie] and I've seen [Number] movies. I'm [Favorite Sport] and I play [Number] sports. I'm [Favorite Music] and I love [Favorite Album]. I'm [Favorite Movie] and I've seen [Number] movies. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city of light and art. It is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also famous for its rich history, including the French Revolution and the French Revolution. The city is home to many famous museums, including the Louvre and the Musée d'Orsay. Paris is a bustling city with a diverse population and is known for its fashion, art, and food scenes. It is a popular tourist destination and is often referred to as the "City of Light" due to its vibrant nightlife and cultural scene. Paris is a city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection, risk assessment, and portfolio management. As AI technology continues



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm [Age]. I'm a [What is your occupation or profession?]. I have [Number of years of experience in this field] years of experience in [mention your field of expertise or skill]. I'm currently [What is your current position, if any?]. I'm always looking to learn and grow, always seeking to improve my skills and knowledge. What is your name, and what's your profession? You can have a neutral or slightly positive tone to your introduction. You can also suggest a specific skill or accomplishment that you're trying to showcase in your introduction. Good luck with your introduction! [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the Île-de-France region of the country. It serves as the administrative center for France and is the largest city in France by population. French cuisine 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]

 and

 I

'm

 a

 [

career

 field

]

 expert

.

 I

 have

 [

number

 of

 years

]

 years

 of

 experience

 in

 [

mention

 the

 field

].

 I

'm

 [

any

 relevant

 skills

 or

 expertise

].

 I

'm

 passionate

 about

 [

mention

 a

 current

 area

 of

 interest

 or

 passion

].

 I

 believe

 that

 [

something

 important

 about

 myself

],

 [

any

 relevant

 personal

 qualities

].

 I

'm

 [

any

 notable

 attributes

 or

 qualities

 that

 make

 me

 unique

].

 I

'm

 excited

 to

 [

mention

 what

 you

 want

 to

 do

 next

 in

 your

 career

].

 


You

 can

 also

 include

 any

 personal

 anecdotes

 or

 stories

 to

 help

 connect

 with

 the

 reader

.

 Remember

 to

 keep

 the

 introduction

 short

 and

 to

 the

 point

,

 as

 it

's

 important



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 known

 for

 its

 rich

 history

,

 including

 the

 Roman

 and

 French

 em

pires

,

 the

 French

 Revolution

,

 and

 the

 French

 Revolution

.

 Paris

 is

 also

 renowned

 for

 its

 fashion

 industry

,

 particularly

 the

 cout

ure

 industry

,

 and

 its

 coffee

 culture

,

 which

 is

 famous

 for

 its

 

2

4

-hour

 coffee

 shops

.

 Paris

 is

 a

 vibrant

 city

 with

 a

 rich

 tape

stry

 of

 cultures

 and

 customs

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

 and

 historical

 sites

,

 including

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 marked

 by

 significant

 advancements

 in

 several

 areas

,

 including

:



1

.

 Enhanced

 AI

 capabilities

:

 As

 we

 continue

 to

 improve

 our

 ability

 to

 process

 and

 analyze

 large

 amounts

 of

 data

,

 AI

 will

 be

 able

 to

 understand

 more

 complex

 patterns

 and

 make

 more

 accurate

 predictions

.

 This

 will

 enable

 AI

 to

 perform

 tasks

 that

 were

 previously

 impossible

 or

 expensive

 to

 accomplish

.



2

.

 Increased

 AI

 integration

 with

 human

 consciousness

:

 AI

 will

 become

 more

 integrated

 with

 human

 consciousness

,

 leading

 to

 more

 advanced

 forms

 of

 consciousness

,

 emotion

,

 and

 perception

.

 This

 could

 lead

 to

 more

 profound

 understanding

 of

 the

 human

 experience

 and

 potential

 for

 greater

 empathy

 and

 compassion

.



3

.

 AI

-driven

 autonomous

 weapons

:

 AI

 will

 be

 able

 to




In [6]:
llm.shutdown()