# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rudi and I am a person who is passionate about the development of a sustainable future. As an environmental scientist, my expertise is on the environmental impact of transportation and its potential to reduce air pollution. I have been working with the New South Wales Government to develop a plan to reduce air pollution from heavy vehicles, and my goal is to reduce the number of vehicles on the road.
To achieve this goal, I have developed a plan that includes the use of electric vehicles, which are powered by electricity generated from renewable sources such as wind and solar power. This plan is designed to reduce the number of vehicles on the road, which in turn will
Prompt: The president of the United States is
Generated text:  now considering whether to continue the trend of a two-term limit for the first-term incumbent, Donald Trump. He wants to understand the impact of a two-term limit and its potential side effects on the political lands

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your job or profession]. I enjoy [insert a brief description of your hobbies or interests]. What's your favorite hobby or activity? I'm a [insert a brief description of your favorite activity]. I'm always looking for new experiences and learning new things. What's your favorite book or movie? I'm a [insert a brief description of your favorite book or movie]. I'm always on the lookout for new adventures

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower and rich cultural heritage. It is a bustling metropolis with a diverse population and a rich history dating back to the Middle Ages. The city is home to many famous landmarks such as the Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles. Paris is also known for its fashion industry, with many famous designers and boutiques. The city is a major transportation hub and a popular tourist destination, with many attractions and events throughout the year. Paris is a city of contrasts and a city of art, culture, and history. Its status as the capital of France is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and efficient AI systems.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and deployment.

3. Increased use of AI for autonomous systems: Autonomous systems will become more prevalent, with machines being able to make decisions and take actions without human intervention. This could lead to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name] and I am a [age] year old [gender] who was born in [birth year] and grew up in [city]. I grew up surrounded by the beauty of nature and I have always had a passion for [occupation or hobby]. I am a [occupation or hobby] who has always been [anything like your current hobby or favorite hobby]. I am currently living in [location] and I am here to [something related to your character's profession or hobbies]. What is your most amazing accomplishment so far? What is your current project? What is your favorite place to live? What is your favorite hobby? What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the department of Île-de-France, near the English Channel. It is the largest city in the European Union and one of the largest cities in the world. It is the seat of the French government

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 writer

,

 an

 editor

,

 and

 an

 indie

 artist

.

 I

'm

 known

 for

 my

 work

 in

 the

 literary

 arts

 and

 have

 a

 passion

 for

 writing

,

 editing

,

 and

 sharing

 my

 work

 with

 others

.

 I

 enjoy

 tackling

 complex

 themes

,

 exploring

 new

 writing

 styles

,

 and

 pushing

 the

 boundaries

 of

 what

's

 possible

 in

 literature

.

 My

 craft

 has

 been

 hon

ed

 through

 countless

 hours

 of

 writing

,

 editing

,

 and

 practice

,

 and

 I

 strive

 to

 create

 work

 that

 reson

ates

 with

 readers

 on

 a

 deeper

 level

.

 I

 am

 always

 open

 to

 new

 experiences

 and

 learning

 new

 things

,

 and

 I

 believe

 in

 the

 power

 of

 the

 written

 word

 to

 connect

 people

 on

 an

 emotional

 and

 spiritual



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 commonly

 known

 as

 the

 "

City

 of

 Love

".

 It

 was

 founded

 in

 the

 

8

th

 century

 by

 the

 Mo

ors

,

 who

 named

 it

 after

 their

 native

 city

 of

 R

ennes

.

 Paris

 is

 the

 

1

4

th

 largest

 city

 in

 the

 world

 by

 population

 and

 the

 

4

th

 largest

 by

 area

.

 It

 is

 a

 UNESCO

 World

 Heritage

 Site

 and

 a

 major

 center

 of

 the

 French

 economy

.

 The

 city

 has

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 Paris

 is

 known

 for

 its

 artistic

 and

 cultural

 scene

,

 with

 many

 museums

,

 theaters

,

 and

 theaters

 of

 music

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 there

 are

 several

 possible

 trends

 that

 may

 emerge

 in

 the

 years

 to

 come

:



1

.

 Autonomous

 vehicles

:

 As

 autonomous

 cars

 become

 more

 advanced

 and

 widely

 available

,

 they

 could

 revolution

ize

 transportation

 by

 reducing

 traffic

 accidents

 and

 improving

 fuel

 efficiency

.

 Autonomous

 vehicles

 could

 also

 lead

 to

 a

 decrease

 in

 traffic

 congestion

,

 reducing

 greenhouse

 gas

 emissions

 and

 increasing

 air

 quality

.



2

.

 Rob

otic

 healthcare

:

 AI

 could

 be

 used

 to

 develop

 more

 accurate

,

 personalized

 medical

 treatments

,

 potentially

 saving

 lives

 and

 reducing

 the

 need

 for

 human

 intervention

.

 Additionally

,

 AI

-powered

 robots

 could

 help

 with

 surgeries

,

 assisting

 with

 the

 diagnosis

 and

 treatment

 of

 various

 diseases

,

 and

 even

 assisting

 with

 rehabilitation

.



3

.

 Improved

 security




In [6]:
llm.shutdown()