# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.30it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jiaqi, and I'm a self-taught learner who is passionate about learning, actively researching, and making new discoveries. I am a student of History. I am in Grade 10th. I am a relatively young person who is always learning. I am very curious and enthusiastic about learning, and I am always eager to share my knowledge with others. What is your favorite subject in school? I have been in a good relationship with my parents and am well-liked by everyone. My favorite teacher is Mr. Wang, who often provides me with positive encouragement and well-thought-out answers. The morning exercises in today
Prompt: The president of the United States is
Generated text:  a very important person in the country. Who is the president of the United States? The president of the United States is the President of the United States. The president is the head of state, head of government, and commander-in-chief of the armed forces of the United States. The president serv

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and I love [job title] because [reason for passion]. What do you do at work? I'm a [job title] at [company name], and I'm passionate about [job title] and I love [job title] because [reason for passion]. What do you enjoy doing outside of work? I enjoy [job title] because [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is also known for its cuisine, including its famous croissants and its famous French fries. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city that has played a significant role in French history and continues to be a major cultural and economic center in the world. Paris is the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks with increasing accuracy and efficiency. Some potential future trends in AI include:

1. Increased integration with other technologies: AI systems will become more integrated with other technologies, such as sensors, actuators, and actuators, to create more complex and adaptive systems.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be increased concerns about privacy and security. There will be efforts to develop new technologies



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I’m a/an [Age] year old. I am [Gender] and [Name], a/an [Title] at [Company]. I've always loved [What it is you love/What you are] and I am always eager to learn and grow. I’m always looking for new experiences and opportunities to expand my knowledge and skills. I'm a hard worker and a great communicator. I have a passion for [What is it you love/What you are] and I am always striving to make the world a better place. I am a/an [Greatness] who I believe in, and I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a bustling metropolis with a rich history and iconic landmarks.

Why is Paris considered the "Queen of the City"?
Paris is considered the "Queen of the City" due to its historical significance and cultural importance. It is the capital of France, the largest city in Europe, and on

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 a

 [

Occup

ation

]

 with

 over

 [

Number

]

 years

 of

 experience

 in

 [

Industry

].

 I

 am

 dedicated

 to

 helping

 people

 get

 their

 life

 back

 on

 track

 after

 [

Reason

 for

 Mot

ivation

].

 I

 am

 always

 seeking

 to

 learn

 and

 improve

 my

 skills

 to

 make

 me

 the

 best

 [

Skill

].

 What

 is

 your

 favorite

 hobby

,

 and

 what

 is

 it

 used

 for

?



I

 am

 a

 [

Occup

ation

]

 with

 over

 [

Number

]

 years

 of

 experience

 in

 [

Industry

].

 I

 am

 dedicated

 to

 helping

 people

 get

 their

 life

 back

 on

 track

 after

 [

Reason

 for

 Mot

ivation

].

 I

 am

 always

 seeking

 to

 learn

 and

 improve

 my

 skills

 to

 make

 me

 the

 best

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 where

 the

 E

iff

el

 Tower

 stands

 tall

 and

 the

 Lou

vre

 Museum

 is

 housed

.

 Paris

 is

 a

 bustling

 met

ropolis

 with

 many

 famous

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 Arc

 de

 Tri

omp

he

,

 and

 is

 a

 major

 cultural

 and

 economic

 center

 in

 Europe

.

 The

 French

 language

 is

 spoken

 in

 Paris

,

 and

 the

 city

 is

 home

 to

 many

 prestigious

 universities

 and

 institutions

.

 The

 French

 capital

 city

 is

 known

 for

 its

 vibrant

 nightlife

,

 modern

 architecture

,

 and

 rich

 cultural

 heritage

.

 The

 city

 has

 a

 long

 and

 stor

ied

 history

,

 with

 its

 origins

 in

 ancient

 Gaul

 and

 its

 role

 as

 a

 trade

 center

 in

 the

 medieval

 period

.

 Today

,

 Paris

 remains

 a

 popular



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 continue

 to

 evolve

 and

 divers

ify

,

 driven

 by

 the

 continued

 advancements

 in

 computing

 power

,

 data

 storage

,

 and

 machine

 learning

 algorithms

.

 Here

 are

 some

 possible

 trends

 in

 the

 AI

 landscape

 in

 the

 next

 few

 years

:



1

.

 Increased

 Use

 of

 AI

 in

 Healthcare

:

 With

 the

 rise

 of

 AI

-powered

 medical

 imaging

,

 AI

-ass

isted

 diagnosis

,

 and

 AI

-driven

 drug

 discovery

,

 the

 use

 of

 AI

 in

 healthcare

 is

 expected

 to

 increase

 significantly

 in

 the

 coming

 years

.



2

.

 Increased

 Integration

 of

 AI

 in

 Financial

 Services

:

 AI

-powered

 fraud

 detection

,

 chat

bots

,

 and

 virtual

 assistants

 are

 expected

 to

 play

 a

 more

 prominent

 role

 in

 the

 financial

 services

 industry

 in

 the

 future

.



3

.

 Increased

 Use




In [6]:
llm.shutdown()