# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.15it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Flora. I am a 10th grade student at Hudson High School. I have a passion for teaching, and I have been working as an assistant teacher since high school. As an assistant teacher, I have helped students with various subjects such as math, English, and science. I have also taught on the field, leading and participating in science experiments. I enjoy engaging students in fun and hands-on activities that help them learn and stay interested in their studies. I am eager to continue my education and seek ways to improve my teaching skills. What's your favorite hobby, and what are you currently working on? As an AI language
Prompt: The president of the United States is
Generated text:  a very important person. He/she is the head of the government of the country. He/she is the commander in chief of the armed forces, which includes the army and the navy. He/she is responsible for making laws and for enforcing them. The president also helps the country 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] who has been [number of years] in the industry. I am passionate about [reason for passion], and I am always looking for ways to [action or goal]. I am a [type of person] who is [character trait or quality] and I am always [character trait or quality]. I am [character trait or quality] and I am always [character trait or quality]. I am [character trait or quality] and I am always [character trait or quality]. I am [character trait or quality] and I am always [character trait or quality]. I am [character

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French Academy of Sciences. Paris is a bustling metropolis with a rich cultural heritage and is a major economic and political center in Europe. It is also known for its fashion industry, art scene, and cuisine. The city is home to many international organizations and is a major tourist destination. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that has been

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased integration with other technologies: AI will continue to be integrated with other technologies such as blockchain, IoT, and autonomous vehicles, creating a more interconnected and integrated world.

2. Enhanced privacy and security: As AI becomes more prevalent, there will be a need for increased privacy and security measures to protect user data and prevent misuse of AI systems.

3. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations and the responsible use of AI systems.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First Name] and I am a [Occupation] who was [Insert short sentence about your experiences] and [Insert any relevant experiences or qualities]. I like to [mention something about your personality]. I have [mention any hobbies, interests, or talents] and I enjoy [mention a hobby or activity you like to do]. I'm [mention your age, if relevant] and I'm currently [mention the current state or age of your character]. I've always been [mention your long-term goal or dream], and I am [mention your character trait or personality trait that drives you to achieve this goal]. As a [Insert

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its rich history, diverse culture, and iconic landmarks such as Notre-Dame Cathedral and the Eiffel Tower. 

Q: Who are the founders of France? The founders of France wer

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 passionate

 about

 [

mention

 a

 specific

 hobby

 or

 interest

 you

 have

 in

 common

 with

 your

 job

 title

],

 and

 I

'm

 always

 looking

 for

 opportunities

 to

 learn

 and

 grow

.

 If

 you

 need

 help

,

 don

't

 hesitate

 to

 ask

 me

 anything

.

 #

Self

Introduction





I

'm

 [

Name

]

 from

 [

location

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 passionate

 about

 [

mention

 a

 specific

 hobby

 or

 interest

 you

 have

 in

 common

 with

 your

 job

 title

],

 and

 I

'm

 always

 looking

 for

 opportunities

 to

 learn

 and

 grow

.

 If

 you

 need

 help

,

 don

't

 hesitate

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 and

 vibrant

 cultural

 scene

.

 Paris

 is

 often

 referred

 to

 as

 the

 "

city

 of

 love

"

 due

 to

 its

 romantic

 charm

 and

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 famous

 for

 its

 cuisine

,

 art

,

 and

 fashion

,

 and

 is

 home

 to

 numerous

 museums

,

 theaters

,

 and

 other

 attractions

.

 



Given

 that

 Paris

 is

 famous

 for

 its

 love

 of

 art

 and

 architecture

,

 what

 are

 some

 examples

 of

 art

 and

 architecture

 found

 in

 the

 city

?



Some

 examples

 of

 art

 and

 architecture

 found

 in

 Paris

 include

:



-

 The

 Lou

vre

 Museum

,

 the

 world

's

 largest

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 significant

 advancements

 in

 a

 variety

 of

 areas

,

 including

:



1

.

 Increased

 efficiency

:

 AI

 is

 likely

 to

 become

 more

 efficient

 in

 many

 different

 applications

,

 from

 manufacturing

 to

 healthcare

 to

 transportation

,

 by

 autom

ating

 repetitive

 tasks

 and

 improving

 the

 accuracy

 of

 data

 analysis

.



2

.

 Personal

ization

:

 AI

 is

 likely

 to

 become

 more

 personalized

,

 as

 it

 is

 able

 to

 learn

 from

 individual

 data

 sets

 and

 provide

 tailored

 recommendations

 and

 solutions

 to

 users

.



3

.

 Automation

 of

 jobs

:

 AI

 is

 likely

 to

 automate

 many

 of

 the

 jobs

 that

 require

 physical

 labor

,

 freeing

 up

 workers

 to

 focus

 on

 more

 complex

 and

 creative

 tasks

.



4

.

 Greater

 collaboration

:

 AI

 is

 likely

 to

 facilitate

 greater

 collaboration

 between

 humans




In [6]:
llm.shutdown()