# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.01it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.64it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.33it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aparna, and I am the newest addition to the team. I am thrilled to be here and learn from the talented and dedicated team at Hirea. I am a digital enthusiast with a passion for storytelling and creative problem-solving. I bring a unique blend of artistic and analytical skills to the table, having a background in fine arts and design. I am excited to apply my skills in content creation, social media management, and creative writing to help Hirea's clients achieve their marketing goals. In my free time, I love to explore new places, practice yoga, and read about different cultures and histories. I look forward to connecting with
Prompt: The president of the United States is
Generated text:  often described as the most powerful person in the world. With the power to sign bills into law, command the military, and negotiate treaties, the president plays a central role in shaping American domestic and foreign policy.
Despite the power that comes wit

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and trying to learn more about the Japanese culture. I'm a bit of a introvert, but I'm always up for a good conversation. I'm looking forward to meeting new people and making connections.
This is a good start, but it's a bit too long and could be more concise. Here's a revised version: Hello, my name is Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This is a concise and factual statement about France’s capital city. It does not include any additional information or opinions, making it a clear and accurate statement. The statement is also easy to understand and does not require any further explanation. This is an example of a well-written factual statement. The statement is also neutral and does not express any bias or opinion. It simply states a fact, making it a reliable source of information. The statement is also concise, making it easy to read and understand. It does not include any unnecessary words or information, making it a clear and effective statement. Overall,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to improve patient outcomes and reduce healthcare costs.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the way we learn,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Astrid Winters. I'm a freelance writer and historian with a specialization in 19th-century European history. I've written articles and blog posts on a variety of topics, including the Paris Commune and the life of Florence Nightingale. I'm currently working on a book about the women of the British Suffragette movement.
Here are some suggestions for writing a neutral self-introduction for Astrid Winters:
Use a simple, straightforward format. Start with a brief statement of your name, followed by a concise statement of your profession or area of expertise. You can add a few details about your work or current projects if you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
What is the name of the famous river that runs through Paris? The Seine River.
What is the name of the famous landmark located in Paris

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Evelyn

 Vega

.

 I

 am

 a

 

25

-year

-old

 nurse

 who

 recently

 moved

 to

 Los

 Angeles

 from

 a

 small

 town

 in

 the

 Midwest

.

 I

 work

 on

 the

 onc

ology

 floor

 at

 a

 large

 hospital

 and

 enjoy

 hiking

 and

 trying

 new

 foods

 in

 my

 free

 time

.

 That

's

 me

.

 I

 am

 a

 bit

 of

 a

 home

body

 and

 value

 my

 alone

 time

,

 but

 I

 am

 also

 looking

 forward

 to

 exploring

 this

 new

 city

 and

 meeting

 new

 people

.


Answer

:

 Hello

,

 my

 name

 is

 Evelyn

 Vega

.

 I

 am

 a

 

25

-year

-old

 nurse

 who

 recently

 moved

 to

 Los

 Angeles

 from

 a

 small

 town

 in

 the

 Midwest

.

 I

 work

 on

 the

 onc

ology

 floor

 at

 a

 large

 hospital

 and

 enjoy

 hiking



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 northern

 part

 of

 the

 country

.

 Paris

 is

 a

 major

 city

 with

 a

 population

 of

 over

 

2

.

1

 million

 people

.

 It

 is

 also

 a

 significant

 cultural

 and

 economic

 hub

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.


The

 post

 What

 is

 the

 capital

 city

 of

 France

?

 first

 appeared

 on

 .

 Here

 is

 a

 quick

 excerpt

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.


The

 capital

 of

 France

 is

 Paris

,

 located

 in

 the

 northern

 part

 of

 the

 country

.

 Paris

 is

 a

 major

 city

 with

 a

 population

 of

 over

 

2

.

1

 million

 people

.

 It

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 subject

 of

 significant

 interest

 and

 speculation

.

 While

 it

's

 difficult

 to

 predict

 the

 future

 with

 certainty

,

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 **

Increased

 Adoption

 in

 Industries

**:

 AI

 will

 continue

 to

 be

 adopted

 across

 various

 industries

,

 including

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 We

 can

 expect

 to

 see

 more

 AI

-powered

 solutions

 in

 areas

 like

 medical

 diagnosis

,

 personalized

 medicine

,

 and

 predictive

 maintenance

.


2

.

 **

Edge

 AI

 and

 IoT

**:

 With

 the

 proliferation

 of

 Internet

 of

 Things

 (

Io

T

)

 devices

,

 edge

 AI

 will

 become

 increasingly

 important

.

 Edge

 AI

 refers

 to

 the

 processing

 of

 AI

-related

 tasks

 on

 devices

 or

 at

 the

 edge

 of

 the

 network

,

 rather




In [6]:
llm.shutdown()