# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.12it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.76it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.49it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ivonne and I am a photographer based in Mexico City. I am a documentary photographer and have been working in this field since 2007. My work has been exhibited in several countries around the world, including the United States, Mexico, Spain, and Argentina. I have also received several awards and recognitions for my work, including the World Press Photo award.
My photography focuses on telling the stories of people and communities that are often overlooked by mainstream media. I am particularly interested in exploring the ways in which people respond to social and economic change, and how these changes affect their daily lives. I have worked on several projects that have taken
Prompt: The president of the United States is
Generated text:  the head of the executive branch, responsible for enforcing the laws and serving as the commander in chief of the armed forces. The president is elected through the Electoral College and serves a four-year te

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city and spend most of my free time reading and writing. I'm a bit of a introvert and prefer to keep to myself, but I enjoy meeting new people and trying new things. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell.
This is a good example of a neutral self-introduction because it doesn't reveal too much about Kaida's personality or background. It simply states her name, age, occupation, and a few basic facts about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city, providing a clear and direct answer to the question. It does not include any additional information or opinions, making it a suitable example of a concise factual statement. 
Note: This response is a direct answer to the question and does not require any further analysis or explanation. It is a simple and clear statement of fact.  The tone is neutral and objective, providing a factual answer without any emotional or biased language. 
Let me know if you need any modifications! 
Here

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the way we learn, with the ability to personalize education and make it more accessible to people around the world. In the future, AI is likely to be used



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Ethan Thompson. I'm a 35-year-old web developer living in Chicago. I'm interested in video games, hiking, and trying new foods. I'm not particularly outgoing, but I enjoy meeting new people and making friends. I'm looking forward to getting to know everyone better. Ethan Thompson, here.
Answer: Ethan Thompson. I'm a 35-year-old web developer living in Chicago. I'm interested in video games, hiking, and trying new foods. I'm not particularly outgoing, but I enjoy meeting new people and making friends. I'm looking forward to getting to know everyone better. Ethan Thompson, here. [Sentence

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This response is concise, factual, and directly answers the prompt. However, to further enhance the answer, we can provide additional details or context that might be rele

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ka

ia

 Black

wood

.

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 and

 a

 part

-time

 waitress

.

 I

 work

 at

 a

 diner

 downtown

,

 where

 I

've

 been

 for

 about

 a

 year

 now

.

 I

 enjoy

 listening

 to

 music

 and

 reading

 when

 I

'm

 not

 at

 school

 or

 working

.

 I

'm

 a

 bit

 of

 a

 private

 person

,

 but

 I

 like

 to

 make

 a

 good

 impression

.

 I

've

 recently

 moved

 to

 this

 town

,

 so

 I

'm

 still

 getting

 used

 to

 the

 area

 and

 the

 people

.

 I

'm

 hoping

 to

 make

 some

 new

 friends

 and

 connections

 here

.


Character

 Development

:

 What

 can

 you

 infer

 about

 Ka

ia

 from

 this

 self

-int

roduction

?


Ka

ia

 seems

 to

 be

 a

 bit



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 second

-largest

 city

.

 The

 second

-largest

 city

 in

 France

 is

 Marseille

.

 Provide

 a

 concise

 factual

 statement

 about

 the

 capital

 city

 of

 France

.

 The

 capital

 city

 of

 France

 is

 called

 Paris

.

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 largest

 city

.

 The

 largest

 city

 in

 France

 is

 Paris

.

 Provide

 a

 concise

 factual

 statement

 about

 France

.

 France

 is

 a

 country

 located

 in

 Western

 Europe

.

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 official

 language

.

 The

 official

 language

 of

 France

 is

 French

.

 Provide

 a

 concise

 factual

 statement

 about

 the

 flag

 of

 France

.

 The

 flag

 of

 France

 features

 three

 vertical

 bands

 of

 blue

,

 white

,

 and

 red

.

 Provide

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 not

 set

 in

 stone

,

 but

 here

 are

 some

 possibilities

 based

 on

 current

 research

 and

 trends

.


Some

 possible

 future

 trends

 in

 AI

 include

:


1

.

 Increased

 use

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 prevalent

 in

 decision

-making

 processes

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 arrive

 at

 their

 conclusions

.

 X

AI

 will

 help

 to

 build

 trust

 in

 AI

 by

 providing

 transparent

 and

 interpre

table

 explanations

 for

 its

 decisions

.


2

.

 More

 emphasis

 on

 Human

-A

I

 Collaboration

:

 As

 AI

 becomes

 more

 capable

,

 it

 will

 likely

 be

 used

 in

 conjunction

 with

 human

 workers

 to

 augment

 their

 abilities

.

 This

 will

 lead

 to

 new

 job

 opportunities

 and

 new

 ways

 of

 working

 together

 with

 AI




In [6]:
llm.shutdown()