# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.41it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.35it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.33it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.82it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.62it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ray and I am a proud veteran of the United States Marine Corps.
I am a 1986 graduate of the United States Naval Academy, where I earned a Bachelor of Science degree in English. After graduation, I was commissioned as a 2nd Lieutenant in the United States Marine Corps and served for 20 years as an Infantry Officer.
During my time in the Corps, I had the opportunity to serve in various positions, including Commanding Officer of a company-sized unit, Battalion Operations Officer, and Executive Officer of a battalion. I was also fortunate enough to serve as a liaison officer to the Joint Chiefs of Staff at the Pentagon. My
Prompt: The president of the United States is
Generated text:  giving a speech, but for some reason, you can only hear what he’s saying when he says the word “air.” You can hear every other word, but when he says “air” – nothing. A complete silence follows.
Here’s the excerpt of the speech from the radio:
President: “My fellow A

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy exploring the city's hidden corners and trying new foods. I'm currently working on a novel about a young woman who discovers a mysterious underground art scene in the city. When I'm not writing, you can find me practicing yoga or browsing through used bookstores. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, interests, or motivations. It simply presents a brief overview of who she is and what she does.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is located in the northern part of the country, in the region of Île-de-France. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. The city is home to many famous museums, such as the Louvre and the Orsay, as well as iconic landmarks like the Eiffel Tower and Notre-Dame Cathedral. Paris is also a major center for fashion, cuisine, and art. The city has a population of over 2.1 million people and is a popular tourist destination. The official language of Paris is French, and the city is divided into 20 arrondissements

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it's difficult to predict exactly what the future holds, here are some possible trends that could shape the development and impact of artificial intelligence:
1. Increased Adoption in Industries: AI is already being used in various industries, such as healthcare, finance, and transportation. As the technology improves, we can expect to see increased adoption in more sectors, leading to greater efficiency and productivity gains.
2. Advancements in Natural Language Processing (NLP): NLP is a key area of AI research, enabling machines to understand and generate human language. Future advancements in NLP could lead to more sophisticated chatbots



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lyra Flynn. I'm a 25-year-old freelance journalist and writer, currently living in downtown Los Angeles. When I'm not writing, you can find me exploring the city's hidden corners or practicing yoga. I'm passionate about storytelling and social justice. I'm not quite sure what the future holds, but I'm excited to see where life takes me.
This self-introduction is neutral in tone, meaning it doesn't reveal too much about the character's personality, background, or motivations. It simply provides a brief overview of who they are and what they do. To make it more engaging, you could consider adding a few more

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest city in France and is located in the north-central part of the country, near the Seine River. It has a population of more than 2.1 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 R

iven

 Wy

nt

er

.

 I

'm

 a

 

20

-year

-old

 student

 currently

 studying

 ancient

 history

 at

 the

 University

 of

 Oxford

.

 I

'm

 from

 a

 small

 town

 in

 the

 Midwest

.

 What

 do

 you

 think

?

 Is

 it

 neutral

 and

 suitable

 for

 a

 character

 profile

?


The

 introduction

 is

 fairly

 neutral

 and

 provides

 basic

 information

 about

 the

 character

.

 However

,

 it

 might

 benefit

 from

 a

 bit

 more

 depth

 or

 personality

.

 Here

 are

 a

 few

 suggestions

 to

 consider

:


-

 Add

 a

 bit

 more

 detail

 about

 the

 character

's

 interests

 or

 hobbies

.

 For

 example

,

 "

I

'm

 a

 

20

-year

-old

 student

 currently

 studying

 ancient

 history

 at

 the

 University

 of

 Oxford

,

 and

 I

'm

 particularly

 fascinated

 by

 the

 mythology

 of



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 located

 along

 the

 Se

ine

 River

.

 The

 Se

ine

 is

 one

 of

 Europe

’s

 major

 rivers

.

 The

 capital

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and Notre

-D

ame

 Cathedral

.

 The

 city

 has

 a

 population

 of

 over

 

2

 million

 residents

.


The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 Empire

.

 It

 was

 occupied

 by

 the

 Germans

 during

 World

 War

 II

 and

 was

 a

 major

 center

 for

 the

 French

 Resistance

.

 The

 city

 played

 a

 significant

 role

 in

 the

 French

 Revolution

.

 The

 city

 has

 a

 diverse

 cultural

 scene

 with

 numerous

 museums

 and

 art

 galleries

.

 The

 city

 is

 famous

 for

 its

 cuisine

 and

 wine

.

 Paris

 is

 a

 major

 hub

 for

 fashion

 and

 design



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 advancements

 in

 areas

 such

 as

 natural

 language

 processing

,

 computer

 vision

,

 and

 machine

 learning

.

 These

 advancements

 will

 lead

 to

 more

 sophisticated

 AI

 systems

 that

 can

 perform

 tasks

 that

 previously

 required

 human

 intelligence

,

 such

 as

 understanding

 human

 emotions

 and

 behaviors

,

 and

 making

 decisions

 based

 on

 complex

 data

.


One

 trend

 that

 is

 likely

 to

 emerge

 in

 the

 future

 of

 AI

 is

 the

 development

 of

 more

 autonomous

 and

 self

-s

ufficient

 systems

.

 This

 could

 include

 robots

 that

 can

 navigate

 and

 interact

 with

 their

 environment

 without

 human

 intervention

,

 as

 well

 as

 AI

 systems

 that

 can

 learn

 and

 adapt

 to

 new

 situations

 without

 being

 explicitly

 programmed

.


Another

 trend

 that

 is

 likely

 to

 emerge

 is

 the

 increasing

 use

 of

 AI

 in




In [6]:
llm.shutdown()