# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.27it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.07it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.02it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Christopher Walker, and I am a software engineer at Apple. My work primarily focuses on improving the user experience for Apple’s internal teams through various software applications. Prior to my current role, I worked at Facebook as a software engineer, where I contributed to various projects including the Facebook Ads platform.
As a software engineer, I have a strong passion for technology and innovation. My technical background includes proficiency in languages such as Java, Python, and Swift, and I have experience with various frameworks and tools such as Spring, Django, and Cocoa. I am excited to leverage my skills and expertise to contribute to the development of innovative solutions that drive business growth
Prompt: The president of the United States is
Generated text:  the head of state and the head of government of the United States, and is the commander-in-chief of the world's most powerful military forces. The president serves a fo

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and a collection of short stories. I'm looking forward to meeting new people and making connections in the writing community.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states the character's name, age, occupation, interests, and current projects. This kind of introduction is useful for a character who is trying to make a good impression or establish themselves in a new community. It's also a good way to show the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is located in the northern part of the country, along the Seine River. Paris is known for its rich history, art, fashion, and culture. It is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also a major hub for international business, finance, and tourism. Paris is a popular destination for visitors from around the world, attracting over 23 million tourists annually. The city is divided into 20 arrondissements, each with its own unique character and charm. Paris is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much debate and speculation. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the way we



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Naomi. I'm a 25-year-old woman who works as an assistant professor in the field of linguistics at a small liberal arts college. I have a husband, a daughter, and a pet cat, and I enjoy reading, hiking, and playing the piano.
I'm a professor at a small liberal arts college where I teach linguistics courses to undergraduate students. I'm a family woman, a reader, a hiker, and a musician. I'm interested in language acquisition, phonetics, and sociolinguistics.
This self-introduction should be concise, neutral, and professional. It provides a brief overview of your character

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a descriptive statement about France’s capital city. Paris is a beautiful city with stunning architecture, famous landmarks, and a romantic atmosphere.
Provide a statement about t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ak

ira

 Ak

ats

uki

.

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 who

's

 struggling

 to

 find

 my

 place

 in

 the

 world

.

 I

'm

 a

 bit

 of

 a

 day

dream

er

,

 often

 getting

 lost

 in

 my

 own

 thoughts

,

 and

 I

'm

 still

 figuring

 out

 who

 I

 am

 and

 where

 I

 belong

.

 I

 enjoy

 reading

,

 writing

,

 and

 listening

 to

 music

,

 but

 I

'm

 not

 sure

 what

 the

 future

 holds

 for

 me

.

 That

's

 me

 in

 a

 nutshell

.

 What

 do

 you

 think

?


Here

 are

 a

 few

 areas

 where

 the

 introduction

 could

 be

 improved

:


1

.

 

 The

 introduction

 should

 be

 shorter

.

 It

's

 currently

 about

 

40

 words

,

 which

 is

 a

 bit

 too



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 and

 is

 situated

 along

 the

 Se

ine

 River

.

 The

 city

 is

 a

 major

 cultural

 and

 economic

 hub

,

 known

 for its

 art

 museums

,

 fashion

,

 cuisine

,

 and

 historical

 landmarks

 such as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

.

 A

 population

 of

 around

 

2

.

1

 million

 people

 live

 within

 the

 city

 limits

,

 with

 over

 

12

 million

 people

 in

 the

 metropolitan

 area

.

 The

 city

 is

 divided

 into

 

20

 arr

ond

isse

ments

 and

 is

 served

 by

 two

 international

 airports

,

 Or

ly

 and

 Charles

 de

 Gaul

le

.

 The

 city

 is

 also

 home

 to

 many

 universities

 and

 research

 institutions

,

 including

 the

 Sor

bon

ne

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 promising

,

 with

 potential

 applications

 in

 healthcare

,

 transportation

,

 and

 education

.

 However

,

 it

 also

 raises

 concerns

 about

 job

 displacement

 and

 bias

 in

 decision

-making

.

 The

 development

 of

 more

 advanced

 AI

 systems

 will

 require

 careful

 consideration

 of

 these

 issues

 and

 the

 need

 for

 increased

 transparency

 and

 accountability

.

 As

 AI

 continues

 to

 evolve

,

 we

 can

 expect

 to

 see

 more

 sophisticated

 systems

 that

 can

 learn

,

 reason

,

 and

 interact

 with

 humans

 in

 more

 natural

 and

 intuitive

 ways

.

 The

 increasing

 use

 of

 AI

 in

 various

 industries

 and

 aspects

 of

 life

 is

 inevitable

,

 and

 it

 is

 essential

 to

 be

 aware

 of

 its

 potential

 implications

.


Possible

 Future

 Trends

 in

 AI

:


1

.

 **

Increased

 Adoption

 in

 Industries

:**

 AI

 will

 continue

 to

 be




In [6]:
llm.shutdown()