# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.02s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.58it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.29it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Gabriel, I am a 20-year-old student at the University of Puerto Rico, Mayagüez campus. I am from the town of Arecibo, which is located in the northern coast of the island. I am majoring in mathematics with a minor in physics and I am currently in my third year of studies. I have a strong interest in mathematics, especially in number theory and algebraic geometry. My research interests lie in the areas of algebraic K-theory and its connections to the study of elliptic curves.
As an undergraduate student, I have had the opportunity to participate in various research projects related to number theory and
Prompt: The president of the United States is
Generated text:  required by the Constitution to take the oath of office only once, before beginning the first term of office. Therefore, a president does not take an oath of office at the beginning of each term. Instead, the president takes the oath of office before the beginning of the first term an

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading fantasy novels in my free time. I'm also a member of the school's debate team and enjoy arguing about current events and social issues. I'm a bit of a introvert and prefer to spend time alone or with close friends, but I'm working on being more outgoing. I'm a junior in high school and am trying to decide on a college major. I'm interested in studying English or history, but I'm still exploring my options. I'm a bit of a perfectionist and can get stressed out

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." The city has a diverse population and is home to people from all

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze large amounts of medical data, identify patterns, and make predictions, leading to more accurate diagnoses and personalized treatment plans.
2. Advancements in natural language processing: Natural language processing (NLP) is a subset of AI that enables computers to understand and generate human language. Future advancements in NLP are expected to lead to more sophisticated



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar. I'm a 27-year-old astrobiologist studying the unique conditions of planetary atmospheres. I'm currently based at the Mars Research Station on the red planet. I'm here to learn and contribute to the ongoing research in my field. That's me in a nutshell. Nice to meet you. The introduction is neutral, as per the character's request, and doesn't reveal too much about their personal life or motivations. However, the mention of the Mars Research Station provides some context about their work and location, which can be a good starting point for further conversation or development. What does this self-introduction

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The Eiffel Tower is a famous landmark in which city? Paris
Identify a notable river that runs through Paris. Seine River
What is the p

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lily

.

 I

'm

 a

 

25

-year

-old

 graphic

 designer

 working

 in

 a

 small

 firm

 in

 downtown

 Seattle

.

 I

'm

 currently

 studying

 for

 my

 Master

's

 degree

 in

 the

 evenings

.

 That

's

 about

 me

.

 I

'm

 not

 really

 into

 sports

,

 but

 I

 do

 enjoy

 hiking

 and

 trying

 new

 restaurants

.

 How

 would

 you

 like

 me

 to

 proceed

 with

 introducing

 my

 character

?


Your

 introduction

 is

 clear

 and

 concise

,

 which

 is

 great

 for

 a

 neutral

 self

-int

roduction

.

 However

,

 to

 help

 you

 proceed

,

 I

'll

 offer

 a

 few

 suggestions

 to

 add

 some

 personality

 to

 your

 character

.


You

 could

 reveal

 a

 bit

 more

 about

 Lily

's

 background

 or

 interests

.

 For

 example

,

 you

 could

 mention

 where

 she

's

 from



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 the

 most

 populous

 city

 in

 France

 and

 the

 centre

 of

 the

 Î

le

-de

-F

rance

 region

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 France

,

 at

 the

 con

fluence

 of

 the

 Se

ine

 River

,

 which

 is

 approximately

 

300

 miles

 (

480

 km

)

 northwest

 of

 the

 city

 of

 Lyon

.

 The

 city

 covers

 an

 area

 of

 approximately

 

1

,

028

 square

 miles

 (

2

,

661

 square

 kilometers

).

 The

 city

 has

 a

 population

 of

 more

 than

 

2

.

1

 million

 people

,

 with

 a

 metropolitan

 area

 population

 of

 more

 than

 

12

.

2

 million

 people

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

 museums

,

 fashion

,

 cuisine

,

 and

 architecture

.

 It



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 there

 are

 several

 trends

 that

 may

 shape

 the

 field

 in

 the

 coming

 years

.


Potential

 Future

 Trends

 in

 Artificial

 Intelligence

:


1

.

 Hybrid

 Intelligence

:

 AI

 systems

 will

 be

 designed

 to

 work

 together

 with

 human

 intelligence

,

 creating

 a

 hybrid

 intelligence

 that

 lever

ages

 the

 strengths

 of

 both

.


2

.

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

,

 leading

 to

 the

 development

 of

 explain

able

 AI

.


3

.

 Edge

 AI

:

 With

 the

 proliferation

 of

 IoT

 devices

,

 AI

 will

 be

 deployed

 at

 the

 edge

,

 closer

 to

 the

 data

 source

,

 enabling

 faster

 and

 more

 efficient

 processing

.


4

.

 Human

-A

I

 Collaboration

:




In [6]:
llm.shutdown()