# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.11it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.76it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Manuel.
I am a 21 year old from Spain and I am studying Hospitality Management. I have recently arrived to Melbourne and I am excited to learn and experience everything that this amazing city has to offer.
I have experience in the service industry, working as a barista and a waiter, and I am eager to expand my knowledge and skills in the hospitality field.
I am a friendly and hard working person, always looking for new challenges and opportunities to grow.
I am confident that I will be able to adapt quickly to Melbourne and make new friends.
I am looking for a part-time job, preferably in a cafe or restaurant, where I
Prompt: The president of the United States is
Generated text:  a politician who serves as the head of the federal government and the commander-in-chief of the armed forces. He is also the head of the executive branch of the government. The president is elected through the Electoral College, a process that involves the selection o

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy exploring the city's hidden corners and trying new foods. I'm currently working on a novel about a young woman's journey through the Japanese countryside. When I'm not writing, you can find me practicing yoga or browsing through used bookstores. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal any personal opinions or biases. It simply presents Kaida's background, interests, and personality in a straightforward and factual way. This is a good approach for a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
Paris is the capital and largest city of France, with a population of over 2.1 million people within the city limits. It is the most populous urban area in the European Union and the second most populous city in the European Union, after London. Paris is a global center for art, fashion, cuisine, and culture, and is one of the world's leading tourist destinations. The city is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. Paris is also a major hub for business, finance, and education, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, there are several trends that are likely to shape the development and impact of artificial intelligence in the coming years. Here are some possible future trends in AI:
1. Increased Adoption in Various Industries: AI is expected to become increasingly adopted in various industries, including healthcare, finance, transportation, and education. This will lead to improved efficiency, productivity, and decision-making in these sectors.
2. Advancements in Natural Language Processing (NLP): NLP is a key area of AI research, and significant advancements are expected in the coming years.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Astra Lumen. I'm a skilled warrior from a planet called Elyria, where the skies are a deep shade of indigo and the trees are a vibrant green. I've traveled extensively throughout the galaxy, honing my combat skills and learning new ways to harness the power of the elements. When I'm not fighting for justice, I enjoy exploring abandoned ruins and practicing my art of aeromancy. I'm a bit of a wanderer at heart, always seeking new challenges and adventures. Astra Lumen, at your service. Astra Lumen is a warrior and aeromancer from the planet Elyria,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the Île-de-France region and is the center of French politics, economy, culture, and history. It is home to famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Mus

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 August

.

 I

'm

 a

 bit

 of

 an

 intro

vert

,

 but

 I

'm

 trying

 to

 work

 on

 that

.

 I

 enjoy

 reading

,

 collecting

 rare

 books

,

 and

 playing

 chess

.

 I

 live

 in

 a

 small

,

 cozy

 apartment

 in

 the

 city

.

 That

's

 me

 in

 a

 nutshell

.

 Nothing

 too

 exciting

,

 but

 I

 like

 it

 that

 way

.


Describe

 the

 personality

 of

 a

 fictional

 character

.

 August

 is

 a

 intros

pective

 and

 analytical

 person

.

 He

's

 not

 very

 outgoing

,

 but

 he

 has

 a

 dry

 sense

 of

 humor

 and

 can

 be

 witty

 when

 he

 wants

 to

 be

.

 He

's

 very

 detail

-oriented

 and

 values

 precision

 and

 accuracy

.

 He

's

 also

 a

 bit

 of

 a

 perfection

ist

,

 which

 can

 sometimes

 make



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Next

,

 provide

 a

 brief

 description

 of

 a

 notable

 landmark

 or

 attraction

 in

 Paris

.

 The

 E

iff

el

 Tower

 is

 a

 famous

 iron

 lattice

 tower

 located

 in

 the

 heart

 of

 Paris

.


Discuss

 the

 historical

 significance

 of

 the

 E

iff

el

 Tower

 and

 its

 purpose

 when

 it

 was

 first

 built

.

 The

 E

iff

el

 Tower

 was

 built

 in

 

188

9

 for

 the

 World

’s

 Fair

,

 held

 in

 Paris

 that

 year

,

 to

 serve

 as

 the

 entrance

 arch

.

 It

 was

 intended

 to

 be

 a

 temporary

 structure

,

 but

 it

 became

 an

 iconic

 symbol

 of

 the

 city

 and

 was

 left

 standing

 after

 the

 fair

.


Next

,

 provide

 some

 basic

 information

 about

 the

 structure

 of

 the

 E

iff

el

 Tower

,

 including

 its



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 uncertain

,

 with

 several

 trends

 that

 may

 shape

 the

 field

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


 

 

1

.

 Increased

 adoption

 of

 AI

 in

 various

 industries

:

 AI

 is

 already

 being

 used

 in

 industries

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.

 In

 the

 future

,

 we

 can

 expect

 to

 see

 increased

 adoption

 of

 AI

 in

 other

 industries

 such

 as

 education

,

 customer

 service

,

 and

 manufacturing

.


 

 

2

.

 Adv

ancements

 in

 natural

 language

 processing

:

 Natural

 language

 processing

 (

N

LP

)

 is

 a

 subset

 of

 AI

 that

 deals

 with

 the

 interaction

 between

 computers

 and

 humans

 in

 natural

 language

.

 We

 can

 expect

 to

 see

 advancements

 in

 N

LP

 that

 enable

 computers

 to

 better

 understand

 and




In [6]:
llm.shutdown()