# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.01s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.03s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.02s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.17it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tommy and I'm a huge fan of the Harry Potter series. I'm a Gryffindor at heart, but I'm not biased (or am I?). I love reading the books, watching the movies, and even trying to learn a few spells (just kidding, or am I? . I'm excited to share my love of Harry Potter with everyone here and discuss all things Hogwarts with you!

What about you? Are you a fellow Gryffindor, or do you prefer a more... noble (ahem, Slytherin) approach? What's your favorite book or movie in the series? Let's get the conversation
Prompt: The president of the United States is
Generated text:  not above the law. He or she must comply with federal laws, including those that apply to non-citizens. They also must respect the rights of all people, including those in detention, to a fair and impartial trial.
President Trump's decision to separate children from their families at the border has been widely condemned by human rights organizations, faith communities, and a grow

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team. I like to think I'm pretty laid-back and easy-going, but I can get pretty passionate about the topics I care about. I'm not really sure what I want to do with my life yet, but I'm open to exploring different possibilities. I'm a bit of a introvert, but I'm working on being more outgoing and meeting new people. That's me in a nutshell. What do you think? Is this

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or elaboration. It is a simple and straightforward statement that conveys the necessary information. This type of statement is often used in encyclopedias, dictionaries, and other reference materials where brevity and accuracy are essential. It is also a good example of a statement that can be used as a starting point for further research or discussion. Overall, the statement is clear, concise, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of Explainable AI (XAI): As AI becomes more pervasive in decision-making, there will be a growing need to understand how AI systems arrive at their conclusions. XAI will become more important to ensure transparency and accountability in AI decision-making.
2. Rise of Edge AI: With the proliferation of IoT devices, there will be a growing need for AI to be deployed at the edge, closer to where data is generated. Edge AI will enable faster processing and reduced



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Luna Nightshade, but my friends call me Lu. I’m a 20-year-old student at the Silvermist Academy of Magic, studying the art of shadow manipulation. I enjoy reading fantasy novels, practicing my spells in my spare time, and playing with my black cat, Onyx. That’s a bit about me! I hope you can get to know me better. Luna Nightshade is a student at a school for magic users and studies shadow manipulation. She is 20 years old, loves reading fantasy novels, and has a pet cat. Here is a short, neutral self-introduction for a fictional character. Note: This introduction

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Provide a concise factual statement about the historical significance of Paris. Paris has been a major center of culture and learning for over 2,000 years and is known for its significant contrib

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ara

 and

 I

'm

 a

 

21

-year

-old

 communications

 major

 at

 the

 state

 university

.

 I

'm

 currently

 working

 on

 a

 project

 to

 create

 a

 social

 media

 campaign

 for

 a

 local

 non

-profit

 organization

.

 In

 my

 free

 time

,

 I

 love

 to

 read

 and

 hike

 in

 the

 nearby

 mountains

.

 I

'm

 looking

 forward

 to

 learning

 more

 about

 this

 community

 and

 getting

 to

 know

 everyone

.

 You

 could

 also

 try

 adding

 a

 bit

 of

 personality

 to

 your

 introduction

,

 such

 as

:

 "

I

'm

 a

 bit

 of

 a

 book

worm

,

 but

 you

 can

 also

 find

 me

 trying

 out

 new

 hiking

 trails

 on

 the

 weekends

"

 or

 "

I

'm

 a

 huge

 music

 fan

,

 and

 I

'm

 always

 on

 the

 lookout

 for

 new



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 known

 for

 its

 beautiful

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 These

 famous

 sites

 attract

 millions

 of

 tourists

 every

 year

.

 France

 is

 a

 country

 with

 a

 rich

 history

,

 culture

,

 and

 language

,

 known

 as

 the

 French

 language

.

 France

 has

 a

 diverse

 population

 with

 people

 from

 various

 ethnic

 backgrounds

 living

 in

 the

 country

.

 France

 is

 a

 popular

 destination

 for

 business

 and

 tourism

,

 with

 major

 industries

 such

 as

 fashion

,

 food

,

 and

 wine

.

 Paris

 is

 a

 major

 hub

 for

 international

 trade

 and

 commerce

.

 France

 is

 a

 member

 of

 the

 European

 Union

 and

 plays

 a

 significant

 role

 in

 international

 relations.

 The

 country

 has

 a

 well



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 not

 just

 about

 the

 technology

 itself

 but

 also

 about

 how

 it

 will

 impact

 society

,

 economies

,

 and

 individual

 lives

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 **

AI

-P

owered

 Virtual

 Assist

ants

**:

 Virtual

 assistants

 like

 Alexa

,

 Google

 Assistant

,

 and

 Siri

 will

 become

 more

 sophisticated

,

 integrating

 multiple

 AI

 technologies

 like

 natural

 language

 processing

,

 computer

 vision

,

 and

 speech

 recognition

 to

 provide

 more

 personalized

 and

 intuitive

 services

.


2

.

 **

Increased

 Use

 of

 Edge

 AI

**:

 With

 the

 proliferation

 of

 IoT

 devices

,

 edge

 AI

 will

 become

 more

 prominent

,

 allowing

 for

 real

-time

 processing

 and

 decision

-making

 at

 the

 edge

 of

 the

 network

,

 reducing

 latency

 and

 improving

 response

 times

.


3

.

 **

Ex

plain




In [6]:
llm.shutdown()