# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 03-20 18:34:54 __init__.py:190] Automatically detected platform cuda.


The following error message 'operation scheduled before its operands' can be ignored.


INFO 03-20 18:35:10 __init__.py:190] Automatically detected platform cuda.
INFO 03-20 18:35:10 __init__.py:190] Automatically detected platform cuda.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.01it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.55it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Reginald. I am a computer science major at a university, and I have a few problems with my computer. My computer is an HP Pavilion dv6 laptop. It is about 4 years old, and it has a 2.3 GHz processor, 4 GB of RAM, and a 500 GB hard drive. I am running Windows 8.1 on it.

Recently, I have been noticing that my computer is shutting down unexpectedly, and sometimes it freezes or becomes unresponsive. I have also noticed that my computer takes a long time to start up, especially after it has been turned off for a while. I
Prompt: The president of the United States is
Generated text:  the head of state and the head of government of the United States. The president serves a four-year term and is limited to two terms. The president is elected by the people through the Electoral College system.
The president is responsible for enforcing the laws of the United States, commanding the armed forces, and conducting the nation's foreign policy. The president

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new coffee shops. I'm a bit of a introvert and prefer quieter environments, but I'm always up for a good conversation. I'm currently working on a novel and trying to build my writing portfolio. That's me in a nutshell. What do you think? Is it a good self-introduction?
The introduction is neutral and provides some basic information about the character, Kaida. It doesn't reveal too much about her personality, interests, or background, which is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country. It is situated on the Seine River. The city is known for its art museums, fashion, and cuisine. Paris is a popular tourist destination. It is home to the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people. The official language of Paris is French. The city has a rich history and culture, and it is considered one of the most beautiful cities in the world. Paris is a major economic and cultural center in Europe. It is a hub for international business, finance

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of Explainable AI (XAI): As AI becomes more pervasive in decision-making, there will be a growing need for transparency and accountability. XAI will become increasingly important to ensure that AI systems are explainable and trustworthy.
2. Rise of Edge AI: With the proliferation of IoT devices, there will be a growing need for AI to be deployed at the edge, closer to the data source. Edge AI will enable faster processing, reduced latency, and improved security



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Luna Nightshade. I'm a 19-year-old college student, and I'm currently studying psychology. I have long, dark hair and piercing green eyes. I'm quiet and observant by nature, preferring to listen rather than speak. I'm working on building my confidence and speaking up more in class discussions. That's me in a nutshell. I'm looking forward to getting to know you.
As the narrative progresses, I can develop Luna's personality and relationships further. For now, this is a starting point that can help me create a character that readers can relate to and invest in.
It's a good idea to introduce Luna's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northern part of the country. It has a population of approximately 2.1 million people within its city limits. Paris is known for its rich cultural h

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Em

ilia

.

 I

'm

 a

 

25

-year

-old

 bar

ista

 working

 at

 a

 small

 coffee

 shop

 in

 a

 trendy

 neighborhood

.

 When

 I

'm

 not

 making

 drinks

,

 I

 enjoy

 reading

 and

 taking

 long

 walks

 around

 the

 city

.

 I

'm

 a

 bit

 of

 a

 home

body

,

 but

 I

 do

 appreciate

 a

 good

 night

 out

 with

 friends

.

 I

'm

 a

 loyal

 friend

 and

 family

 member

,

 and

 I

 value

 honesty

 and

 authenticity

 in

 my

 relationships

.

 I

'm

 also

 a

 bit

 of

 a

 perfection

ist

,

 which

 can

 sometimes

 make

 me

 a

 bit

...

pr

ick

ly

.

 That

's

 me

 in

 a

 nutshell

.

 What

 do

 you

 think

?

 Does

 this

 introduction

 sound

 like

 a

 good

 starting

 point

 for

 a

 character

?

 It



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 What

 is

 the

 city

 known

 for

?

 Paris

 is

 the

 most

 visited

 city

 in

 the

 world

 with

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 era

.

 It

 is

 also

 known

 for

 its

 fashion

,

 art

,

 and

 culture

.

 What

 are

 some

 of

 the

 famous

 landmarks

 in

 Paris

?

 The

 famous

 landmarks

 in

 Paris

 include

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 Arc

 de

 Tri

omp

he

,

 and

 the

 Lou

vre

 Museum

.

 These

 landmarks

 are

 a

 testament

 to

 the

 city

’s

 rich

 history

 and

 architectural

 style

.

 What

 is

 the

 city

’s

 atmosphere

 like

?

 The

 city

 of

 Paris

 has

 a

 lively

 and

 romantic

 atmosphere

,

 with

 beautiful

 streets

,

 charming

 cafes

,

 and

 a

 vibrant

 nightlife

.

 It

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 has

 the

 potential

 to

 revolution

ize

 various

 aspects

 of

 our

 lives

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Hybrid

 Intelligence

:

 The

 integration

 of

 human

 and

 artificial

 intelligence

 to

 create

 a

 more

 powerful

 and

 efficient

 intelligence

.


2

.

 Explain

able

 AI

:

 The

 ability

 to

 understand

 and

 interpret

 the

 decision

-making

 process

 of

 AI

 systems

,

 making

 them

 more

 transparent

 and

 trustworthy

.


3

.

 Edge

 AI

:

 AI

 processing

 at

 the

 edge

 of

 the

 network

,

 closer

 to

 the

 data

 source

,

 reducing

 latency

 and

 improving

 real

-time

 processing

.


4

.

 Autonomous

 Systems

:

 AI

-driven

 systems

 that

 can

 operate

 independently

,

 making

 decisions

 and

 taking

 actions

 without

 human

 intervention

.


5

.

 Human

-A

I

 Collaboration

:




In [6]:
llm.shutdown()