# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/hidden_states.py). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.05it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:02,  1.01s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.02s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex and I am a Cybersecurity Engineer. I'm reaching out because I would like to create a fictional story and I'm looking for some inspiration and advice from experienced writers and cybersecurity professionals.
My story is set in a futuristic world where cybersecurity has become an essential part of everyday life. In this world, people have access to advanced technology and artificial intelligence, but with it comes the threat of cyber attacks and data breaches. My main character, a young and skilled cybersecurity engineer named Maya, is tasked with protecting a powerful artificial intelligence system that has the potential to revolutionize society.

I would like to create a story that is not only exciting and
Prompt: The president of the United States is
Generated text:  a constitutional office in the executive branch of the United States government. The president is both the head of state and the head of government. The office of the presid

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in Tokyo. I enjoy exploring the city's hidden corners and trying new foods. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a few projects, including a novel and a graphic novel. I'm excited to see where my creative pursuits take me.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or motivations. It simply provides a brief overview of who she is and what she does. This can be helpful for a character who is still

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The Eiffel Tower is a famous landmark in Paris. It was built for the 1889 World’s Fair and was intended to be a temporary structure. However, it has become an iconic symbol of Paris and one of the most recognizable landmarks in the world.
The Louvre Museum is another famous landmark in Paris. It is one of the world’s largest and most visited museums, housing a vast collection of art and artifacts from around the world, including the Mona Lisa.
The Champs-Élysées is a famous avenue in Paris known for its upscale shopping and dining. It is lined with cafes, restaurants, and high

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development.
One possible future trend in AI is the rise of autonomous systems. Autonomous systems are capable of operating independently, making decisions, and adapting to new situations without human intervention. This could lead to significant advancements in areas such as transportation, healthcare, and manufacturing.
Another possible trend is the increasing use of AI in decision-making processes. AI systems can analyze vast amounts of data, identify patterns, and make predictions, which could lead to more informed decision-making in fields such as



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Alice Everwood. I'm a 25-year-old freelance artist living in Portland, Oregon. I work from a small studio space in my shared house, surrounded by half-finished projects and cluttered canvases. Outside of work, I enjoy spending time with my cat, taking long walks along the Willamette River, and sipping coffee at local cafes. I'm a bit of a introvert, but I appreciate meeting new people and hearing their stories. I'm always looking for new inspiration and ideas, and I'm excited to see where my art takes me next.
This self-introduction includes details about Alice's name, age

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is a city located in northern-central France. It has a population of around 2.1 million people. The city is the largest in France and serves as the country’s political, economic, 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Oliver

 Wel

les

.

 I

'm

 a

 

25

-year

-old

 journalist

 with

 a

 degree

 in

 English

 literature

 and

 a

 keen

 interest

 in

 history

.


I

've

 worked

 for

 a

 local

 newspaper

 in

 the

 city

 where

 I

 grew

 up

 and

 have

 recently

 moved

 to

 the

 capital

 to

 pursue

 bigger

 stories

 and

 better

 opportunities

.

 I

'm

 a

 bit

 of

 a

 research

 enthusiast

 and

 love

 getting

 lost

 in

 old

 archives

 and

 dusty

 libraries

.


I

'm

 here

 to

 explore

 the

 city

,

 meet

 new

 people

,

 and

 uncover

 some

 of

 its

 hidden

 secrets

.

 Nice

 to

 meet

 you

!

 Oliver

 Wel

les

.

 Nice

 to

 meet

 you

 too

!

 Hello

,

 I

'm

 Sophia

 Patel

,

 a

 

28

-year

-old

 historian with

 a

 specialization

 in

 

18

th



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


This

 statement

 is

 factual

 and

 provides

 a

 clear

 and

 concise

 answer

 to

 the

 question

 about

 France

's

 capital

 city

.

 The

 answer

 is

 brief

,

 accurate

,

 and

 meets

 the

 requirements

 of

 the

 task

.

 Next

,

 I

 will

 look

 at

 the

 second

 requirement

 of

 the

 task

.

 Next

 I

 will

 write

 a

 short

 paragraph

 about

 Paris

.

 Paris

 is

 the

 capital

 and

 most

 populous

 city

 of

 France

,

 with

 an

 area

 of

 

105

.

4

 km

2

 and

 a

 population

 of

 approximately

 

2

.

1

 million people

 in

 

201

9

.

 It

 is

 a

 global

 center

 for

 art

,

 fashion

,

 cuisine

,

 and

 culture

,

 and

 is

 one

 of

 the

 most

 popular

 tourist

 destinations

 in

 the

 world

.

 Paris

 is

 known



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 constantly

 evolving

,

 and

 it

's

 difficult

 to

 predict

 exactly

 what

 will

 happen

,

 but

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 **

Increased

 Aut

onomy

**:

 AI

 systems

 will

 become

 more

 autonomous

,

 making

 decisions

 without

 human

 intervention

.

 This

 could

 lead

 to

 more

 efficient

 and

 effective

 decision

-making

,

 but

 also

 raises

 concerns

 about

 accountability

 and

 control

.


2

.

 **

Edge

 AI

**:

 As

 devices

 become

 more

 connected

,

 AI

 will

 be

 deployed

 at

 the

 edge

,

 closer

 to

 the

 data

 source

,

 to

 process

 and

 analyze

 data

 in

 real

-time

.

 This

 will

 enable

 faster

 and

 more

 efficient

 processing

 of

 data

.


3

.

 **

Ex

plain

able

 AI

 (

X

AI

)**

:

 As

 AI

 becomes

 more

 widespread




In [6]:
llm.shutdown()