# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.47s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.73s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:08<00:02,  2.77s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.29s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Raksha. I am an Indian artist based in Mumbai. I'm a self-taught painter with a passion for the beauty of nature, architecture and the human form. I create vibrant and emotive paintings that explore the connections between people, places and the world around us. I'm fascinated by the play of light and colour and the way these elements can evoke emotions and moods.
I've always been drawn to art, and as a child, I spent hours sketching and painting. As I grew older, my interest in art only deepened, and I began to explore different mediums and techniques. I'm now a full-time
Prompt: The president of the United States is
Generated text:  a democratically elected official who serves as the head of state and government of the United States. The president is both the commander-in-chief of the armed forces and the head of the federal executive branch of the U.S. government.
The president is elected to a four-year term through the Electoral College sy

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. I'm a bit of a perfectionist, which can sometimes make me come across as stubborn or overly critical. I'm working on being more open-minded and flexible, but it's a work in progress. I'm looking forward to getting to know you better. How would you describe Kaida? What are her strengths and weaknesses? What are some potential conflicts or challenges she might face? Use evidence from

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, in the Île-de-France region. It is situated on the Seine River. Paris is known for its rich history, art, fashion, and culture. The city is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. Paris is a major economic and cultural center, and it is one of the most visited cities in the world. The city has a population of over 2.1 million people, and the metropolitan area has a population of over 12.2 million people. Paris is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for AI systems to be transparent and explainable. This trend is expected to continue, with AI systems being designed to provide clear explanations



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: ...
Comments for Hello, my name is...
I like it. It's a simple and concise introduction that doesn't reveal too much about the character's background or personality. It leaves room for more information to be revealed as the story unfolds. Well done!
Here's a possible continuation:
"I'm a 25-year-old writer and artist who's always been fascinated by the world of science fiction and fantasy. I spend most of my free time reading, drawing, and experimenting with new ideas. But I'm also a bit of a introvert and often find myself lost in thought, wondering about the mysteries of the universe and my place in it

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is situated in the north-central part of the country, in the Île-de-France region. Paris is one of the world's most famous and romantic cities, known f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Orion

 Black

wood

 and

 I

'm

 a

 

28

-year

-old

 archae

ologist

.

 I

've

 spent

 the

 past

 

10

 years

 studying

 ancient

 civilizations

,

 and

 I

've

 worked

 on

 digs

 in

 Egypt

,

 Greece

,

 and

 Mes

opot

am

ia

.

 I

'm

 currently

 based

 in

 London

,

 where

 I

 work

 for

 the

 British

 Museum

.

 When

 I

'm

 not

 analyzing

 artifacts

 or

 lect

uring

 on

 history

,

 I

 enjoy

 hiking

 and

 practicing

 yoga

.

 I

'm

 a

 bit

 of

 a

 history

 buff

,

 and

 I

'm

 always

 looking

 for

 new

 discoveries

 and

 insights

 into

 the

 past

.


Here

's

 a

 rewritten

 version

 of

 the

 introduction

 with

 a

 bit

 more

 flair

:

 "

I

'm

 Orion

 Black

wood

,

 a

 

28

-year

-old

 archae

ologist



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 often

 called

 the

 City

 of

 Light

.

 Paris

 is

 known

 for

 its

 many

 museums

,

 art

 galleries

,

 and

 historical

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

 Dame

 Cathedral

.

 The

 Se

ine

 River

 runs

 through

 the

 city

.

 The

 population

 of

 Paris

 is

 approximately

 

2

.

1

 million

 people

.

 Paris

 has

 a

 diverse

 and

 vibrant

 cultural

 scene

.

 The

 city

 hosts

 many

 international

 events

,

 including

 fashion

 shows

 and

 concerts

.

 Overall

,

 Paris

 is

 a

 significant

 cultural

 and

 economic

 center

 in

 Europe

.

 The

 city

 has

 a

 rich

 history

 and

 architecture

,

 and

 it

 is

 one

 of

 the

 most

 popular

 tourist

 destinations

 in

 the

 world

.

 The

 official

 language

 is

 French

.

 The

 climate

 is

 temper

ate



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 rapidly

 changing

,

 with

 ongoing

 advancements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:


   

 

1

.

 Increased

 automation

 and

 job

 displacement

:

 As

 AI

 becomes

 more

 sophisticated

,

 there

 is

 a

 growing

 concern

 that

 it

 may

 dis

place

 human

 workers

 in

 various

 industries

,

 leading

 to

 significant

 job

 displacement

.

 However

,

 this

 also

 creates

 opportunities

 for

 workers

 to

 up

skill

 and

 res

kill

 in

 emerging

 fields

.


   

 

2

.

 Greater

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 prevalent

 in

 decision

-making

 processes

,

 there

 is

 a

 growing

 need

 for

 AI

 systems

 to

 be

 explain

able

 and

 transparent

,

 enabling

 humans

 to

 understand

 how

 AI




In [6]:
llm.shutdown()