# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.35it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.31it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.31it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.58it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ayaan, and I am a senior from Pakistan. I am an International Student here at Caltech, and this is my second year at the institute. I am pursuing a degree in Computer Science. Before coming to Caltech, I was studying at NUST, Pakistan, where I completed my undergraduate degree in Computer Science and Engineering.

I am passionate about research and innovation, and I am excited to be a part of the Caltech community, which is known for its academic excellence and innovative culture. My research interests include Natural Language Processing, Artificial Intelligence, and Computer Vision.

During my free time, I enjoy playing football and badm
Prompt: The president of the United States is
Generated text:  an elected position, and the president of the United States has a lot of power. The president is both the head of the executive branch and the commander-in-chief of the armed forces. Some of the powers of the president include the ability to sign 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy exploring the city's hidden corners and trying new foods. I'm currently working on a novel about a young woman who discovers a mysterious underground world in the city. When I'm not writing, you can find me practicing yoga or browsing through used bookstores. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or motivations. It provides a brief overview of her current life and interests, without giving away too

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, finance, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the "City of Love." The city has a diverse range of neighborhoods, each with its own unique

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future holds, here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with applications such as robotic surgery, AI-powered diagnostic tools, and personalized medicine.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Maya Flynn. I'm a 25-year-old journalist who works as a freelance writer. I'm based in New York City. That's a bit about me.
Write a short, neutral self-introduction for a fictional character. Hello, my name is Axel Winters. I'm a 32-year-old scientist specializing in environmental conservation. I'm based in a small town in the Pacific Northwest. That's a bit about me.
Write a short, neutral self-introduction for a fictional character. Hello, my name is Ruby Singh. I'm a 28-year-old artist who works as a graphic designer by day. I'm based

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This statement is one of the most basic and essential facts about France. It is the first thing that would be known about the country, and it is something that is taught in schools and is common knowledge. The capital of

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ethan

 Wild

er

,

 and

 I

'm

 a

 

25

-year

-old

 journalist

 who

's

 been

 working

 at

 the

 local

 newspaper

 for

 three

 years

.

 I

'm

 a

 graduate

 of

 the

 University

 of

 Iowa

,

 where

 I

 earned

 a

 degree

 in

 journalism

.

 When

 I

'm

 not

 working

,

 I

 enjoy

 hiking

,

 reading

,

 and

 practicing

 yoga

.


This

 self

-int

roduction

 is

 neutral

 because

 it

 focuses

 on

 factual

 information

 about

 Ethan

's

 job

,

 education

,

 and

 hobbies

,

 without

 revealing

 his

 personality

,

 values

,

 or

 opinions

.

 It

 provides

 a

 clear

 and

 concise

 overview

 of

 who

 he

 is

 and

 what

 he

 does

,

 but

 doesn

't

 give

 away

 anything

 about

 his

 motivations

,

 conflicts

,

 or

 character

 traits

.

 This

 neutral

 tone

 is

 useful



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 an

 example

 of

 a

 French

 national

 holiday

 celebrated

 in

 Paris

.

 One

 of

 the

 most

 popular

 national

 holidays

 in

 Paris

 is

 Bast

ille

 Day

,

 celebrated

 on

 July

 

14

th

.

 It

 commemor

ates

 the

 storm

ing

 of

 the

 Bast

ille

 prison

,

 a

 symbol

 of

 the

 French

 Revolution

.


Provide

 a

 description

 of

 a

 famous

 French

 landmark

 located

 in

 Paris

.

 The

 E

iff

el

 Tower

 is

 an

 iconic

 iron

 lattice

 tower

 located

 in

 the

 heart

 of

 Paris

,

 standing

 at

 

324

 meters

 tall

.

 It

 was

 built

 for

 the

 

188

9

 World

's

 Fair

 and

 is

 now

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

.


Provide

 a

 list

 of

 three

 French

 artists

 associated

 with

 the

 Imp

ression

ist



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 much

 debate

 and

 speculation

,

 but

 here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

.


As

 artificial

 intelligence

 continues

 to

 advance

,

 it

's

 likely

 that

 we

'll

 see

 significant

 developments

 in

 several

 areas

,

 including

:


1

.

 Increased

 use

 of

 AI

 in

 daily

 life

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 we

 can

 expect

 to

 see

 it

 used

 in

 more

 and

 more

 applications

,

 from

 smart

 homes

 and

 cities

 to

 healthcare

 and

 education

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

:

 AI

 systems

 will

 become

 increasingly

 proficient

 in

 understanding

 and

 generating

 human

 language

,

 making

 it

 easier

 for

 people

 to

 interact

 with

 machines

 and

 each

 other

.


3

.

 Growth

 of




In [6]:
llm.shutdown()