# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.02s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.04s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.03it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Priti Nath.
I am a neurologist with a special interest in the neurology of sleep and wakefulness.
I have worked as a neurologist in the UK, Australia and India, and have a wealth of experience in treating a wide range of neurological conditions, including those related to sleep.
My current role is as a Consultant Neurologist at the Royal London Hospital.
I am also a Clinical Associate Professor in the Department of Neurology at Queen Mary University of London.
I have worked in a variety of academic and clinical roles, including as a senior lecturer in the University of Melbourne, and as a specialist registrar in the
Prompt: The president of the United States is
Generated text:  not accountable to anyone
I'd like to know if this statement is true or false. I'd appreciate a definition of accountability and an explanation of the role of the president's accountability in the US system of government.
Accountability in the context of the US system o

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert and prefer to spend time alone, but I'm always up for a good conversation when I'm feeling social. I'm currently working on a novel and a few art projects, and I'm excited to see where my creative endeavors take me. I'm looking forward to meeting new people and making connections in my community.
This self-introduction is neutral because it doesn't reveal too much about Kaida's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city is also known for its romantic atmosphere and is a popular tourist destination. The population of Paris is approximately 2.1 million people, but the metropolitan area has a population of over 12 million people. Paris is a global center for business, finance, fashion, and culture, and is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to improve patient outcomes and reduce healthcare costs.
2. Widespread adoption of AI in industries: AI is already being used in various industries, including finance,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Artemis Blackwood. I'm a 25-year-old botanist currently based in the Pacific Northwest.
Include a following sentence about the character's goals or interests. I'm fascinated by the unique plant species found in the region's temperate rainforests and hope to one day write a comprehensive guide to the area's flora.

## Step 1: Introduce the character's name and age.
Hello, my name is Artemis Blackwood. I'm a 25-year-old botanist.

## Step 2: Provide the character's current location.
Currently based in the Pacific Northwest.

## Step 3: Mention the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France and is located in the northern part of the country in the Île-de-France region. It is situated on the Seine River and is home to many famous landmarks such as the Eiffel Tower, 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 E

ly

se

 F

anning

,

 and

 I

’m

 a

 

25

-year

-old

 freelance

 writer

 from

 Portland

,

 Oregon

.

 I

 enjoy

 hiking

 in

 the

 Pacific

 Northwest

 and

 experimenting

 with

 new

 recipes

 in

 my

 free

 time

.

 My

 work

 has

 been

 featured

 in

 several

 local

 publications

.

 That

’s

 a

 little

 about

 me

.

 You

 can

 call

 me

 E

ly

se

.


Writing

 a

 self

-int

roduction

 is

 about

 showing

 your

 personality

,

 interests

,

 and

 skills

 in

 a

 concise

 way

.

 You

 can

 tailor

 your

 self

-int

roduction

 to

 a

 specific

 situation

,

 like

 meeting

 new

 colleagues

 or

 attending

 a

 networking

 event

.


Here

 are

 some

 key

 elements

 to

 include

 in

 a

 self

-int

roduction

:


1

.

 Your

 name

:

 Start

 with

 your

 full

 name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 often

 called

 the

 City

 of

 Light

 due

 to

 its

 historical

 association

 with

 the

 Enlightenment

.

 It

 is

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

,

 with

 a

 metropolitan

 area

 that

 covers

 over

 

12

.

2

 million

 people

.

 Paris

 is

 a

 major

 cultural

,

 economic

,

 and

 intellectual

 center

 in

 Europe

,

 and

 is

 considered

 one

 of

 the

 world

's

 greatest

 cities

.

 The

 city

 has

 a

 long

 history

 dating

 back

 to

 the

 

3

rd

 century

 BC

,

 and

 has

 been

 the

 capital

 of

 France

 since

 

987

.

 ##

 Step

 



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 with

 continuous

 advancements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Adoption

 in

 Everyday

 Life

:

 AI

 will

 become

 increasingly

 integrated

 into

 our

 daily

 lives

,

 from

 personal

 assistants

 like

 Siri

 and

 Alexa

 to

 more

 sophisticated

 applications

 in

 healthcare

,

 finance

,

 and

 education

.


2

.

 Adv

ancements

 in

 Edge

 AI

:

 As

 IoT

 devices

 become

 more

 prevalent

,

 edge

 AI

 will

 enable

 real

-time

 processing

 and

 decision

-making

 at

 the

 edge

 of

 the

 network

,

 reducing

 latency

 and

 increasing

 efficiency

.


3

.

 Rise

 of

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

-driven

 decisions




In [6]:
llm.shutdown()