# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.25it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.24it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.51it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Bradley Winlow and I am a Senior Lecturer in Statistics at the University of Kent. I am also a Chartered Statistician (CStat) and have a Ph.D. in Statistics from the University of Kent. My research interests are in the areas of Bayesian inference, time series analysis, and statistical computing. I am currently the Director of Postgraduate Studies for the School of Mathematics, Statistics and Actuarial Science at the University of Kent.
I am a Senior Lecturer in Statistics and a Chartered Statistician (CStat). My research interests are in the areas of Bayesian inference, time series analysis, and statistical computing
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, and is the highest-ranking official in the federal government. The president is elected through the Electoral College system, with the person receiving the majority of the electoral votes becoming the presi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. I'm a bit of a perfectionist, which can sometimes make me come across as stubborn or overly critical. I'm working on balancing my desire for precision with being more open-minded and flexible. I'm looking forward to meeting new people and learning more about their perspectives.
This text was written by a student in a creative writing class. The assignment was to write a short, neutral self-introduction for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a fact or a piece of trivia. The statement is also grammatically correct and easy to understand, making it suitable for a variety of contexts, such as educational materials, trivia games, or general knowledge quizzes. Overall, the statement is a clear and concise expression of a factual piece of information. The statement is also neutral and does not express any

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and explainability in AI decision-making. Explainable AI (XAI) aims to provide insights into how AI systems arrive at their conclusions,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Sage. I'm a quiet, reserved individual who often finds myself lost in thought. I'm not one for grand gestures or loud declarations, preferring to observe and listen rather than speak. My hobbies include reading, nature walks, and practicing yoga. I enjoy solitude and find comfort in the stillness of the world around me. I'm not much of a planner, but I do my best to live in the present moment. That's me in a nutshell.
I will always be the one who isn't watching you, but you know I'm watching you. I'm the quiet observer, always lurking in the shadows, waiting for the perfect

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Location: Paris is situated in the northern part of France. Geographic coordinates: 48.8567° N, 2.2945° E. Area: Paris covers an area of 105.4 square kilometers. Population: As of 2020

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Aur

éli

en

.

 I

 work

 as

 a

 librarian

 in

 a

 small

 town

 surrounded

 by

 dense

 forest

.

 When

 I

'm

 not

 organizing

 books

,

 I

 enjoy

 reading

 about

 history

,

 listening

 to

 classical

 music

,

 and

 going

 for

 long

 walks

 in

 the

 woods

.

 I

'm

 a

 quiet

 and

 observ

ant

 person

,

 content

 with

 keeping

 to

 myself

,

 but

 always

 willing

 to

 lend

 a

 listening

 ear

 or

 offer

 a

 helping

 hand

 when

 needed

.

 What

 are

 some

 specific

 things

 you

'd

 like

 to

 know

 about

 me

?

 I'm

 happy

 to

 share

 more

 about

 myself

 if

 you

'd

 like

.

 


1

.

 What

 is

 the

 character

's

 occupation

?


2

.

 What

 is

 the

 character

's

 living

 situation

 like

?


3

.

 What

 are

 some



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 capital

 of

 France

 is

 Paris

.


The

 city

 of

 Paris

 is

 known

 as

 the

 City

 of

 Light

.


Paris

 is

 a

 beautiful

 and

 historic

 city

.


The

 city

 is

 built

 along

 the

 Se

ine

 River

.


The

 Lou

vre

 Museum

 is

 located

 in

 the

 city

.


The

 E

iff

el

 Tower

 is

 a

 famous

 landmark

 in

 the

 city

.

 The

 E

iff

el

 Tower

 was

 originally

 intended

 to

 be

 a

 temporary

 structure

 but

 it

 became

 a

 beloved

 landmark

 and

 was

 left

 standing

.


The

 city

 of

 Paris

 is

 a

 hub

 for

 fashion

,

 art

,

 and

 culture

.


Paris

 is

 home

 to

 many

 famous

 universities

 and

 research

 institutions

.


Paris

 is

 the

 center

 of

 French

 politics

 and

 government

.


France

 has

 a

 complex

 and

 rich



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 several

 trends

 are

 likely

 to

 shape

 its

 development

 and

 impact

.


Art

ificial

 intelligence

 (AI

)

 is

 a

 rapidly

 evolving

 field

 that

 is

 transforming

 various aspects

 of

 our

 lives

,

 from

 healthcare

 and

 finance

 to

 transportation

 and

 education

.

 As

 AI

 technology

 continues

 to

 advance

,

 several

 trends

 are

 likely

 to

 shape

 its

 development

 and

 impact

 in

 the

 future

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Adoption

 of

 Edge

 AI

:


Edge

 AI

 refers

 to

 the

 use

 of

 AI

 algorithms

 on

 devices

 or

 sensors

 that

 are

 close

 to

 the

 source

 of

 the

 data

,

 rather

 than

 relying

 on

 cloud

-based

 processing

.

 This

 trend

 is

 expected

 to

 continue

 as

 more

 devices

 become

 connected

 and

 the

 need

 for

 faster




In [6]:
llm.shutdown()