# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 03-26 05:02:13 __init__.py:190] Automatically detected platform cuda.


INFO 03-26 05:02:24 __init__.py:190] Automatically detected platform cuda.


INFO 03-26 05:02:25 __init__.py:190] Automatically detected platform cuda.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.00s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.10s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.11s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.10it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jim. I'm a graduate student in psychology, and I'm currently studying cognitive development and child psychology. I've been assigned a research paper on a topic of my choice, and I'm considering a paper on the effects of language on cognitive development in young children. I'm looking for some general information to get started, and I was wondering if you could provide me with some key points or a brief overview of the topic.
Language is a complex and multifaceted topic, and its effects on cognitive development in young children are still being studied and debated by researchers. Here are some key points to consider:
1. Language and cognitive development are
Prompt: The president of the United States is
Generated text:  the head of state and head of government for the United States. The president serves a four-year term and is limited to two terms by the 22nd Amendment to the Constitution.
The president is directly elected by the people throug

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city and spend most of my time working on various writing projects or reading books. I'm a bit of a introvert and enjoy spending time alone, but I also value my relationships with friends and family. I'm a bit of a perfectionist, which can sometimes make it difficult for me to start new projects, but I'm working on learning to be more flexible and adaptable. I'm interested in learning more about the world and meeting new people, and I'm excited to see where life takes me.
This is a good

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
The capital of France is Paris. The city is known for its iconic landmarks, such as the Eiffel Tower and Notre-Dame Cathedral, as well as its rich history, art museums, and fashion industry. Paris is a major cultural and economic center, attracting millions of tourists and business travelers each year. The city is also home to many international organizations, including the United Nations Educational, Scientific and Cultural Organization (UNESCO) and the Organisation for Economic Co-operation and Development (OECD). Paris is a global hub for finance, fashion, and culture, and is considered one of the most

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and accountability. Explainable AI (XAI) aims to provide insights into how AI systems make decisions, enabling users to understand and trust AI-driven



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jasmine Nguyen, and I'm a 20-year-old student at the University of California, Berkeley. I'm studying computer science and planning to graduate within the next two years. I'm currently living off-campus with three roommates in a small house in Berkeley. In my free time, I enjoy playing video games, reading science fiction novels, and listening to electronic music.
This introduction is short and neutral, providing some basic information about Jasmine's identity, education, and interests. It doesn't reveal too much about her personality or background, which is suitable for a neutral self-introduction.
Here are a few alternative versions of Jasmine's introduction,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris, also known as the City of Light, is a significant cultural and financial center with a r

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

.

 I

'm

 a

 freelance

 astro

log

er

,

 currently

 residing

 in

 a

 small

,

 coastal

 town

.

 I

've

 had

 a

 curious

 mind

 since

 childhood

,

 always

 trying

 to

 make

 sense

 of

 the

 stars

 and

 the

 mysteries

 of

 the

 universe

.

 My

 approach

 to

 astrology

 is

 unconventional

,

 often

 blending

 scientific

 knowledge

 with

 ancient

 traditions

.

 I

'm

 always

 eager

 to

 learn

 and

 explore

 new

 ideas

,

 and

 I

 believe

 that

 understanding

 the

 cosmos

 can

 help

 us

 better

 understand

 ourselves

 and

 our

 place

 in

 the

 world

.


Write

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

.


Hello

,

 my

 name

 is

 Eli

an

ore

 Qu

asar

.

 I

'm

 a

 freelance

 astro

log

er

,

 currently

 residing

 in



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.


The

 capital

 of

 France

 is

 Paris

.

 Location

:

 Paris

,

 the

 capital

 of

 France

,

 is

 situated

 on

 the

 river

 Se

ine

 in

 the

 north

-central

 part

 of

 the

 country

.

 Population

:

 The

 population

 of

 Paris

 is

 approximately

 

2

.

1

 million

,

 with

 over

 

12

 million

 in

 the

 metropolitan

 area

.

 History

:

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 era

,

 and

 it

 has

 been

 the

 capital

 of

 France

 since

 the

 

12

th

 century

.

 Not

able

 Land

marks

:

 Some

 of

 Paris

’

 most

 famous

 landmarks include

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 advancements

 in

 various

 fields

,

 including

 machine

 learning

,

 natural

 language

 processing

,

 computer

 vision

,

 and

 robotics

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 adoption

 of

 edge

 AI

:

 As

 the

 number

 of

 IoT

 devices

 grows

,

 edge

 AI

 will

 become

 more

 prevalent

,

 enabling

 faster

 processing

 and

 reduced

 latency

.


2

.

 Development

 of

 more

 sophisticated

 convers

ational

 AI

:

 Future

 AI

 systems

 will

 be

 able

 to

 engage

 in

 more

 natural

 and

 human

-like

 conversations

,

 leveraging

 advancements

 in

 natural

 language

 processing

.


3

.

 Greater

 use

 of

 computer

 vision

:

 AI

 will

 be

 increasingly

 used

 in

 computer

 vision

 applications

,

 such

 as

 object

 detection

,

 facial

 recognition

,

 and

 autonomous

 vehicles

.


4

.




In [6]:
llm.shutdown()