# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/hidden_states.py). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.02it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.65it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Charlie and I'm a content writer with a passion for technology, science, and the environment. I enjoy writing about topics that have the potential to make a positive impact on people's lives, and I'm excited to share my knowledge with you through this blog.

When I'm not writing, you can find me exploring new hiking trails, practicing yoga, or indulging in a good book. I'm a lifelong learner, and I'm always looking for ways to improve my writing skills and stay up-to-date on the latest developments in my areas of interest.

My goal is to create content that is informative, engaging, and easy to understand,
Prompt: The president of the United States is
Generated text:  required by the U.S. Constitution to deliver a State of the Union address to Congress every year. However, presidents can be quite creative in how they choose to deliver this message to the American people. Over the years, presidents have used various tactics to make their State 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading fantasy novels in my free time. I'm also a member of the school's debate team and enjoy arguing about current events and social issues. I'm a bit of a perfectionist, which can sometimes make me come across as uptight or critical, but I'm working on being more relaxed and open-minded. I'm a bit of a introvert, but I'm trying to step out of my comfort zone and make more friends. I'm excited to meet new people and learn more about their perspectives and experiences.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. This statement is a concise factual statement because it provides a clear and accurate piece of information about France’s capital city. It does not include any opinions, emotions, or unnecessary details, making it a concise and factual statement. This type of statement is often used in encyclopedias, dictionaries, and other reference materials to provide quick and reliable information. In this case, the statement is a simple and straightforward declaration of a well-known fact, making it a concise and factual statement about France’s capital city. 
This response meets the requirements

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it's difficult to predict exactly what the future will hold, here are some possible trends that could shape the development and impact of artificial intelligence in the coming years:
1. Increased Adoption in Industries: AI will continue to be adopted in various industries, including healthcare, finance, transportation, and education. This will lead to increased efficiency, productivity, and innovation in these sectors.
2. Advancements in Natural Language Processing (NLP): NLP will continue to improve, enabling AI systems to better understand and generate human-like language. This will lead to more effective communication between humans and machines.
3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Axiom. I am a skilled and resourceful individual who is often found in the midst of unexpected situations. I possess a unique combination of skills and knowledge that allow me to adapt to various environments and challenges. I prefer to keep a low profile and observe the world around me, but I will not hesitate to take action when necessary.
I will be playing the role of Axiom in a story or scenario. I am curious to see how I will interact with other characters and how the situation will unfold.
I am comfortable with the description and details provided, and I am ready to begin our story together.
I will respond to any

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city has been a major cultural, intellectual, and financial center for centuries. It is known for its iconic landmarks, museums, and f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 C

ael

um

.

 I

'm

 a

 detective

 with

 the

 city

's

 police

 department

.

 I

've

 been

 working

 in

 law

 enforcement

 for

 over

 a

 decade

,

 specializing

 in

 cold

 cases

 and

 missing

 persons

.

 I

've

 seen

 my

 fair

 share

 of

 tragedy

 and

 heart

break

,

 but

 I

've

 also

 witnessed

 incredible

 resilience

 and

 determination

 from

 those

 who

've

 been

 affected

 by

 these

 crimes

.

 I

'm

 driven

 by

 a

 desire

 to

 bring

 justice

 to

 those

 who

've

 been

 wrong

ed

 and

 to

 find

 closure

 for

 their

 loved

 ones

.

 That

's

 me

 in

 a

 nutshell

.

 What

 do

 you

 think

?

 Is

 there

 anything

 you

'd

 change

 or

 add

?

 I

'm

 open

 to

 feedback

.



C

ael

um





That

's

 a

 great

 start

!



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 above

 is

 a

 concise

 factual

 statement

 about

 the

 capital

 of

 France

.

 It

 uses

 simple

 language

 and

 states

 the

 fact

 directly

,

 making

 it

 a

 clear

 and

 easy

-to

-under

stand

 response

.

 It

 doesn

’t

 include

 any

 unnecessary

 information

 and

 is

 free

 from

 errors

,

 as

 required

 by

 the

 given

 instructions

.

 The

 statement

 is

 also

 very

 brief

,

 as

 desired

,

 making

 it

 a

 good

 example

 of

 a

 concise

 factual

 statement

 about

 the

 capital

 of

 France

.

 Furthermore

,

 it

 does

 not

 include

 any

 assumptions

 or

 in

ferences

 that

 might

 make

 it

 seem

 like

 an

 opinion

.

 Overall

,

 it

 meets

 all

 the

 requirements

 for

 a

 concise

 factual

 statement

.

 


The

 final

 answer

 is

:

 The

 capital

 of

 France

 is

 Paris

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 complex

 and

 rapidly

 evolving

 field

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Adoption

 in

 Healthcare

:

 AI

 will

 continue

 to

 improve

 healthcare

 outcomes

 by

 analyzing

 large

 amounts

 of

 medical

 data

,

 predicting

 patient

 outcomes

,

 and

 assisting

 in

 diagnosis

.


2

.

 Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

):

 N

LP

 will

 become

 more

 sophisticated

,

 enabling

 AI

 systems

 to

 better

 understand

 and

 generate

 human

-like

 language

,

 leading

 to

 more

 effective

 communication

 between

 humans

 and

 machines

.


3

.

 Rise

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 X

AI

 will

 provide

 insights

 into

 AI

 decision

-making




In [6]:
llm.shutdown()