# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 03-19 17:07:00 __init__.py:190] Automatically detected platform cuda.




INFO 03-19 17:07:17 __init__.py:190] Automatically detected platform cuda.
INFO 03-19 17:07:17 __init__.py:190] Automatically detected platform cuda.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.07it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.69it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.41it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chris. I've been working on a video game for a while now, and I'm at a stage where I need to start thinking about making it available to the public. I'm thinking of making it available on the App Store and possibly Google Play. I've heard a lot about the app review process, but I'm not really sure what it's involved. Can you tell me a bit more about it?

I'd love to hear your thoughts on how to improve my app's chances of passing the review process. And, if you have any tips or recommendations on how to prepare for it, I'd be grateful to hear them.


Prompt: The president of the United States is
Generated text:  required by law to be a natural-born citizen of the United States, at least 35 years old, and a resident of the United States for at least 14 years.
I. President must be a natural-born citizen.
A. The president must have been born in the United States or one of its territories.
B. The president must have been born in the United States 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and trying to learn more about the Japanese culture. That's me in a nutshell.
This is a good start, but it's a bit too straightforward. Let's try to add a bit more depth and personality to the introduction. Here's a revised version: Hi, I'm Kaida. I'm a writer, a wanderer, and a bit of a foodie. When I'm not scribbling away on my latest novel, you can find me exploring

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city is also a major hub for international business and finance. Paris is a popular tourist destination and is often referred to as the "City of Light." The city has a population of over 2.1 million people and is a major cultural and economic center in Europe. The official language of Paris is French, and the city is divided into 20 arrondissements, or districts

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there will be a growing need for AI systems to be transparent and explainable. This will involve developing techniques to interpret and understand the decisions made by AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Renn. I'm a quiet and observant person who prefers to listen more than I speak. I'm a curious person who enjoys learning about the world around me. I work in a library as a cataloger, which allows me to surround myself with books and knowledge. I'm not much of a social butterfly, but I'm always happy to engage in a meaningful conversation.
What does the character of Renn reveal about themselves in this introduction?
Renn reveals that they are:
1. A quiet and observant person who prefers to listen more than they speak.
2. Curious and enjoy learning about the world around them.
3

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This statement is concise and factual. It directly answers the question and provides no unnecessary information. It includes the name of the country, which makes it specific to the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Elle

,

 and

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 based

 in

 New

 York

 City

.

 I

've

 been

 working

 on

 various

 projects

 and

 hon

ing

 my

 craft

 for

 the

 past

 five

 years

,

 and

 I

'm

 excited

 to

 collaborate

 with

 others

 and

 take

 on

 new

 challenges

.


El

le

 is

 a

 freelance

 writer

 with

 

5

 years

 of

 experience

.

 She

 has

 a

 strong

 portfolio

 that

 showcases

 her

 writing

 skills

,

 and

 she

 is

 excited

 to

 work

 with

 others

 on

 new

 projects

.

 Elle

 is

 professional

 and

 reliable

,

 with

 a

 passion

 for

 storytelling

 and

 a

 drive

 to

 learn

 and

 grow

.

 She

 is

 based

 in

 New

 York

 City

 and

 is

 available

 to

 collaborate

 with

 clients

 on

 a

 variety

 of

 writing

 projects

.


The

 tone



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 capital

 city

 of

 France

 is

 famous

 for

 its

 (

Select

 all

 that

 apply

):


A

)

 Medieval

 architecture




B

)

 Contemporary

 art




C

)

 Both

 medieval

 and

 contemporary

 art




D

)

 Medieval

 and

 historical

 architecture

 only




Answer

:

 C

)

 Both

 medieval

 and

 contemporary

 art




Explanation

:

 The

 city

 of

 Paris

 is

 well

-known

 for

 its

 beautiful

 architecture

,

 including

 medieval

 structures

 such

 as

 the

 Notre

 Dame

 Cathedral

,

 as

 well

 as

 contemporary

 art

.

 While

 medieval

 and

 historical

 structures

 are

 prominent

,

 contemporary

 art

 also

 plays

 a

 significant

 role

 in

 the

 city

’s

 landscape

,

 including

 famous

 artists

 and

 museums

 such

 as

 the

 Lou

vre

.

 


Use

 evidence

 from

 the

 text

 to

 support

 an

 answer

 to

 the

 question



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 ongoing

 debate

 and

 speculation

,

 but

 here

 are

 some

 potential

 trends

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

:


1

.

 Increased

 Integration

 with

 Human

 Behavior

:


AI

 systems

 will

 become

 more

 integrated

 with

 human

 behavior

,

 enabling

 them

 to

 better

 understand

 and

 respond

 to

 human

 emotions

,

 needs

,

 and

 behaviors

.

 This

 could

 lead

 to

 more

 personalized

 and

 effective

 interactions

 between

 humans

 and

 AI

 systems

.


2

.

 Greater

 Em

phasis

 on

 Explain

ability

 and

 Transparency

:


As

 AI

 becomes

 more

 pervasive

 in

 various

 industries

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 to

 be

 transparent

 and

 explain

able

.

 This

 means

 that

 AI

 developers

 will

 focus

 on

 creating

 systems

 that

 can

 provide

 clear

 explanations

 for

 their

 decisions

 and




In [6]:
llm.shutdown()