# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.06it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.04s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.06s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.15it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lloyd and I am a security officer at a small manufacturing facility. I am reaching out to you for some advice. I have been noticing some unusual behavior from a few of my coworkers, and I am concerned that they may be at risk or even in danger. My concern is that they may be working while under the influence of a substance. I am not sure if it is my place to intervene or what steps I should take if I do decide to approach them.
Hello Lloyd, thank you for reaching out. It takes a lot of courage to address concerns like this, and I commend you for taking the initiative to speak up. However,
Prompt: The president of the United States is
Generated text:  not the president of the world. That’s a fundamental aspect of the American system of government, which was established to protect American interests and maintain American sovereignty. However, President Joe Biden’s statement that “America will lead in the world” has raised concerns among many, in

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys playing the guitar and listening to music. I'm a bit of a daydreamer and often get lost in my own thoughts. I'm a bit shy, but I'm working on being more outgoing and confident. I'm a junior in high school, and I'm trying to figure out my future plans and interests. I'm a bit of a bookworm and love reading about history and science. I'm also interested in photography and taking pictures of the world around me. I'm a bit of a perfectionist, but I'm trying to learn to be more

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. The city is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. Paris is also a major economic and financial center, and is home to many international organizations and companies. The city has a population of over 2.1 million people and is a popular tourist destination. Paris is known for its romantic atmosphere, beautiful architecture, and vibrant cultural scene. The city has

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it's difficult to predict exactly what the future will hold, here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, potentially leading to breakthroughs in disease prevention and treatment.
2. Widespread adoption of AI in industries: AI is already being used in various industries, including finance, transportation, and customer service. In the future, AI is likely to become even



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jasper Flynn, and I'm a 27-year-old writer living in New York City. I work as a freelance journalist, and I'm currently researching a story on sustainable energy practices. I enjoy hiking and trying out new craft beers. I'm a bit of a introverted person, but I have a close-knit group of friends who keep me company. I'm excited to see where life takes me next.

## Step 1: Start with a basic introduction to establish the character's identity.
Begin by stating the character's name and a brief description of who they are.

## Step 2: Provide some context about the character's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Note: This is a simple fact about Paris as a city, not the Paris Agreement or any other topic related to Paris. ## Step 1: Identify the city in question
The question asks for factual inf

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Evelyn

 Black

wood

.

 I

'm

 a

 

22

-year

-old

 graduate

 student

 in

 the

 field

 of

 bot

any

.

 I

've

 always

 been

 fascinated

 by

 the

 unique

 adaptations

 of

 plants

 and

 the

 ways

 in

 which

 they

've

 evolved

 to

 thrive

 in

 different

 environments

.

 Outside

 of

 academia

,

 I

 enjoy

 spending

 time

 in

 nature

,

 hiking

 and

 collecting

 rare

 specimens

 for

 my

 research

.

 I

'm

 currently

 working

 on

 a

 thesis

 project

 that

 explores

 the

 medicinal

 properties

 of

 a

 specific

 genus

 of

 flowering

 plants

.

 I

'm

 looking

 forward

 to

 meeting

 new

 people

 and

 learning

 from

 others

 in

 my

 field

.


#

E

vel

yn

 Black

wood




#

Bot

any




#

Grad

uate

 Student




#

Nature

 Lover




#

Research

er




#

Ac



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


I

 couldn

’t

 find

 a

 reputable

 source

 that

 provides

 the

 definition

 of

 a

 met

ropolis

.

 It

 is

 possible

 that

 the

 term

 is

 not

 formally

 defined

 in

 language

 resources

.

 However

,

 based

 on

 general

 knowledge

,

 a

 met

ropolis

 is

 a

 major

 city

 that

 is

 a

 center

 of

 commerce

,

 culture

,

 finance

,

 and

 industry

.

 The

 city

 of

 Paris

 is

 a

 met

ropolis

.


I

 was

 unable

 to

 verify

 the

 definition

 of

 the

 term

 “

met

ropolis

”

 in

 French

,

 but

 based

 on

 general

 knowledge

,

 the

 city

 of

 Paris

 is

 referred

 to

 as

 a

 met

ropolis

 in

 French

 as

 well

.

 Paris

 is

 called

 a

 “

m

ét

rop

ole

”

 in

 French

.


Provide

 a

 concise

 statement

 about

 the

 location



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 subject

 of

 great

 interest

 and

 speculation

.

 While

 it

 is

 difficult

 to

 predict

 with

 certainty

,

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Adoption

 of

 AI

 in

 Various

 Industries

:


AI

 is

 already

 being

 used

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.

 In

 the

 future

,

 we

 can

 expect

 to

 see

 increased

 adoption

 of

 AI

 in

 other

 industries

 such

 as

 education

,

 manufacturing

,

 and

 customer

 service

.


2

.

 Adv

ancements

 in

 Machine

 Learning

:


Machine

 learning

 is

 a

 key

 aspect

 of

 AI

,

 and

 we

 can

 expect

 to

 see

 significant

 advancements

 in

 this

 area

.

 This

 could

 include

 the

 development

 of

 more

 sophisticated

 algorithms

,

 the

 use

 of

 more

 advanced

 data

 analysis

 techniques

,




In [6]:
llm.shutdown()