# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.10s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.44it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.28it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Leticia and I am a new member of this forum. I am thrilled to be a part of this community and I am looking forward to learning from all of you. I am a language learner and I am currently studying Spanish. I have been studying for about 3 years now and I have made some progress, but I still have a lot to learn. I am eager to practice my speaking and listening skills, so I would be happy to have a conversation with anyone who is willing to chat with me.
I am also interested in learning about the culture and customs of Spanish-speaking countries. I have always been fascinated by the rich history
Prompt: The president of the United States is
Generated text:  the head of the executive branch of the federal government of the United States, and is the commander-in-chief of the armed forces of the United States. The president is indirectly elected by the people through the Electoral College. The president serves a four-year term, and can be re-elected

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in Tokyo. I enjoy exploring the city's hidden corners and trying new foods. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and a graphic novel, and I'm excited to see where my creative projects take me. I'm looking forward to meeting new people and making connections in the Tokyo art and writing community. That's me in a nutshell! What do you think? Is there anything you'd like to add or change?
I think your self-introduction is clear and concise.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
Paris is the capital and largest city of France, with a population of over 2.1 million people within the city limits and over 12.2 million people in the metropolitan area. It is one of the most famous and iconic cities in the world, known for its stunning architecture, art museums, fashion, cuisine, and romantic atmosphere. Paris is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which houses the Mona Lisa. The city is also a major center for business, finance, and culture, and is home

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, helping to improve patient outcomes and reduce costs.
2. Rise of autonomous vehicles: Self-driving cars and trucks are already being tested on public roads, and it's likely that they



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Maya Ramos. I work as a freelance graphic designer. I'm a bit of a creative, always looking for new ideas and projects to work on. When I'm not designing, I enjoy trying out new restaurants and cooking new recipes. I'm a bit of a homebody, but I love meeting new people and making new friends.
Hello, my name is Olivia Thompson. I work as a marketing manager for a small company. I'm a bit of a planner, always looking for ways to improve our brand and reach new customers. In my free time, I enjoy reading and practicing yoga. I'm a bit of a perfectionist,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Next, provide a brief description of the city. The city of Paris is known for its beautiful architecture, famous landmarks like the Eiffel Tower, and world-class art museums like the Louvre.
Now, provide an 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 A

ki

hiro

 Nak

amura

,

 and

 I

'm

 a

 

25

-year

-old

,

 fourth

-generation

 sushi

 chef

 at

 my

 family

's

 restaurant

 in

 Tokyo

.


My

 specialty

 is

 crafting

 unique,

 traditional

 sushi

 rolls

 using

 the

 fres

hest

 seasonal

 ingredients

 and

 an

 endless

 supply

 of

 creativity

.

 When

 I

'm

 not

 working

 in

 the

 kitchen

,

 I

'm

 usually

 practicing

 my

 traditional

 call

igraphy

 skills

 or

 trying

 to

 master

 the

 art

 of

 playing

 the

 sham

isen

,

 a

 traditional

 Japanese

 string

ed

 instrument

.

 I

'm

 a

 quiet

 and

 observ

ant

 person

 who

 prefers

 to

 let

 my

 work

 speak

 for

 itself

,

 but

 I

 appreciate

 the

 beauty

 of

 life

's

 simple

 moments

 and

 the

 connections

 that

 come

 from

 sharing

 food

 and

 traditions

 with

 others

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 located

 on

 the

 Se

ine

 River

.


Read

 the

 prompt

 carefully

.

 The

 prompt

 asks

 for

 a

 concise

 factual

 statement

 about

 the

 capital

 city

 of

 France

.

 Make

 sure

 the

 answer

 is

 factual

 and

 concise

.


The

 capital

 of

 France

 is

 a

 major

 cultural

 and

 economic

 center

.


This

 answer

 is

 too

 vague

 and

 does

 not

 provide

 a

 concise

 factual

 statement

 about

 the

 capital

 city

 of

 France

.


The

 capital

 of

 France

 is

 Paris

,

 which

 is

 located

 on

 the

 Se

ine

 River

.


This

 answer

 is

 concise

 and

 factual

,

 providing

 the

 name

 of

 the

 capital

 city

 and

 a

 relevant

 geographical

 feature

.

 However

,

 it

 doesn

't

 fully

 answer

 the

 prompt

,

 which

 asks

 for

 a

 statement

 about

 the

 capital

 city

,

 not

 the

 capital

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 factors

,

 including

 technological

 advancements

,

 societal

 needs

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 **

Increased

 Adoption

 in

 Everyday

 Life

**:

 AI

 will

 become

 increasingly

 integrated

 into

 everyday

 life

,

 from

 smart

 homes

 and

 cities

 to

 personal

 assistants

 and

 wear

ables

.

 AI

 will

 make

 tasks

 easier

,

 more

 efficient

,

 and

 more

 convenient

.


2

.

 **

Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

)**

:

 N

LP

 will

 improve

 significantly

,

 enabling

 humans

 to

 communicate

 more

 effectively

 with

 machines

.

 This

 will

 lead

 to

 better

 customer

 service

,

 more

 accurate

 language

 translation

,

 and

 improved

 text

 analysis

.


3

.

 **

Expansion

 of

 Edge

 AI

**:

 Edge




In [6]:
llm.shutdown()