# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")



Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.66it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.33it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ruby Rose, and I'm a London-based actress, model, and DJ. Born in Melbourne, Australia, I'm a natural performer with a passion for taking risks and pushing boundaries.
I've had the privilege of working on some incredible projects, including the hit TV series "Orphan Black" and the film "John Wick: Chapter 2." I've also modeled for top brands like Armani and Gucci, and I've even had the opportunity to DJ at some of the world's most iconic clubs.
But it's not just about the work – it's about the people I meet and the experiences I have along the way.
Prompt: The president of the United States is
Generated text:  reportedly planning to take executive action to address the issue of surprise medical billing.
President Donald Trump has called on Congress to pass legislation to ban surprise medical billing, but so far, lawmakers have been unable to reach an agreement on a bill.
Surprise medical billing occurs when patients receive a bill from a docto

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm a bit of a introvert, but I'm always up for a good conversation. I'm interested in learning more about the world and its cultures, and I'm always looking for new experiences to write about. I'm a bit of a perfectionist, but I'm also a firm believer in learning from my mistakes. I'm excited to meet new people and hear their stories. That's me in a nutshell. What do

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  The city is located in the northern part of the country, along the Seine River.  It is a major cultural and economic center, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum.  Paris is also a popular tourist destination, attracting millions of visitors each year.  The city has a rich history, dating back to the Roman era, and has been a center of art, literature, and science for centuries.  Today, Paris is a vibrant and diverse

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with applications such as robotic surgery, personalized medicine, and AI-assisted diagnosis.
2. Widespread adoption of AI in the workplace: AI is already being used in many



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Emilia Gray. I work as an art conservator at a small museum in a quiet town.
This introduction doesn't reveal too much about Emilia's personality or background, but gives a sense of what she does and where she is. This can be a good way to introduce a character in a story or script. It's a good example of a neutral self-introduction. A neutral self-introduction is a brief description of oneself that doesn't give away too much information about one's personality, interests, or background. It's good for making a good first impression or introducing oneself in a professional or formal setting. It can also be

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 
The population of Paris is approximately 2.1 million people, however, the larger urban area has a population of around 12.2 million people. 
Paris is s

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lena

,

 and

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 living

 in

 a

 small

 town

.

 I

 enjoy

 reading

,

 hiking

,

 and

 playing

 my

 guitar

 in

 my

 free

 time

.


Writing

 a

 neutral

 self

-int

roduction

 can

 be

 a

 bit

 challenging

,

 as

 it

 requires

 finding

 a

 balance

 between

 providing

 enough

 information

 and

 not

 giving

 away

 too

 much

 about

 the

 character

's

 personality

 or

 background

.

 Here

's

 a

 revised

 introduction

 that

 keeps

 Lena

's

 character

 neutral

:



"

Hi

,

 I

'm

 Lena

.

 I

'm

 a

 writer

 and

 a

 bit

 of

 a

 outdoors

y

 person

.

 When

 I

'm

 not

 working

,

 you

 can

 find

 me

 exploring

 the

 local

 trails

 or

 str

um

ming

 my

 guitar

."



This

 introduction

 still

 con

veys

 Lena



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 The

 capital

 of

 France

 is

 Paris

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.


The

 capital

 of

 France

 is

 Paris

.


What

 is

 the

 capital

 of

 France

?

 The

 capital

 of

 France

 is

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 The

 capital

 of

 France

 is

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 The

 capital

 of

 France

 is

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 The

 capital

 of

 France

 is

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 The

 capital

 of

 France

 is

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 The

 capital

 of

 France

 is

 Paris

.

 What

 is

 the

 capital

 of

 France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 various

 factors

 including

 advancements

 in

 computing

 power

,

 data

 availability

,

 and

 societal

 needs

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 explain

ability

 and

 transparency

 in

 AI

 decision

-making

 processes

.

 This

 means

 that

 AI

 systems

 will

 need

 to

 provide

 clear

 and

 understandable

 explanations

 for

 their

 decisions

,

 which

 will

 help

 to

 build

 trust

 and

 accountability

.


2

.

 Development

 of

 more

 specialized

 AI

 applications

:

 As

 AI

 continues

 to

 evolve

,

 we

 can

 expect

 to

 see

 more

 specialized

 AI

 applications

 that

 are

 tailored

 to

 specific

 industries

 or

 tasks

.

 For

 example

,

 AI

-powered

 medical




In [6]:
llm.shutdown()