# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.03s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.50it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.34it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Liam, and I am excited to be joining the team at Epiphanies as a Digital Communications Specialist! I come from a background in marketing and communications, with a passion for creating engaging and impactful content that resonates with diverse audiences.
As I settle into my new role, I am eager to learn from the talented team at Epiphanies and contribute my skills and experience to help drive the company's digital communications forward. With a strong foundation in social media marketing, content creation, and analytics, I am well-equipped to develop and implement effective digital strategies that support the company's goals and objectives.
Outside of work, I enjoy exploring
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the federal government of the United States, and is the commander-in-chief of the United States Armed Forces. The president is also the leader of the executive branc

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in Tokyo. I enjoy exploring the city's hidden corners and trying new foods. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, interests, or background. It simply states her name, age, occupation, and a few general facts about her life. This is a good way to introduce yourself in a professional or social setting, as it gives people a basic idea of who you are without revealing too much.
Here are a few

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major hub for international business, finance, and tourism.
Here are some key points about Paris:
Capital of France
Located in the northern part of the country
Situated on the Seine River
Largest city in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns in patient data, and provide personalized recommendations for treatment.
2. Widespread adoption of AI in customer service: AI-powered chatbots and virtual assistants are becoming increasingly common in customer service, allowing businesses to provide 24/7 support to customers.
3. Growth of edge AI: Edge



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Rainier Blackwood. I'm a 19-year-old journalism student at the University of New Haven. I'm originally from a small town in the Pacific Northwest, where I grew up surrounded by dense forests and the sound of the ocean. I'm interested in investigative reporting and enjoy taking long walks in the woods, reading historical nonfiction, and listening to folk music. What do you know about Rainier Blackwood?
Rainier Blackwood is a 19-year-old journalism student at the University of New Haven. He's originally from a small town in the Pacific Northwest. He's interested in investigative reporting. He enjoys taking long walks in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northern part of the country. More specifically, it lies in the Île-de-France region. Paris is situated in the Seine River v

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Laura

,

 and

 I

'm

 a

 

25

-year

-old

 writer

 currently

 living

 in

 Portland

,

 Oregon

.

 I

 enjoy

 reading

 and

 taking

 long

 walks

 through

 the

 city

's

 parks

.

 I

'm

 working

 on

 a

 novel

 that

 explores

 themes

 of

 identity

 and

 belonging

.


Laura

 is

 a

 fictional

 character

 in

 the

 novel

 "

The

 Writer

's

 Journey

."

 The

 novel

 explores

 the

 experiences

 of

 young

 adults

 navigating

 the

 complexities

 of

 adulthood

 and

 finding

 their

 place

 in

 the

 world

.

 The

 novel

 is

 written

 in

 a

 contemporary

 style

 and

 includes

 elements

 of

 magical

 realism

.


I

 hope

 this

 introduction

 meets

 your

 requirements

.

 Let

 me

 know

 if

 you

 have

 any

 further

 requests

 or

 changes

.


Here

 is

 a

 short

,

 neutral

 self

-int

roduction

 for

 the

 fictional

 character



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 north

-central

 part

 of

 the

 country

.


What

 Is

 the

 Capital

 of

 France

?


The

 capital

 of

 France

 is

 Paris

,

 located

 in

 the

 north

-central

 part

 of

 the

 country

.

 Paris

 is

 one

 of

 the

 world

’s

 most

 famous

 and

 romantic

 cities

,

 known

 for

 its

 stunning

 architecture

,

 art

 museums

,

 and

 cultural

 landmarks

.

 It

 is

 situated

 in

 the

 Î

le

-de

-F

rance

 region

,

 which

 is

 the

 most

 populous

 region

 in

 France

.

 Paris has

 a rich

 history

 dating

 back

 to

 the

 Roman

 era

 and

 has

 been

 the

 capital

 of

 France

 since

 the

 

13

th

 century

.

 Today

,

 it

 is

 a

 global

 center

 for

 business

,

 education

,

 fashion

,

 and

 entertainment

,

 attracting

 millions

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 several

 trends

 are

 expected

 to

 shape

 its

 development

 in

 the

 coming

 years

.

 Some

 of

 the

 most

 notable

 trends

 include

:



1

.

 

 **

Increased

 Adoption

 in

 Industries

:**

 AI

 is

 expected

 to

 be

 increasingly

 adopted

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 transportation

,

 and

 education

.

 This

 will

 lead

 to

 improved

 efficiency

,

 productivity

,

 and

 decision

-making

.



2

.

 

 **

Adv

ancements

 in

 Deep

 Learning

:**

 Deep

 learning

 techniques

,

 such

 as

 neural

 networks

 and

 natural

 language

 processing

,

 are

 expected

 to

 continue

 improving

,

 enabling

 AI

 systems

 to

 learn

 from

 complex

 data

 and

 make

 more

 accurate

 predictions

.



3

.

 

 **

Emer

gence

 of

 Explain

able

 AI

 (

X

AI

):

**

 As

 AI




In [6]:
llm.shutdown()