# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.29it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.21it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.22it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.47it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kaci, and I'm a PhD student in the Department of Computer Science at the University of British Columbia. My research focuses on developing artificial intelligence and machine learning techniques to solve real-world problems. I am supervised by Dr. Alvin Cheung and Dr. Christopher Brooks.
I'm originally from California, but I moved to Vancouver to pursue my graduate studies. In my free time, I enjoy hiking, reading, and trying out new recipes in the kitchen. I'm excited to be a part of the UBC Computer Science community and look forward to meeting fellow students, faculty, and staff.
My research interests are in the areas of artificial
Prompt: The president of the United States is
Generated text:  an important office, one that requires a great deal of responsibility and a strong moral compass. In the past, presidents have been chosen for their integrity, wisdom, and ability to make tough decisions in times of crisis.
But it's clear that the cur

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city, where I spend most of my time working on various writing projects and trying to stay organized. I enjoy reading, hiking, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation or a friendly debate. I'm looking forward to meeting new people and making connections in my community.
This self-introduction is neutral because it doesn't reveal any personal opinions or biases. It simply states the character's name, occupation, and interests, and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
Paris is the capital and largest city of France, with a population of approximately 2.1 million people within the city limits. It is one of the most famous and romantic cities in the world, known for its stunning architecture, art museums, fashion, cuisine, and historic landmarks such as the Eiffel Tower and Notre-Dame Cathedral. Paris is a major cultural and economic center, and is home to many international organizations, including the United Nations Educational, Scientific and Cultural Organization (UNESCO) and the Organisation for Economic Co-operation and Development (OECD). The city is also a major

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, helping to improve patient outcomes and reduce healthcare costs.
2. Widespread adoption of AI in industries: AI is already being used in various industries, including finance, transportation,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Zephyr Wilder. I'm a 19-year-old who lives in a small town surrounded by vast fields and forests. I work part-time as a mechanic at my family's repair shop, fixing cars and bikes for locals.
Zephyr Wilder is a young adult who lives in a small town surrounded by natural beauty. She is a hard worker and a skilled mechanic, having learned the trade from her family's repair shop. Zephyr is a neutral character, not giving away her personality, interests, or motivations in this introduction. The goal is to establish a sense of normalcy and ordinariness, making her more

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The capital is located in the north central part of the country. It is the most populous city in the country with a population of about 2.1 million people. The city is known for its rich history,

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ale

the

a

 La

R

oux

.

 I

'm

 a

 

22

-year-old

 communications

 major

 at the

 University of

 Chicago.

 I'm

 interested

 in

...

 read

 more




ale

the

a

 lar

oux




I

'm

 a

 

22

-year

-old

 communications

 major

 at

 the

 University

 of

 Chicago

.

 I

'm

 interested

 in

 social

 justice

,

 activism

,

 and

 community

 organizing

.

 In

 my

 free

 time

,

 I

 enjoy

 playing

 guitar

,

 practicing

 yoga

,

 and

 exploring

 the

 city

.


ale

the

a

 lar

oux




I

'm

 a

 

22

-year

-old

 communications

 major

 at

 the

 University

 of

 Chicago

.

 I

'm

 interested

 in

 social

 justice

,

 activism

,

 and

 community

 organizing

.

 In

 my

 free

 time

,

 I

 enjoy

 playing

 guitar

,

 practicing

 yoga



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 France

 is

 a

 country

 located

 in

 Western

 Europe

,

 and

 Paris

 is

 its

 largest

 city

.

 The

 city

 is

 situated

 on

 the

 Se

ine

 River

 and

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 major

 cultural

,

 economic

,

 and

 financial

 center

 and

 is

 home

 to

 many

 museums

,

 art

 galleries

,

 and

 historic

 sites

.

 The

 city

 attracts

 millions

 of

 tourists

 each

 year

.

 Paris

 is

 a

 cosm

opolitan

 city

 with

 a

 rich

 history

,

 vibrant

 culture

,

 and

 a

 blend

 of

 modern

 and

 historic

 architecture

.

 It

 is

 a

 popular

 destination

 for

 business

,

 tourism

,

 and

 education

.

 The

 city

 has

 a

 diverse



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

,

 and

 there

 are

 several

 trends

 that

 could

 shape

 its

 development

 in

 the

 years

 to

 come

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 **

Increased

 Adoption

 in

 Industries

:**

 AI

 is

 expected

 to

 become

 more

 prevalent

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 Its

 use

 will

 lead

 to

 increased

 efficiency

,

 productivity

,

 and

 decision

-making

 capabilities

.


2

.

 **

Adv

ancements

 in

 Deep

 Learning

:**

 Deep

 learning

 techniques

,

 such

 as

 neural

 networks

,

 will

 continue

 to

 improve

,

 enabling

 AI

 systems

 to

 learn

 and

 adapt

 more

 effectively

.

 This

 could

 lead

 to

 breakthrough

s

 in

 areas

 like

 computer

 vision

,

 natural

 language

 processing

,

 and

 speech

 recognition

.


3

.

 **

Edge




In [6]:
llm.shutdown()