# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.03it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.62it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.27it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah
I am a therapist and a coach, with a passion for empowering individuals to achieve their full potential. My approach is compassionate, non-judgmental and tailored to meet your unique needs.
I have worked with clients from diverse backgrounds and with a wide range of concerns, including anxiety, depression, trauma, relationships, and personal growth. My training includes trauma-informed practices, mindfulness, and cognitive-behavioral therapy (CBT).
I believe that everyone has the strength and resilience to overcome challenges and achieve their goals. My role as a therapist and coach is to provide a safe, supportive and collaborative space for you to explore
Prompt: The president of the United States is
Generated text:  not above the law. The fact that Donald Trump may have obstructed justice is one of the reasons why he should be investigated. The Obama administration had a number of investigations into Trump's business dealings and his 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading fantasy novels in my free time. I'm also a member of the school's debate team and enjoy arguing about current events. I'm a bit of a perfectionist, which can sometimes make me come across as uptight or critical, but I'm working on being more open-minded and flexible. I'm a bit of a introvert and prefer to spend time alone or with close friends, but I'm trying to step out of my comfort zone and meet new people. I'm a bit of a dreamer and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, in the region of Île-de-France. It is situated on the Seine River. Paris is known for its rich history, art, fashion, and cuisine. The city is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. Paris is a major cultural and economic center, and it attracts millions of tourists each year. The city has a population of over 2.1 million people, making it one of the most populous cities in Europe. Paris is a global hub for business, finance, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Rise of autonomous vehicles: Autonomous vehicles are already being tested on public roads, and it's likely



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Shelly. I live in a small town surrounded by fields and woods. I work as a librarian at the local library, where I help people find books and answer research questions. In my free time, I like to read, walk, and spend time with my family. I'm a quiet and introverted person who enjoys simple pleasures in life. That's me in a nutshell.
This introduction does not reveal anything about Shelly's background, personality traits, or motivations. It simply provides a brief overview of her life and daily activities. It is a neutral, low-key introduction that may help readers or listeners get to know Shelly gradually

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the north-central part of the country and is situated along the Seine River. Paris is known for its iconic landmarks, rich history, an

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ruby

 Black

wood

,

 and

 I

'm

 a

 

25

-year

-old

 software

 engineer

.

 I

've

 been

 working

 for

 a

 major

 tech

 company

 for

 three

 years

 now

,

 where

 I

 specialize

 in

 creating

 user

-friendly

 interfaces

 for

 their

 various

 applications

.

 I

 enjoy

 solving

 problems

 and

 collaborating

 with

 my

 team

 to

 deliver

 high

-quality

 results

.

 When

 I

'm

 not

 coding

,

 you

 can

 find

 me

 trying

 out

 new

 recipes

 in

 the

 kitchen

 or

 practicing

 yoga

 in

 my

 free

 time

.

 I

'm

 a

 quiet

 and

 observ

ant

 person

,

 but

 I

'm

 always

 up

 for

 a

 good

 conversation

 when

 the

 mood

 strikes

 me

.

 That

's

 a

 bit

 about

 me

,

 I

 suppose

.

 What

 do

 you

 think

?

 Is

 this

 self

-int

roduction

 effective

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 France

 is

 a

 country

 located

 in

 Western

 Europe

.

 Paris

 is

 its

 largest

 city

 and

 is

 a

 major

 cultural

 and

 financial

 center

.

 It

 is

 known

 for

 its

 iconic

 landmarks

,

 art

 museums

,

 and

 fashion

.

 Paris

 is

 also

 known

 as

 the

 City

 of

 Light

 due

 to

 its

 historical

 significance

 in

 the

 Enlightenment

.

 The

 city

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

 within

 its

 city

 limits

.

 The

 Paris

 metropolitan

 area

 has

 a

 population

 of

 over

 

12

 million

 people

,

 making

 it

 one

 of

 the

 largest

 urban

 areas

 in

 the

 European

 Union

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 looking

 bright

,

 with

 several

 trends

 and

 advancements

 that

 are

 set

 to

 shape

 the

 industry

 in

 the

 coming

 years

.

 Some

 of

 the

 key

 trends

 include

:


1

.

 **

Increased

 adoption

 of

 edge

 AI

**:

 Edge

 AI

 involves

 processing

 data

 closer to

 where

 it

 is

 generated

,

 reducing

 latency

 and

 improving

 real

-time

 decision

-making

.

 This

 trend

 is

 expected

 to

 gain

 momentum

 as

 IoT

 devices

 and

 autonomous

 vehicles

 become

 more

 prevalent

.


2

.

 **

Adv

ancements

 in

 explain

able

 AI

 (

X

AI

)**

:

 As

 AI

 becomes

 increasingly

 embedded

 in

 decision

-making

 processes

,

 there

 is

 a

 growing

 need

 for

 transparency

 and

 accountability

.

 X

AI

 aims

 to

 provide

 insights

 into

 how

 AI

 models

 make

 decisions

,

 enabling

 humans

 to

 understand

 and

 trust

 the




In [6]:
llm.shutdown()