# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.25it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.11it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.01it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sabrina. I am a new therapist in your community and I am excited to introduce myself to you. I am a Licensed Professional Counselor (LPC) with a passion for helping individuals, couples, and families navigate life's challenges and achieve their goals.
I have experience working with individuals of all ages, from children to adults, and with a wide range of concerns, including anxiety, depression, trauma, relationships, and self-esteem. I also have experience working with couples and families to strengthen their bonds and improve communication.
I am a trauma-informed therapist, which means that I am trained to work with individuals who have experienced trauma and
Prompt: The president of the United States is
Generated text:  a symbol of American democracy and values. The president is elected by the people to serve as the head of the executive branch of the government. He or she has the power to make key decisions and set the overall direction fo

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. I'm a bit of a perfectionist, which can sometimes make me come across as a bit bossy or critical. I'm working on being more open-minded and listening to others' perspectives. I'm a bit of a introvert, but I do enjoy spending time with close friends and family. I'm not really sure what I want to do with my life yet, but I'm excited to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
Paris is the capital and largest city of France, with a population of over 2.1 million people within its city limits. It is the center of France's economy, culture, and politics, and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major hub for international business, fashion, and tourism, and is home to many of the world's most famous universities and research institutions. The city is divided into 20 arrondissements, or districts, and is served by a comprehensive public

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Rise of autonomous vehicles: Autonomous vehicles are already being tested on public roads, and it's likely that they will become more common in the future. AI will play a key role in the development of autonomous vehicles, enabling them to navigate complex



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Felicity A. Nightshade. I'm a 25-year-old freelance journalist with a passion for true crime stories. I'm based in the city of Ashwood, where I've spent the last five years honing my writing skills and building a network of contacts in the local police department and courts. I've written for several prominent online publications and have a knack for getting to the bottom of a good mystery. I'm always on the lookout for my next big scoop, and I'm not afraid to take risks to get the story. I'm a bit of a loner, but I have a strong sense of justice and a willingness

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The University of Virginia’s first president was a notable historical figure. Thomas Jefferson served as the first president of the University of Virginia.
France has a rich history that includes

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

,

 but

 my

 friends

 call

 me

 Em

.

 I

'm

 a

 college

 student

 studying

 computer

 science

 and

 engineering

.

 I

'm

 from

 a

 small

 town

 in

 Ohio

,

 but

 I

've

 been

 living

 in

 New

 York

 City

 for

 the

 past

 year

 to

 pursue

 my

 education

.

 I

 like

 to

 think

 of

 myself

 as

 a

 curious

 and

 creative

 problem

 solver

.

 In

 my

 free

 time

,

 I

 enjoy

 hiking

 and

 trying

 out

 new

 restaurants

 in

 the

 city

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 but

 I

'm

 always

 up

 for

 a

 good

 conversation

.


The

 self

-int

roduction

 should

 be

 short

 and

 to

 the

 point

,

 but

 also

 reveal

 some

 character

 traits

 and

 background

 information

.

 It

 should

 also

 be

 neutral

,

 meaning

 it

 should



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 often

 referred

 to

 as

 the

 City

 of

 Light

.

 This

 nickname

 has

 a

 rich

 history

 behind

 it

,

 and

 it

 reflects

 the

 city

's

 significance

 in

 the

 Enlightenment

.


France

 is

 known

 for

 its

 beautiful

 cities

,

 rich

 history

,

 and

 stunning

 art

.

 Many

 of

 the

 most

 famous

 works

 of

 art

 have

 been

 created

 in

 France

,

 including

 the

 Mona

 Lisa

 by

 Leonardo

 da

 Vinci

.


The

 E

iff

el

 Tower

 is

 a

 famous

 landmark

 in

 Paris

,

 France

.

 It

 was

 built

 in

 

188

9

 for

 the

 World

's

 Fair

 and

 stands

 at

 

324

 meters

 tall

.


The

 Palace

 of

 Vers

ailles

 is

 a

 beautiful

 palace

 that

 is

 located

 near

 Paris

.

 It

 was

 the

 royal

 palace

 of

 France

 from



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 influenced

 by

 various

 factors

,

 including

 advances

 in

 computing

 power

,

 the

 development

 of

 new

 algorithms

,

 and

 the

 availability

 of

 large

 datasets

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Adoption

 of

 AI

 in

 Everyday

 Life

:

 AI

 will

 become

 increasingly

 integrated

 into

 our

 daily

 lives

,

 from

 virtual

 assistants

 like

 Siri

 and

 Alexa

 to

 more

 advanced

 applications

 in

 healthcare

,

 finance

,

 and

 education

.


2

.

 Rise

 of

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 Explain

able

 AI

 (

X

AI

)

 will

 become

 a

 critical

 area

 of

 research

,

 enabling

 humans

 to

 understand

 the

 reasoning

 behind

 AI

-driven

 decisions




In [6]:
llm.shutdown()