# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.08s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.16s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.20s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.02it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kristen and I'm a 29-year-old freelance writer and editor living in Chicago. I'm excited to join the Red Thread community and connect with like-minded individuals who share my passions for writing, editing, and storytelling.
I've been writing for over a decade, and my work has appeared in various publications, including the Chicago Tribune, Chicago Reader, and Time Out Chicago. I've also edited for several clients, including authors, entrepreneurs, and small businesses.
I'm particularly interested in writing about social justice, culture, and technology, and I'm always looking for new ways to tell engaging stories that spark conversation and inspire action. When I
Prompt: The president of the United States is
Generated text:  responsible for executing the laws of the United States. The president is also the commander-in-chief of the armed forces of the United States. The president has the power to grant reprieves and pardons to individuals con

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation.
Kaida is a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. She enjoys hiking, reading, and trying out new recipes in her spare time. Kaida is a bit of an introvert, but she's always up for a good conversation.
Kaida is a 25-year-old freelance writer and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, finance, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a diverse population and is home to people from all over

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased Adoption in Everyday Life: AI is likely to become more ubiquitous in everyday life, with applications in areas such as healthcare, finance, transportation, and education.
2. Advancements in Machine Learning: Machine learning, a subset of AI, is expected to continue to advance, enabling AI systems to learn from data and improve their performance over time.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need to understand how AI systems make decisions, leading



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Anastasia Reimann, but most people call me Ana. I'm a 22-year-old freelance writer living in a small town surrounded by lush green forests. I spend most of my free time hiking, reading, and experimenting with new recipes in my tiny kitchen. When I'm not working or outdoors, you can find me curled up with a good book and a cup of hot tea.
What are some common features of a self-introduction?
A self-introduction typically includes the person's name, age, occupation or studies, and any other relevant information that might be of interest to others.
What are some tips for writing a self

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The Eiffel Tower is one of the most iconic landmarks in France, and it has been rebuilt or renovated several times since its original construction in the late 19th century. Th

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Em

ilia

 Grey

 and

 I

 work

 as

 a

 freelance

 writer

 in

 a

 small

 town

.

 I

 enjoy

 the

 quiet

 pace

 of

 life

 and

 spend

 my

 free

 time

 exploring

 the

 local

 trails

 and

 reading

 old

 books

.

 I

'm

 a

 bit

 of

 a

 intro

vert

 and

 value

 my

 alone

 time

,

 but

 I

'm

 always

 happy

 to

 meet

 new

 people

 and

 hear

 their

 stories

.

 I

'm

 looking

 forward

 to

 meeting

 you

 and

 learning

 more

 about

 you

.

 S

uggested

 answers

:

 a

)

 I

 am

 happy

 to

 meet

 you

 and

 learn

 more

 about

 you

.

 b

)

 I

 am

 a

 good

 friend

 and

 would

 love

 to

 help

 you

 learn

 more

 about

 you

.

 c

)

 I

 am

 a

 skilled

 writer

 and

 would

 love

 to

 collaborate

 with

 you

 on



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 a

 famous

 city

 known

 for

 its

 historic

 landmarks

,

 fashion

,

 and

 cuisine

.

 The

 city

 is

 home

 to

 many

 world

-class

 museums

,

 such

 as

 the

 Lou

vre

 and

 the

 Or

say

.

 The

 E

iff

el

 Tower

 is

 a

 famous

 landmark

 and

 a

 symbol

 of

 Paris

.


Paris

 is

 also

 known

 for

 its

 romantic

 atmosphere

,

 with

 its

 beautiful

 parks

 and

 gardens

,

 such

 as

 the

 Luxembourg

 Gardens

.

 The

 city

 is

 a

 popular

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.


In

 addition

 to

 its

 cultural

 and

 historical

 significance

,

 Paris

 is

 also

 a

 major

 economic

 and

 financial

 center

,

 hosting

 many

 international

 organizations

 and

 multinational

 companies

.

 The

 city

 has

 a

 diverse

 population

 and

 a

 strong

 sense

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 with

 new

 developments

 and

 innovations

 emerging

 continuously

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

 that

 are

 worth

 considering

:


1

.

 Increased

 Integration

 with

 Other

 Technologies

:

 As

 AI

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 increased

 integration

 with

 other

 technologies

,

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

),

 blockchain

,

 and

 augmented

 reality

 (

AR

).

 This

 integration

 will

 enable

 AI

 systems

 to

 access

 and

 analyze

 vast

 amounts

 of

 data

 from

 various

 sources

,

 leading

 to

 more

 accurate

 predictions

 and

 decision

-making

.


2

.

 More

 Human

-like

 Intelligence

:

 Future

 AI

 systems

 will

 likely

 become

 more

 human

-like

 in

 their

 intelligence

,

 with

 the

 ability

 to

 understand

 and

 replicate

 human

 emotions

,

 empathy

,

 and




In [6]:
llm.shutdown()