# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.44s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.79s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:08<00:02,  2.83s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.33s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jamie and I am a PhD student studying Mathematics at the University of Auckland. I am currently in my second year of study and am really enjoying it so far. I am interested in algebraic topology and am currently working on a project involving homology and cohomology.
I am also passionate about science communication and am involved with a number of outreach and education projects. I believe that science and mathematics are essential for understanding the world we live in and I am committed to making them accessible to everyone.
In my free time, I enjoy playing music, hiking and trying out new recipes in the kitchen.

## Step 1: Understand the question

Prompt: The president of the United States is
Generated text:  expected to provide a clear vision of the country’s priorities and goals during the annual State of the Union address. While the speech is not a law or a binding document, it sets the tone for the country’s legislative agenda for the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm currently working on a novel and a graphic novel, and I'm excited to see where my creative projects take me. I'm looking forward to connecting with like-minded individuals and learning from their experiences. How can I help you today?
This introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is located in the northern part of the country, in the region of Île-de-France. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. The city is home to many famous museums, such as the Louvre and the Orsay, and iconic landmarks like the Eiffel Tower and Notre-Dame Cathedral. Paris is also a major hub for fashion, cuisine, and art, attracting millions of tourists and visitors each year. The city has a population of over 2.1 million people and is a global center for business, finance, and culture. Paris is a city that seamlessly blends

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, including the development of AI-powered robots that can assist with surgeries and other medical procedures.
2. Widespread adoption of AI in education: AI is already being used in education to personalize learning experiences, grade assignments, and provide feedback to students. In the future,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Viktor Kuznetsov. I'm 32 years old, born and raised in the small town of Vostok in Eastern Siberia. I work as a mechanic in my family's garage, fixing cars and trucks for the locals. I'm a laid-back person who enjoys the simple things in life, like fishing and hiking in the taiga.
Here are a few suggestions for how you could revise the introduction to add a bit more depth and personality:

*   Add a few details about Viktor's interests or hobbies that give insight into his personality. For example, you could mention that he's an avid reader of science fiction novels,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. Paris is the largest city in France and is located in the northeastern part of the country, ne

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 A

vis

,

 and

 I

'm

 a

 skilled

 code

breaker

 with

 a

 passion

 for

 cryptography

.

 I

'm

 currently

 working

 as

 a

 freelance

 analyst

,

 taking

 on

 projects

 that

 allow

 me

 to

 challenge

 my

 skills

 and

 learn

 new

 techniques

.

 I

'm

 fluent

 in

 multiple

 programming

 languages

,

 and

 my

 expertise

 spans

 a

 range

 of

 cryptographic

 algorithms

,

 including

 AES

,

 RSA

,

 and

 Ell

ipt

ic

 Curve

 Crypt

ography

.

 When

 I

'm

 not

 working

,

 you

 can

 find

 me

 experimenting

 with

 new

 cryptographic

 methods

 or

 attending

 security

 conferences

 to

 stay

 up

-to

-date

 on

 the

 latest

 developments

 in

 the

 field

.


This

 self

-int

roduction

 is

 neutral

 because

 it

:


Provides

 a

 clear

,

 concise

 overview

 of

 A

vis

's

 professional

 background

 and

 expertise




Avoid



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 E

iff

el

 Tower

 is

 one

 of

 the

 city

's

 most

 famous

 landmarks

.

 Paris

 is

 known

 for

 its

 fashion

 industry

,

 and

 the

 city

 is

 home

 to

 the

 world

-ren

owned

 fashion

 houses

 Chanel

 and

 D

ior

.


The

 Se

ine

 River

 runs

 through

 the

 heart

 of

 the

 city

,

 and

 many

 famous

 bridges

 are

 located

 along

 its

 banks

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 attracts

 millions

 of

 visitors

 every

 year

.

 The

 Lou

vre

 Museum

 is

 one

 of

 the

 city

's

 most

 visited

 attractions

,

 and

 it

 houses

 a

 vast

 collection

 of

 art

 and

 artifacts

 from

 around

 the

 world

.


Paris

 is

 a

 city

 of

 history

,

 culture

,

 and

 romance

,

 and

 it

 has

 something

 to

 offer

 for

 everyone

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 much

 debate

 and

 speculation

,

 but

 here

 are

 some

 possible

 future

 trends

:


1

.

 Increased

 automation

:

 AI

 will

 continue

 to

 automate

 many

 tasks

,

 freeing

 humans

 from

 mundane

 and

 repetitive

 work

.

 This

 could

 lead

 to

 increased

 productivity

 and

 efficiency

,

 but

 also

 raises

 concerns

 about

 job

 displacement

 and

 the

 need

 for

 workers

 to

 develop

 new

 skills

.


2

.

 More

 human

-like

 intelligence

:

 As

 AI

 becomes

 more

 advanced

,

 it

 may

 become

 increasingly

 difficult

 to

 distinguish

 from

 human

 intelligence

.

 This

 could

 lead

 to

 breakthrough

s

 in

 areas

 like

 natural

 language

 processing

,

 machine

 learning

,

 and

 computer

 vision

.


3

.

 W

ides

pread

 adoption

:

 AI

 will

 become

 increasingly

 ubiquitous

 in

 many

 areas

 of

 life

,

 including

 healthcare

,




In [6]:
llm.shutdown()