# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.27it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.31it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.29it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.57it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Hannah. I am a photographer based in Melbourne, Australia. I am so glad you have stopped by my website. I specialise in capturing the love and joy of your special day through photography. I am passionate about creating timeless and beautiful images that tell the story of your wedding day.
I am based in Melbourne and I love capturing weddings all over Victoria. I have a keen eye for detail and I am always looking for unique and creative ways to capture your special day.
I love getting to know my couples and understanding what makes them tick. This helps me to create a truly personal and unique photo story for you.
I am committed to making
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, and is the highest official in the federal government. The president is directly elected by the people through the Electoral College, which was established by the Founding Fathers at th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and have a passion for public speaking. When I'm not studying or participating in extracurricular activities, I like to spend time with my friends and family, watching movies, and playing video games. I'm a bit of a perfectionist, but I'm working on being more relaxed and open-minded. I'm excited to meet new people and learn more about their interests and experiences.
This self-introduction is neutral because it doesn

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also known for its romantic atmosphere and is often referred to as the City of Light. Paris is a major tourist destination and is visited by millions of people each year. The city has a population of over 2.1 million people and is a hub for business, education, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, there are several trends that are likely to shape the development and impact of artificial intelligence in the coming years. Here are some possible future trends in AI:
1. Increased Adoption of AI in Various Industries: AI is already being used in various industries such as healthcare, finance, transportation, and customer service. In the future, we can expect to see increased adoption of AI in other industries such as education, agriculture, and manufacturing.
2. Advancements in Natural Language Processing (NLP): NLP is a subset of AI that deals with



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Noah Fournier. I'm a 25-year-old college student, and I'm majoring in English literature. I enjoy reading, writing, and playing guitar in my free time.
What is the main purpose of a self-introduction?
The main purpose of a self-introduction is to briefly provide information about oneself, such as one's name, profession, education, interests, and other relevant details, in a professional or social setting. It serves as a way to establish a connection with others and help them get to know you.
What is the tone of the given self-introduction?
The tone of the given self-introduction is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France and is located in the northern part of the country in the region of Île-de-France. It is situated on the Seine River and is known for its hi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ara

 and

 I

'm

 a

 professional

 landscape

 designer

.

 I

 work

 with

 a

 variety

 of

 clients

 to

 create

 unique

 outdoor

 spaces

 that

 blend

 seamlessly

 into

 their

 surroundings

.

 I

've

 been

 in

 the

 business

 for

 over

 five

 years

 and

 have

 a

 solid

 understanding

 of

 what

 makes

 a

 space

 truly

 beautiful

.

 I

'm

 also

 a

 bit

 of

 a

 nature

 lover

 and

 enjoy

 spending

 time

 outdoors

 whenever

 possible

.


A

 complete

 and

 well

-

structured

 essay

 requires

 a

 strong

 introduction

 that

 sets

 the

 tone

 for

 the

 rest

 of

 the

 essay

.

 The

 introduction

 provides

 background

 information

 on

 the

 topic

 and

 guides

 the

 reader

 through

 the

 rest

 of

 the

 essay

.

 Here

 are

 some

 tips

 to

 help

 you

 write

 a

 great

 introduction

:

 Choose

 a

 strong

 opening

 sentence

:



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 city

 in

 the

 country

.

 Paris

 is

 a

 global

 center

 for

 art

,

 fashion

,

 cuisine

,

 and

 culture

,

 attracting

 millions

 of

 tourists

 and

 international

 businesses

.

 It

 has

 been

 a

 major

 hub

 for

 centuries

,

 hosting

 the

 French

 monarchy

 and

 the French

 Revolution

,

 among

 other

 historical

 events

.


Now

,

 let

’s

 explore

 some

 interesting

 facts

 about

 Paris

.


Interesting

 Facts

 About

 Paris




1

.

 Name

 Origin

:

 The

 name

 “

Paris

”

 is

 derived

 from

 the

 Celtic

 word

 “

L

ut

et

ia

,”

 which

 referred

 to

 the

 city

 located

 on

 the

 Se

ine

 River

.

 The

 Romans

 later

 renamed

 it

 “

L

ut

et

ia

 Paris

ior

um

”

 or

 “

Paris

 of

 the

 Paris

ii

,”

 which



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 one

 of

 exciting

 possibilities

 and

 daunting

 challenges

.

 As

 AI

 technology

 continues

 to

 evolve

 and

 improve

,

 we

 can

 expect

 to

 see

 significant

 advancements

 in

 various

 areas

,

 including

:


More

 sophisticated

 natural

 language

 processing

 (

N

LP

)

 capabilities

,

 enabling

 AI

 systems

 to

 understand

 and

 generate

 human

-like

 language

,

 leading

 to

 improved

 human

-A

I

 collaboration

 and

 communication

.


Increased

 use

 of

 deep

 learning

 and

 neural

 networks

,

 allowing

 AI

 systems

 to

 learn

 from

 large

 datasets

 and

 improve

 their

 performance

 in

 areas

 like

 computer

 vision

,

 speech

 recognition

,

 and

 decision

-making

.


Expansion

 of

 AI

 into

 new

 domains

,

 such

 as

 healthcare

,

 education

,

 and

 transportation

,

 where

 AI

 can

 help

 improve

 outcomes

,

 efficiency

,

 and

 safety

.


Growing

 use

 of

 Explain




In [6]:
llm.shutdown()