# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.25it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.24it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.52it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Patrick and I'm a digital artist. I specialize in creating unique and custom illustrations and artworks for individuals, businesses, and organizations. I work with a variety of mediums, including digital painting, drawing, and photography.
I've had the pleasure of working with a wide range of clients, from entrepreneurs to large corporations, to create personalized and engaging visual content that meets their specific needs and objectives. My goal is to bring your vision to life and provide you with high-quality, visually appealing artwork that exceeds your expectations.

Some of my services include:

1.  **Custom Illustrations**: I create bespoke illustrations that capture the essence of your brand,
Prompt: The president of the United States is
Generated text:  facing a major crisis on his hands. On Wednesday, a 34-year-old man climbed over the White House fence and ran inside the building, getting as far as the East Room before being caught 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking and reading in my free time. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell. What do you think? Is there anything you'd change or add?
I think your self-introduction is clear and concise, and it gives a good sense of who Kaida is. Here are a few suggestions for improvement:
* Consider adding a bit more personality

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also known for its romantic atmosphere and is a popular tourist destination. Paris is a global center for business, finance, and culture, and is considered one of the most beautiful and iconic cities in the world. The city has a population of over 2.1 million people and is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Rise of Explainable AI: As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. Explainable AI (XAI) is a subfield of AI that focuses on developing AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Akira. I'm a 25-year-old botanist who studies rare plant species in the Amazon rainforest. My friends would describe me as adventurous and diligent. I'm not one for big cities or crowds, and I prefer the peacefulness of nature. What do you think of me?
The writer of this introduction is giving us a chance to imagine Akira in action. Akira is a botanist who works in the Amazon rainforest, which is a challenging and beautiful environment. The description of Akira as adventurous and diligent suggests that they are someone who is confident and determined, but also perhaps a bit reserved. Ak

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city of Paris is located in the northern region of the country along the Seine River. The Eiffel Tower, a famous landmark, is located in Paris.
This factual statement 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Agent

 

93

.

 I

'm

 a

 seasoned

 operative

 with

 a

 talent for

 strategy

 and

 problem

-solving

.

 I

 have

 extensive

 experience

 working

 in

 high

-pressure

 environments

 and

 have

 a

 strong

 background

 in

 covert

 operations

.

 I

'm

 a

 skilled

 fighter

 and

 tact

ician

,

 and

 I

'm

 comfortable

 working

 in

 a

 variety

 of

 different

 settings

 and

 situations

.

 I

'm

 looking

 forward

 to

 new

 challenges

 and

 opportunities

 to

 use

 my

 skills

 to

 make

 a

 positive

 impact

.


This

 self

-int

roduction

 focuses

 on

 the

 character

's

 professional

 skills

 and

 background

,

 and

 does

 not

 include

 any

 personal

 details

 or

 emotional

 language

.

 It

 is

 concise

 and

 to

 the

 point

,

 and

 does not

 reveal

 too much

 about

 the

 character

's

 personality

 or

 motivations.

 This

 is

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris.


Previous

 Previous

 post

:

 What

 country

 is

 Australia located

 in

?


Next Next

 post

:

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 Next

 question

:

 What

 is

 the

 capital

 of France

?

 Paris

.

 What is

 the

 capital of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.

 What

 is

 the

 capital

 of

 France

?

 Paris

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 an increasingly

 popular

 topic

 in

 discussions

 on

 technology

 and

 society

,

 and

 several

 experts

 have

 ventured

 to

 predict

 its

 development

.

 Some

 predictions

 have

 been

 relatively

 conservative

,

 while

 others

 have

 been

 more

 speculative,

 including

 a

 few

 that

 suggest

 AI

 may

 become

 an

 existential

 threat

 to

 humanity

.

 Here

 are

 some

 possible

 future

 trends in

 AI:


1.

 Continued

 advancement

 in

 machine

 learning

 and

 deep

 learning


2

.

 Increased

 use

 of

 AI

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 and

 education




3

.

 Development

 of

 more

 sophisticated

 natural

 language

 processing

 and

 computer

 vision




4

.

 Emer

gence

 of

 autonomous

 systems

 that

 can

 interact

 with

 the

 physical

 world




5

.

 Potential

 for

 AI

 to

 surpass

 human

 intelligence

 and

 become

 a

 super




In [6]:
llm.shutdown()