# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.06it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.68it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.34it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.28it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lauren and I am the owner of Little Sunshine Childcare. I am a mother of 2 and have worked with children for over 10 years in various roles. I have a strong passion for early childhood education and providing a nurturing environment for children to learn and grow. I am dedicated to providing the highest quality care and education for the children in my center. I strive to build strong relationships with each family and provide individualized care to meet the unique needs of each child. I am committed to creating a safe, engaging, and inclusive environment that allows children to reach their full potential.
Here at Little Sunshine, we believe in creating a warm
Prompt: The president of the United States is
Generated text:  supposed to represent all Americans. But some Americans have a hard time accepting that, especially when it comes to issues like immigration, health care, or economic policy.
Take, for example, the backlash against President 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm currently working on a novel and a few art projects that I'm excited to share with the world someday. I'm looking forward to connecting with like-minded individuals and learning from their experiences. That's me in a nutshell! What do you think? Is it neutral enough?
Your self-introduction is neutral and concise. It provides a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  The city is located in the northern part of the country, along the Seine River.  It is the largest city in France and is known for its rich history, art, fashion, and culture.  The city is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum.  Paris is also known for its romantic atmosphere and is often referred to as the "City of Light."  The city has a population of over 2.1 million people and is a major

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Widespread adoption of AI in industries: AI is expected to be adopted in various industries, including finance, transportation, and education. AI-powered systems will be able to automate tasks, improve efficiency, and enhance decision-making.




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Akira and I'm a 20-year-old university student studying computer science. I'm from a small town in the Midwest and I've lived here my whole life. I enjoy reading science fiction novels and spending time outdoors, especially hiking and camping. In my free time, I'm working on building my own computer from scratch. It's been a fun project, but it's also been a bit of a challenge.
Can you help me make this introduction more concise and engaging?
Here's a rewritten version of the introduction:
Hi, I'm Akira, a 20-year-old computer science major from the Midwest. When I'm not

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Provide a concise factual statement about France’s population. The population of France is over 67 million people. Provide a concise factual statement about France’s climate. France has a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ak

ira

 Yam

ato

.

 I

'm

 a

 

22

-year

-old

 university

 student

 major

ing

 in

 Environmental

 Science

.

 I

 enjoy

 spending

 my

 free

 time

 hiking

 and

 reading

 about

 sustainable

 living

.

 I

'm

 originally

 from

 Tokyo

,

 Japan

,

 but

 I

've

 been

 living

 in

 Canada

 for

 the

 past

 two

 years

.

 What

 do

 you

 think

?

 This

 introduction

 is

 quite

 straightforward

 and

 doesn

't

 reveal

 much

 about

 Ak

ira

's

 personality

 or

 background

.

 Here

 are

 some

 suggestions

 to

 make

 it

 more

 engaging

:



1

.

 

 **

Add

 a

 personal

 touch

**:

 Instead

 of

 simply

 stating

 your

 major

,

 you

 could

 mention

 why

 you

 chose

 Environmental

 Science

.

 For

 example

,

 "

I

'm

 passionate

 about

 creating

 a

 more

 sustainable

 future

,

 which

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Capital

 Cities

 around

 the

 World




A

 capital

 city

 is

 the

 seat

 of

 government

 for

 a

 country

,

 state

,

 or

 other

 region

.

 Capital

 cities

 often

 serve

 as

 the

 center

 of

 politics

,

 economy

,

 and

 culture

 for

 their

 respective

 areas

.

 Here

 are

 some

 examples

 of

 capital

 cities

 from

 around

 the

 world

:


1

.

 **

Paris

,

 France

**:

 Known

 for

 its

 art

,

 fashion

,

 and

 cuisine

,

 Paris

 is

 one

 of

 the

 world

's

 most

 romantic

 cities

.


2

.

 **

Tok

yo

,

 Japan

**:

 A

 bustling

 met

ropolis

 with

 cutting

-edge

 technology

 and

 vibrant

 culture

.


3

.

 **

Can

berra

,

 Australia

**:

 A

 planned

 city

 with

 a

 strong

 focus

 on

 education

 and

 research

.


4

.

 **



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 various

 technological

 advancements

 and

 societal

 needs

.

 Some

 possible

 future

 trends

 in

 AI

 include

:


   

 

1

.

 Increased

 focus

 on

 Explain

ability

 and

 Transparency

:

 As

 AI

 becomes

 more

 pervasive

 in

 decision

-making

,

 there

 will

 be

 a

 growing

 need

 for

 understanding

 and

 explaining

 AI

-driven

 decisions

.

 Techniques

 like

 model

 interpret

ability

 and

 feature

 attribution

 will

 become

 more

 prevalent

 to

 ensure

 accountability

 and

 trust

worth

iness

.


   

 

2

.

 Integration

 with

 Edge

 AI

:

 With

 the

 proliferation

 of

 IoT

 devices

 and

 smart

 sensors

,

 Edge

 AI

 will

 play

 a

 crucial

 role

 in

 processing

 data

 closer

 to

 its

 source

,

 reducing

 latency

 and

 enabling

 real

-time

 decision

-making

.

 This

 trend

 will

 lead

 to

 more

 efficient

 and

 effective

 AI




In [6]:
llm.shutdown()