# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.33it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.25it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.23it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.51it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Carlos and I am a swimming instructor. I am passionate about teaching people how to swim. I have been doing this for many years and have taught people of all ages and skill levels. I am patient, kind, and encouraging, and I love seeing my students learn and grow in the water.
I have a lot of experience teaching children and adults with special needs, and I am confident in my ability to adapt to each student's unique needs and learning style. I am also certified in first aid and CPR, so you can rest assured that you and your loved ones are in good hands with me.
In addition to teaching swimming lessons, I
Prompt: The president of the United States is
Generated text:  the head of the executive branch and is both the head of state and the head of government of the United States. He is responsible for commanding the military, signing laws into effect, and serving as the face of the country on the world stage. He is also the commander-in-chief of t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in Tokyo. I enjoy exploring the city's hidden corners and trying out new foods. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and a graphic novel, and I'm excited to see where my creative projects take me. I'm looking forward to meeting new people and learning about their experiences.
This is a good example of a neutral self-introduction because it:
Provides basic information about the character, such as their name, age, and occupation.
Highlights their interests and hobbies, which

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city, providing a clear and direct answer to the question. It does not include any additional information or opinions, making it a suitable example of a concise factual statement. 
Note: This response is a direct answer to the question and does not require any further analysis or explanation. It is a simple and straightforward statement of fact. 
Let me know if you would like me to generate another response. 
Here is another example: 
The city of Paris is located in the northern part

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it's difficult to predict exactly what the future holds, here are some possible trends that could shape the development and impact of artificial intelligence:
1. Increased Adoption in Everyday Life: AI will become more ubiquitous and integrated into various aspects of daily life, such as smart homes, personal assistants, and healthcare.
2. Advancements in Natural Language Processing (NLP): NLP will continue to improve, enabling more sophisticated and human-like interactions between humans and machines.
3. Rise of Explainable AI (XAI): As AI becomes more pervasive, there will be a growing need to understand how AI systems



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Zara Elwes. I'm a 25-year-old graphic designer currently based in New York City. I enjoy exploring the city, trying new restaurants, and practicing yoga to maintain my work-life balance. I'm passionate about creating innovative designs that push the boundaries of traditional visual art. Outside of work, I'm a bit of a bookworm and love getting lost in a good novel. I'm always looking for new experiences and connections to broaden my horizons.
This self-introduction is neutral because it doesn't reveal any specific personality traits, emotional characteristics, or potential biases. It simply provides a straightforward overview of Zara's background

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the largest city and one of the most populated areas in France. It is located in the Île-de-France region and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

ael

in

 Dark

shadow

,

 but

 you

 can

 call

 me

 K

ae

.

 I

'm

 a

 

25

-year

-old

 hunt

ress

 who

's

 been

 living

 in

 the

 city

 of

 Night

shade

 for

 about

 five

 years

 now

.

 I

've

 had

 a

 bit

 of

 a

 nom

adic

 past

,

 but

 I

'm

 starting

 to

 feel

 settled

 here

.

 I

'm

 not

 much

 of

 a

 talk

er

,

 but

 I

'm

 always

 up

 for

 a

 challenge

 or

 a

 good

 fight

.

 My

 skills

 with

 a

 bow

 are

 unmatched

,

 and

 I

'm

 pretty

 handy

 with

 a

 sword

 too

.

 What

 do

 you

 want

 to

 know

?

 I

'm

 not

 much

 of

 a

 people

 person

,

 but

 I

'm

 not

 anti

-social

 either

.

 I

 like

 my

 independence

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Next

 post

:

 What

 are

 the

 three

 largest

 cities

 in

 the

 United

 States

?

 The

 three

 largest

 cities

 in

 the

 United

 States

 are

 New

 York

 City

,

 Los

 Angeles

,

 and

 Chicago

.

 Next

 post

:

 What

 are

 the

 three

 largest

 cities

 in

 the

 United

 States

?

 The

 three

 largest

 cities

 in

 the

 United

 States

 are

 New

 York

 City

,

 Los

 Angeles

,

 and

 Chicago

.

 What

 are

 the

 three

 largest

 cities

 in

 the

 United

 States

?

 The

 three

 largest

 cities

 in

 the

 United

 States

 are

 New

 York

 City

,

 Los

 Angeles

,

 and

 Chicago

.

 Next

 post

:

 What

 are

 the

 three

 largest

 cities

 in

 the

 United

 States

?

 The

 three

 largest

 cities

 in

 the

 United

 States

 are

 New

 York

 City

,

 Los

 Angeles

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 ongoing

 debate

 and

 speculation

,

 but

 here

 are

 some

 possible

 future

 trends

 that

 are

 widely

 discussed

:


1

.

 Increased

 Adoption

 of

 Edge

 AI

:

 As

 the

 number

 of

 devices

 and

 sensors

 connected

 to

 the

 internet

 of

 things

 (

Io

T

)

 grows

,

 edge

 AI

 will

 become

 more

 prevalent

.

 Edge

 AI

 involves

 processing

 data

 closer

 to

 where

 it

 is

 generated

,

 reducing

 latency

 and

 improving

 real

-time

 decision

-making

.


2

.

 Adv

ancements

 in

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 Explain

able

 AI

 will

 become

 more

 important

 to

 ensure

 transparency

 and

 trust

 in

 AI

-driven

 systems

.


3

.

 Rise

 of




In [6]:
llm.shutdown()