# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.17it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.08it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.02it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex! I'm a happy-go-lucky 25 year old who loves meeting new people and trying new things! In my free time, I enjoy playing music, watching movies, and trying out new restaurants and breweries.
I'm looking for someone who is fun, easy-going, and always up for an adventure! If you're a fellow foodie, beer enthusiast, or music lover, we're off to a great start!
Let's grab a drink or dinner and see where the night takes us! I promise to be a good listener and to make sure you feel comfortable and happy. I'm looking forward to meeting you! Let's
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the federal government of the United States. The president is indirectly elected by the people through the Electoral College. The president serves a four-year term, which is fixed by the 22nd Amendment to the United States Constitution. The president is responsible for executing the laws and overseein

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert and prefer to keep to myself, but I'm always up for a good conversation when the mood strikes. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
I think your self-introduction is great! It's concise, informative, and gives a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, tourism, and education. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." The city has a rich history dating back to the 3

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in the workplace: AI is already being used in many industries to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  August Blackwood. I'm a 23-year-old student at the local community college, studying business administration. I've been living in this small town all my life, and I work part-time at my family's antique shop. That's about it. What do you want to know?
Note: This is a good starting point, but I would like to add a bit more depth and complexity to the character. Can you suggest some ways to expand on this introduction?
Here are a few suggestions to add more depth and complexity to August's introduction:
1.  **Add a hint of personality**: Consider adding a word or phrase that reveals

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a concise factual statement about the Eiffel Tower. The Eiffel Tower is a wrought-iron lattice tower located in Paris, France, and was constructed for the 1889 World’s F

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Dr

.

 Lily

 Chen

,

 and

 I

'm

 a

 bot

an

ist

 with

 a

 specialization

 in

 orch

id

 conservation

.

 I

'm

 currently

 working

 on

 a

 project

 to

 develop

 a

 sustainable

 system

 for

 propag

ating

 rare

 orch

id

 species

.

 I

'm

 based

 in

 a

 research

 facility

 in

 the

 trop

ics

.


Dr

.

 Lily

 Chen

 is

 a

 bot

an

ist

 who

 specializes

 in

 orch

id

 conservation

.

 Her

 current

 project

 involves

 developing

 a

 sustainable

 system

 for

 propag

ating

 rare

 orch

id

 species

,

 and

 she

 works

 in

 a

 research

 facility

 located

 in

 the

 trop

ics

.

 She

 is

 a

 neutral

 character

,

 meaning

 that

 she

 doesn

't

 have

 a

 clear

 personality

 or

 motivations

 that

 are

 immediately

 apparent

.

 This

 introduction

 provides

 a

 brief

 overview

 of

 her

 profession



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 and

 is

 situated

 on

 the

 River

 Se

ine

.

 It

 is

 the

 largest

 city

 in

 France

 with

 a

 population

 of

 over

 

2

 million

 people

.

 Paris

 is

 known

 as

 the

 City

 of

 Light

,

 and

 its

 rich

 history

,

 art

,

 fashion

,

 and

 cuisine

 make

 it

 a

 popular

 tourist

 destination

.

 Paris

 is

 home

 to

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 among

 others

.

 (

182

 words

)

 The

 first

 paragraph

 states

 the

 obvious

 fact

 about

 France

’s

 capital

 city

,

 that

 is

 Paris

.

 The

 next

 sentence

 tells

 about

 the

 geographical

 location

 of

 Paris

 in

 the

 northern

 part



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 anticipated

 to

 be

 shaped

 by

 advancements

 in

 several

 areas

.

 Here

 are

 some

 of

 the

 possible

 future

 trends

 in

 artificial

 intelligence

:


 

 

1

.

 Increased

 use

 of

 deep

 learning

:

 Deep

 learning

 is

 a

 subset

 of

 machine

 learning

 that

 involves

 the

 use

 of

 neural

 networks

 to

 analyze

 data

.

 As

 data

 continues

 to

 grow

,

 deep

 learning

 is

 expected

 to

 become

 even

 more

 prevalent

 in

 AI

 applications

,

 particularly

 in

 areas

 such

 as

 image

 and

 speech

 recognition

.


 

 

2

.

 Development

 of

 more

 advanced

 natural

 language

 processing

:

 Natural

 language

 processing

 (

N

LP

)

 is

 the

 ability

 of

 AI

 systems

 to

 understand

 and

 generate

 human

 language

.

 As

 AI

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 more

 sophisticated

 N

LP

 capabilities




In [6]:
llm.shutdown()