# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.22it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.13it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.12it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  David and I am an old dog. I am 15 years old and I've lived a good life. I've had many adventures, loved by my family and enjoyed many treats. I'm still a bit spry, but not as much as I used to be. My joints creak and I get tired easily, but I still have a heart full of love and a mind that is sharp.
I've seen many things in my life, from the birth of new puppies to the passing of old friends. I've been there for my family through thick and thin, always ready to lend a listening ear or a comforting nuzzle.
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president serves a four-year term and is elected by the Electoral College. The president is responsible for enforcing laws, commanding the armed forces, and serving as the commander-in-chief of the military. The president is also the chief diplomat, representing the United States at home and abroad. The president 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys playing the guitar and listening to music. I'm a bit of a bookworm and love reading fantasy novels. I'm a bit shy, but I'm working on being more outgoing. I'm a junior at Springdale High School. That's me in a nutshell.
I'm a 17-year-old high school student who enjoys playing the guitar and listening to music. I'm a bit of a bookworm and love reading fantasy novels. I'm a bit shy, but I'm working on being more outgoing. I'm a junior at Springdale High School.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, near the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major hub for international business, tourism, and education. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." Source: Wikipedia. The statement is concise and factual, providing a brief

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Advancements in natural language processing: AI-powered chatbots and virtual assistants are becoming increasingly sophisticated, allowing for more natural and human-like interactions. This trend is expected to continue, with AI systems becoming more capable of understanding and responding to human language



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elliot Waverly. I'm a quiet and observant person who works as a book conservator at a small library in a rural town. I spend my free time reading and tending to my small garden. I find comfort in routine and familiarity. That's me. This introduction focuses on the character's personality, occupation, and hobbies, providing a concise and neutral overview of who they are. It doesn't reveal any dramatic or sensational aspects of their life, keeping the tone straightforward and matter-of-fact. The use of the phrase "That's me" adds a touch of humility and authenticity, making the introduction feel more like a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The following is a statement about a particular topic. The statement needs to be rewritten to make it clear that it is about the city of Paris in France

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lucas

 Ellis

,

 and

 I

'm

 a

 

25

-year

-old

 photographer

.

 I

've

 been

 working

 as

 a

 freelance

 photographer

 for

 three

 years

,

 capturing

 the

 beauty

 of

 the

 world

 around

 me

 through

 the

 lens

 of

 my

 camera

.

 I

 love

 traveling

 to

 new

 places

 and

 experiencing

 different

 cultures

,

 which

 often

 finds

 its

 way

 into

 my

 work

.

 When

 I

'm

 not

 behind

 the

 camera

,

 you

 can

 find

 me

 exploring

 local

 coffee

 shops

 or

 hiking

 in

 the

 nearby

 woods

.

 What

 do

 you

 think

?


I

 think

 the

 introduction

 is

 well

-written

 and

 it

 effectively

 con

veys

 the

 character

's

 personality

 and

 interests

.

 The

 use

 of

 simple

 and

 concise

 language

 makes

 it

 easy

 to

 understand

 and

 relate

 to

 the

 character

.

 However

,

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

,

 along

 the

 Se

ine

 River

.

 The

 population

 of

 Paris

 is

 approximately

 

2

.

1

 million

 people

 within

 the

 city

 limits

 and

 

12

.

2

 million

 in

 the

 larger

 metropolitan

 area

.

 Paris

 is

 a

 major

 hub

 for

 culture

,

 fashion

,

 and

 tourism

 in

 Europe

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 the

 Gal

lic

 era

,

 and

 it

 has

 been

 a

 major

 center

 of

 politics

,

 culture

,

 and

 learning

 for

 centuries

.

 Today

,

 Paris

 is

 a

 global

 city

 and

 a

 popular



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 increasingly

 being

 discussed

 among

 experts

 and

 business

 leaders

,

 with

 predictions

 ranging

 from

 exciting

 possibilities

 to

 dyst

opian

 nightmares

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

 that

 are

 being

 considered

:


1

.

 Increased

 Adoption

 of

 Edge

 AI

:

 As

 the

 Internet

 of

 Things

 (

Io

T

)

 continues

 to

 grow

,

 AI

 will

 be

 integrated

 into

 more

 devices

 and

 applications

,

 enabling

 real

-time

 processing

 and

 analysis

 of

 data

 at

 the

 edge

.

 This

 will

 reduce

 latency

 and

 improve

 the

 efficiency

 of

 AI

 systems

.


2

.

 Adv

ancements

 in

 Natural

 Language

 Processing

:

 N

LP

 will

 continue

 to

 improve

,

 enabling

 more

 effective

 human

-A

I

 interaction

 and

 understanding

.

 This

 will

 lead

 to

 more

 sophisticated

 chat

bots

,

 virtual

 assistants

,

 and

 language




In [6]:
llm.shutdown()