# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.39it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.32it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.31it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.82it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.60it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Caroline and I'm 18 years old. I am a student at a community college, and I'm really interested in photography. My friends and family have always told me that I have a great eye for capturing special moments, and I would love to turn my passion into a career.
I have a lot of experience taking photos of my friends and family, and I even started a photography club at my high school. I would love the opportunity to take my skills to the next level and learn more about the photography industry.
I'm excited to learn more about the different types of photography, such as portrait, landscape, and wildlife photography. I
Prompt: The president of the United States is
Generated text:  the highest-ranking official in the federal government, serving as both the head of state and the head of government. The president is directly elected by the people through the Electoral College and serves a four-year term. The president has many important powers and resp

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm currently working on a novel and a graphic novel, and I'm excited to see where my creative projects take me.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or motivations. It simply provides a brief overview of who she is and what she does. This can be helpful

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major center for business, education, and culture. Paris is also known for its romantic atmosphere and is a popular destination for tourists from around the world. The city has a diverse range of neighborhoods, each with its own unique

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for AI systems to be transparent and explainable. This means that AI systems will need to provide clear explanations for their decisions and actions,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Luna. I'm a 22-year-old freelance writer who has been working with clients across the world. I have a degree in creative writing from a local university. I enjoy writing short stories and poetry, and I'm always looking to collaborate with like-minded individuals who share my passion for creative writing. Outside of work, I enjoy practicing yoga and playing the guitar. I'm a creative and curious person who is always looking for new experiences and opportunities to grow.
This self-introduction is neutral because it:
Does not reveal too much personal information
Does not express strong opinions or biases
Does not try to persuade or influence others
It provides

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The Paris region has approximately 12 million people living in it and is the largest metropolitan a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

 Ak

ats

uki

 and

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 from

 Tokyo

.

 I

'm

 pretty

 average

 in

 every

 way

,

 but

 I

 have

 a

 strong

 interest

 in

 music

 and

 fashion

.

 When

 I

'm

 not

 attending

 classes

,

 I

 enjoy

 listening

 to

 J

-pop

,

 playing

 my

 guitar

,

 and

 trying

 out

 new

 hairstyles

.

 I

'm

 still

 figuring

 out

 what

 I

 want

 to

 do

 with

 my

 life

,

 but

 for

 now

,

 I

'm

 just

 taking

 things

 one

 day

 at

 a

 time

.

 That

's

 me

 in

 a

 nutshell

.

 What

 do

 you

 think

?

 Is

 it

 good

 or

 bad

?

 The

 best

 self

-int

roduction

 is

 one

 that

 is

 genuine

 and

 honest

.

 Your

 self

-int

roduction

 should

 reflect



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 northern

 part

 of

 the

 country

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

,

 architecture

,

 and

 fashion

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 a

 center

 for

 education

,

 with

 several

 prestigious

 universities

 and

 institutions

 of

 higher

 learning

.

 Overall

,

 Paris

 is

 a

 city

 that

 is

 steep

ed

 in

 history

 and

 culture

,

 and

 it

 is

 a

 popular

 destination

 for

 tourists

 and

 visitors

 from

 around

 the

 world

.

 Provide

 a

 factual

 answer

 to

 the

 following

 question

:

 What

 is

 the

 population

 of

 Paris

?

 According

 to

 the

 

201

9

 estimate

,

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 various

 trends

 are

 emerging

 that

 will

 shape

 the

 industry

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Integration

 with

 the

 Internet

 of

 Things

 (

Io

T

):

 AI

 will

 be

 integrated

 with

 the

 Internet

 of

 Things

 (

Io

T

)

 to

 create

 a

 more

 connected

 and

 intelligent

 world

.

 AI

-powered

 devices

 will

 be

 able

 to

 communicate

 with

 each

 other

 and

 make

 decisions

 based

 on

 data

 from

 sensors

 and

 other

 sources

.


2

.

 Increased

 use

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 widespread

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 X

AI

 will

 become

 more

 prevalent

 to

 provide

 transparency

 and

 accountability

 in




In [6]:
llm.shutdown()