# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.23it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.12it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.11it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jonathan Haskins. I'm a first-year MFA student in fiction at the University of Wyoming. I'm excited to be here, and I look forward to getting to know my fellow writers and the wider Laramie community.
My interests and influences are diverse, but I'm particularly drawn to the intersection of the personal and the historical, as well as the relationships between identity, place, and narrative. I'm interested in exploring these themes through a variety of styles and forms, including short stories, novels, and possibly even hybrid or experimental work.
When I'm not writing, I enjoy hiking, reading, and cooking. I'm
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, and is the commander-in-chief of the armed forces. The president is indirectly elected by the people through the Electoral College. The president serves a four-year term and is limited to two terms in office.
The 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying out new restaurants. I'm a bit of a homebody, but I love exploring the city and discovering new hidden gems. I'm currently working on a novel and trying to learn more about the Japanese language and culture. That's me in a nutshell! What do you think? Is it too long or too short? Should I add or remove anything?
Your self-introduction is a good length and provides a good balance of personal and professional information. It's also neutral, which is a good approach for a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, near the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, education, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the “City of Love.” The city has a diverse range of neighborhoods, each with its own unique character

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze large amounts of medical data, identify patterns, and make predictions, leading to more accurate diagnoses and personalized treatment plans.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI aims to provide transparency and interpretability of AI models



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Finnley Wychwood, and I'm a 19-year-old student at Ravenwood University. I'm studying environmental science with a focus on sustainable development. I enjoy hiking and reading about history. What is wrong with this self-introduction?
This introduction is too neutral. It lacks any unique personality, emotions, or background that might make Finnley stand out from other characters. The text is also too formal and doesn't reveal much about Finnley's interests or motivations. A more engaging introduction might include details about Finnley's goals, values, or conflicts that could make the character more interesting and relatable. Example: "My name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Retain your reader's interest by making the least descriptive and most general statement about the city. Paris is 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ephy

r

.

 I'm

 a

 young

 woman

 with

 short

,

 sp

iky

 black

 hair

 and

 piercing

 green

 eyes

.

 I

've

 been

 living

 in

 the

 city

 for

 a

 few

 years

,

 working

 as

 a

 freelance

 journalist

 to

 make

 ends

 meet

.

 I

'm

 generally

 quiet

 and

 observ

ant

,

 preferring

 to

 listen

 rather

 than

 speak

,

 but

 when

 I

 do

 talk

,

 I

'm

 usually

 straightforward

 and

 to

 the

 point

.

 I

'm

 not

 particularly

 attached

 to

 any

 one

 place

 or

 group

,

 and

 I

 tend

 to

 keep

 my

 own

 counsel

.

 I

'm

 here

 to

 learn

,

 to

 observe

,

 and

 to

 tell

 stories

.


You

 could

 add

 more

 details

,

 but

 it

's

 a

 good

 starting

 point

.

 This

 character

 seems

 to

 be

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 country

’s

 largest

 city

 and

 a

 major

 financial

,

 economic

,

 and

 cultural

 hub

.

 The

 city

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the Lou

vre

 Museum

,

 as

 well

 as

 its

 fashion

 industry

 and

 romantic

 atmosphere

.

 Paris

 is

 home

 to

 many

 prestigious

 educational

 institutions

 and

 has

 a

 diverse

 population

 of

 around

 

2

.

1

 million

 people

.

 It

 is

 also

 known

 for

 its

 culinary

 delights

,

 such

 as

 French

 cuisine

 and

 cheese

.

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 era

 and

 has

 been

 an

 important

 center

 of

 politics

,

 culture

,

 and

 learning

 for

 centuries

.

 Its

 stunning

 architecture

,

 vibrant

 cultural

 scene

,

 and

 iconic

 status

 make



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 based

 on

 some

 of

 the

 latest

 technology

 that

 we

 use

 today

.

 Artificial

 intelligence

 is

 rapidly

 evolving

 and

 transforming

 the

 way

 we

 live

 and

 work

.

 There

 are

 many

 possibilities

 of

 trends

 that

 we

 may

 see

 in

 the

 future

 of

 artificial

 intelligence

.


Some

 possible

 future

 trends

 in

 artificial

 intelligence

 include

:


1

.

 Increased

 use

 of

 deep

 learning

:

 Deep

 learning

 is

 a

 subset

 of

 machine

 learning

 that

 involves

 the

 use

 of

 neural

 networks

 to

 analyze

 data

.

 It

 is

 currently

 being

 used

 in

 many

 applications

,

 including

 image

 and

 speech

 recognition

,

 natural

 language

 processing

,

 and

 game

 playing

.

 In

 the

 future

,

 we

 can

 expect

 to

 see

 even

 more

 widespread

 adoption

 of

 deep

 learning

,

 particularly

 in

 areas

 such

 as

 healthcare

,

 finance

,




In [6]:
llm.shutdown()