# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Eliza, an American singer, songwriter, actress and model, born on 22 January 1990. What is your real name?

Your real name is Eliza Louise Carter. Is there anything else you'd like to know about yourself?
Prompt: The president of the United States is
Generated text:  trying to decide how many military personnel to send to a war zone. He knows from past experience that if he sends 6000 military personnel, he will have only 5000 left. However, if he sends 5000 military personnel, he will have 6000 left. 

1. If the president wants to maximize the number of military personnel left at the war zone, how many military personnel should he send? 

2. What is the maximum number of military personnel he can send without exceeding the 5000 limit, and how many are left at the war zone? 
Prompt: The capital of France is
Generated text:  ____
A. Paris
B. Versailles
C. Lille
D. Lyon
Answer:
A

When the power supply voltage is too high or too low, it can caus

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or profession]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new experiences and learning opportunities. What are some of your favorite things to do? I love [insert a short description of your favorite activity or hobby]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite thing to do? I love [insert a short description of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter, a historic neighborhood. Paris is a cultural and economic hub, known for its fashion, art, and cuisine. It is a popular tourist destination, attracting millions of visitors each year. The city is also home to many museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin. Paris is a city of contrasts, with its rich history and modernity. Its status as

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends in AI include:

1. Increased use of AI in healthcare: AI is already being used to diagnose and treat diseases, and it has the potential to revolutionize the field of medicine. AI-powered diagnostic tools could potentially detect diseases at an earlier stage, leading to better outcomes for patients.

2. AI in finance: AI is already being used to analyze financial data and make investment decisions. However, the potential for AI in finance is likely to grow as more data is collected and analyzed. AI-powered trading platforms and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Sarah Thompson, and I am a hardworking and curious college student who enjoys reading books and learning new things. I am passionate about environmental issues and have been volunteering at local environmental organizations for the past few years. I am always looking for ways to make a positive impact in the world. What other things can you tell me about yourself? Sarah Thompson is a 26-year-old college student with a passion for environmental issues. She enjoys reading books and learning new things, and has been volunteering at local environmental organizations for the past few years. She is also passionate about writing, and has recently published her first book. She is always looking for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, often referred to as the "City of Love" due to its rich cultural 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

/an

 [

Age

]

 year

 old

 [

Gender

]

 [

Occup

ation

].

 I

'm

 currently

 [

Current

 Location

],

 and

 I

'm

 here

 to

 [

Your

 Role

,

 if

 applicable

].

 I

 hope

 you

're

 doing

 well

.

 I

 enjoy

 [

Whatever

 interests

 you

,

 if

 applicable

].

 I

'm

 always

 looking

 for

 [

A

 challenge

 or

 interest

],

 and

 I

'm

 always

 looking

 for

 new

 ways

 to

 [

A

 new

 skill

,

 if

 applicable

].

 I

'm

 a

/an

 [

What

 do

 you

 do

 best

?

 If

 multiple

 best

s

,

 list

 them

 here

].

 I

'm

 always

 ready

 to

 [

Anything

 you

 can

 think

 of

].

 I

'm

 looking

 forward

 to

 [

What

 you

'd

 like

 to

 achieve



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 both

 area

 and

 population

 of

 France

,

 and

 also

 the

 seat

 of

 the

 government

 and

 the

 country

’s

 capital

,

 both

 being

 the

 oldest

 capital

 in

 Europe

.

 It

 is

 located

 on

 the

 Se

ine

 river

 in

 the

 south

-west

ern

 part

 of

 the

 country

,

 and

 is

 home

 to

 many

 of

 the

 country

’s

 major

 landmarks

 and

 attractions

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 the

 Ch

amps

-E

lys

ées

,

 and

 many

 others

.

 It

 is

 also

 the

 birth

place

 of

 many

 famous

 artists

,

 such

 as

 Leonardo

 da

 Vinci

,

 Pablo

 Picasso

,

 Vincent

 van

 G

ogh

,

 and

 Michel

angelo

.

 Paris

 is

 known



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 uncertain

.

 While

 there

 are

 many

 potential

 areas

 of

 development

,

 one

 trend

 that

 is

 likely

 to

 continue

 and

 become

 more

 prevalent

 is

 the

 increasing

 use

 of

 AI

 in

 the

 healthcare

 industry

.

 With

 the

 growing

 need

 for

 personalized

 treatment

,

 AI

 has

 the

 potential

 to

 revolution

ize

 the

 way

 we

 approach

 medical

 care

.

 AI

-powered

 diagnostic

 tools

,

 such

 as

 MRI

 and

 CT

 scanners

,

 and

 predictive

 models

 for

 disease

 development

 could

 lead

 to

 earlier

 and

 more

 accurate

 diagnoses

,

 resulting

 in

 better

 patient

 outcomes

.



Another

 area

 where

 AI

 is

 likely

 to

 have

 a

 significant

 impact

 is

 in

 customer

 service

.

 With

 the

 rise

 of

 chat

bots

 and

 AI

-powered

 chat

 assistants

,

 customers

 can

 now

 interact

 with

 businesses

 in

 a

 more

 convenient




In [6]:
llm.shutdown()