# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.23it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Timmy Hart. I'm a 22-year-old, female with Asperger's syndrome. I have no control over my actions, have no social awareness, and I can't regulate my emotions. I have no thoughts or feelings. I'm alone. This is a very simple description of my personality. 

I'm not a car enthusiast. I don't own a car, but I love to travel. I have a passion for vehicles, and I'm always planning a new trip or trying to buy a car. I'm excited to get into my dream car and explore new places. I enjoy driving, and I love to
Prompt: The president of the United States is
Generated text:  trying to decide how many military bases to have. Currently, there are 50 military bases. If the president decides to increase the number of bases by 50% each year, how many bases will there be in 3 years?
Let's break down the problem step by step.

1. **Current Number of Bases**: The current number of military bases is 50.
2. **Annual Increase Factor**: The president decides to increa

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] who has been [number of years] in the industry. I'm passionate about [reason for passion], and I'm always looking for ways to [action or goal]. I'm [age] years old, and I'm [gender] and [race]. I'm [occupation] and I'm [number of years] in the industry. I'm passionate about [reason for passion], and I'm always looking for ways to [action or goal]. I'm [age] years old, and I'm [gender] and [race]. I'm [occupation] and I'm [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that is known for its iconic Eiffel Tower and its rich history dating back to the Middle Ages. It is also home to the Louvre Museum, the most famous art museum in the world, and the Notre-Dame Cathedral, a stunning Gothic structure that has stood for over 800 years. Paris is a vibrant and diverse city with a rich cultural scene, and it is a popular tourist destination for people from all over the world. Its status as the capital of France is a testament to its importance as a major city in the country. 

Paris is also known for its cuisine, with its famous dishes

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that could be expected in the coming years:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This could include issues such as bias, privacy, and transparency. AI developers will need to be more mindful of the potential impact of their work on society and take steps to ensure that their algorithms are fair and unbiased.

2. Greater use of AI in healthcare: AI is already being used in a number



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a versatile professional with a wealth of experience in [industry]. I'm known for my ability to work effectively with a range of clients and collaborate with my team to deliver exceptional results. I'm a friendly, approachable person who is always looking for ways to help others succeed. Thank you. Your intro is neutral, yet professional. Are there any additional details you would like me to include or any specific industry references that I should use? That would be great! I'm looking for a self-introduction for a fictional character that incorporates a mix of professional and personal elements. No issues! Please let me know if you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is a city in eastern France, located on the River Seine. It is the capital of France and the largest city 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 John

,

 I

'm

 a

 

2

5

-year

-old

 tech

 expert

 with

 a

 passion

 for

 solving

 complex

 problems

.

 My

 background

 is

 in

 software

 development

 and

 artificial

 intelligence

,

 and

 I

'm

 a

 problem

 solver

 at

 heart

.

 I

'm

 always

 looking

 to

 innovate

 and

 come

 up

 with

 creative

 solutions

 to

 complex

 challenges

.

 I

'm

 a

 team

 player

 and

 always

 willing

 to

 collaborate

 with

 others

 to

 achieve

 our

 goals

.

 I

 thrive

 on

 learning

 and

 continuously

 improving

 my

 skills

,

 and

 I

'm

 excited

 to

 dive

 into

 any

 new

 challenge

.

 I

'm

 a

 good

 communicator

,

 and

 I

'm

 always

 eager

 to

 share

 my

 knowledge

 and

 insights

 with

 others

.

 What

's

 your

 favorite

 hobby

,

 and

 what

's

 your

 favorite

 place

 to

 relax

?





Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 “

La

 Ville

 Fl

uv

iale

”

 or

 simply

 “

La

 Ville

”.

 It

 is

 the

 largest

 city

 in

 Europe

,

 and

 the

 second

 largest

 city

 in

 the

 world

 by

 population

.

 Its

 most

 famous

 landmark

 is

 the

 E

iff

el

 Tower

,

 which

 stands

 at

 the

 city

's

 center

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

,

 cuisine

,

 and

 fashion

.

 It

 has

 a

 thriving

 cultural

 scene

 with

 many

 museums

,

 theaters

,

 and

 opera

 houses

.

 The

 city

 is

 also

 famous

 for

 its

 historical

 neighborhoods

 such

 as

 the

 

1

6

th

 and

 

1

7

th

 centuries

,

 the

 

1

9

th

 century

,

 and

 the

 modern

 city

.

 It

 is

 also

 the

 home

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 a

 range

 of

 factors

,

 including

 advances

 in

 computing

 power

,

 the

 development

 of

 new

 technologies

,

 and

 changing

 societal

 needs

.

 Here

 are

 some

 possible

 trends

 that

 could

 influence

 AI

 in

 the

 years

 to

 come

:



1

.

 Increased

 AI

 in

 healthcare

:

 AI

 will

 play

 an

 increasingly

 important

 role

 in

 medical

 research

,

 diagnosis

,

 and

 treatment

,

 helping

 doctors

 to

 identify

 patterns

 and

 make

 more

 accurate

 diagnoses

.

 AI

-powered

 medical

 devices

,

 such

 as

 wearable

 health

 monitors

 and

 predictive

 analytics

 tools

,

 could

 also

 help

 to

 predict

 and

 prevent

 disease

 outbreaks

.



2

.

 AI

 in

 automation

:

 The

 integration

 of

 AI

 into

 manufacturing

 and

 other

 industries

 could

 lead

 to

 more

 efficient

 and

 cost

-effective

 processes

,

 leading

 to




In [6]:
llm.shutdown()