# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.48it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dusan. I have been working in my research center since 2001. I’m currently a professor in the School of Computing, which is located in the heart of the metropolitan area of Vienna, Austria. My main research areas are in the field of distributed computing, network technology, and distributed programming. I’m the director of the European Graduate School (ESRC) for the Center for Computer Science, which is located in Vienna. I’m also the director of the Center for the Foundation of Computing, which is located in Vienna. I also serve as the director of the Center for the Future of Computing, which is located in
Prompt: The president of the United States is
Generated text:  a noble-born nobleman, born in 1959, and he has a complex and multifaceted personality, which includes a penchant for being affectionate towards his family members, a tendency to be a bit of a prankster, and a lack of clear-cut decisions. His birth year is 1959, and his death ye

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and I'm always looking for ways to [job title] my skills and knowledge. I'm always eager to learn and grow, and I'm always looking for opportunities to contribute to the company and help it succeed. What's your favorite hobby or activity? I'm a [job title] at [company name], and I enjoy [job title]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in the European Union and the second-largest city in the world by population. It is located in the south of France and is the seat of government, administration, and culture for the country. Paris is famous for its architecture, art, and cuisine, and is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major center for business, finance, and tourism in Europe. Paris is a cultural and intellectual hub, and is known for its annual festivals and events such as the World Cup

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased automation: As AI becomes more advanced, it is likely to be used more extensively in areas such as manufacturing, transportation, and customer service. This will lead to increased automation of tasks and processes, freeing up workers to focus on more complex and creative work.

2. AI ethics and privacy: As AI becomes more advanced, there will be a growing concern about its impact on society. This will lead to increased regulation and scrutiny of AI development and deployment, as well as a need for greater



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Age]. I am [Occupation]. I am the type of person you don't want to be in a relationship with. I am a[Occupation] who is dedicated to [My Hobby/Interest/Goal]. I am [My Attitude]. What kind of person are you? Myattitude is me, and I am [Attitude]. How is it that you got into this business?
I have always had a passion for [Your Hobby/Interest/Goal], and my drive for success has always been my driving force. I am always looking for ways to make a difference and help others

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, an international metropolis known for its rich history and world-class museums, including the Louvre and the Musée d'Orsay. Paris is also a major tourist destination, famous for its cultural offerings, fashion scene, and gastronomy. The French Parliament is located in the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

].

 I

 bring

 with

 me

 a

 strong

 work

 ethic

,

 reliability

,

 and

 a

 positive

 attitude

.

 I

 strive

 to

 make

 a

 difference

 in

 my

 team

's

 work

 and

 strive

 to

 always

 look

 for

 ways

 to

 improve

 my

 work

 habits

.

 I

 am

 always

 ready

 to

 help

 others

 and

 am

 always

 willing

 to

 learn

 new

 skills

.

 I

'm

 also

 a

 good

 listener

 and

 I

 take

 the

 time

 to

 listen

 to

 feedback

 and

 suggestions

 from

 others

.

 I

'm

 a

 team

 player

 and

 always

 strive

 to

 build

 positive

 relationships

 with

 others

.

 I

 love

 to

 work

 on

 ideas

 and

 always

 have

 a

 fresh

 perspective

 on

 things

.

 I

'm

 a

 hard

 worker

 and

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 landmarks

,

 cultural

 attractions

,

 and

 rich

 history

,

 including

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 home

 to

 many

 famous

 museums

,

 including

 the

 Mus

ée

 d

'

Or

say

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Mus

ée

 de

 l

'

Or

anger

ie

.

 In

 terms

 of

 cuisine

,

 Paris

 is

 famous

 for

 its

 French

 cuisine

,

 including

 its

 famous

 Paris

 bag

u

ette

,

 cheese

,

 and

 wine

.

 The

 city

 is

 also

 known

 for

 its

 fashion

 industry

,

 with

 iconic

 fashion

 houses

 such

 as

 Chanel

,

 Louis

 V

uit

ton

,

 and

 F

endi

.

 Paris

 is

 a

 bustling



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 number

 of

 trends

 and

 developments

 that

 are

 expected

 to

 shape

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 technology

.

 Some

 of

 the

 potential

 future

 trends

 in

 artificial

 intelligence

 include

:



1

.

 Autonomous

 vehicles

:

 The

 development

 of

 autonomous

 vehicles

 is

 already

 underway

,

 and

 it

 is

 expected

 to

 become

 more

 prevalent

 in

 the

 future

.

 These

 vehicles

 will

 be

 able

 to

 navigate

 roads

 and

 navigate

 their

 own

 routes

 without

 human

 intervention

,

 reducing

 the

 risk

 of

 accidents

 and

 improving

 safety

.



2

.

 Improved

 language

 understanding

:

 AI

 systems

 will

 become

 more

 capable

 of

 understanding

 human

 language

,

 enabling

 them

 to

 better

 communicate

 and

 provide

 information

 to

 users

.



3

.

 Increased

 use

 of

 AI

 in

 healthcare

:




In [6]:
llm.shutdown()