# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.89it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aisha and I am an educator and community organizer focused on creating an inclusive environment for all. My passion is to promote equity, justice, and fairness for all people through education and activism. I believe in the power of diversity, inclusion, and collaboration to create a more just, equitable, and sustainable world for everyone. I am committed to continuous learning and engagement in social justice issues.
What is your background and how did you get into education? My background is in education and my passion for education is rooted in a desire to inspire and empower people. I started my education journey through college and graduate school, where I gained valuable experience and knowledge
Prompt: The president of the United States is
Generated text:  a position that can be filled by someone who has reached a certain age. The last person to hold this position was George W. Bush. Why is this position so important? The president of t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What brings you to [company name] and what do you do there? I'm always looking for new challenges and opportunities to grow and learn. What do you think is the most important thing for a successful career in [company name]? I believe that a successful career in [company name] is all about [insert a short

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historical and cultural center with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as its vibrant arts scene and world-class cuisine. The city is also home to many museums, theaters, and other cultural institutions, making it a popular destination for tourists and locals alike. Paris is a major hub for business, finance, and international affairs, and is a major center for the arts and culture industry. It is also known for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the development of the technology in the coming years. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection and risk management. As AI technology



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [job title] with a strong passion for [occupation]. I have always been driven by a desire to help others and make a positive impact in the world. I believe that every person has the potential to reach their full potential and that by working together, we can achieve great things. 

I am dedicated to using my skills, knowledge and passion to make a positive difference in the world. I am always looking for new opportunities to learn and grow, and I am a self-starter with a can-do attitude. I am here to help anyone I meet, no matter their background or circumstances. I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is an international and cultural center known for its historic landmarks, iconic landmarks, museums, and a famous fashion industry.

What are some unique attractions or 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

-old

 with

 [

Occup

ation

].

 I

'm

 currently

 living

 in

 [

City

]

 and

 have

 been

 looking

 forward

 to

 meeting

 you

 as

 it

's

 been

 a

 while

.

 I

've

 always

 been

 fascinated

 by

 the

 world

 and

 it

's

 cultures

,

 and

 I

 love

 trying

 new

 foods

 and

 cultures

.

 I

'm

 really

 excited

 to

 meet

 you

,

 [

Name

]

!



This

 is

 a

 fictional

 character

,

 and

 I

'm

 not

 using

 any

 names

 or

 actual

 identities

.

 I

'd

 like

 to

 keep

 it

 neutral

 and

 factual

,

 without

 any

 personal

 or

 cultural

 bias

.

 Let

 me

 know

 if

 you

 would

 like

 me

 to

 adapt

 or

 modify

 anything

.

 Let

 me

 know

 how

 I

 can



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 in

 France

 by

 population

 and

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

 and

 numerous

 museums

 and

 art

 galleries

.

 It

 is

 also

 home

 to

 the

 Lou

vre

 Museum

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Pal

ais

 Garn

ier

 opera

 house

.

 Paris

 is

 the

 world

's

 

7

th

-largest

 city

 by

 population

 and

 has

 a

 rich

 cultural

 heritage

 that

 has

 been

 influenced

 by

 its

 history

 of

 inv

asions

,

 conquest

s

,

 and

 immigration

.

 Its

 unique

 blend

 of

 old

-world

 charm

 and

 modern

ity

 has

 made

 it

 a

 must

-

visit

 destination

 for

 tourists

 and

 locals

 alike

.

 Paris

 is

 a

 city

 that

 constantly

 challenges

 and

 evolves

,

 always

 striving

 to

 maintain



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

 and

 will

 likely

 see

 continued

 advancements

 in

 several

 key

 areas

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 Use

 of

 AI

 in

 Healthcare

:

 AI

 is

 being

 used

 in

 healthcare

 to

 improve

 patient

 outcomes

,

 reduce

 costs

,

 and

 improve

 diagnosis

 and

 treatment

.

 This

 includes

 using

 AI

 to

 analyze

 medical

 images

,

 predict

 disease

 outbreaks

,

 and

 improve

 patient

 care

.



2

.

 Enhanced

 Personal

ized

 Health

:

 AI

 is

 being

 used

 to

 analyze

 a

 patient

's

 genetic

 and

 medical

 history

 to

 develop

 personalized

 treatment

 plans

.

 This

 can

 help

 doctors

 make

 better

 decisions

 and

 improve

 patient

 outcomes

.



3

.

 Automation

 of

 Routine

 Tasks

:

 AI

 is

 being

 used

 to

 automate

 routine

 tasks

 in

 industries

 such

 as

 manufacturing




In [6]:
llm.shutdown()