# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.19it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Marie. I'm a young woman who was born in 2013. How old are you now? I can tell you that you were born in 2013, so you are 13 years old. That's a lot for you to be doing! Marie, how old are you now? (I'll wait a moment before answering)
I'm 14 years old, so you're right. That's a good age to be! Marie, how old are you now? (I'll wait a moment before answering) Marie, how old are you now? (I'll wait a moment before answering
Prompt: The president of the United States is
Generated text:  trying to decide whether to go to war with another country. The president has two options: Option A, which would cost the nation $200 billion in today's dollars, and Option B, which would cost the nation $300 billion in today's dollars, but the country would have to pay the full amount of $200 billion in the future. The president is concerned that the country's economy would collapse if the country went to war with another country. The government, however, can pa

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, positive, enthusiastic statement about yourself]. I'm always looking for new challenges and opportunities to grow and learn. What do you like to do in your free time? I enjoy [insert a short, positive, enthusiastic statement about your hobbies or interests]. I'm always looking for new experiences and learning opportunities to expand my horizons. What's your favorite hobby or activity? I love [insert a short, positive, enthusiastic statement

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the third-largest city in the world by population. Paris is known for its rich history, beautiful architecture, and vibrant culture. The city is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also a major hub for business, finance, and tourism, making it a popular destination for tourists and locals alike. The city is home to many cultural institutions and events throughout the year, including the annual Eiffel Tower Festival and the annual World Cup football tournament. Paris is a city of contrasts, with its modern architecture and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation: AI is already being used in a wide range of industries, from manufacturing to healthcare to finance. As the technology continues to advance, we can expect to see more automation in various sectors, leading to increased efficiency and productivity.

2. Enhanced privacy and security: As AI becomes more integrated into our daily lives, there will be an increased need for privacy and security. This will require the development of new technologies and regulations



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name] and I'm a [insert fictional profession or role]. I'm really good at [insert a specific skill or trait that makes me unique or impressive]. I have a lot of energy, so I'm always able to get things done quickly and efficiently. I'm always ready for action and can take on any challenge. I enjoy working with people and always try to build strong relationships with them. I'm a great listener and always try to understand the perspectives of others. I love to learn and always strive to improve myself. Thank you for asking! What makes you unique or impressive?
Hello, my name is [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known as "The City of Light" and "The City of Letters". It is located on the Loire River in the center of the country and is home to many famous landma

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 excited

 to

 dive

 into

 this

 new

 chapter

 and

 see

 what

 surprises

 await

 me

!

 [

Insert

 relevant

 details

 about

 your

 character

,

 such

 as

 your

 role

,

 what

 you

 do

,

 and

 any

 unique

 skills

 or

 experiences

 you

 bring

 to

 the

 table

.

]



---



Please

 modify

 the

 existing

 dialogue

 to

 make

 it

 more

 concise

 and

 flow

ery

,

 while

 also

 incorporating

 a

 hint

 of

 sarc

asm

.

 Perhaps

 you

 could

 say

 something

 like

:

 "

I

'm

 a

 [

job

 title

]

 at

 [

company

 name

],

 and

 I

've

 got

 some

 pretty

 cool

 stuff

 going

 on

!

 I

'm

 excited

 to

 get

 started

 and

 see

 what

 this

 place



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 and

 most

 populous

 city

 in

 the

 country

.

 It

 is

 located

 on

 the

 left

 bank

 of

 the

 Se

ine

 river

 and

 is

 the

 seat

 of

 the

 Government

,

 the

 head

 of

 state

 and

 head

 of

 government

 of

 France

.

 It

 has

 a

 population

 of

 over

 

2

 million

 people

.

 Paris

 is

 known

 for

 its

 history

,

 art

, music

,

 and

 cuisine

.

 It

 is

 also

 an

 important

 financial

 center

 and

 home

 to

 many

 world

-ren

owned

 museums

 and

 galleries

.

 Despite

 its

 size

,

 Paris

 is

 also

 a

 charming

 city

 with

 beautiful

 parks

,

 monuments,

 and

 ancient

 streets

.

 It

 is

 considered

 one

 of

 the

 world

's

 most

 beautiful

 cities

 and

 is

 often

 called

 "

The

 City

 of

 Light



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 constantly

 evolving

,

 but

 here

 are

 some

 potential

 trends

 that

 are

 likely

 to

 shape

 the

 field

 in

 the

 coming

 years

:



1

.

 Increased

 AI

 integration

 with

 human

 decision

-making

:

 With

 the

 increasing

 adoption

 of

 AI

 in

 various

 industries

,

 we

 are

 likely

 to

 see

 more

 integration

 between

 AI

 and

 human

 decision

-making

 processes

.

 This

 could

 lead

 to

 more

 complex

 and

 nuanced

 AI

 systems

 that

 require

 human

 oversight

.



2

.

 AI

 systems

 becoming

 more

 accessible

 and

 affordable

:

 As

 the

 cost

 of

 AI

 technology

 continues

 to

 decline

,

 we

 may

 see

 more

 AI

 systems

 becoming

 accessible

 to

 individuals

 and

 organizations

 at

 a

 lower

 cost

.

 This

 could

 lead

 to

 more

 widespread

 adoption

 of

 AI

 in

 various

 sectors

,

 and

 more

 people

 being

 able




In [6]:
llm.shutdown()