# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:08<00:00,  8.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:08<00:00,  8.03s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  2013Year3.

Hello! I am 2013Year3 and I am a high school student from China. I just got a new model of my phone, which is very good. I am excited to share my story with you guys.

First, I want to share my story about my phone. I want to share my phone with you all, and we should all learn from it. I use my phone for shopping, studying, and chatting with friends. I use my phone to record my daily life, and I have a good relationship with my mom. I also use my phone to take pictures and
Prompt: The president of the United States is
Generated text:  a male. The vice president of the United States is a female. Can the vice president of the United States be married to the president of the United States? To determine if the vice president of the United States can be married to the president, let's break down the given information and analyze it step by step.

1. The president of the United States is a male. This means the president is not married t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I am a [job title] at [company name]. I have been working in this field for [number of years] years. I am passionate about [reason for being in this field]. I am always looking for ways to [what I am looking for in my job]. I am [how I am different from other candidates]. I am [what I am looking for in a potential employer]. I am [how I am looking for in a potential employer]. I am [how I am looking for in a potential employer]. I am [how I am looking for in a potential employer]. I am [how I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that serves as the political, cultural, and economic center of the country. It is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as its rich history and diverse cultural scene. Paris is also known for its fashion industry, art scene, and its role as a global hub for business and commerce. The city is home to many famous landmarks and attractions, including the Palace of Versailles, the Louvre Museum, and the Arc de Triomphe. Paris is a city that is steeped in history and culture, and continues to be

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt in ways that are difficult for humans to do. This could lead to more efficient and effective AI systems that can perform tasks that are currently beyond the capabilities of humans.

2. Greater emphasis on ethical and social implications: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical and social implications. This could lead to more careful design and development of AI systems, as well as more consideration of the potential impact of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [profession] and [age]. I have always been passionate about [your profession] and have spent years honing my skills. I am constantly learning and growing, and I believe that my dedication to my craft makes me unique and exciting to work with. I am an [any profession] and I am always looking for new and exciting challenges to pursue. I am confident and hardworking, and I am always willing to invest in myself and my team. I am committed to [any profession] and I am always on the lookout for opportunities to learn and grow. I am excited to work with you and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest city in France and serves as the political, economic, cultural, and intellectual center of the country. It is also the seat of the French government and capital of 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 an

 [

insert

 your

 profession

].

 I

 have

 a

 passion

 for

 [

insert

 something

 that

 inspires

 you

],

 and

 I

 love

 exploring

 new

 experiences

.

 I

'm

 always

 eager

 to

 learn

 and

 grow

,

 and

 I

'm

 always

 looking

 for

 new

 adventures

 and

 opportunities

 to

 grow

 as

 a

 person

.

 I

'm

 confident

 in

 my

 abilities

 and

 am

 always

 ready

 to

 take

 on

 new

 challenges

.

 Whether

 it

's

 a

 new

 skill

 or

 a

 new

 adventure

,

 I

'm

 ready

 to

 step

 up

 and

 take

 on

 the

 world

!

 I

'm

 excited

 to

 meet

 you

 and

 see

 what

 I

 can

 do

 for

 you

!

 [

Your

 Name

]

 [

insert

 how

 you

 would

 like

 to

 be

 introduced

,

 such

 as

 "



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Petite

-P

ou

le

".

 It

 is

 the

 largest

 city

 in

 France

,

 with

 an

 estimated

 population

 of

 over

 

2

.

 

3

 million

 people

.

 Paris

 is

 famous

 for

 its

 architecture

,

 museums

,

 and

 cultural

 institutions

,

 including

 the

 Lou

vre

 and

 the

 E

iff

el

 Tower

.

 It

 is

 also

 known

 for

 its

 art

,

 food

,

 and

 fashion

,

 as

 well

 as

 its

 annual

 F

ête

 de

 la

 Fe

u

ille

,

 a

 colorful

 carnival

.

 Paris

 is

 the

 heart

 of

 French

 culture

 and

 is

 a

 major

 international

 center

 of

 education

,

 business

,

 and

 entertainment

.

 It

 is

 also

 known

 for

 its

 romantic

 history

 and

 romantic

 architecture

.

 



Paris

 is

 often

 referred

 to

 as



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 different

 trends

 that

 will

 shape

 the

 way

 we

 interact

 with

 technology

,

 work

,

 and

 even

 our

 families

.

 Some

 potential

 trends

 include

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 There

 is

 growing

 concern

 that

 AI

 can

 have

 unintended

 consequences

 if

 not

 developed

 and

 used

 eth

ically

.

 As

 a

 result

,

 there

 is

 increasing

 pressure

 for

 developers

 to

 consider

 the

 ethical

 implications

 of

 their

 creations

 and

 to

 ensure

 that

 they

 are

 designed

 for

 the

 most

 positive

 outcomes

.



2

.

 Rise

 of

 specialized

 AI

:

 As

 AI

 becomes

 more

 complex

 and

 capable

,

 there

 may

 be

 a

 trend

 towards

 creating

 specialized

 AI

 that

 can

 be

 tailored

 to

 specific

 tasks

 or

 applications

.

 This

 could

 lead

 to

 increased

 efficiency

,




In [6]:
llm.shutdown()