# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.67it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Madeline. I'm a high school student in Hong Kong and I love studying the piano. As an exchange student in a Chinese school, I have to study Chinese as my second language. I have to be very careful with my Chinese, so I need to practice a lot to improve my Chinese. 

My Chinese teacher introduced me to the piano and gave me some very useful instructions. She explained how to play the piano and taught me the right way to play the music I like. 

I enjoy listening to Chinese music as well. I'm learning about the history of Chinese music, and I'm interested in different kinds of Chinese music.
Prompt: The president of the United States is
Generated text:  a very important person. He is like a king. He is here to help the people of the country and protect their safety. He has a lot of important things to do. Sometimes he is busy, but he loves his job. His job is to make sure the country is safe. He makes sure everyone can live safely and have a goo

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [Age] year old, and I have a [Skill] in [Skill]. I'm always looking for ways to improve my [Skill], and I'm always eager to learn new things. I'm a [Interest] in [Interest], and I'm always looking for ways to make my life more interesting. I'm a [Personality] person, and I'm always looking for ways to make my life more enjoyable. I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and music, and is a popular tourist destination. Paris is a cultural and economic hub of France and a major international city. It is home to many world-renowned museums, theaters, and art galleries, as well as a vibrant nightlife and fashion scene. Paris is a city of contrasts, with its modern

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are likely to shape the way we live, work, and interact with technology. Here are some potential trends that could be expected in the future:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence. This could lead to more sophisticated forms of AI that can learn from and adapt to human behavior, making it easier for humans to interact with AI systems.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and validation of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Occupation]. I've always been fascinated by [Interest or Hobby] and have been learning it for years. My work is a testament to my passion, and I believe it's important to share it with the world. 

Thank you for considering my application. Let's connect to discuss how we can help achieve our goals together. 

Note: Don't include any personal information or opinions in your introduction. Start with a neutral greeting. Include your name, occupation, and an interest or hobby that makes you unique. Your introduction should be brief and concise, encouraging potential connections without revealing too much about your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the Loire Valley region of southwestern France. Paris is known for its stunning architecture, rich history, and diver

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 currently

 a

 [

Your

 Occupation

]

 at

 [

Your

 Company

/

In

stitution

].

 I

've

 always

 been

 passionate

 about

 [

Your

 Passion

],

 and

 I

'm

 committed

 to

 [

Your

 Mission

/

Your

 Mission

 Statement

].

 I

 thrive

 in

 [

Your

 Role

]

 and

 I

'm

 always

 looking

 to

 learn

 and

 grow

 in

 order

 to

 be

 a

 better

 [

Your

 Skill

/

Ability

].

 I

've

 always

 been

 passionate

 about

 [

Your

 Hobby

]

 and

 I

 enjoy

 [

Your

 Personality

 Trait

],

 which

 has

 allowed

 me

 to

 be

 a

 good

 friend

 to

 my

 family

 and

 my

 friends

.

 I

'm

 always

 ready

 to

 learn

 and

 improve

 my

 skills

 and

 I

'm

 confident

 that

 I

 can

 accomplish

 [

Your

 Goal

].



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 statement

 is

:

 Paris

 is

 the

 capital

 city

 of

 France

.

 



A

 simpler

 version

 could

 be

:

 Paris

 is

 the

 largest

 city

 in

 France

.

 



This

 statement

 is

 fact

ually

 correct

 and

 provides

 a

 concise

 summary

 of

 the

 capital

's

 location

 within

 the

 country

.

 However

,

 it

 may

 not

 be

 the

 most

 direct

 or

 clear

 way

 to

 introduce

 Paris

 to

 a

 new

 visitor

.

 



A

 more

 direct

 and

 straightforward

 way

 to

 introduce

 Paris

 might

 be

:



"The

 French

 capital

,

 located

 in

 the

 south

 of

 the

 country

,

 is

 known

 as

 Paris

."

 



This

 version

 is

 more

 concise

,

 avoids

 using

 the

 word

 "

capital

,

 "

 and

 gives

 a

 more

 accurate

 and

 natural

-s

ounding

 description

 of

 the

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 combination

 of

 advancements

 in

 hardware

,

 software

,

 and

 natural

 language

 processing

.

 Some

 potential

 trends

 that

 are

 likely

 to

 shape

 AI

 in

 the

 coming

 years

 include

:



1

.

 Increased

 mini

atur

ization

:

 As

 the

 cost

 of

 manufacturing

 hardware

 continues

 to

 decline

,

 the

 size

 and

 power

 of

 AI

 systems

 will

 likely

 increase

.

 This

 will

 enable

 more

 powerful

 and

 intelligent

 machines

 to

 be

 built

,

 and

 will

 make

 it

 easier

 for

 researchers

 to

 develop

 new

 AI

 technologies

.



2

.

 Greater

 integration

 with

 other

 fields

:

 AI

 is

 already

 being

 used

 in

 many

 different

 fields

,

 but

 there

 is

 potential

 for

 even

 more

 integration

 with

 other

 fields

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.

 This

 could

 lead

 to

 new




In [6]:
llm.shutdown()