# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.46it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aya. I'm an Indian female who has a passion for learning about different cultures and languages. I like to explore the beauty of diverse societies and how they contribute to the global culture. I also like to study and learn new languages, especially those from other parts of the world.

My academic background is in the field of journalism and I hold a degree in Linguistics. While studying journalism, I also received a certificate in Cultural Communication from the University of Cambridge. My passion for learning is driven by a desire to broaden my horizons and explore new ideas and perspectives.

I enjoy sharing my knowledge and experiences with others. I believe that learning
Prompt: The president of the United States is
Generated text:  a very important person. The job of president is to take care of the country. Here are some things that the president does. First, the president has to make decisions about the country. He has to decide what

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. The city is also famous for its fashion industry, with many famous fashion houses and designers operating in the area. Paris is a popular tourist destination and a major economic center in France. It is home to many important museums, art galleries, and theaters. The city is also known for its cuisine, with many famous French dishes and restaurants. Overall, Paris is a city of art, culture, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks, from simple tasks like image recognition to complex tasks like autonomous driving and medical diagnosis. Additionally, AI will continue to be integrated into various industries, from finance and healthcare to manufacturing and transportation, leading to increased efficiency and productivity. Finally, AI will continue to be used for ethical and social reasons, such as in the development of personalized medicine and the prevention of diseases. Overall, the future of AI is likely to be



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a 34-year-old software engineer with [number] years of experience in [specific field or technology]. I'm driven, organized, and passionate about improving technology through innovation and problem-solving. I love coding, helping people, and creating digital experiences that make a difference in people's lives. I believe in continuous learning and collaboration, and I'm always looking for new challenges and opportunities to grow as a team member. My goal is to contribute to the advancement of technology and make a positive impact on the world. Thank you for taking the time to meet me. What technology is your field of expertise in?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in France by population and the third-largest city in the world by land area. Paris is al

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

Age

]

 year

 old

 [

Gender

]

 who

 currently

 resides

 in

 [

City

].

 I

 have

 always

 been

 fascinated

 by

 the

 world

 around

 me

,

 and

 my

 interest

 in

 history

,

 science

,

 and

 literature

 has

 only

 grown

 stronger

.

 I

 am

 passionate

 about

 exploring

 new

 places

 and

 cultures

,

 and

 I

 believe

 that

 everyone

 should

 have

 the

 opportunity

 to

 learn

 about

 the

 world

 around

 them

.

 In

 my

 free

 time

,

 I

 enjoy

 reading

,

 traveling

,

 and

 playing

 music

.

 What

's

 your

 favorite

 hobby

?

 I

'd

 love

 to

 hear

 about

 your

 hobbies

 as

 well

.

 [

Name

]?

 [

Name

]?

 [

Name

]?

 Hello

,

 my

 name

 is

 [

Name

],

 and

 I

 am

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 vibrant

 culture

.


Paris

 is

 the

 capital

 city

 of

 France

,

 famous

 for

 its

 majestic

 architecture

,

 stunning

 views

 of

 the

 city

,

 and

 vibrant

 cultural

 scene

.

 The

 city

 is

 also

 known

 for

 its

 rich

 history

,

 including

 the

 Mo

at

 Hum

ble

,

 which

 was

 once

 a

 royal

 residence

.

 Paris

 is

 a

 UNESCO

 World

 Heritage

 Site

 and

 a

 major

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.

 The

 city

's

 diverse

 population

,

 including

 the

 many

 languages

 spoken

,

 and

 its

 emphasis

 on

 the

 arts

 and

 education

 also

 make

 it

 a

 popular

 destination

 for

 many

 people

.

 Paris

 is

 a

 cultural

 and

 historic

 melting

 pot

 of

 Europe

 and

 beyond

.

 Its



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 heavily

 influenced

 by

 the

 rapid

 advancements

 in

 technology

,

 particularly

 in

 the

 areas

 of

 machine

 learning

 and

 natural

 language

 processing

.

 Some

 of

 the

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 Automation

:

 AI

 is

 expected

 to

 automate

 many

 tasks

 that

 are

 currently

 done

 by

 humans

,

 including

 data

 analysis

,

 routine

 maintenance

,

 and

 routine

 repairs

.

 This

 automation

 will

 require

 more

 sophisticated

 AI

 models

 to

 handle

 the

 complexity

 of

 new

 and

 complex

 tasks

.



2

.

 Enhanced

 Privacy

:

 With

 the

 increasing

 amount

 of

 data

 being

 collected

 and

 shared

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 that

 are

 secure

 and

 eth

ically

 aligned

.

 This

 means

 that

 AI

 systems

 will

 be

 designed

 to

 protect

 user

 data

 and




In [6]:
llm.shutdown()