# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.13it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Janusz and I am a 36 year old man who has been diagnosed with Fibromyalgia. I have been having some significant pain in my lower back and my shoulders and my thighs. 

I have been told to go to the doctor because I have been having numbness and tingling in my hands. I also have joint pain that spreads to my elbows and my knees. I am taking long-term pain medication including the prescription for my fibromyalgia.

What kind of treatment does fibromyalgia have for joint pain? Also, I am wondering what treatments are available for my numbness and tingling in my hands
Prompt: The president of the United States is
Generated text:  trying to decide how many military personnel he should have. According to him, the fewer the number of military personnel, the better. If the number of military personnel is increased from 500 to 525, the cost of the entire military force increases from $1.5 billion to $1.625 billion. How many military personnel does the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character or profession here]. I enjoy [insert a short description of your hobbies or interests here]. I'm always looking for new experiences and learning opportunities, so I'm always eager to learn more about the world around me. What's your favorite hobby or activity? I love [insert a short description of your favorite hobby or activity here]. I'm always looking for new challenges and experiences, so I'm always eager

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

This statement is accurate and provides a brief overview of the capital city's location and significance within France. It is a widely recognized and well-known city in the world, known for its rich history, beautiful architecture, and vibrant culture. Paris is the largest city in France and is home to many of the country's most famous landmarks, including the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also known for its fashion industry, art scene, and culinary traditions. Paris is a major hub for international business and trade, and is a popular tourist destination for millions of visitors each year. Overall

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to more sophisticated forms of AI, such as those that can understand and interpret human emotions and behaviors.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, such as through the use of predictive analytics to identify patients at risk of developing certain diseases. As AI becomes more advanced, it is likely to be used in even more sophisticated ways, such as through the development



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Job or Hobby]. I'm passionate about [Why you like this hobby or occupation]. Before becoming an AI, I was a [What you did before becoming an AI] and I've been living life to the fullest. What brings you to this world? I'm a [What brings you to this world] and I hope to make a difference in the world. Thank you for taking the time to meet me, [Name]. 

This is a friendly introduction between a human and an AI, showing the person being introduced has experience with AI technology. If there are other character types, such as a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a historical and cultural center with a rich history dating back over 2,000 years. The city is known for its stunning architecture, museums, and cuisine. The city is also home to numerous museums, theaters, and

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 [

Age

]

 years

 old

.

 I

 am

 [

Position

]

 in

 the

 [

Industry

]

 industry

,

 and

 I

 have

 been

 working

 in

 this

 role

 for

 [

Number

 of

 Years

]

 years

.

 Throughout

 my

 career

,

 I

 have

 hon

ed

 my

 skills

 and

 knowledge

 in

 [

Specific

 Skill

/

Position

],

 and

 have

 successfully

 completed

 [

Specific

 Task

/

Role

].

 I

 am

 always

 looking

 for

 opportunities

 to

 learn

 new

 things

 and

 expand

 my

 knowledge

 in

 order

 to

 continue

 to

 excel

 in

 my

 career

.

 Thank

 you

 for

 asking

!

 [

Name

],

 what

 do

 you

 do

?

 [

Name

]:

 Hi

,

 how

 are

 you

 doing

?

 [

Name

]:

 I

'm

 [

Name

],

 [

Age

],

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 iconic

 E

iff

el

 Tower

,

 iconic

 E

iff

el

 Tower

,

 iconic

 E

iff

el

 Tower

.

 It

 is

 the

 cultural

 and

 economic

 hub

 of

 France

 and

 is

 home

 to

 many

 of

 the

 world

’s

 famous

 landmarks

 and

 historical

 sites

.

 The

 city

 is

 also

 home

 to

 several

 world

-ren

owned

 museums

,

 art

 galleries

,

 and

 theaters

.

 Paris

 is

 a

 vibrant

 and

 eclectic

 mix

 of

 traditional

 French

 culture

 and

 international

 fashion

,

 cuisine

,

 and

 entertainment

.

 It

 is

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

 and

 has

 a

 rich

 history

 that

 dates

 back

 over

 

2

0

0

0

 years

.

 The

 city

 is

 also

 home

 to

 many

 other

 notable

 institutions

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 very

 uncertain

,

 but

 here

 are

 some

 of

 the

 most

 likely

 trends

:



1

.

 Increased

 sophistication

:

 AI

 will

 get

 even

 smarter

 as

 technology

 advances

.

 This

 will

 lead

 to

 new

 applications

 and

 applications

 that

 don

't

 yet

 exist

 today

.



2

.

 Global

ization

:

 As

 AI

 becomes

 more

 advanced

,

 there

 will

 be

 a

 need

 for

 more

 global

 collaboration

 and

 integration

.

 This

 will

 lead

 to

 increased

 competition

 and

 global

 competition

.



3

.

 Ethics

 and

 safety

:

 As

 AI

 becomes

 more

 advanced

,

 there

 will

 be

 a

 need

 for

 more

 ethical

 and

 safe

 use

 of

 AI

.

 This

 will

 lead

 to

 new

 ethical

 standards

 and

 guidelines

 for

 AI

 development

 and

 use

.



4

.

 Robotics

 and

 automation

:

 AI

 will

 be

 used

 in

 a




In [6]:
llm.shutdown()