# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.01it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Liya and I am a 14 year old girl who wants to become an astronaut. What would you tell me to prepare for my journey to the moon? Becoming an astronaut is an incredibly exciting and challenging journey that requires a significant amount of preparation. Here are some key areas you might want to consider:

1. Health and Fitness: To prepare for the long and difficult journey to the moon, you will need to undergo rigorous training. You will need to be in excellent physical condition, with the ability to perform all of the necessary tasks, such as climbing on a rocket, breathing in space, and performing physical exertion.

2
Prompt: The president of the United States is
Generated text:  a private individual, and therefore, the president is not subject to the same legal constraints as a private individual.
A. True
B. False
C. None of the above
Answer:
B

Based on the provisions of the Negotiable Instruments Law, the validity of a bill of exchange is 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your job or experience here]. How can I assist you today? I look forward to hearing from you. [Name] [Company Name] [Phone number] [Email address] [LinkedIn profile link] [Twitter handle] [Facebook page] [Instagram account] [GitHub repository] [LinkedIn group] [Twitter hashtag] [Facebook group] [Instagram group] [GitHub repository] [LinkedIn group] [Twitter

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and art galleries. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is home to many notable French artists, writers, and musicians, and is known for its rich history and cultural heritage. Paris is a vibrant and dynamic city with a rich history and a strong sense of identity. Its status as the capital of France has made it a major center of politics, culture, and business in the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, AI is likely to continue to be used for a wide range of applications, from healthcare and finance to transportation and entertainment. As AI becomes more integrated into our daily lives, we can expect to see even more widespread adoption and integration of AI into our society. However, it is also important to note that AI is not without its challenges and risks, and it is important to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name here]. I'm a kind, gentle soul with a heart full of kindness. I'm always ready to lend a helping hand and be there for those in need. I'm not just someone who knows how to take care of yourself, I'm someone who also carries the weight of the world with me. I'm someone who will never give up on a loved one, even if it means going the extra mile. I am a soulmate, and I love to be with people who are kind and compassionate. Let's make this world a better place together. [insert name here]. I am [insert name here]. How

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

A. True B. False
A. True
The capital of France is Paris. 

Paris is the capital city of France, and it is located in the Northwestern region of the country. The city is famous for its rich history, beautiful architecture, and cul

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 [

Age

]

 years

 old

.

 I

 have

 always

 had

 a

 passion

 for

 learning

 and

 exploring

 the

 world

 around

 me

,

 and

 I

 am

 always

 seeking

 new

 experiences

 to

 expand

 my

 hor

izons

.

 I

 have

 a

 natural

 ability

 to

 communicate

 and

 interact

 with

 people

,

 and

 I

 am

 able

 to

 empath

ize

 with

 others

 in

 ways

 that

 many

 people

 struggle

 with

.

 I

 enjoy

 working

 hard

 and

 dedic

ating

 myself

 to

 learning

 and

 improving

 myself

,

 and

 I

 am

 always

 eager

 to

 learn

 new

 things

 and

 keep

 up

 with

 the

 latest

 trends

 and

 technologies

.

 I

 am

 also

 a

 good

 listener

 and

 an

 excellent

 communicator

,

 and

 I

 enjoy

 sharing

 my

 knowledge

 with

 others

.

 I

 am

 confident

 in

 my

 abilities



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 located

 on

 the

 French

 Riv

iera

.

 It

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Latin

 Quarter

.

 Paris

 is

 also

 a

 major

 center

 for

 music

,

 art

,

 and

 culture

,

 and

 is

 home

 to

 numerous

 museums

,

 theaters

,

 and

 events

 throughout

 the

 year

.

 It

 is

 a

 popular

 tourist

 destination

,

 known

 for

 its

 rich

 history

,

 cuisine

,

 and

 vibrant

 culture

.

 Paris

 is

 the

 fifth

-largest

 city

 in

 the

 world

 and

 is

 a

 major

 economic

 center

 in

 Europe

.

 The

 city

 is

 also

 known

 for

 its

 innovative

 use

 of

 modern

 technology

,

 including

 the

 use

 of

 the

 Internet

 and

 mobile



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 complex

,

 but

 there

 are

 several

 potential

 trends

 that

 could

 shape

 its

 development

 and

 evolution

:



1

.

 Increased

 integration

 with

 human

 decision

-making

:

 As

 AI

 becomes

 more

 capable

,

 it

 is

 expected

 to

 become

 more

 integrated

 with

 human

 decision

-making

 processes

.

 AI

 systems

 will

 be

 able

 to

 learn

 and

 adapt

 to

 the

 complexities

 of

 human

 decision

-making

,

 and

 will

 be

 able

 to

 provide

 more

 accurate

 and

 nuanced

 feedback

.



2

.

 Enhanced

 ethical

 considerations

:

 As

 AI

 becomes

 more

 advanced

,

 there

 will

 be

 an

 increased

 focus

 on

 ethical

 considerations

.

 This

 may

 include

 issues

 such

 as

 privacy

,

 bias

,

 and

 accountability

.



3

.

 Expansion

 of

 AI

 applications

:

 As

 AI

 technology

 becomes

 more

 advanced

,

 there

 will

 be

 an




In [6]:
llm.shutdown()