# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.74it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lucy. I'm 12 years old. I live in a big house. My parents are teachers. The house is big and beautiful. My bedroom is near the front of the house. I have a big desk and a big bed in my bedroom. My bedroom is clean and comfortable. I have a small living room. My room is next to the living room. There is a small table and a small sofa in the living room. My living room is very nice. I like to read books in my living room. There is a small garden behind my house. I often go to the garden to play in the flowers and eat
Prompt: The president of the United States is
Generated text:  a politician, but he/she does not hold the office of the head of a government department.
The answer to the question "What is the first thing that the president needs to do before taking office?" is "greet everyone". Is this answer to the question correct?
Select from: I. yes. II. no.
II.Premise: "A group of people walking along a beach." Hypothesis: "A group of people w

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. It is located on the Seine River and is the seat of government, administration, and culture for the country. Paris is known for its rich history, art, and cuisine. It is also a major tourist destination and a major economic center. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a cultural and political center of France and a major hub for international trade and diplomacy. It is also a major hub for the French economy and a major economic

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see more automation and robotics in various industries, including manufacturing, transportation, and healthcare. This will lead to increased efficiency, productivity, and cost savings for businesses and individuals.

2. Enhanced personalization: AI will enable more personalized experiences for users, with the ability to learn and adapt to individual preferences and behaviors. This will lead to more efficient and effective communication, as well as



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Emily and I have always been passionate about helping people. Whether it's assisting them in their personal or professional lives, I strive to provide them with the support and guidance they need to achieve their goals. I believe in the power of empathy and compassion to truly make a difference. I have a natural talent for connecting with people and helping them feel valued and understood. Thank you. (500 words) Emily is a relatable and optimistic character who embodies the essence of empathy and compassion. She is passionate about helping others and strives to provide them with the support and guidance they need to achieve their goals. Her natural talent for connecting with people

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the political, economic, and cultural center of France. It is hom

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Role

]

 who

 has

 been

 following

 the

 [

Topic

]

 for

 [

Number

 of

 years

]

 years

.

 I

'm

 always

 eager

 to

 learn

 more

 and

 always

 ready

 to

 share

 what

 I

've

 discovered

.

 I

'm

 always

 willing

 to

 help

 others

 and

 support

 them

 in

 their

 own

 journey

.

 My

 journey

 is

 always

 about

 sharing

 knowledge

 and

 not

 just

 listening

.

 I

'm

 a

 [

Type

 of

 person

]

 who

 is

 always

 up

 for

 adventure

 and

 always

 ready

 to

 take

 on

 new

 challenges

.

 Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

!

 What

 are

 some

 ways

 to

 get

 someone

 to

 like

 you

 based

 on

 your

 self

-int

roduction

?



I

 understand

 that

 your

 self

-int

roduction

 is

 neutral



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 rich

 history

,

 diverse

 cultural

 scene

,

 and

 beautiful

 architecture

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

 and

 museums

,

 including

 the

 Lou

vre

 and

 the

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 a

 bustling

 hub

 for

 international

 trade

 and

 finance

,

 attracting

 businesses

 and

 tourists

 from

 around

 the

 world

.

 Paris

 is

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 French

 culture

 and

 history

.

 It

 has

 been

 recognized

 as

 a

 UNESCO

 World

 Heritage

 site

 for

 its

 historical

 and

 cultural

 significance

.

 With

 its

 elegant

 architecture

,

 charming

 neighborhoods

,

 and

 vibrant

 atmosphere

,

 Paris

 is

 a

 truly

 unforgettable

 destination

 for

 travelers

.

 Based

 on

 the

 information

 provided

,

 what

 is

 the

 capital

 city

 of

 France

?



The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 increased

 complexity

,

 scalability

,

 and

 personal

ization

.

 In

 terms

 of

 complexity

,

 AI

 systems

 are

 expected

 to

 become

 more

 sophisticated

,

 able

 to

 learn

 and

 adapt

 to

 new

 situations

.

 They

 will

 be

 able

 to

 handle

 a

 wider

 range

 of

 tasks

 and

 be

 more

 efficient

 at

 them

.

 In

 terms

 of

 scalability

,

 AI

 will

 continue

 to

 evolve

 to

 become

 more

 efficient

 and

 effective

 at

 processing

 large

 amounts

 of

 data

.

 It

 will

 also

 be

 able

 to

 scale

 up

 to

 handle

 larger

 and

 more

 complex

 problems

.

 In

 terms

 of

 personal

ization

,

 AI

 will

 be

 able

 to

 learn

 from

 users

'

 interactions

 and

 provide

 personalized

 recommendations

 and

 solutions

.

 It

 will

 also

 be

 able

 to

 adapt

 to

 users

'

 preferences




In [6]:
llm.shutdown()