# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.63it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tom. I have a problem. I am a student in my second year of university. I have difficulty with my studies. Every time I have a test, I have trouble with remembering what to write down. I believe that I am just in trouble, but I really do not understand. I am not sure what to do to improve my studies.

Help me with this problem.

Sure, I'd be happy to help! Let's break down the situation and address it step by step.

### Common Problems in Learning and Studying

1. **Review and Rehearsal**
   - **Problem:** You are trying to remember information
Prompt: The president of the United States is
Generated text:  trying to decide which candidate to recommend for the position of president of the United States. The two candidates are the incumbent candidate and the incumbent candidate's successor. The incumbent candidate is extremely optimistic about the future of the country, and says that the country will be better off with the incumbent candidate bei

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and what you're looking for in a job. Let's chat! [Name] [Job Title] [Company Name] [Company Address] [City, State, Zip Code] [Phone Number] [Email Address] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Website URL] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Website URL] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [LinkedIn Profile] [Twitter Profile] [Facebook Profile

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. It is also known for its cuisine, fashion, and art scene. The city is home to many international organizations and is a major economic center in Europe. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also known for its diverse population, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends in AI include:

1. Increased integration with other technologies: AI is already being integrated into a wide range of technologies, including smartphones, smart homes, and self-driving cars. As these technologies continue to evolve, we can expect to see even more integration of AI into other areas of our lives.

2. Greater emphasis on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, privacy, and transparency.

3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm [Occupation or Title]. I've been [Age or Current Status] for [Number] years, [age/position]. I work in the [Industry/Field] with [Title], where I'm currently [Current Position]. My name is [Name], and I'm here to share my knowledge and experience with you. I'm excited to be here, and I look forward to learning more about the world around me and sharing my knowledge with you. I'm happy to answer any questions or give you a tour of the office. Let's get to know each other better. [Name] [Occup

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.  

The statement is factually correct. Paris is the capital of France, which has a population of over 6 million people. It is the largest city in France and the third-largest city in the world by population. The city has a rich history dating back to Ro

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

occupation

].

 I

'm

 [

age

],

 and

 I

'm

 from

 [

city

].

 I

've

 always

 been

 an

 [

interest

]

 and

 I

 enjoy

 [

reason

].

 I

'm

 a

 [

occupation

]

 because

 I

 [

job

 description

],

 and

 I

've

 always

 been

 an

 [

interest

]

 from

 a

 young

 age

.

 I

'm

 [

interest

]

 because

 I

'm

 [

reason

].

 I

've

 always

 loved

 [

job

 description

]

 because

 I

 wanted

 to

 [

reason

].

 My

 dream

 is

 to

 [

description

]

 and

 I

 believe

 in

 [

reason

].

 I

'm

 a

 [

occupation

]

 because

 [

job

 description

],

 and

 I

've

 always

 been

 an

 [

interest

]

 from

 a

 young

 age

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 known

 for

 its

 rich

 history

,

 vibrant

 culture

,

 and

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 a

 major

 center

 for

 business

,

 science

,

 and

 culture

.

 Paris

 serves

 as

 the

 cultural

 heart

 of

 Europe

,

 known

 for

 its

 romantic

 atmosphere

,

 art

 galleries

,

 and

 historic

 landmarks

,

 including

 the

 Lou

vre

 Museum

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 The

 city

 is

 also

 home

 to

 many

 international

 institutions

 and

 is

 one

 of

 the

 world

's

 leading

 centers

 for

 fashion

,

 cinema

,

 and

 entertainment

.

 With

 its

 diverse

 population

 and

 world

-ren

owned

 cuisine

,

 Paris

 continues

 to

 be

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 will

 likely

 be

 shaped

 by

 a

 complex

 inter

play

 of

 technological

,

 economic

,

 social

,

 and

 political

 factors

.

 Here

 are

 some

 potential

 trends

 that

 could

 potentially

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 automation

 and

 artificial

 general

 intelligence

 (

AG

I

):

 While

 some

 see

 AG

I

 as

 a

 potential

 future

 outcome

,

 others

 see

 it

 as

 a

 distant

 and

 uncertain

 possibility

.

 Some

 experts

 predict

 that

 automation

 will

 continue

 to

 increase

,

 but

 other

 AI

 researchers

 are

 skeptical

 of

 the

 extent

 to

 which

 automation

 will

 reach

 the

 point

 where

 it

 will

 be

 able

 to

 perform

 tasks

 that

 require

 human

-level

 intelligence

.



2

.

 Development

 of

 new

 types

 of

 AI

:

 AI

 researchers

 are

 constantly

 developing

 new

 types

 of

 AI




In [6]:
llm.shutdown()