# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.81it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ben. I work at a school.
As a teacher, I will assist students in the learning process, provide a structure and clear guidance to the students, encourage students to ask questions, and create a positive learning environment.
What should I do as a teacher at a school? As a teacher at a school, there are a few key steps you can take to assist students in the learning process, provide structure and guidance, encourage questions, and create a positive learning environment. Here are some strategies you can consider:

### 1. **Assess Student Needs and Experiences**
   - **Knowledge and Skill Set Assessment:** Evaluate students' knowledge
Prompt: The president of the United States is
Generated text:  trying to decide how many military officers to reserve for the future, when his chief of staff suggests they reserve more. So, he holds a meeting with the highest ranked generals and gives them an option to either sell their army or buy new military offic

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or profession]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [insert a short description of your favorite activity or hobby]. I'm always looking for ways to expand my knowledge and skills. What's your favorite book or movie? I love [insert a short description of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major center for business, finance, and tourism in France. It is a popular tourist destination and a cultural hub for Europe. The city is home to many museums, theaters, and other cultural institutions. Paris is a city of contrasts, with its modern architecture and historical landmarks blending together to create a unique and fascinating city.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems are likely to become more integrated with human intelligence, allowing them to learn from and adapt to human behavior and decision-making processes.

2. Enhanced natural language processing: AI systems will become more capable of understanding and generating human-like language, allowing for more natural and effective communication.

3. Improved predictive analytics: AI will become more capable of predicting future events and outcomes, allowing for more accurate and timely decision-making.

4. Enhanced security: AI systems will become more secure, with better algorithms and machine learning techniques to detect and prevent



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [major] major in [major's field of study]. I have always had an interest in [major's subject] and have been studying it for [number of years]. I believe I am very knowledgeable in [major's subject], and I am excited to help you or your team achieve success in [major]. What is your area of expertise or expertise? Is there anything specific you are looking for in a potential team member or someone in the market? How can I help you? [Name] is looking for a [mention a specific area of interest or expertise]. How can I help? [Name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic Eiffel Tower, as well as its historical landmarks like the Louvre Museum and Notre-Dame Cathedral. The city is also famous for its diverse French culture, including its cuisine, music, and fash

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 software

 developer

 with

 a

 passion

 for

 helping

 people

 understand

 complex

 systems

.

 I

 am

 a

 flexible

,

 adaptable

,

 and

 energetic

 individual

 who

 thr

ives

 on

 the

 challenge

 of

 finding

 creative

 solutions

 to

 problems

.

 I

 enjoy

 teaching

 others

 to

 learn

 how

 to

 code

 and

 helping

 them

 understand

 the

 concepts

 behind

 programming

.

 My

 goal

 is

 to

 make

 technology

 accessible

 to

 everyone

,

 and

 I

 am

 constantly

 learning

 and

 growing

 in

 my

 field

.

 I

 am

 looking

 forward

 to

 working

 with

 you

.

 How

 can

 I

 find

 out

 more

 about

 me

 and

 my

 work

?

 Let

 me

 know

 if

 you

 need

 any

 other

 information

 about

 me

.

 Thank

 you

 for

 your

 time

 and

 consideration

!

 [

Name

]

 [

Or

 your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

,

 and

 one

 of

 the

 world

’s

 most

 populous

 cities

.

 It

 is

 a

 historic

 center

 and

 major

 cultural,

 economic and

 political center

 of France

. Paris

 is known

 for its

 architecture,

 museums,

 restaurants

,

 and

 music

,

 as

 well

 as

 its

 historical

 and

 cultural

 heritage

.

 The

 city

 is

 also

 known

 for

 its

 romantic

 and

 romantic

 ambiance

.

 Paris

 is

 known

 for

 its

 literary

 culture

,

 and

 hosts

 numerous

 cultural

 and

 artistic

 events

 and

 exhibitions

 throughout

 the

 year

.

 It

 is

 one

 of

 the

 world

’s

 major

 financial

 centers

 and

 is

 home

 to

 many

 international

 financial

 institutions

.

 It

 is

 also

 known

 for

 its

 fashion

 and

 fashion

 industry

.

 Paris

 is

 a

 city

 with

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 an

 exciting

 and

 rapidly

 evolving

 area

 of

 research

 and

 development

.

 Here

 are

 some

 possible

 trends

 in

 the

 development

 of

 artificial

 intelligence

 in

 the

 next

 decade

:



1

.

 Increased

 Focus

 on

 Explain

able

 AI

:

 As

 AI

 is

 becoming

 more

 complex

 and

 sophisticated

,

 its

 ability

 to

 explain

 its

 decisions

 and

 actions

 will

 become

 increasingly

 important

.

 This

 will

 require

 the

 development

 of

 more

 sophisticated

 algorithms

 that

 can

 provide

 a

 clear

 and

 understandable

 explanation

 of

 the

 AI

's

 decision

-making

 process

.



2

.

 Adv

ancements

 in

 Natural

 Language

 Processing

:

 With

 the

 rise

 of

 the

 Internet

 of

 Things

 (

Io

T

)

 and

 the

 increasing

 availability

 of

 large

 amounts

 of

 text

 data

,

 natural

 language

 processing

 (

N

LP

)

 will

 become

 more

 important

 as

 AI




In [6]:
llm.shutdown()