# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  K.C. I am 13 years old. I have a friend named John. We are both in the same class. When I was younger, we used to be very good friends. But now that we have grown up, we seem to be not good friends anymore. It's hard for me to tell John what to do. I don't know if he is still trying to make me stop. Sometimes I think that he is. I am not sure if he wants to be friends with me. I am sorry for the trouble we have caused each other. 

Compose an email to John expressing your feelings. [These are the
Prompt: The president of the United States is
Generated text:  trying to decide between two different versions of the same policy proposal. He decides to compare the cost of implementing each version. In the first version, it costs $100,000 to implement. In the second version, it costs $1,000,000 to implement. Calculate the total cost difference between the two versions.

To determine the total cost difference between the two versions of the policy pr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character here]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [insert a short description of your favorite activity here]. I'm always looking for new experiences and adventures to try. What's your favorite book or movie? I love [insert a short description of your favorite book or movie here]. I'm always looking for new ways to explore

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the seat of the French government and the country's cultural and political capital. Paris is a major tourist destination and a popular destination for French cuisine and fashion. It is also known for its rich history and cultural heritage, including the influence of the French Revolution and the influence of the French language. The city is home to many museums, theaters, and other cultural institutions, and is a major economic center in France. Paris is a city of contrasts, with its modern architecture and historical landmarks blending

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. As AI becomes more integrated into our daily lives, we may see a shift towards more ethical and responsible use of the technology, with a focus on minimizing harm and maximizing benefits. Additionally, there may be a growing emphasis on developing AI that is more transparent and accountable, with greater consideration given to the ethical implications of AI development and deployment. Finally, there may be a growing focus on



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  ____________ and I am a/an ________. I am ________. I _____.

I will provide some examples of neutral self-introductions for fictional characters, but as a machine, I don't have the ability to create names or personalities. I can only provide neutral introductions to characters, which typically don't include any personal details or subjective characteristics. For example:

1. I am an AI assistant. My name is AI. I am an AI.
2. I am a data scientist. I am a data scientist. 
3. I am an engineer. I am an engineer.
4. I am an economist. I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city is known for its rich history, stunning architecture, and diverse cultural scene. It was founded in the 6th century by the Romans and has been the capital of France since 1800. Paris is a major hub for internatio

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

Your

 Job

 Title

]

 at

 [

Your

 Company

].

 I

'm currently

 [

Your

 Position

]

 at

 [

Your

 Company

]

 and

 have

 been

 working

 on

 a

 project

 that

 involves

 [

Your

 Area

 of

 Expert

ise

].

 I

'm

 passionate

 about

 [

Your

 Hobby

 or

 Interest

],

 and

 I

 enjoy

 [

Your

 Personal Qual

ities

/

Att

itudes

].

 I

'm

 committed

 to

 [

Your

 Business

 Philosophy]

 and

 strive

 to

 achieve

 [

Your

 Goals

].

 As

 someone

 who

 is

 always

 looking

 for

 new

 opportunities

 and

 is

 always

 eager

 to

 learn

, I

'm

 always

 eager

 to learn

 more

 about

 your

 company

 and

 its

 products

 or

 services

.

 Thank

 you

 for

 considering

 [

Your

 Name

]

 as

 a

 potential



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



French

 citizens

 and

 visitors

 are familiar

 with

 this

 city

,

 known

 for

 its

 rich

 history

,

 iconic

 landmarks

,

 and

 diverse

 culture.

 The

 city is

 home

 to

 numerous

 museums

,

 galleries

,

 theaters

,

 and

 restaurants

,

 offering

 a

 unique

 blend

 of

 modern

 and

 traditional

 influences

.

 The

 French

 Parliament

 is

 also

 situated

 in

 Paris

, making

 it

 the seat

 of

 the

 national

 government

.

 The

 city

 is

 famous

 for

 its

 art

,

 cuisine

,

 and

 fashion

,

 with

 the

 iconic

 E

iff

el

 Tower

 standing

 as

 a

 symbol

 of

 the

 city

's

 rich

 history

 and

 cultural

 heritage

.

 Overall

,

 Paris

 is

 a

 fascinating

 destination

 for

 visitors

 from

 all

 over

 the

 world

.

 



Paris

 is

 the

 largest

 city

 in

 France

 and

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 trends

 that

 are

 currently being

 explored

 by

 researchers

 and

 developers

.

 Here

 are

 some

 of

 the

 key

 trends

 that

 could

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increased

 Use

 of

 Machine

 Learning

:

 As

 the

 demand

 for

 AI

 continues

 to

 grow

,

 it

 is

 likely

 that

 we

 will

 see

 an

 increase

 in

 the

 use

 of

 machine

 learning

,

 which

 is

 the

 process

 of

 training

 algorithms

 to

 learn

 from

 data

 without

 being

 explicitly

 programmed

.

 This

 could

 lead

 to

 more

 efficient

 and

 effective

 applications

 of

 AI

,

 such

 as

 personalized

 medicine

 and

 autonomous

 vehicles

.



2

.

 AI

 Integration

 with

 N

LP

 and

 N

LP

 Integration

 with

 AI

:

 The

 integration

 of

 AI

 with

 natural

 language

 processing




In [6]:
llm.shutdown()