# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.91it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Milena and I live in the United States. I work in a local library. I have a certain amount of money to spend. My friend wants to buy a book and a calculator. The book costs $5 and the calculator costs $10. My friend can only use 25% of his money. How much money does he have left?

To determine how much money Milena's friend has left after buying the book and the calculator, we need to follow these steps:

1. **Calculate the amount of money Milena's friend has:**
   Milena's friend can use 25% of his money to
Prompt: The president of the United States is
Generated text:  a member of the following political party: ____
A. Democratic Party
B. Republican Party
C. Socialist Party
D. Green Party
Answer:
A

In the context of the debate between democracy and dictatorship, which of the following statements about the democracy of the socialist party is correct? ____
A. The socialist party is a democracy without any limitations.
B. The socialist party op

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, art, and cuisine. Paris is a popular tourist destination and a major economic center in France. It is home to many famous museums, theaters, and restaurants. The city is also known for its annual fashion week and its role in the French Revolution. Paris is a vibrant and dynamic city that continues to thrive today

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and experiences. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Enhanced ethical considerations: As AI becomes more integrated with human intelligence, there will be increased scrutiny of its ethical implications. This could lead to more stringent regulations and guidelines for AI development and deployment.

3. Greater reliance on AI for decision-making: AI is likely to become more integrated with human decision-making processes, allowing machines to make



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Age]. I am a [Job or Profession], and I am dedicated to [My Profession's Main Goal or Mission]. I have been dedicated to this field for [Number] years, and I believe that I have the skills and experience to make a positive impact in [Your Profession's Field of Interest]. I am [Favorite Hobby or Activity], and I look forward to [Favorite Activity]. I am always looking for new challenges, and I am eager to learn and grow in my field. Thank you. I hope you have a great day. What would you like me to say next? Let me know

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the “City of Love” and the “City of Light”. Paris is a city of stunning architecture, unique food and music, and a unique way of life, making it the largest city in Europe by population. It is the seat of the 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

position

]

 at

 [

Company

 Name

].

 I

'm

 passionate

 about

 [

mention

 something

 that

 shows

 your

 love

 or

 enthusiasm

 for

 your

 job

],

 and

 I

'm

 always

 up

 for

 [

mention

 an

 activity

 or

 project

 that

 showcases

 your

 interests

 and

 skills

].

 I

 thrive

 in

 [

mention

 a

 situation

 or

 challenge

 that

 showcases

 your

 skills

 and

 abilities

].

 I

'm

 driven

 and

 ambitious

,

 always

 looking

 for

 ways

 to

 [

mention

 a

 problem

 or

 challenge

 you

 are

 trying

 to

 solve

].

 I

'm

 always

 learning

 and

 eager

 to

 [

mention

 a

 skill

 or

 interest

 that

 is

 developing

].

 I

'm

 [

mention

 your

 age

]

 years

 old

,

 and

 I

'm

 a

 [

mention

 your

 gender

].

 I

'm



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



The

 Paris

 metro

 system

 includes

 over

 

3

,

 

0

0

0

 stations

,

 including

 major

 shopping

 districts

,

 cultural

 venues

,

 and

 landmarks

.

 The

 city

's

 metro

 system

 is

 one

 of

 the

 most

 extensive

 in

 the

 world

,

 serving

 over

 

2

0

 million

 commuters

 each

 day

.

 It

 was

 designed

 to

 be

 a

 central

 hub

 for

 transportation

,

 offering

 the

 freedom

 to

 get

 anywhere

 in

 the

 city

.

 The

 metro

 system

 is

 designed

 to

 be

 accessible

 to

 all

 residents

 of

 France

 and

 some

 visitors

 to

 the

 country

.

 The

 Paris

 metro

 system

 has

 a

 high

 capacity

 and

 operates

 on

 a

 wide

 range

 of

 modes

 of

 transportation

,

 including

 buses

,

 trains

,

 and

 bicycles

.

 The

 Paris

 metro

 system

 is

 also



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 constantly

 evolving

,

 with

 many

 potential

 trends

 that

 are

 shaping

 its

 direction

.

 Here

 are

 a

 few

 key

 areas

 that

 are

 likely

 to

 see

 significant

 advancements

 in

 the

 coming

 years

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 more

 people

 become

 aware

 of

 the

 potential

 risks

 associated

 with

 AI

,

 there

 is

 a

 growing

 emphasis

 on

 making

 sure

 that

 AI

 systems

 are

 designed

 and

 used

 in

 a

 way

 that

 is

 fair

,

 transparent

,

 and

 responsible

 for

 the

 decisions

 they

 make

.



2

.

 More

 diverse

 AI

:

 As

 technology

 continues

 to

 advance

,

 there

 will

 be

 more

 opportunities

 for

 AI

 to

 learn

 and

 improve

,

 but

 there

 will

 also

 be

 more

 opportunities

 for

 it

 to

 have

 biases

 and

 be

 biased

 in

 its

 decision




In [6]:
llm.shutdown()