# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.62it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.61it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ishaan. I'm 16 years old and I'm a big fan of Japanese cuisine, so I am curious about the foods of Japan, and I would love to know your thoughts on the various dishes that are typically prepared there.

1. In the majority of Japanese homes, how much do you usually use of the Japanese language?

2. What is your favorite Japanese dish? What is your favorite ingredient in it?

3. What are some of the popular Japanese restaurants in Tokyo?

4. Is there a Japanese dish that you usually use as a reference when cooking Japanese cuisine?

5. Do you have any favorite Japanese desserts?


Prompt: The president of the United States is
Generated text:  a very important person. He has a very important job. He works to make sure that everyone has enough food and clothing. He also makes sure that everyone has enough money to pay for their bills. He also helps to keep the country safe. That's why the president is very important. President Obama was the first 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I love [job title] because [reason for passion]. What do you do at work? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I love [job title] because [reason for passion]. What do you enjoy doing in your free time

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French Academy of Sciences. Paris is a bustling city with a rich history and culture, and it is a popular tourist destination. It is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. The city is also home to many famous restaurants and cafes, and it is a popular destination for tourists and locals alike. Paris is a city of contrasts, with its modern architecture and historical landmarks blending

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars to personalized medicine. Additionally, AI is likely to become more integrated with other technologies, such as blockchain and quantum computing, creating new possibilities for innovation and collaboration. Finally, the ethical and social implications of AI are likely to become increasingly important, with concerns about privacy, bias, and the potential for AI to replace human workers. Overall, the future of AI is likely to be characterized by continued innovation, integration with other technologies,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a writer. I'm an introverted person who enjoys writing word documents and planning out what I want to write. I'm not really a public speaker, but I do have a lot of experience speaking in front of groups. I also have a really bad habit of using my personal pronouns instead of first-person pronouns in my writing. That's why I like to say "I'm writing" instead of "I am writing." I'm really interested in the world of writing and have always been fascinated by the stories we tell. Can you tell me about yourself and what you love to do? Hello,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its rich history, iconic architecture, and diverse culture. The city is home to numerous famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. It i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

 name

].

 I

 am

 a

 [

insert

 age

,

 height

,

 weight

,

 etc

.

].

 I

 love

 [

insert

 favorite

 hobby

 or

 activity

].

 What

 brings

 you

 to

 this

 world

?

 How

 would

 you

 like

 to

 be

 remembered

?

 What

 do

 you

 dream

 of

 accompl

ishing

 in

 the

 future

?

 How

 do

 you

 handle

 stress

 and

 make

 time

 for

 yourself

?

 What

 makes

 you

 different

 from

 your

 peers

?

 What

's

 the

 ultimate

 goal

 of

 your

 career

?

 What

 do

 you

 aspire

 to

 be

 in

 

1

0

 years

?

 I

 hope

 I

 can

 meet

 you

 somewhere

 on

 Earth

.

 Your

 dream

 world

 is

 very

 real

,

 and

 I

 am

 so

 happy

 to

 meet

 you

!

 (

long

 pause

)

 Yeah

,

 that

’s



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



A

.

 Incorrect

,

 as

 the

 capital

 of

 France

 is

 not

 Paris

.


B.

 Correct

,

 as

 Paris

 is

 the

 capital

 of

 France

.


C

.

 Incorrect

,

 as

 Paris

 is

 not

 the

 capital

 of

 France

.


D

.

 Correct

,

 as

 Paris

 is

 the

 capital

 of

 France

.

 To

 determine

 the

 correct

 answer

,

 let

's

 break

 down

 the

 options

 step

 by

 step

:



A

.

 Incorrect

,

 as

 the

 capital

 of

 France

 is

 not

 Paris

.


-

 This

 statement

 is

 incorrect

 because

 the

 capital

 of

 France

 is

 indeed

 Paris

,

 which

 is

 the

 only

 city

 in

 France

.



B

.

 Correct

,

 as

 Paris

 is

 the

 capital

 of

 France

.


-

 This

 statement

 is

 correct

 because

 Paris

 is

 the

 capital

 city

 of

 France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 difficult

 to

 predict

,

 but

 there

 are

 several

 possible

 trends

 that

 could

 play

 a

 role

 in

 shaping

 the

 course

 of

 AI

 development

.

 Here

 are

 some

 of

 the

 key

 trends

 that

 are

 currently

 being

 considered

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 concerns

 over

 the

 potential

 risks

 of

 AI

 have

 grown

,

 there

 is

 a

 push

 towards

 increased

 focus

 on

 ethical

 AI

.

 This

 could

 lead

 to

 more

 sophisticated

 AI

 systems

 that

 take

 into

 account

 the

 ethical

 implications

 of

 their

 actions

,

 and

 could

 also

 lead

 to

 more

 stringent

 regulation

 of

 AI

 development

 and

 deployment

.



2

.

 Improved

 precision

 and

 accuracy

:

 AI

 is

 getting

 better

 at

 performing

 tasks

 that

 were

 once considered

 beyond

 the

 realm

 of

 human capability

.

 This




In [6]:
llm.shutdown()