# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.51it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ivan, I'm 24 years old, I have a passion for technology and I currently study at a university. What are your interests, hobbies, and skills? As an AI language model, I do not have interests, hobbies, or skills in the traditional sense. However, I am designed to assist with a wide range of tasks and provide information to people around the world. If you have any questions or need assistance with a specific topic, feel free to ask! I'm always here to help. 

Wow, it's amazing how versatile you are. Can you tell me more about your ability to assist with tasks and provide information
Prompt: The president of the United States is
Generated text:  trying to decide how many military planes to buy. He finds a graph showing the budget cuts for the previous year and the budget cuts for the upcoming year. The graph shows a budget cut of $500 million for the previous year and a budget cut of $200 million for the upcoming year. If the president decides to 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the second-largest city in the world by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major center for art, music, and fashion. The city is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. Paris is a popular tourist destination and a major economic center in France. It is also a cultural hub for Europe and the world. The city is home to many important institutions, including the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is likely to become more prevalent in various industries, from manufacturing to healthcare to transportation. Automation will likely lead to increased efficiency and productivity, but it will also lead to job displacement for some workers.

2. Enhanced privacy and security: As AI becomes more integrated into our daily lives, there will be increased concerns about privacy and security. AI systems will need to be designed and implemented with the utmost care to ensure that they are not used to harm or mislead individuals.

3. AI ethics and governance: As AI becomes more prevalent, there will be a need for ethical



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name]. I am a [insert age] year-old woman with [insert profession or background]. I enjoy [insert something you like or feel], [insert something related to your profession or background]. I am a [insert nationality or ethnicity]. I currently live in [insert location]. I am passionate about [insert a personal interest or hobby]. I enjoy [insert a random fact about yourself]. I am excited to meet you and learn more about your life. What's your name? And maybe a little bit about yourself. Sure, let's get started. How about you, [insert other person's name]? And what's your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the city of light.

Paris is often referred to as the city of light because of its iconic skyline, breathtaking architecture, and vibrant culture. The city is home to

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 [

Your

 Age

]

 years

 old

.

 I

'm

 a

 [

Your

 occupation

]

 and

 I

've

 been

 working

 in

 the

 field

 of

 [

Your

 field

 of

 expertise

]

 for

 [

Your

 career

 length

]

 years

.

 I

 enjoy

 [

Your

 hobby

 or

 passion

].

 I

'm

 currently

 [

Your

 current

 occupation

 or

 position

].

 I

 believe

 in

 [

Your

 value

 or

 belief

].

 And

 I

'm

 always

 looking

 to

 [

Your

 goal

 or

 aspiration

]

 and

 I

'm

 committed

 to

 achieving

 it

.

 What

 interests

 you

?

 What

 are

 you

 passionate

 about

?

 Why

 do

 you

 think

 you

 might

 be

 interested

 in

 working

 in

 this

 field

 or

 with

 this

 organization

?


[

Your

 Name

]

 starts

 to

 answer

 the

 interview

 questions



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 capital

 city

 of

 France

,

 located

 on

 the

 River

 Se

ine

 in

 the

 North

 East

 of

 the

 country

.

 It

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 de

 Paris

,

 and

 for

 its

 rich

 history

,

 culture

,

 and

 diverse

 neighborhoods

.

 The

 city

 is

 also

 famous

 for

 its

 food

,

 fashion

,

 and

 music

 scenes

.

 Paris

 is

 a

 globally

 recognized

 and

 popular

 destination

 for

 tourists

,

 and

 is

 home

 to

 many

 world

-ren

owned

 museums

,

 historical

 sites

,

 and

 cultural

 institutions

.

 Despite

 the

 challenges

 of

 aging

 populations

 and

 economic

 disparities

,

 Paris

 remains

 an

 important

 cultural

 and

 social

 hub

 in

 France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

,

 innovations

,

 and

 applications

 of

 AI

 technologies

 in

 various

 industries

 and

 fields

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 automation

 and

 personal

ization

:

 As

 AI

 technologies

 continue

 to

 improve

,

 the

 level

 of

 automation

 in

 businesses

 will

 likely

 increase

.

 Personal

ization

 will

 become

 even

 more

 prominent

,

 as

 AI

 algorithms

 will

 be

 able

 to

 analyze

 vast

 amounts

 of

 data

 to

 understand

 customer

 preferences

 and

 tailor

 their

 experiences

 accordingly

.



2

.

 AI

-based

 healthcare

 advancements

:

 AI

 is

 already

 being

 used

 to

 develop

 new

 treatments

 and

 therapies

 for

 various

 diseases

.

 As

 AI

 becomes

 more

 sophisticated

,

 it

 may

 be

 able

 to

 identify

 new

 potential

 treatments

 and

 develop

 personalized

 treatment

 plans




In [6]:
llm.shutdown()