# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.76it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kevin and I am a lawyer who specializes in corporate law. I represent individuals and companies in all areas of legal issues, including contract law, corporate governance, tax law, employment law, and intellectual property law. Can you tell me about a specific case you have represented in the past? Yes, as a lawyer, I represent individuals and companies in all areas of legal issues, including contract law, corporate governance, tax law, employment law, and intellectual property law. One of my recent cases involves a contract dispute between a retail company and a delivery service provider. The dispute arose out of a misunderstanding about the delivery schedule and resulted in the delivery
Prompt: The president of the United States is
Generated text:  represented by the vice president. In how many different ways can the vice president be chosen? There are 4 candidates for president, so there are 4 choices for the president.
Since the vice presi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest in the industry]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [reason for interest in the industry] and I'm always eager to learn and improve. I'm a [reason for interest in the industry] and I'm always eager to learn and improve. I'm a [reason for interest in the industry] and I'm always eager to learn and improve. I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the third-largest city in the world by population. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also a major center for the arts, music, and fashion. It is a popular tourist destination and a major economic hub in Europe. The city is home to many international organizations and is a major center for research and development. Paris is a city of contrasts, with its rich history and culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and adaptive AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more robust AI systems that are designed to be transparent, accountable, and responsible.

3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a professional linguist. I have a deep understanding of the English language, and I can translate and paraphrase texts in various languages fluently. I enjoy my work, and I find great satisfaction in helping people understand the nuances of language. I specialize in language learning and teaching, and I believe that my skills can make a positive difference in people's lives. Thank you for taking the time to meet me. [Name] [Optional: Add a brief quote or note that reflects your professional experience or perspective.] [Note: If the character is a fictional character, ensure they have a clear self-introduction that

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

(6 points) 

1. What is the capital of France?
2. Name the major cities in France. (6 points) 

3. Which city is not in France

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 a

 [

occupation

]

 who

 has

 [

any

 relevant

 skills

 or

 experience

]

 and

 have

 been

 [

any

 relevant

 achievements

 or

 accomplishments

].

 I

 am

 always

 [

any

 relevant

 personality

 trait

]

 and

 always

 ready

 to

 [

any

 relevant

 goal

 or

 objective

].

 I

 am

 a

 [

any

 relevant

 hobby

 or

 interest

]

 who

 enjoys

 [

any

 relevant

 interests

 or

 passions

].

 I

 am

 [

any

 relevant

 personality

 trait

]

 and

 always

 [

any

 relevant

 trait

 or

 quality

].

 I

 am

 [

any

 relevant

 personality

 trait

]

 and

 always

 [

any

 relevant

 trait

 or

 quality

].

 I

 am

 [

any

 relevant

 personality

 trait

]

 and

 always

 [

any

 relevant

 trait

 or

 quality

].

 I

 am

 [

any

 relevant

 personality

 trait

]

 and

 always



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 Lo

ire

 Valley

 region

 of

 southwestern

 France

.

 It

 is

 one

 of

 the

 most

 populous

 cities

 in

 the

 European

 Union

 and

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 vibrant

 culture

.

 Paris

 was

 founded

 as

 the

 seat

 of

 the

 French

 monarchy

 and

 has

 played

 a

 significant

 role

 in

 the

 country

’s

 history

,

 including

 the

 French

 Revolution

,

 the

 French

 Revolution

,

 and

 the

 French

 Revolution

.

 It

 is

 home

 to

 numerous

 famous

 landmarks

 and

 attractions

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 famous

 for

 its

 fashion

,

 gastr

onomy

,

 and

 art

 scene

,

 attracting

 millions

 of

 tourists

 each

 year

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 continued

 advancements

 in

 technologies

 like

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 AI

 is

 already

 being

 used

 in

 a

 wide

 range

 of

 applications

 such

 as

 autonomous

 vehicles

,

 self

-driving

 cars

,

 and

 virtual

 assistants

.

 In

 the

 future

,

 we

 can

 expect

 to

 see

 AI

 integrated

 into

 more

 and

 more

 industries

,

 from

 healthcare

 and

 finance

 to

 manufacturing

 and

 transportation

.

 AI

 will

 likely

 become

 more

 personalized

 and

 adaptable

,

 allowing

 it

 to

 learn

 from

 the

 data

 it

 receives

 and

 make

 better

 decisions

 in

 real

-time

.

 Additionally

,

 AI

 will

 likely

 become

 more

 ethical

 and

 consider

ate

,

 with

 more

 transparency

 and

 accountability

 in

 how

 it

 is

 used

.

 Finally

,

 AI

 will

 likely

 continue

 to

 be

 used

 to




In [6]:
llm.shutdown()