# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.44it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alix and I am a 36 year old pregnant woman with a 2 year old infant. I am a new mom and I have noticed on my abdomen that I have developed a lump. I am concerned about the possibility of a pituitary tumor. What should I do? I have a one year old daughter. I am very worried about her health and should I go to a doctor for this?
Thank you for your question. While it's important to keep a close eye on any changes in the abdomen, the lump you're noticing in your abdomen could be a variety of things, including a pituitary tumor. Pit
Prompt: The president of the United States is
Generated text:  expected to vote on Friday. In each of the next two weeks, the president will vote on the following day. If the president makes the correct decision, the winner is the president who gets more than half of the votes cast. If the president makes an incorrect decision, the winner is the president who gets the fewest votes cast. In a certain election, the first 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title] at [company name]. I'm always looking for ways to [job title] and [job title] at [company name], and I'm always eager to learn and grow. What's your job title and what do you do at your current job? I'm a [job title] at [company name], and I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting the headquarters of many major companies and institutions. Paris is known for its rich history, including the influence of the French Revolution and the influence of the French language. It is also home to many famous museums, including the Musée d'Orsay and the Musée Rodin. Paris is a popular tourist destination, with millions of visitors each year. The city is also known for its cuisine, including French cuisine, and its fashion industry. Overall

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased integration with other technologies: AI will continue to be integrated with other technologies such as blockchain, IoT, and autonomous vehicles, creating a more interconnected and integrated world.

2. Enhanced privacy and security: As AI becomes more prevalent, there will be a need for greater privacy and security measures to protect user data and prevent misuse of AI systems.

3. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations and the responsible



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Job Title] at [Company Name]. My hobbies and interests range from running and hiking to playing guitar. I'm passionate about taking risks and exploring new places, and I love to travel. I'm always looking for new experiences and adventures, and I enjoy writing my own stories and managing my own projects.
I'm a true believer in the power of self-improvement and strive to constantly grow and learn. I'm always looking for new opportunities to challenge myself and grow as a person. I'm a hard worker who's always putting in the extra effort to achieve my goals. I'm confident in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the most populous city in Europe, with a population of over 2 million people. The city is home to iconic landmarks such as the Eiffel Tower and Notre-Dame Cathe

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 a

 skilled

 [

skill

 or

 expertise

]

 that

 is

 passionate

 about

 [

reason

 for

 being

 passionate

].

 I

'm

 determined

 to

 [

reason

 for

 being

 determined

]

 and

 work

 tirelessly

 towards

 achieving

 [

reason

 for

 being

 determined

].

 I

'm

 also

 [

reason

 for

 being

 determined

]

 by

 [

reason

 for

 being

 determined

].

 I

'm

 committed

 to

 [

reason

 for

 being

 committed

]

 and

 always

 strive

 to

 [

reason

 for

 being

 committed

].

 I

'm

 [

reason

 for

 being

 committed

]

 and

 always

 strive

 to

 [

reason

 for

 being

 committed

].

 I

'm

 [

reason

 for

 being

 committed

]

 and

 always

 strive

 to

 [

reason

 for

 being

 committed

].



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

 statement

 is

 true

.

 Paris

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

,

 with

 an

 estimated

 population

 of

 over

 

2

.

7

 million

 people

.

 It

 serves

 as

 the

 capital

 of

 France

 and

 plays

 a

 central

 role

 in

 its

 political

,

 cultural

,

 and

 economic

 life

.

 Paris

 is

 also

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 known

 for

 its

 distinctive

 architecture

,

 stunning

 views

 of

 the

 city

 and

 the

 Atlantic

 Ocean

,

 and

 its

 rich

 history

 and

 culture

.

 Its

 status

 as

 the

 capital

 is

 an

 important

 part

 of

 France

's

 political

 and

 cultural

 identity

.

 The

 statement

 also

 includes

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 some

 possible

 trends

 that

 could

 emerge

 are

:



1

.

 Automation

 of

 tasks

:

 AI

 is

 becoming

 more

 prevalent

 in

 automation

 of

 tasks

,

 such

 as

 manufacturing

,

 healthcare

,

 transportation

,

 and

 customer

 service

.

 This

 will

 allow

 businesses

 to

 increase

 efficiency

 and

 reduce

 costs

.



2

.

 AI

 will

 continue

 to

 evolve

:

 As

 AI

 technology

 improves

,

 it

 will

 become

 more

 capable

 of

 performing

 tasks

 that

 were

 previously

 considered

 impossible

.

 This

 could

 lead

 to

 new

 opportunities

 for

 innovation

 and

 progress

.



3

.

 AI

 will

 be

 more

 ethical

:

 There

 is

 growing

 awareness

 about

 the

 potential

 of

 AI

,

 and

 there

 are

 efforts

 to

 ensure

 that

 it

 is

 used

 responsibly

 and

 eth

ically

.

 As

 more

 AI

 technology

 is

 developed




In [6]:
llm.shutdown()