# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.03it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  6th grade student and I would like to ask about the concept of an equality and why is it called an equality?
A. The concept of equality is a fundamental and universal property of numbers that allows us to perform operations with numbers and make inferences about the relationship between numbers. It is called equality because it states that two or more numbers are equivalent to each other, meaning that one number is equal to another number. This concept is based on the idea that equal numbers have the same value, regardless of how they are arranged or represented. For example, 2 and 3 are equal because they represent the same amount of something,
Prompt: The president of the United States is
Generated text:  trying to choose between two medical research candidates. Candidate A is currently receiving medical research grants from 6 major pharmaceutical companies. Candidate B is currently receiving medical research grants from only 2 major pharmac

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] my skills and knowledge. I'm always eager to learn and improve, and I'm always willing to take on new challenges. I'm a [job title] and I'm excited to be here at [company name]. I'm looking forward to [job title] and I'm looking forward to [job title] with you. I'm excited to be here

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French Academy of Sciences, and the French Riviera. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is known for its fashion, art, and cuisine, and is a major economic center in Europe. It is also home to the French Parliament, the French Academy of Sciences, and the French Riviera. Paris is a vibrant and dynamic city with a rich cultural heritage and is a major tourist destination. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient monitoring.

2. Greater integration of AI into everyday life: AI is already being integrated into everyday life through the use of voice assistants, self-driving cars



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [job title] at [company]. Before joining the company, I worked at [previous company] as a [current job title]. I'm an [age] year old, and I grew up in [city/region]. I love [reason why I like my job], and I'm always [type of person]. I love [why I love my job], and I'm always [type of person]. I enjoy [reason for being] and I'm always [type of person]. I'm [name] and I'm [age]. I love to [reason for being] and I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic Eiffel Tower, museums, and vibrant cultural scene.  

### The Capital of France: Paris

Paris, the heart of France, is renowned for its iconic Eiffel Tower, rich cultural heritage, and vibrant street life. Known as "La Ville Blanche," Paris is a dynamic and ever-evolving city that captivates visitors with i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 your

 name

],

 and

 I

'm

 a

 [

insert

 your

 profession

 or

 occupation

].

 I

 recently

 graduated

 from

 [

insert

 your

 alma

 mater

]

 and

 have

 been

 working

 in

 [

insert

 your

 field

 of

 work

]

 for

 [

insert

 your

 number

]

 years

.

 I

'm

 always

 on

 the

 lookout

 for

 new

 challenges

 and

 opportunities

 to

 learn

 and

 grow

.

 What

 are

 some

 of

 the

 things

 you

 like

 to

 do

 in

 your

 free

 time

?

 I

 enjoy

 going

 hiking

,

 playing

 sports

,

 reading

 books

,

 and

 listening

 to

 music

.

 What

 do

 you

 like

 to

 do

 in

 your

 free

 time

?

 I

 enjoy

 going

 hiking

,

 playing

 sports

,

 reading

 books

,

 and

 listening

 to

 music

.

 I

'm

 always

 looking

 for

 new

 challenges

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.


Paris

,

 the

 capital

 city

 of

 France

,

 is

 renowned

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 rapid

 advancement

 in

 the

 areas

 of

 natural

 language

 processing

,

 computer

 vision

,

 and

 machine

 learning

.

 There

 is

 also

 growing

 interest

 in

 developing

 ethical

 considerations

 around

 AI

,

 including

 issues

 related

 to

 bias

,

 transparency

,

 and

 accountability

.

 Additionally

,

 there

 is

 an

 increasing

 focus

 on

 developing

 AI

 that

 is

 more

 effective

 and

 less

 prone

 to

 errors

 in

 decision

 making

.

 Finally

,

 there

 is

 a

 growing

 emphasis

 on

 developing

 AI

 that

 is

 more

 compatible

 with

 humans

 and

 can

 operate

 in

 a

 broader

 range

 of

 environments

.

 



One

 potential

 future

 trend

 is

 the

 development

 of

 AI

 that

 can

 operate

 in

 human

-like

 environments

 with

 humans

 in

 the

 loop

.

 This

 could

 include

 developing

 AI

 that

 can

 work

 alongside

 humans

 in




In [6]:
llm.shutdown()