# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.14it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Wanda Brown. I just got a new job in a grocery store. I’ve been waiting for my job for about two weeks. The job gives me my own truck. I am taking out a loan to pay for it. I am considering taking a loan to pay for my wife’s house. I will be taking out a loan to pay for my car as well. I plan to take out two more loans to pay for them. My husband and I live in different counties. Our car is in our home, and we can afford to pay for it. How is it possible that the government is taking out more loans than we are
Prompt: The president of the United States is
Generated text:  a man, who has a salary of $300,000, and has received his salary for 10 years. If the president is now 75 years old and is only eligible for his current salary until the end of his term in office, how much will the president have in his retirement fund at the end of his term?
The president's current salary is $300,000 and he has received his salary for 10 years, so he has bee

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a unique trait or skill that sets me apart from other characters in the story]. And what's your name? I'm [insert your name]. I'm looking forward to meeting you and learning more about you. What can you tell me about yourself? I'm a [insert a unique trait or skill that sets me apart from other characters in the story]. And what's your name? I'm [insert your name]. I'm looking

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville Flottante" (floating city). It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, art, and culture, and is a major tourist destination. The city is also home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a vibrant and dynamic city with a rich cultural and artistic heritage. The city is also known for its food scene, with many famous restaurants and cafes serving up delicious cuisine. Overall, Paris is a city of contrasts and excitement,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. As AI becomes more integrated into our daily lives, we may see a shift towards more ethical and responsible use of AI, with a focus on minimizing harm and maximizing benefits. Additionally, AI will continue to evolve and adapt to new challenges and opportunities, leading to new applications and innovations in the field. Overall, the future of AI is likely to be a rapidly evolving and transformative field, with a



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  __________. I'm a/an ____________ (a/an, the, /) professional __________. My ____________ is to help people find happiness and fulfillment in life. I'm a/an ____________ (a/an, the, /) dedicated __________. I'm a/an ____________ (a/an, the, /) traveler who loves to explore the world and try new things. I'm a/an ____________ (a/an, the, /) communicator who helps people connect with others and build meaningful relationships. I'm a/an ____________ (a/an, the, /) learner who loves

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city famous for its iconic Eiffel Tower and romantic Canal Saint-Martin. It is located on the River Seine and is one of the world’s most important cultural and economic centers. The city has a rich history dating back to the 6th century and has undergone numerous transformations

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 [

age

]

 years

 old

.

 I

 have

 [

number

]

 years

 of

 experience

 in

 [

job

 title

].

 I

 love

 [

occupation

]

 and

 I

'm

 passionate

 about

 [

what

 you

 do

 for a

 living

].

 I

'm

 always

 looking

 for

 opportunities

 to

 grow

 in

 my

 field

 and

 I

'm

 always

 ready

 to

 learn

 new

 things

.

 I

'm

 a

 [

job

 title

]

 and

 I

 thrive

 on

 [

positive

 trait

].

 I

'm

 [

job

 title

]

 and

 I

 love

 [

job

 title

]

 and

 I

'm

 [

positive

 trait

].

 I

'm

 [

job

 title

]

 and

 I

'm

 always

 looking

 for

 [

positive

 trait

].

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

,

 the

 City

 of

 Lights

,

 is

 the

 capital

 city

 of

 France

 and

 is

 renowned

 for

 its

 beautiful

 architecture

,

 vibrant

 culture

,

 and

 annual

 summer

 celebrations

.

 Its

 iconic

 landmarks

 include

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Notre

 Dame

 de

 Paris

 Basil

ica

,

 among

 others

.

 Paris

 is

 also

 home

 to

 many

 museums

,

 theaters

,

 and

 other

 cultural

 institutions

,

 making

 it

 a

 major

 destination

 for

 tourists

 and

 a

 cultural

 hub

 for

 the

 region

.

 Its

 annual

 La

 Tom

be

au

 de

 la

 Re

ine

 Marie

 procession

 and

 the

 World

 Cup

 football

 match

 in

 the

 Par

c

 des

 Pr

inces

 are

 among

 its

 most

 famous

 events

.

 The

 French

 people



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 constantly

 evolving

.

 Here

 are

 some

 of

 the

 most

 likely

 trends

 to

 see

 in

 AI

 over

 the

 next

 decade

:



1

.

 Improved

 precision

 and

 efficiency

:

 One

 of

 the

 biggest

 trends

 in

 AI

 is

 improving

 the

 accuracy

 and

 efficiency

 of

 machine

 learning

 algorithms

.

 As

 AI

 becomes

 more

 sophisticated

,

 it

 will

 be

 able

 to

 perform

 more

 complex

 tasks

 and

 find

 more

 accurate

 solutions

 to

 problems

.



2

.

 Increased

 use

 of

 natural

 language

 processing

:

 As

 more

 people

 rely

 on

 artificial

 intelligence

-powered

 chat

bots

 and

 virtual

 assistants

,

 the

 need

 for

 more

 advanced

 natural

 language

 processing

 will

 only

 grow

.

 This

 will

 involve

 building

 better

 models

 that can

 understand

 human

 language

 and

 generate

 human

-like

 responses

.



3

.

 Greater

 emphasis

 on

 ethical




In [6]:
llm.shutdown()