# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Zhongli and I’m a PhD student in the department of Molecular and Cellular Biology at University of California, Berkeley. My research focuses on the function of the Wnt signaling pathway and its role in neuronal development and cell fate specification. I am especially interested in the mechanisms of the Wnt signaling in the context of neurodegenerative diseases, such as Alzheimer’s disease and Parkinson’s disease.
My current research project focuses on the characterization of the Wnt pathway in neurons of the vertebrate brain. I am looking at the role of the Wnt pathway in the development of neurons and in the specification of neuronal identity. The Wnt pathway
Prompt: The president of the United States is
Generated text:  a political office with political power and authority. The president is the head of the executive branch of the United States government. While not a leader in the traditional sense, the president of the United States is a na

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill] with [Number] years of experience in [Field]. I'm a [Skill] with [Number] years of experience in [Field]. I'm a [Skill] with [Number] years of experience in [Field]. I'm a [Skill] with [Number] years of experience in [Field]. I'm a [Skill] with [Number] years of experience in [Field]. I'm a [Skill] with [Number] years of experience in [Field]. I'm a [Skill] with [Number]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in the country. It is located on the Seine River and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is known for its rich history, art, and culture, and is a popular tourist destination for visitors from around the world. It is also home to the French Parliament and the French National Library. The city is known for its vibrant nightlife, fashion, and food scene, and is a major center for business and commerce in Europe. Paris is a city of contrasts, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and experiences. This could lead to more natural and intuitive interactions between humans and machines.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions. This could lead to more efficient and effective applications of AI in various fields.

3. Increased focus on ethical considerations: As AI becomes more integrated with human intelligence,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [Occupation] who is passionate about [What you do best]. I enjoy [Reason why you enjoy it]. My favorite [Food] is [Your favorite food]. I hope to be [Future goal], and I'm always looking for ways to [Additional goal]. Thank you for considering me for an interview! [Name]: This is [First Name] from [Company]. We met on [Date/Time], and I'm excited to have you here today to meet me! [Name]: I'm [Last Name]. I'm the [Title] at [Company], and I've been working there

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France and the second-largest city in the European Union. It is located on the Seine River and is home to the Louvre Museum, the Eiffel Tower, and other cultural and historical landmarks. Paris has a diverse population of about 2.7 million people and is t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

_

 and

 I

 am

 a

/an

 __

________

__

_.

 My

 name

 is

 __

________

_

 and

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 oldest

 continuously

 inhabited

 city

 in

 Europe

.

 It

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 cultural

 and

 historical

 heritage

,

 home

 to

 numerous

 renowned

 museums

,

 galleries

,

 and

 theat

res

.

 Paris

 is

 also

 renowned

 for

 its

 cuisine

,

 art

,

 fashion

,

 and

 nightlife

.

 The

 city

 has

 a

 vibrant

 and

 diverse

 population

,

 with

 residents

 from

 all

 over

 the

 world

.

 Despite

 its

 historical

 and

 cultural

 significance

,

 Paris

 remains

 a

 global

 met

ropolis

 and

 continues

 to

 be

 a

 popular

 tourist

 destination

.

 It

 is

 the

 country

’s

 second

-largest

 city

 and

 is

 often

 referred

 to

 as

 the

 “

Paris

 of

 Paris

”

 due

 to

 its

 vibrant

 atmosphere.

Human

:

 Wow

,

 Paris

 sounds

 like

 an

 amazing

 place

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 factors

,

 including

 technological

 advancements

,

 changing

 societal

 needs

,

 and

 emerging

 trends

 in

 business

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 focus

 on

 ethics

 and

 privacy

:

 As

 more

 data

 becomes

 available

,

 there

 will

 be

 increased

 scrutiny

 of

 how

 AI

 is

 used

 and

 the

 impact

 it

 has

 on

 individuals

 and

 society

 as

 a

 whole

.

 There

 will

 be

 a

 greater

 emphasis

 on

 ensuring

 that

 AI

 is

 developed

 and

 used

 eth

ically

 and

 that

 it

 does

 not

 harm

 individuals

 or

 harm

 society

.



2

.

 Integration

 of

 AI

 with

 other

 technologies

:

 AI

 is

 increasingly

 being

 integrated

 with

 other

 technologies

 like

 the

 Internet

 of

 Things

 (

Io

T

)

 and

 the

 Internet

 of




In [6]:
llm.shutdown()