# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.77it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Gabriela and I am a PhD student at the University of Oklahoma. My current research interests are in algebraic geometry, especially in the study of moduli spaces of stable curves. I am also interested in the role of moduli spaces of vector bundles and their dualities in geometrically abundant phenomena. I have published my research in journals such as the Journal of Algebra, the Journal of Mathematical Sciences, and the International Journal of Mathematics. My PhD supervisor is János Kollár. I am currently pursuing my PhD from the University of Oklahoma. To get started, I need to make a few changes in the given text. First,
Prompt: The president of the United States is
Generated text:  a political office, and that office was first filled in 1789. The person who filled this office was: 
A. George Washington
B. John Adams
C. Thomas Jefferson
D. James Madison

To determine the correct answer, let's examine each option and consider the timeline of 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] and [job title]. I'm a [job title] who is always [job title] and I'm always [job title]. I'm a [job title] who is always [job title] and I'm always [job title]. I'm a [job title] who is always [job title] and I'm always [job title]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination, known for its rich history, art, and cuisine. The city is home to many notable French artists, writers, and musicians, and is considered one of the most beautiful cities in the world. Paris is a vibrant and dynamic city, with a diverse population and a rich cultural heritage. The city is also known for its role in the French Revolution and its influence on French literature and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence. This could lead to more sophisticated forms of AI, such as those that can understand and adapt to human emotions and behaviors.

2. Greater reliance on data: AI will become more data-driven, with more data being collected and analyzed to improve its performance. This could lead to more efficient and effective AI systems, as well as more accurate predictions and insights.

3. Greater use of machine learning: Machine learning will continue to play a larger role in AI, with more complex algorithms



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [occupation]. I'm passionate about [project or hobby you're passionate about]. What inspired you to become a writer?
When I was [age], my family moved to a small town where I found a love for writing. I started by writing a letter to a local newspaper, but it didn't take off. I quit my job to pursue my passion and started writing full time. I love the creative process and the thrill of the unknown. What are your writing goals? I want to become a [type of writer] and write more books. 
What's one piece of advice you have for aspiring

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its iconic landmarks such as Notre-Dame Cathedral and Eiffel Tower, as well as its rich history, beautiful gardens, and lively nightlife. French people are known for their enthusiasm and qualit

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

,

 and

 I

 am

 a

 writer

.

 I

 enjoy

 writing

 about

 my

 own

 experiences

 and

 emotions

,

 and

 I

 love

 to

 share

 my

 thoughts

 and

 insights

 with

 others

.

 What

 can

 you

 tell

 me

 about

 yourself

?

 I

 have

 a

 background

 in

 journalism

 and

 have

 a

 deep

 appreciation

 for

 the

 arts

 and

 literature

.

 I

 enjoy

 reading

 and

 apprec

iating

 the

 works

 of

 great

 writers

 and

 thinkers

.

 What

 are

 your

 favorite

 books

 or

 authors

 to

 read

?

 As

 an

 author

,

 I

 am

 particularly

 fond

 of

 writing

 about

 my

 personal

 experiences

 and

 emotions

.

 I

 enjoy

 exploring

 different

 perspectives

 and

 sharing

 my

 thoughts

 and

 insights

 with

 others

.

 What

 are

 your

 favorite

 hobbies

 or

 activities

?

 As

 an

 author

,

 I

 am

 constantly

 learning

 new

 things



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 bustling

 city

 known

 for

 its

 rich

 history

,

 artistic

 heritage

,

 and

 stunning

 architecture

.

 Visitors

 can

 explore

 the

 city

's

 numerous

 museums

,

 picturesque

 streets

,

 and

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

.

 Paris

 is

 also

 home

 to

 numerous

 restaurants

,

 bars

,

 and

 cafes

,

 and

 is

 a

 popular

 tourist

 destination

 for

 its

 diverse

 culture

,

 cuisine

,

 and

 annual

 events

 like

 the

 E

iff

el

 Tower

 celebration

.

 The

 city

's

 rich

 history

,

 art

,

 and

 tourism

 industry

 make

 it

 a

 unique

 and

 unforgettable

 destination

 for

 visitors

 to

 France

.

 



Note

:

 This

 statement

 is

 fact

ually

 accurate

 and

 provides

 a

 comprehensive

 overview

 of

 the

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 a

 number

 of

 trends

 that

 are

 currently

 in

 development

,

 but

 which

 are

 likely

 to

 shape

 the

 field

 of

 AI

 in

 the

 coming

 decades

.



1

.

 Deep

 Learning

:

 The

 use

 of

 deep

 learning

,

 a

 type

 of

 machine

 learning

,

 has

 already

 shown

 great

 promise

 in

 natural

 language

 processing

,

 computer

 vision

,

 and

 speech

 recognition

.

 In

 the

 future

,

 it

 is

 likely

 that

 deep

 learning

 will

 continue

 to

 be

 the

 dominant

 tool

 for

 AI

,

 as

 it

 is

 already

 widely

 used

 in

 a

 wide

 variety

 of

 applications

 and

 is

 expected

 to

 become

 even

 more

 powerful

 as

 the

 technology

 continues

 to

 evolve

.



2

.

 Bi

ometric

 Security

:

 As

 more

 and

 more

 people

 start

 to

 rely

 on

 bi

ometric




In [6]:
llm.shutdown()