# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Beth, I am a 17 year old female. I was diagnosed with a genetic condition 3 years ago that causes an autoimmune response, meaning the body overreacts and attacks its own cells. I have been on immunosuppressive medication for the past 1 year and am on dialysis. I currently have a very low T-cell count, in my peripheral blood. I have been on a B-cell inhibitor for a year, that helps my immune system by killing off lymphoma, but does not have a long lasting effect. I have never been on a bone marrow transplant, or received a bone marrow transplant. I was
Prompt: The president of the United States is
Generated text:  visiting a small town, and he decides to give a small gift to each resident. He starts with $1000. After giving out the gifts, the amount of money left is $400. However, he also decides to keep $100 as a permanent gift for the town. How much money did the president give out in total, including the permanent gift? To determine the tota

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a brief description of your profession or experience here]. I enjoy [insert a brief description of your hobbies or interests here]. What's your favorite hobby or activity? I love [insert a brief description of your favorite activity here]. What's your favorite book or movie? I love [insert a brief description of your favorite book or movie here]. What's your favorite place to go? I love [insert a brief description of your favorite

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament and the French Parliament House. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a major transportation hub, with the Eiffel Tower serving as a symbol of the city. The city is also known for its cuisine, with dishes like croissants, escargot, and foie gras being popular. Paris is a vibrant and dynamic city with a diverse population and a rich cultural heritage. It is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and artificial intelligence: As AI becomes more advanced, it is likely to become more integrated into our daily lives, from manufacturing and transportation to healthcare and customer service. Automation will likely become more prevalent, with machines taking on tasks that were previously done by humans.

2. AI ethics and privacy concerns: As AI becomes more advanced, there will likely be increased concerns about its ethical implications and the potential for misuse. There will also be a growing need for regulations and guidelines



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name]. I'm a/an [insert fictional character's profession or occupation]. And I really enjoy [insert reason why you enjoy your profession or occupation]. I believe that being knowledgeable and skilled in my field allows me to make a significant impact in my community and beyond. And I also enjoy [insert reason why you enjoy this job]. So, if you have any questions, please feel free to ask me anything. I'm always happy to help and learn from you. [insert character's name] [insert character's profession or occupation] [insert reason why you enjoy your profession or occupation] [insert reason why you enjoy

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the largest city in France and the seat of government and politics. The city is located on the left bank of the Seine River, 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

Age

]

 year

 old

 [

Gender

],

 and

 I

 currently

 reside

 in

 [

City

/

State

].

 My

 passion

 is

 [

Favorite

 Hobby

/

Activity

/

Interest

].

 I

 live

 in

 a

 [

Town

/C

ity

/

State

]

 with

 my

 [

Family

/C

are

g

iver

].

 What

 brings

 you

 to

 this

 place

?

 [

Show

 some

 enthusiasm

 or

 interest

 in

 the

 environment

].

 Thank

 you

 for

 the

 opportunity

 to

 meet

 you

,

 [

Name

].

 [

Name

]

 is

 looking

 forward

 to

 meeting

 you

 as

 well

 and

 to

 exploring

 your

 world

.

 [

Name

]

 and

 I

 are

 excited

 to

 get

 to

 know

 each

 other

 and

 to

 learn

 more

 about

 you

.

 [

Name

]

 is

 looking

 forward



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 a

 bustling

 met

ropolis

 with

 a

 long

 history

 and

 is

 known

 for

 its

 beautiful

 architecture

,

 vibrant

 music

,

 and

 annual

 cultural

 and

 artistic

 events

.

 The

 city

 is

 also

 home

 to

 the

 French

 Parliament

 building

,

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 It

 is

 a

 major

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 annually

.

 Paris

 has

 a

 rich

 cultural

 and

 artistic

 heritage

 and

 is

 celebrated

 for

 its

 French

 cuisine

,

 nightlife

,

 and

 fashion

.

 Its

 population

 is

 around

 

2

.

7

 million

 people

,

 making

 it

 the

 most

 populous

 city

 in

 the

 European

 Union

.

 Paris

 is

 a

 modern

 and

 vibrant

 city

 with

 a

 strong

 sense

 of

 French

 identity

 and

 culture

.

 Its

 historical

 significance



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 exciting

,

 with

 numerous

 trends

 and

 potential

 applications

 shaping

 its

 development

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Integration

 with

 Other

 Technologies

:

 As

 AI

 becomes

 more

 integrated

 with

 other

 technologies

,

 such

 as

 robotics

,

 artificial

 intelligence

,

 and

 big

 data

,

 it

 will

 become

 even

 more

 powerful

 and

 capable

.

 This

 integration

 will

 enable

 AI

 to

 perform

 tasks

 that

 are

 currently

 the

 domain

 of

 human

 beings

,

 such

 as

 image

 recognition

,

 natural

 language

 processing

,

 and

 decision

 making

.



2

.

 Democrat

ization

 of

 AI

:

 AI

 is

 currently

 mostly

 owned

 by

 the

 tech

 industry

 and

 research

 institutions

,

 but

 there

 is

 growing

 recognition

 that

 AI

 should

 be

 democrat

ized

 and

 accessible

 to

 all

.

 This

 means

 that




In [6]:
llm.shutdown()