# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:09<00:00,  9.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:09<00:00,  9.10s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sherry and I am a high school senior in the 9th grade. I am from New York City and I love reading, cooking, and being outdoors. I enjoy participating in music, which I play a variety of instruments. I am also an avid golfer, which I play with my family. I have a passion for sustainability and I am determined to do my part in protecting the planet and making it a better place to live. Additionally, I am interested in helping those in need, such as helping with food banks and donating blood. I am currently a volunteer at a local food bank and I have been participating in the local
Prompt: The president of the United States is
Generated text:  an elected official. He is the leader of the federal government and the highest executive official in the United States. He is the head of the executive branch of the federal government and is responsible for guiding the administration of the government and carrying out the policies of the administration. T

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the third-largest city in the world by population. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also a major center for business, finance, and tourism, making it a popular destination for tourists and locals alike. The city is home to many cultural institutions and events throughout the year, including the annual Eiffel Tower Festival and the annual Spring Festival. Paris is a city of contrasts, with its modern architecture and historical landmarks blending

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some possible future trends in AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This includes issues such as bias, transparency, accountability, and privacy.

2. Development of more advanced AI systems: As AI technology continues to advance, we may see the development of more advanced AI systems that can perform tasks that were previously thought to be impossible or difficult.

3. Integration of AI with other technologies: AI is already being integrated



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [field or occupation] who has been working in the [industry] for [number] years. I am passionate about [reason why you love your job], and I am always looking for ways to grow and improve my skills. I am confident in my ability to excel in this field, and I am eager to share my experiences and knowledge with others. Please feel free to ask me any questions you have about my work or life. I am excited to meet you! [Name] // [Address] // [City] // [State] // [Zip] // [Phone] // [LinkedIn]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

That's correct! The capital of France is Paris. It's known for its historic and cultural attractions, beautiful architecture, and lively city life. The city has a rich history dating back over 2, 000 years, and has been home to many notable figures, incl

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

].

 I

'm

 a

 skilled

 adventurer

 who

 has

 always

 been

 fascinated

 by

 the

 mysteries

 of

 the

 world

.

 With

 a

 sharp

 mind

 and

 a

 natural

 thirst

 for

 adventure

,

 I

'm

 always

 on

 the

 lookout

 for

 new

 challenges

 and

 opportunities

 to

 explore

 the

 world

.

 I

 love

 to

 travel

,

 learn

 new

 languages

,

 and

 try

 new

 foods

,

 and

 I

'm

 always

 up

 for

 the

 adventure

 of

 discovery

.

 Whether

 I

'm

 advent

uring

 solo

 or

 with

 a

 group

,

 I

'm

 always

 on

 the

 lookout

 for

 new

 experiences

 and

 new

 challenges

.

 So

,

 if

 you

're

 looking

 for

 someone

 with

 a

 keen

 sense

 of

 curiosity

 and

 a

 willingness

 to

 try

 new

 things

,

 you

're

 in

 the

 right

 place

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Explanation

:

 Paris

 is

 the

 largest

 city

 in

 France

,

 the

 

2

3

rd

-largest

 city

 in

 the

 world

,

 and

 the

 

1

3

th

-largest

 city

 in

 Europe

.

 It

 is

 the

 capital

 of

 France

 and

 the

 largest

 city

 in

 the

 French

 Overse

as

 Department

 of

 Paris

.

 It

 is

 located

 in

 the

 center

 of

 the

 Paris

 region

 and

 is

 known

 for

 its

 historic

 landmarks

,

 art

,

 and

 cuisine

.

 Paris

 is

 also

 famous

 for

 its

 fashion

 industry

 and

 is

 a

 popular

 destination

 for

 tourists

 from

 all

 over

 the

 world

.

 



To

 summarize

,

 Paris

 is

 the

 capital

 city

 of

 France

,

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 vibrant

 culture

,

 and

 delicious

 cuisine

.

 The

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 incredibly

 exciting

 and

 promises

 to

 change

 the

 world

 in

 many

 ways

.

 Some

 possible

 trends

 in

 AI

 that

 are

 currently

 in

 progress

 and

 that

 are

 likely

 to

 continue

:



1

.

 Increased

 autonomy

:

 As

 AI

 systems

 become

 more

 advanced

,

 they

 are

 being

 designed

 to

 be

 more

 autonomous

.

 This

 means

 that

 AI

 systems

 can

 be

 made

 to

 make

 decisions

 without

 direct

 human

 intervention

,

 which

 can

 lead

 to

 a

 more

 efficient

 and

 cost

-effective

 system

.



2

.

 Personal

ization

:

 AI

 is

 increasingly

 being

 used

 to

 personalize

 experiences

 for

 customers

.

 This

 can

 lead

 to

 more

 effective

 marketing

 and

 targeted

 advertising

.



3

.

 Self

-aware

ness

:

 AI

 is

 becoming

 more

 self

-aware

,

 with

 the

 ability

 to

 think

 for

 itself

 and

 make

 decisions




In [6]:
llm.shutdown()