# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.86it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.86it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Boris. I can speak and write English, but I can't read. How can I learn to read? What should I do? Boris's speech sounds quite informal. It's not clear what questions or concerns you have regarding reading. Please provide more details about your query or ask a more specific question, and I'll be happy to assist you. Let me know if you need any assistance in reading. Is there a particular book you'd like to learn from, or do you prefer to study reading skills yourself? Regardless of your preferences, I would be glad to help you. Please feel free to ask about how to improve your reading skills
Prompt: The president of the United States is
Generated text:  a political office. The longest a president has held the office of President of the United States is five years. If the current president has been in office for 32 years, what is the number of terms in which the president has been in office?
To determine the number of terms in which the preside

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I am passionate about [reason for being at the company]. I am always looking for ways to [add value to the company]. I am [age] years old and I am [gender]. I am [occupation] and I am [hobbies or interests]. I am [language] and I am [country]. I am [email address]. I am [phone number]. I am [website address]. I am [social media handle]. I am [any other relevant information]. I am [any other relevant information]. I am [any other relevant information

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center, hosting numerous museums, theaters, and festivals throughout the year. Paris is a popular tourist destination and is home to many international institutions and organizations. The city is known for its rich history, diverse culture, and vibrant nightlife. It is a major hub for international business and trade, and is a major transportation hub for Europe. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the city's vibrant culture. The city is also known for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and personalized medicine to virtual assistants and chatbots. Additionally, AI is likely to continue to be used for tasks such as fraud detection, cybersecurity, and environmental monitoring, as well as for tasks such as language translation and language generation. As AI becomes more integrated into our daily lives, we can expect to see even more widespread adoption of these technologies, and to see even more significant changes in the way we live and work. However



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm an [age] year old [age] year old [gender]. I am an [occupation] who has been pursuing a [career] for [number of years] years. I am a [occupation] who has been pursuing a [career] for [number of years] years and who is passionate about [interest] and [job]. I love [job] because I am [job] and I am passionate about [interest]. I believe in [your belief or value] and I am determined to [your belief or value] for [reason] and I am determined to [your belief or value

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city renowned for its rich history, art, and vibrant culture. The city is situated on the Loire River and has a population of approximately 1.2 million people. Paris is home to numerous famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Sarah

.

 I

 am

 a

 professional

 copy

writer

 with

 over

 

5

 years

 of

 experience

 in

 the

 industry

.

 I

 specialize

 in

 crafting

 engaging

 and

 impactful

 copy

 for

 websites

,

 blogs

,

 and

 social

 media

 posts

.

 My

 goal

 is

 to

 help

 companies

 grow

 their

 brand

 by

 creating

 content

 that

 reson

ates

 with

 their

 target

 audience

 and

 drives

 conversions

.

 I

 love

 working

 with

 clients

 who

 are

 passionate

 about

 their

 products

 or

 services

 and

 want

 to

 share

 their

 message

 in

 a

 way

 that

 is

 both

 informative

 and

 memorable

.

 I

 am

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 learn

 and

 grow

 as

 a

 copy

writer

.

 Thank

 you

 for

 considering

 me for

 an

 interview

!

 That

 sounds

 like

 a

 great

 position

 for

 me

.

 Can

 you

 give



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



A

.

 True




B

.

 False




A

.

 True





Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 country

's

 capital

.

 It

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 diverse

 cultural

 scene

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 known

 for

 its

 cuisine

,

 fashion

,

 and

 music

.

 The

 city

 has

 a

 reputation

 for

 being

 a

 hub

 of

 creativity

 and

 innovation

,

 and

 has

 played

 a

 significant

 role

 in

 the

 development

 of

 French

 culture

 and

 language

.

 As

 a

 result

,

 it

 is

 often

 referred

 to

 as

 "

the

 Paris

 of

 Paris

."



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 promising

 and

 has

 the

 potential

 to

 revolution

ize

 many

 industries

 and

 solve

 some

 of

 the

 most

 challenging

 problems

 in

 the

 world

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 more

 ethical

 concerns

 are

 raised

 about

 AI

,

 such

 as

 bias

 and

 privacy

,

 there

 will

 be

 a

 growing

 emphasis

 on

 developing

 AI

 that

 is

 more

 ethical

 and

 responsible

.



2

.

 Deep

 learning

 and

 reinforcement

 learning

:

 These

 two

 areas

 of

 AI

 are

 rapidly advancing

 and

 will

 continue

 to

 play

 a

 key

 role

 in

 the

 development

 of

 new

 and

 advanced

 AI

 systems

.



3

.

 Improved

 natural

 language

 processing

:

 Natural

 language

 processing

 is

 becoming

 more

 sophisticated

 and

 is

 being

 used

 to

 automate

 tasks

 such

 as

 customer




In [6]:
llm.shutdown()