# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jason and I have been an active member in the Fannie Mae community for many years now. For those of you who don't know, Fannie Mae is the national mortgage finance agency that assists the government in the purchase and sale of mortgages. It does not directly finance the mortgages. Fannie Mae is the largest national mortgage finance organization in the country, and has a mission of building access to affordable housing for low- and moderate-income families. My name is Jason and I have been an active member in the Fannie Mae community for many years now. For those of you who don't know, Fannie Mae is the national mortgage finance
Prompt: The president of the United States is
Generated text:  very busy. He has to work for 20 years. At the beginning of each year, he works for 5 years, and then he takes a 3-year vacation. After 20 years, how many years will he work in a year?
To determine how many years the president of the United States will work 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] who has been [number of years] in the industry. I have a passion for [occupation] and have always been [number of years] in the field. I am always looking for new challenges and opportunities to grow and learn. I am a [number of years] in the industry and have always been [number of years] in the field. I am always looking for new challenges and opportunities to grow and learn. I am a [number of years] in the industry and have always been [number of years] in the field. I am always looking for new challenges and opportunities

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and its rich history dating back to the Middle Ages. It is also the seat of the French government and the country's cultural and political center. Paris is a bustling metropolis with a diverse population, and it is home to many famous landmarks such as the Louvre Museum and the Notre-Dame Cathedral. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest and most prestigious in the world. Overall, Paris is a vibrant and exciting city that is a must-visit for anyone interested in French culture and history.Human: Can you

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more efficient and effective AI systems that can make better decisions and solve complex problems.

2. Greater emphasis on ethical AI: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more transparent and accountable AI systems that are designed to minimize harm and maximize benefits.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm an [age] year old [role] here. I'm currently [some of your experiences, such as working at [company], studying at [school], or traveling to [place]]. I'm [able to describe your personality traits or hobbies]. I'm a [choose one] with [describe a specific skill, interest, or passion]. I have a goal [state one, such as "to improve my [specific skill or area]"] and I'm [one or two words to describe my motivation for this goal]. I'm [one or two words to describe my overall personality or character traits].



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the capital city of France, located on the Île de la Cité, on the North Atlantic Ocean, in the Provence region, in the northwestern Mediterranean. It is the largest city in Europe and one of the most populous urban areas in the wo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Jane

 and

 I

'm

 a

 mid

life

 crisis

 survivor

 who

 used

 to

 be

 a

 successful

 chef

.

 After

 a

 few

 years

 of

 struggling

 financially

,

 I

've

 decided

 to

 start

 a

 cookbook

.

 My

 story

 is

 a

 tale

 of

 how

 I

 faced

 my

 own

 struggles

 and

 found

 my

 way

 back

 to

 success

.

 I

'm

 passionate

 about

 sharing

 my

 knowledge

 and

 helping

 others

 achieve

 their

 dreams

.

 What

's

 your

 story

?

 I

'm

 Jane

.

 What

 inspired

 you

 to

 start

 a

 cookbook

?

 I

 used

 to

 be

 a

 successful

 chef

,

 but

 after

 a

 few

 years

 of

 struggling

 financially

,

 I

 decided

 to

 start

 a

 cookbook

 to

 share

 my

 knowledge

 and

 inspire

 others

.

 It

 was

 a

 big

 leap

 for

 me

 to

 take

 on

 this

 new

 challenge



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 beautiful

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 It

 is

 also

 home

 to

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 



The

 city

 of

 Paris

 is

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 French

 culture

,

 history

,

 and

 food

.

 The

 city

 is

 also

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 Notre

-D

ame

 Cathedral

 and

 the

 Arc

 de

 Tri

omp

he

.

 Paris

 has

 a

 diverse

 population

,

 with

 residents

 from

 many

 different

 cultures

 and

 national

ities

.

 



The

 French

 language

 is

 spoken

 in

 the

 capital

 city

 of

 Paris

,

 though

 many

 people

 speak

 English

 as

 their

 first

 language

.

 Paris

 has

 a

 rich

 history

 and

 culture

 that

 has

 made

 it



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 one

 of

 rapid

 advancement

 and

 innovation

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 integration

 of

 AI

 into

 various

 industries

:

 AI

 is

 already

 being

 used

 in

 many

 industries

,

 but

 the

 trend

 is

 likely

 to

 continue

.

 As

 AI

 becomes

 more

 integrated

 into

 different

 sectors

,

 we

 can

 expect

 to

 see

 more

 complex

 and

 sophisticated

 applications

 of

 AI

.



2

.

 Increased

 focus

 on

 ethics

 and

 privacy

:

 As

 AI

 becomes

 more

 advanced

 and

 pervasive

,

 there

 will

 be

 an

 increasing

 focus

 on

 the

 ethical

 and

 privacy

 implications

 of

 AI

.

 This

 will

 likely

 lead

 to

 greater

 regulation

 of

 AI

 development

 and

 deployment

,

 and

 a

 greater

 emphasis

 on

 transparency

,

 accountability

,

 and

 fairness

 in

 AI

-based




In [6]:
llm.shutdown()