# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.17it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Cindy. I am 16 years old. My favorite subject is English. I like to write stories and read books. I love the thrill of adventure. I love the people. I like to stay at home and watch movies on my computer. I am 14 years old and my favorite subject is science. I like to play with my friends and watch cartoons. My mom and dad like to play with me and I like to write stories and do homework. I love my family and my parents. I have a friend called Taylor. He is 15 years old. He likes to play with his friends and read books.
Prompt: The president of the United States is
Generated text:  an elected office. The United States president is elected by the people and is the chief executive of the United States government. The president's main role is to make decisions that will help the government to operate effectively. They are the head of the executive branch. Some of the duties of the president are making decisions in the national defense, foreign pol

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and what you're looking for in a job. What can I do for you today? [Name] is looking for a [job title] at [company name]. [Name] is interested in [job title] at [company name]. [Name] is looking for a [job title] at [company name]. [Name] is interested in [job title] at [company name]. [Name] is looking for a [job title] at [company name]. [Name] is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also a major center for art, culture, and politics, and is home to many world-renowned museums, theaters, and other cultural institutions. Paris is a popular tourist destination, known for its rich history, beautiful architecture, and vibrant culture. It is the largest city in France and a major economic and political center in Europe. The city is also home to many international organizations and institutions, including UNESCO and the European Union. Paris is a city of contrasts, with its modern architecture and historical landmarks

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, privacy, and transparency.

2. Greater use of AI in healthcare: AI is already being used to improve the accuracy of medical diagnoses and treatment plans. As AI becomes more advanced, we may see even more widespread use in healthcare.

3. Increased use of AI in manufacturing: AI is already being used to optimize production processes



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [job title] at [Company Name]. I'm excited to be here and make a difference in [objective]. 

What brings you to [Company Name] and what makes you unique to the role? 

Please share your story and how you got here. What inspired you to pursue this career path? 

I want to know about the challenges you've faced and how you've overcome them. 

Additionally, how do you balance your work and personal life? 

Lastly, I would love to know your vision for [Company Name]. Can you explain your mission statement and how you believe it will shape

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the country and is home to many of France's cultural and historical landmarks. It is the seat of government, the heart of the European Union, and a popular tourist destinati

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

].

 I

’m

 a

 person

 who

 enjoys

 [

insert

 hobby

 or

 interest

].

 I

 spend

 a

 lot

 of

 time

 exploring

 the

 world

,

 reading

 and

 writing

,

 and

 learning

 about

 different

 cultures

 and

 perspectives

.

 I

’m

 always

 open

 to

 new

 experiences

 and

 ideas

,

 and

 I

 try

 to

 use

 my

 knowledge

 to

 help

 people

 in

 my

 personal

 and

 professional

 life

.

 I

’m

 passionate

 about

 sharing

 my

 experiences

 and

 learning

 with

 others

,

 and

 I

’m

 always

 up

 for

 a

 good

 challenge

.

 Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

.

 



Please

 note

 that

 the

 name

 and

 profession

 should

 be

 fictional

 and

 not

 have

 any

 specific

 meanings

 or

 con

notations

.

 Your

 response

 should

 be

 a

 short

,

 neutral

 self

-int



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



To

 verify

 this

 statement

,

 I

 will

:


1

.

 Search

 for

 information

 about

 France

's

 capital

 city

.


2

.

 Look

 for

 the

 name

 of

 Paris

.


3

.

 Check

 if

 it

's

 a

 significant

 city

 in

 France

.


4

.

 Confirm

 it

's

 the

 capital

 city

 of

 France

.



After

 researching

,

 I

 can

 confirm

 that

 Paris

 is

 indeed

 the

 capital

 city

 of

 France

.

 Therefore

,

 I

 can

 summarize

 the

 information

 in

 the

 following

 way

:



The

 capital

 of

 France

 is

 Paris

.

 This

 statement

 is

 factual

 and

 accurate

.

 It

 is

 widely

 recognized

 as

 the

 official

 and

 historical

 center

 of

 France

,

 serving

 as

 the

 seat

 of

 government

,

 administrative

,

 cultural

,

 and

 commercial

 activities

.

 Paris

 is

 known

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 complex

 and

 will

 likely

 involve

 many

 different

 trends

 and

 developments

.

 Here

 are

 some

 possible

 trends

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 medical

 diagnosis

,

 drug

 development

,

 and

 patient

 care

,

 but

 it

 is

 likely

 to

 become

 even

 more

 widespread

 in

 the

 coming

 years

.

 AI

 will

 be

 used

 to

 improve

 the

 accuracy

 of

 diagnoses

,

 reduce

 the

 risk

 of

 errors

,

 and

 provide

 personalized

 treatment

 plans

.



2

.

 AI

 in

 manufacturing

:

 AI

 is

 already

 being

 used

 in

 manufacturing

,

 from

 autom

ating

 production

 lines

 to

 optimizing

 supply

 chains

.

 As

 AI

 technology

 continues

 to

 advance

,

 it

 is

 likely

 to

 be

 used

 even

 more

 extensively

 in

 manufacturing

 to

 improve

 efficiency

,

 reduce

 costs

,




In [6]:
llm.shutdown()