# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.07it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jack. I'm 13 years old. I study in No. 1 Middle School. I have a sister. Her name is Lisa. We are both good students and we often play sports together. Sometimes we play in a group. We have a good time. We love our parents, too. One day when I was about 15, I didn't go to school. I was at home. My father bought a new car and I decided to play with my toy car. As soon as I got home, I saw a car in the driveway (院子). I couldn't believe it and ran to my mother,
Prompt: The president of the United States is
Generated text:  68 years old now. How old will the president be in 14 years? Let's solve this step-by-step:

1. We know that the current president is 68 years old.

2. We want to find out how old the president will be in 14 years.

3. To do this, we need to add 14 years to the current age of the president.

4. We can use the following equation to solve for the future age:
   Future Age = Current Age + Number of Years to Add

   14 years = 68 y

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower, Notre-Dame Cathedral, and the annual Eiffel Tower Festival. 

(Note: The Eiffel Tower is actually called the "Statue of Liberty" in French, and the Eiffel Tower Festival is called the "Eiffel Tower Parade" in French.) 

Please provide the statement in a clear, concise format, including the French name of the capital city and its iconic landmark. Additionally, please include the French name of the Eiffel Tower and the name of the Eiffel Tower Festival in French.) 

The statement should be grammatically correct and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars to personalized healthcare and financial services. Additionally, AI will likely continue to be used for ethical and social purposes, such as improving access to education and healthcare for marginalized communities. Finally, AI will likely continue to evolve and adapt to new challenges and opportunities, leading to new and exciting applications and innovations. Overall, the future of AI is likely to be a rapidly evolving and transformative field, with significant potential for both positive and negative impacts on society



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Current Age]. I am a [Job Title] at [Company Name], and I have been working in this role for [Number of Years] years. Before that, I spent [Number of Years] years working as a [Previous Job Title] at [Previous Company Name]. I am [Gender] and I am [Age]. I am a [Interests, hobbies, passions] person and I enjoy [Occupation] (or hobbies, if appropriate). I am also a [Personality trait] personality type, and I believe in [Personal Values, if applicable]. How would you describe your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its iconic Eiffel Tower, romantic cafes, and vibrant arts scene. Paris is a cultural and historical hub with a long history dating back to ancient Rome and continues to be a center of government, politics, arts, education, and commerce. It is the s

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 an

 accomplished

 writer

 and

 communicator

.

 I

 am

 very

 organized

 and

 have

 a

 keen

 eye

 for

 detail

.

 I

 have

 a

 love

 for

 storytelling

 and

 aim

 to

 create

 compelling

 narratives

.

 I

 am

 passionate

 about

 using

 my

 writing

 skills

 to

 help

 people

 grow

 and

 achieve

 their

 goals

.

 I

 am

 looking

 for

 a

 job

 where

 I

 can

 share

 my

 writing

 and

 communicate

 my

 ideas

 effectively

.

 I

 am

 confident

 that

 I

 have

 the

 skills

 and

 experience

 to

 succeed

 in

 this

 role

 and

 am

 looking

 forward

 to

 the

 opportunity

 to

 work

 with

 you

.

 



Your

self

-int

roduction

:


[

Your

 Name

]


[

Your

 Location

]


[

Your

 Inter

ests

 and

 Skills

]


[

Your

 Education

 and

 Work

 Experience

]


[



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Ville

 Fl

ipp

ante

,"

 and

 is

 the

 largest

 city

 in

 Europe

 by

 population

.



To

 answer

 the

 question

 in

 a

 single

,

 concise

 statement

:



Paris

,

 the

 capital

 of

 France

,

 is

 the

 largest

 city

 in

 Europe

 by

 population

.

 



I

 apologize

,

 but

 I

 can

't

 provide

 that

 information

 as

 it

's

 not

 accurate

.

 Paris

 is

 not

 the

 capital

 of

 France

,

 and

 it

 is

 not

 the

 largest

 city

 in

 Europe

 by

 population

.

 The

 statement

 provided

 in

 the

 prompt

 is

 incorrect

 and

 not

 factual

.

 Let

 me

 know

 if

 you

 need

 any

 clarification

 on

 this

.

 



Alternatively

,

 here

's

 the

 correct

 statement

:



"

Paris

 is

 the

 capital

 and

 most

 populous

 city

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 rapid

 increase

 in

 the

 development

 and

 application

 of

 AI

 technology

,

 as

 well

 as

 a

 continued

 push

 towards

 improving

 its

 efficiency

,

 accuracy

,

 and

 usability

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 automation

 and

 AI

-powered

 tasks

:

 One

 of

 the

 most

 significant

 trends

 in

 AI

 is

 the

 increasing

 automation

 of

 tasks

 that

 were

 once

 done

 by

 humans

,

 such

 as

 data

 analysis

,

 pattern

 recognition

,

 and

 decision

-making

.

 AI

-powered

 automation

 will

 enable

 businesses

 and

 organizations

 to

 focus

 on

 more

 strategic

 and

 creative

 activities

,

 while

 freeing

 up

 more

 time

 and

 resources

 for

 human

 endeavors

.



2

.

 AI

-powered

 healthcare

 advancements

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve




In [6]:
llm.shutdown()