# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.75it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Liza Johnson. I'm 14 years old and I'm from America. I'm a student in a middle school. I'm in Class 2, Grade 7. I have a pet dog named Timmy. He is 5 years old. He is a brown and white dog. He loves me very much. He is a very friendly dog. My family likes to play with Timmy every day. We go to the dog park on Saturday to play with him. On Sunday, we usually play on the school playground. Timmy likes to play with me and sometimes we play with other kids. He is very smart
Prompt: The president of the United States is
Generated text:  running for a second term. Before he can take office, he needs to raise his personal fortune to the square of his age. If he is currently 6 feet tall and plans to stay in the White House for 3 more years, what would be his new fortune? To determine the president's new fortune, we first need to calculate his age when he plans to take office. The president plans to stay in the White House for 3 more years, which means

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic and cultural center with a rich history dating back to the Roman Empire. It is the largest city in France and the second-largest city in the European Union, with a population of over 2. 5 million people. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also a major center for fashion, art, and music, and is home to many world-renowned museums, theaters, and other cultural institutions. Paris is a popular tourist destination and a major

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased automation and artificial intelligence: As AI technology continues to advance, we can expect to see more automation and artificial intelligence in various industries. This could lead to increased efficiency, productivity, and cost savings for businesses and individuals.

2. Improved privacy and security: As AI technology becomes more advanced, there will be an increased need for privacy and security measures to protect personal data. This could lead to new regulations and standards for AI development and use



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  __________ and I am a/an _______ [insert profession, occupation, or role in the fictional world]. I am a/an __________ [insert your major, field of study, or background] student at __________ [insert your school's name and location]. My __________ [insert a distinguishing characteristic or quality] is _________. I am passionate about __________ [insert something that interests you or excites you]. I am also a/an __________ [insert your hobbies, interests, or personal interests] and I enjoy __________ [insert activities you enjoy or places you like to visit]. I am a/an __________ [insert

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is also known as "La Vie". The city is famous for its rich history, culture, and bustling street life. It is home to many famous landmarks such as the Eiffel Tower, 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

Age

]

 year

 old

 [

Prof

ession

].

 I

 recently

 graduated

 [

Year

]

 from

 [

University

 or

 College

 Name

],

 and

 I

 have

 always

 been

 interested

 in

 learning

 and

 improving

 myself

.

 I

 have

 a

 good

 work

 ethic

 and

 am

 eager

 to

 learn

 and

 grow

 in

 all

 areas

 of

 life

.

 What

 kind

 of

 projects

 or

 hobbies

 do

 you

 enjoy

?

 I

 find

 myself

 capt

ivated

 by

 creative

 writing

 and

 I

 enjoy

 watching

 films

 and

 listening

 to

 music

.

 I

 like

 to

 keep

 my

 mind

 active

 by

 reading

 books

 and

 engaging

 in

 outdoor

 activities

 like

 hiking

 and

 fishing

.

 What

 do

 you

 like

 to

 do

 outside

?

 I

 like

 to

 spend

 time

 with

 my

 family

 and

 friends

,

 especially



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Explanation

 for

 an

 

8

-year

-old

:

 Paris

 is

 the

 big

 city

 where

 the

 French

 people

 live

 and

 work

.

 It

's

 like

 a

 giant

 playground

 with

 lots

 of

 yummy

 food

 and

 fun

 places

 to

 go

.

 Paris

 is

 special

 because

 it

 was

 where

 the

 French

 first

 lived

 a

 very

 long

 time

 ago

.

 It

's

 also

 home

 to

 many

 important

 places

 like

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 very

 beautiful

 and

 exciting

,

 just

 like

 a

 magical

 world

 in

 your

 mind

!

 



In

 simple

 terms

:

 Paris

 is

 the

 capital

 city

 of

 France

,

 where

 lots

 of

 people

 live

 and

 have

 fun

.

 It

's

 a

 big

 city

 with

 tall

 buildings

 and

 many

 beautiful

 things



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

,

 with

 endless

 possibilities

 for

 technology

 that

 will

 transform

 our

 world

 and

 society

 in

 exciting

 ways

.

 Here

 are

 some

 potential

 trends

 that

 AI

 is

 likely

 to

 continue

 to

 develop

 and

 evolve

:



1

.

 Increased

 AI

 Ethics

 and

 Bias

:

 With

 the

 increasing

 amount

 of

 data

 that

 AI

 algorithms

 can

 process

,

 we

 will

 need

 to

 address

 the

 ethical

 implications

 of

 AI

 systems

.

 There

 will

 be

 a

 push

 to

 develop

 ethical

 guidelines

 for

 AI

 that

 take

 into

 account

 privacy

,

 fairness

,

 and

 accountability

.

 This

 will

 require

 ongoing

 development

 and

 testing

 of

 new

 ethical

 frameworks

 and

 algorithms

.



2

.

 AI

 for

 Environmental

 Sustainability

:

 With

 the

 planet

 facing

 many

 challenges

,

 there

 will

 be

 a

 push

 for

 AI

 to

 play

 a

 role

 in




In [6]:
llm.shutdown()