# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.27it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.14it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.13it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jamie, and I’m a Marketing and Communications professional with over 10 years of experience in developing and executing successful marketing campaigns, managing social media platforms, and creating compelling content to engage audiences.
I am a creative problem-solver, a strategic thinker, and a collaborative team player. I have a passion for storytelling and a talent for crafting messages that resonate with diverse audiences. I am also a digital native, always staying up-to-date with the latest trends and technologies in the marketing and communications landscape.
Over the years, I have worked with a range of clients, from startups to established brands, across various industries, including finance, healthcare, education
Prompt: The president of the United States is
Generated text:  meeting with the head of the legislative branch and the head of the judicial branch. What is the process by which these three branches of government come together

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys reading, playing video games, and spending time with friends. I'm a bit of a introvert, but I'm working on being more outgoing. I'm a junior, so I'm trying to balance schoolwork with extracurricular activities and a social life. I'm not really sure what I want to do with my life yet, but I'm taking things one step at a time. I'm a bit of a perfectionist, which can be both a blessing and a curse. I'm looking forward to seeing what the future holds.
This self-int

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, near the Seine River. It is the most populous city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is a major hub for international business, finance, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a population of over 2.1 million people and is a popular destination for visitors from around the world. Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, with applications in medical diagnosis, personalized medicine, and patient care.
2. Rise of explainable AI: As AI becomes more pervasive, there will be a growing need for explainable AI, which can provide insights into how AI systems make decisions.
3. Growing importance of human-AI collaboration: As AI becomes more capable, humans and AI systems will need to work together more effectively,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida Yamato, and I'm a 25-year-old software engineer from Tokyo, Japan. I currently live in a small apartment in a quiet neighborhood with my cat, Luna. When I'm not working, you can find me exploring new coffee shops or practicing yoga. I'm a bit of a introverted person, but I enjoy meeting new people and learning about their cultures. What do you like to do in your free time? This introduction doesn't reveal any major secrets about Kaida's personality, skills, or motivations, but it gives a sense of who she is and what she's interested in. It also invites the reader

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city is often referred to as the City of Light (La Ville Lumière) due to its historic status as a center of learning and culture, and its significant role in the Enlightenment.
Provide 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ephy

r

 W

yst

an

.

 I

 am

 a

 

23

-year

-old

 bot

an

ist

 from

 a

 small

 town

 in

 the

 English

 countryside

.

 I

 have

 a

 degree

 in

 plant

 ecology

 and

 have

 worked

 for

 several

 years

 in

 various

 gardens

 and

 research

 institutions

.

 My

 interests

 include

 the

 study

 of

 rare

 and

 unusual

 plant

 species

,

 and

 the

 development

 of

 sustainable

 practices

 in

 h

ort

iculture

.

 I

 enjoy

 long

 walks

 in

 the

 countryside

,

 reading

 about

 the

 natural

 world

,

 and

 experimenting

 with

 new

 recipes

 in

 the

 kitchen

.


This

 is

 a

 good

 starting

 point

 for

 a

 self

-int

roduction

.

 Z

ephy

r

's

 background

 and

 interests

 are

 clearly

 stated

,

 and

 the

 tone

 is

 neutral

 and

 professional

.

 However

,

 you

 may

 want



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 located

 at

 the

 heart

 of

 the

 Î

le

-de

-F

rance

 region

 in

 northern

-central

 France

.

 Paris

 is

 situated

 along

 the

 Se

ine

 River

 and

 has

 a

 population

 of

 approximately

 

2

.

1

 million

 people

.


The

 city

 is

 home

 to

 many

 historical

 landmarks

,

 museums

,

 art

 galleries

,

 and

 cultural

 institutions

.

 The

 city

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 has

 been

 the

 capital

 of

 France

 since

 

987

 and

 has

 played

 a

 significant

 role

 in

 the

 country

’s

 history

 and

 politics

.


The

 city

 is

 also

 a

 major

 economic

 and

 cultural

 center

 in

 Europe

 and

 is

 known

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 more

 sophisticated

,

 widespread

,

 and

 integral

 to

 various

 aspects

 of

 our

 lives

.

 Some

 possible

 future

 trends

 in

 AI

 include

:


1

.

 **

Edge

 AI

:**

 With

 the

 increasing

 demand

 for

 real

-time

 processing

,

 AI

 is

 expected

 to

 be

 deployed

 at

 the

 edge

 of

 the

 network

,

 closer

 to

 the

 source

 of

 the

 data

,

 to

 reduce

 latency and

 improve

 efficiency

.


2

.

 **

Ex

plain

able

 AI

 (

X

AI

):

**

 As

 AI

 becomes

 more

 pervasive

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 decisions

 are

 made

.

 X

AI

 aims

 to

 provide

 transparent

 and

 interpre

table

 AI

 models

,

 enabling

 humans

 to

 understand

 the

 reasoning

 behind

 AI

-driven

 decisions

.


3

.

 **

Human

-A

I




In [6]:
llm.shutdown()