# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.40it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.29it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.28it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.56it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aimée and I am a Ceramic Artist. I am thrilled to share with you my love of ceramics and the art of shaping and forming clay into beautiful, functional pieces that bring joy to people's lives.

My ceramic journey began many years ago, but it's only been in the past five years that I've become passionate about hand-building and throwing on the potter's wheel. The process of creating something from raw clay is meditative and fulfilling for me, and I love the challenge of shaping and transforming a lump of clay into a beautiful, functional piece.

I am constantly experimenting and pushing the boundaries of what is possible with clay, and
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president serves as commander-in-chief of the Armed Forces, has the power to grant reprieves and pardons to individuals, and has the power to sign bills into law or veto them.
The pres

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my free time. I'm currently working on a novel and a few short stories, and I'm excited to see where my writing takes me.
This is a good start, but it's a bit too focused on your writing. You might want to add a bit more about your personality or interests to give a sense of who you are beyond your profession. Here's an example of how you could revise it: Hello, my name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, art, fashion, and culture. It is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. Paris is also a major business and financial center, hosting many international organizations and companies. The city has a population of over 2.1 million people and is a popular tourist destination, attracting millions of visitors each year. Paris is also known for its romantic atmosphere, with its beautiful parks, gardens, and bridges. The city has

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even larger role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and manufacturing. In the future, AI is likely to become even more widespread, with the potential to automate many tasks and improve efficiency



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Astrid, and I work as a botanist studying the unique plant species of the Amazon rainforest. I'm a bit of a introvert and enjoy spending time in solitude, often getting lost in the intricacies of the natural world. I'm not one for grand adventures or making a lot of friends, but I find comfort in the quiet, predictable routine of my work. My focus is on observing and documenting the various plant species, and I'm passionate about preserving the Amazon's rich biodiversity for future generations. I'm not particularly interested in being the center of attention or taking risks, but I'm dedicated to my craft and will

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This is a simple statement that gives the required information about the capital of France. No further detail is provided.
Provide a concise fac

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ephy

r

 and

 I

'm

 a

 

19

-year

-old

 university

 student

 major

ing

 in

 environmental

 science

.

 I

'm

 originally

 from

 a

 small

 town

 in

 the

 Midwest

,

 but

 I

've

 been

 living

 in

 the

 city

 for

 the

 few

 years

.

 I

'm

 interested

 in

 sustainable

 energy

 and

 wildlife

 conservation

.

 In

 my

 free

 time

,

 I

 like

 to

 hike

 and

 read

 about

 climate

 change

.

 That

's

 a

 little

 bit

 about

 me

.


In

cor

porate

 the

 following

 details

:

 I

'm

 from

 a

 small

 town

,

 I

 love

 hiking

,

 I

'm

 interested

 in

 wildlife

 conservation

,

 and

 I

'm

 a

 university

 student

.


I

'm

 a

 

19

-year

-old

 university

 student

 from

 a

 small

 town

 in

 the

 Midwest

,

 where

 I

 grew



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Write

 a

 concise

 summary

 of

 the

 capital

 city

 in

 a

 neutral

 tone

.

 Paris

 is

 the

 capital

 city

 of

 France

,

 located

 in

 the

 northern

 part

 of

 the

 country

.

 It

 is

 known

 for

 its

 rich

 history

,

 cultural

 landmarks

,

 and

 romantic

 atmosphere

.

 The

 city

 is

 situated

 along

 the

 Se

ine

 River

 and

 is

 home

 to

 many

 famous

 museums

,

 art

 galleries

,

 and

 historical

 sites

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 major

 tourist

 destination

 and

 a

 hub

 for

 international

 business

,

 education

,

 and

 culture

.


Define

 the

 term

 "

capital

 city

."

 A

 capital

 city

 is

 the

 city

 or

 town

 that

 is

 the

 seat

 of

 government



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 much

 speculation

 and

 debate

,

 but

 there

 are

 several

 possible

 trends

 that

 are

 likely

 to

 shape

 the

 field

.

 Here

 are

 some

 of

 the

 most

 promising

 areas

 of

 development

:

 


Improved

 Natural

 Language

 Processing

 (

N

LP

):

 N

LP

 is

 a

 key

 area

 of

 AI

 research

,

 and

 we

 can

 expect

 to

 see

 significant

 improvements

 in

 the

 coming

 years

.

 This

 could

 lead

 to

 more

 sophisticated

 chat

bots

,

 virtual

 assistants

,

 and

 language

 translation

 systems

.


Increased

 Use

 of

 Machine

 Learning

:

 Machine

 learning

 is

 a

 type

 of

 AI

 that

 enables

 systems

 to

 learn

 from

 data

 and

 improve

 their

 performance

 over

 time

.

 We

 can

 expect

 to

 see

 more

 widespread

 use

 of

 machine

 learning

 in

 areas

 such

 as

 computer

 vision

,

 natural




In [6]:
llm.shutdown()