# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.30it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.25it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.24it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.51it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Daniela Simões and I am a photographer based in Portugal. I love capturing life’s precious moments, from the most intimate and personal to the most joyous and celebratory. My passion is to tell stories through images, to freeze time and to create a lasting memory for you.
I specialize in wedding, portrait, and family photography. I am known for my warm and natural approach, and I strive to make my clients feel comfortable and relaxed in front of the camera.
I am a member of the Portuguese Photographers Association and the National Association of Professional Photographers (ANPP). I have a degree in Photography from the Istituto
Prompt: The president of the United States is
Generated text:  a member of the executive branch of the federal government and is both the head of state and head of government of the United States. The president serves as the commander-in-chief of the armed forces and is responsible for executing the duties and powers of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student living in a small town in Japan. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and try to stay involved in local community events. I'm not really sure what I want to do with my life yet, but I'm taking things one step at a time. I'm a bit of a introvert and prefer to spend time alone, but I'm working on being more outgoing and confident. I'm a bit of a perfectionist, which can sometimes make things difficult for me,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This is a factual statement that provides a concise and accurate description of the capital city of France. It does not include any additional information or opinions, making it a clear and straightforward statement. This type of statement is often used in encyclopedias, dictionaries, and other reference materials to provide a quick and reliable source of information. It is also a good example of a declarative sentence, which is a type of sentence that makes a statement or assertion. In this case, the statement is a simple and direct assertion about the capital city of France. Overall, the statement is clear, concise

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, including the development of AI-powered robots that can assist with surgeries and other medical procedures.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely to be adopted in many other industries, including manufacturing,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Raven Blackwood, but you can call me Raven. I'm a 25-year-old professional thief with a passion for art and a talent for getting out of tight spots. I've been all over the world, from the neon-lit streets of Tokyo to the dusty alleys of Marrakech, and I've got a story or two to tell if you're willing to listen. I'm not looking for trouble, but it seems to find me anyway. What's your story?
I've been around the world, but I've never been one for grand adventures or high-stakes heists. I like to keep a low profile

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northern region of the country, along the River Seine. The city is known for its iconic landmarks, rich history, and cultural significance, attracting millions of visitors every year.
What are the main reasons why Paris is a popular

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ad

al

yn

 Black

wood

.

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 and

 editor

 living

 in

 Portland

,

 Oregon

.

 I

've

 been

 writing

 professionally

 for

 about

 three

 years

 now

,

 and

 I

've

 had

 a

 few

 pieces

 published

 in

 local

 publications

.

 I

'm

 also

 a

 bit

 of

 a

 coffee

 aficion

ado

,

 and

 I

 can

 often

 be

 found

 exploring

 the

 city

's

 best

 cafes

.


This

 example

 is

 short

,

 neutral

,

 and

 to

 the

 point

.

 It

 establishes

 the

 character

's

 profession

 and

 location

,

 but

 doesn

't

 reveal

 much

 about

 her

 personality

 or

 personal

 life

.

 It

 also

 hints

 at

 her

 interests

 and

 hobbies

,

 but

 doesn

't

 commit

 to

 anything

 too

 deeply.

 This

 kind

 of

 introduction

 can

 be

 useful



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 


The

 following

 is

 a

 list

 of

 key

 facts

 about

 Paris

:


The

 official

 name

 of

 Paris

 is

 the

 City

 of

 Paris

.

 


The

 City

 of

 Paris

 is

 the

 largest

 city

 in

 France

 by

 population

.

 


Paris

 is

 located

 in

 the

 north

-central

 region

 of

 France

.

 


The

 city

 is

 situated

 along

 the

 Se

ine

 River

.

 


The

 official

 language

 of

 Paris

 is

 French

.

 


The

 city

 has

 a

 population

 of

 approximately

 

2

.

1

 million

 people

.

 


Paris

 is

 a

 major

 economic

 and

 cultural

 center

 in

 Europe

.

 


The

 city

 is

 home

 to

 many

 famous

 landmarks

 and

 museums

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 


Paris

 is

 known

 for

 its



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 technological

 advancements

,

 societal

 needs

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


 

 

1

.

 Increased

 use

 of

 Edge

 AI

:

 Edge

 AI

 involves

 processing

 AI

 computations

 on

 devices

,

 such

 as

 smartphones

 or

 smart

 home

 devices

,

 rather

 than

 relying

 on

 cloud

-based

 services

.

 This

 trend

 is

 expected

 to

 continue

 as

 IoT

 devices

 become

 increasingly

 prevalent

.


 

 

2

.

 AI

 for

 Social

 Good

:

 AI

 has

 the

 potential

 to

 address

 some

 of

 the

 world

's

 most

 pressing

 challenges

,

 such

 as

 climate

 change

,

 healthcare

,

 and

 education

.

 Future

 AI

 applications

 will

 focus

 on

 using

 AI

 for

 social

 good

,

 such

 as

 predicting

 natural

 disasters

,

 developing

 personalized

 medicine




In [6]:
llm.shutdown()