# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.30it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.23it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.23it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.50it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elizabeth and I am a 13 year old student at The Albrighton School. I love writing, drawing and playing music with my friends. I was inspired to enter the competition by my English teacher, Mrs White, who has been teaching me how to write about my experiences and ideas.
I think that winning the Young Poet of the Year award would be a fantastic opportunity for me. It would be amazing to see my work in a real book and to have the chance to meet other young people who are passionate about writing. I believe that I have the potential to be a successful writer and this award would be a great boost to my
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president serves as both the commander-in-chief of the armed forces and the head of the executive branch of the federal government. The president is also the chief diplomat and is responsible for representing the country i

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I've been working on my own for about five years now, and I've had the opportunity to work with a variety of clients and projects. I'm interested in exploring different genres and styles, and I'm always looking for new challenges and opportunities. I'm a bit of a introvert, but I enjoy meeting new people and learning about their experiences and perspectives. I'm based in a small town in the Pacific Northwest, where I enjoy hiking and exploring the outdoors. That's a bit about me. What do you think? Is there anything you'd like to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city. It directly answers the question and provides a clear and accurate piece of information. The statement is also free from any subjective language or personal opinions, making it a reliable source of information. 
Note: The statement is a simple and direct answer to the question, which is a key characteristic of a concise factual statement. It does not include any additional information or context that is not necessary to answer the question. 
Let me know if you want me to generate another statement. 


Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, there are several trends that are likely to shape the development and impact of artificial intelligence in the coming years. Here are some possible future trends in AI:
1. Increased Adoption in Various Industries: AI is expected to be adopted in various industries such as healthcare, finance, transportation, and education. This will lead to increased efficiency, productivity, and innovation in these sectors.
2. Advancements in Natural Language Processing (NLP): NLP is a subset of AI that deals with the interaction between computers and humans in natural language. Future advancements



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I'm a 19-year-old student at a local community college, studying graphic design. I enjoy drawing, reading, and playing video games in my free time.
The introduction presents the character's name, age, occupation, and interests in a neutral and factual way, without revealing any personal biases or motivations. This allows the reader to form their own opinions about Kaida and their character. The tone is relaxed and conversational, making it easy for the reader to imagine Kaida as a real person.
Here are a few things to consider when writing a self-introduction for a fictional character:
1. Keep it concise:

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is located in the northern part of the country. It is situated on the Seine River and has a population of approximately 2.1 million people.

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

 Waters

,

 and

 I

’m

 a

 

25

-year

-old

 graphic

 designer

.

 I

 spend

 most

 of

 my

 free

 time

 reading

 and

 experimenting

 with

 new

 artistic

 techniques

.


Now

,

 let

's

 make

 this

 introduction

 more

 engaging

.

 I

'll

 focus

 on

 adding

 sensory

 details

 and

 making

 Emily

's

 personality

 shine

 through

.



---



Here

 are

 some

 ideas

 to

 consider

:



*

  

 **

Add

 sensory

 details

**:

 Describe

 what

 Emily

 sees

,

 hears

,

 reads

,

 or

 experiences

.

 For

 instance

,

 "

I

 spend

 most

 of

 my

 free

 time

 curled

 up

 with

 a

 good

 book

,

 surrounded

 by

 half

-f

inished

 art

 projects

 and

 the

 sound

 of

 indie

 music

 playing

 in

 the

 background

."


*

  

 **

Highlight

 a

 unique

 trait

**:

 Share

 something

 that



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 capital

 of

 France

 is

 a

 city

 with

 a

 rich

 history

 and

 culture

.


Paris

 is

 a

 major

 tourist

 destination

.


The

 city

 is

 home

 to

 many

 world

-ren

owned

 museums

,

 including

 the

 Lou

vre

 and

 the

 Or

say

.


Paris

 is

 also

 known

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

 Dame

 Cathedral

.


The

 city

 has

 a

 diverse

 and

 vibrant

 cultural

 scene

,

 with

 numerous

 art

 galleries

,

 theaters

,

 and

 music

 venues

.


Paris

 is

 a

 major

 hub

 for

 fashion

 and

 cuisine

,

 with

 many

 high

-end

 designers

 and

 restaurants

.


The

 city

 has

 a

 long

 and

 complex

 history

,

 with

 various

 periods

 of

 occupation

 and

 cultural

 influence

.


Paris

 has

 been

 the

 capital

 of

 France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 shaped

 by

 various

 factors

 including

 technological

 advancements

,

 societal

 needs

,

 and

 human

 values

.

 Based

 on

 current

 developments

 and

 trends

,

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Adoption

 in

 Industries

:

 AI

 will

 continue

 to

 be

 adopted

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 transportation

,

 and

 education

.

 This

 will

 lead

 to

 improved

 efficiency

,

 accuracy

,

 and

 decision

-making

.


2

.

 Adv

ancements

 in

 Edge

 AI

:

 As

 the

 Internet

 of

 Things

 (

Io

T

)

 expands

,

 AI

 will

 be

 increasingly

 deployed

 on

 edge

 devices

,

 allowing

 for

 faster

 processing

 and

 reduced

 latency

.

 This

 will

 enable

 more

 real

-time

 decision

-making

 and

 control

.


3

.

 Rise

 of

 Explain

able

 AI

 (

X




In [6]:
llm.shutdown()