# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.39it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.30it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.29it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.59it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tim and I am the founder of DrillUp. I’m excited to share my story with you and hope to inspire others to take the leap and start their own business.
I spent over 12 years in the corporate world, working for companies like IBM and Symantec. It was a great experience, but it wasn’t for me. I was a creative person at heart, always looking for new challenges and opportunities to innovate. I felt like I was just going through the motions, day in and day out.
One day, I had an epiphany. I realized that I had the skills and knowledge to start my own business.
Prompt: The president of the United States is
Generated text:  the leader of the government of the United States. The President is both the head of state and the head of government of the country. The President is responsible for the execution of the laws, as well as the overall conduct of the government. The President has the authority to sign bills into law, veto bills, and appoint federal ju

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city, and I spend most of my free time reading and writing. I'm a bit of a introvert, but I enjoy meeting new people and trying new things. I'm currently working on a novel, and I'm excited to see where my writing takes me. I'm looking forward to getting to know you better.
This is a good example of a neutral self-introduction because it doesn't reveal too much about the character's personality, background, or motivations. It simply provides a brief overview of who they are and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also a major center for business, education, and tourism. Paris is a popular destination for visitors from around the world, attracting over 23 million tourists each year. The city is divided into 20 arrondissements, or districts, and has a population of over 2

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with applications such as:
a. Predictive analytics: AI can analyze large amounts of data to predict patient outcomes and identify high-risk patients.
b. Personalized medicine: AI can help develop personalized treatment plans based on a patient's genetic profile, medical history, and lifestyle.
c. Virtual nursing assistants: AI-powered virtual



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Fiona.
I'm a 25-year-old freelance writer and editor who spends most of my days staring at screens and trying to come up with clever things to say. When I'm not working, I enjoy taking long walks in the park, reading books, and experimenting with new recipes in the kitchen.
I'm a bit of a introvert and a bit of an outsider in my social circle, but I've learned to appreciate my alone time and use it to fuel my creativity. I'm also a bit of a perfectionist, which can be both a blessing and a curse. I'm working on finding a balance between my high standards and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the northern part of the country and has a population of approximately 2.1 million people. It is also the largest metropolitan area in the country, with a population of over 12 mil

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Bern

ad

ette

 "

Bern

ie

"

 D

avenport

.

 I

'm

 a

 freelance

 writer

,

 currently

 based

 in

 the

 city

 of

 Ash

wood

,

 where

 I

've

 lived

 for

 the

 past

 five

 years

.

 I

 work

 from

 home

,

 and

 spend

 most

 of

 my

 time

 writing

 articles

 and

 stories

,

 as

 well

 as

 doing

 research

 for

 my

 writing

 projects

.

 I

'm

 quite

 fond

 of

 old

 books

 and

 antique

 furniture

,

 and

 enjoy

 collecting

 both

 whenever

 I

 can

.

 I

'm

 a

 bit

 of

 a

 lon

er

,

 but

 I

 value

 my

 independence

 and

 enjoy

 my

 freedom

 to

 pursue

 my

 passions

 without

 too

 much

 interference

.

 I

'm

 always

 looking

 for

 new

 projects

 and

 opportunities

 to

 learn

,

 and

 I

'm

 open

 to

 collaborating

 with

 others

 who



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


This

 statement

 describes

 Paris

,

 France

,

 and

 the

 fact

 that

 it

 is

 the

 country

’s

 capital

.

 It

 is

 a

 straightforward

 and

 factual

 declaration

.

 No

 extra

 information

 is

 provided

 beyond

 what

 is

 necessary

 to

 convey

 the

 basic

 fact

 about

 the

 city

.


Next

 Post

:

 What

 is

 the

 significance

 of

 the

 E

iff

el

 Tower

 in

 Paris

?

 The

 E

iff

el

 Tower

 is

 a

 historical

 monument

 located

 in

 Paris

,

 France

.

 It

 was

 built

 for

 the

 

188

9

 World

’s

 Fair

 and

 was

 intended

 to

 be

 a

 temporary

 structure

,

 but

 it

 became

 an

 iconic

 symbol

 of

 the

 city

 and

 a

 popular

 tourist

 attraction

.

 The

 tower

 is

 

324

 meters

 tall

 and

 was

 the

 tallest

 man

-made

 structure

 in

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 going

 to

 be

 huge

.

 From

 automation

 to

 deep

 learning

,

 the

 possibilities

 are

 endless

.

 This

 future

 is

 not

 too

 far

 away

;

 here

 are

 the

 trends

 that

 will

 shape

 the

 future

 of

 artificial

 intelligence

.


Predict

ive

 Maintenance

:

 Predict

ive

 maintenance

 is

 a

 technology

 that

 uses

 AI

 and

 machine

 learning

 to

 forecast

 when

 a

 machine

 or

 equipment

 will

 require

 maintenance

.

 This

 technology

 can

 help

 reduce

 downtime

,

 increase

 efficiency

,

 and

 improve

 the

 overall

 performance

 of

 the

 equipment

.


Automation

 of

 Routine

 Tasks

:

 Automation

 of

 routine

 tasks

 is

 a

 trend

 that

 is

 expected

 to

 continue

 in

 the

 future

 of

 AI

.

 With

 the

 help

 of

 AI

 and

 machine

 learning

,

 routine

 tasks

 can

 be

 automated

,

 freeing

 up

 human

 resources

 for

 more

 complex




In [6]:
llm.shutdown()