# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sabrina and I'm a graphic designer. I specialize in digital design, especially in Photoshop and Illustrator. I like to help people to do their best work and to share my knowledge.
Currently I am working in a graphic design company. I am also a part-time freelancer.
Do you have a specific area of expertise? Can you provide me with some information on your portfolio? I'm interested in learning more about your work, so please share some details about your portfolio. I am looking for inspiration to learn more about your skills and approach. How can I get started with my own freelance career? You can start by identifying your unique style and
Prompt: The president of the United States is
Generated text:  a very busy man. He works in a big building that is 20 floors high. He has to go up and down 10 times a day for work. What is the least number of times that the president will go up and down the stairs in a day? The answer is 1000.
To determine the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the birthplace of the French Revolution and the home of the French language. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is also home to many famous museums, including the Musée d'Orsay and the Musée Rodin. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that has played a significant role in French history and continues to be a major

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations and tasks. This could lead to more efficient and effective use of AI in various fields, such as healthcare, finance, and transportation.

2. Greater use of AI in autonomous vehicles: As autonomous vehicles become more advanced, they are likely to become more integrated with AI, allowing them to make decisions and take actions



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [X] year [Y] college student who loves [X] sports. I'm passionate about [X] and would love to help others learn and grow in [X] areas. What excites you the most about [X] sports?

[Name] - Thank you for asking. I'm excited to learn more about [X] sports and help others achieve their goals in [X] areas. Do you have any questions about [X] sports? I'm always happy to answer any questions or provide any information I can to help you better understand the sport. Good luck with your studies and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

French fries are a type of snack food. In which year was the French fries mentioned? In 1999. French fries are a type of snack food. They originated in France and are popular in the United States. They are often made with potatoes and can be served wit

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

-old

 [

Occup

ation

],

 [

Job

 Title

],

 who

 specializes

 in

 [

Special

ization

/

Job

].

 I

'm

 confident

,

 strong

-w

illed

,

 and

 driven

 to

 achieve

 my

 goals

,

 even

 if

 they

're

 challenging

.

 I

 have

 a

 natural

 ability

 to

 solve

 problems

,

 handle

 stress

,

 and

 connect

 with

 others

,

 which

 I

 aim

 to

 utilize

 to

 help

 others

 grow

 and

 thrive

.

 I

'm

 a

 joy

 to

 be

 around

,

 and

 my

 quick

 wit

 and

 sense

 of

 humor

 help

 to

 lighten

 the

 mood

.

 I

 value

 patience

 and

 empathy

,

 and

 I

 strive

 to

 make

 people

's

 lives

 better

 by

 solving

 problems

,

 making

 others happy

,

 and

 keeping

 everyone

's



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

.

 It

 is

 also

 known

 for

 its

 rich

 history

,

 art

,

 and

 cuisine

.

 It

 is

 an

 important

 center

 of

 politics

,

 culture

,

 and

 business

 in

 the

 country

.

 Its

 annual

 public

 holidays

 include

 Christmas

 and

 Easter

,

 and

 it

 is

 home

 to

 numerous

 world

-ren

owned

 museums

 and

 galleries

.

 The

 French

 capital

 is

 a

 vibrant

 and

 diverse

 city

 with

 a

 rich

 history

 and

 culture

.

 The

 E

iff

el

 Tower

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

,

 and

 the

 city

 is

 known

 for

 its

 delicious

 food

 and

 wine

,

 as

 well

 as

 its

 modern

 architecture

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 we

 can

 see

 many

 potential

 trends

 that

 could

 shape

 the

 evolution

 of

 AI

 technology

:



1

.

 Increased

 focus

 on

 ethical

 considerations

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 there

 will

 be

 increased

 focus

 on

 ethical

 considerations

,

 including

 issues

 related

 to

 bias

,

 privacy

,

 and

 transparency

.



2

.

 Advanced

 natural

 language

 processing

:

 With

 the

 increasing

 volume

 and

 complexity

 of

 data

,

 we

 will

 see

 more

 advanced

 natural

 language

 processing

 that

 is

 able

 to

 understand

 and

 generate

 natural

 language

 better

 than

 what

 humans

 can

 currently

 do

.



3

.

 More

 emphasis

 on

 machine

 learning

:

 With

 the

 rise

 of

 big

 data

 and

 artificial

 intelligence

,

 we

 will

 see

 a

 greater

 focus

 on

 machine

 learning

,

 which

 involves

 the

 use




In [6]:
llm.shutdown()