# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.70it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.73it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.31it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:26,  1.20s/it]

  9%|▊         | 2/23 [00:01<00:14,  1.43it/s]

 13%|█▎        | 3/23 [00:01<00:10,  1.87it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.13it/s]

 22%|██▏       | 5/23 [00:02<00:07,  2.28it/s]

 26%|██▌       | 6/23 [00:03<00:07,  2.40it/s]

 30%|███       | 7/23 [00:03<00:06,  2.57it/s]

 35%|███▍      | 8/23 [00:03<00:05,  2.70it/s]

 39%|███▉      | 9/23 [00:04<00:05,  2.73it/s]

 43%|████▎     | 10/23 [00:04<00:04,  2.78it/s]

 48%|████▊     | 11/23 [00:04<00:04,  2.83it/s]

 52%|█████▏    | 12/23 [00:05<00:03,  2.91it/s]

 57%|█████▋    | 13/23 [00:05<00:03,  2.92it/s]

 61%|██████    | 14/23 [00:05<00:03,  2.94it/s]

 65%|██████▌   | 15/23 [00:06<00:02,  2.96it/s]

 70%|██████▉   | 16/23 [00:06<00:02,  2.87it/s]

 74%|███████▍  | 17/23 [00:06<00:02,  2.81it/s]

 78%|███████▊  | 18/23 [00:07<00:01,  2.89it/s]

 83%|████████▎ | 19/23 [00:07<00:01,  2.96it/s]

 87%|████████▋ | 20/23 [00:07<00:00,  3.15it/s]

 91%|█████████▏| 21/23 [00:08<00:00,  3.16it/s]

 96%|█████████▌| 22/23 [00:08<00:00,  3.16it/s]

100%|██████████| 23/23 [00:08<00:00,  3.13it/s]100%|██████████| 23/23 [00:08<00:00,  2.65it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elena and I am the new Teacher at the International School of Zug and Luzern. I am excited to be here and look forward to getting to know all of my new students and colleagues.
I have a degree in Early Childhood Education from the University of Athens and a Master's degree in Special Education from the University of New Mexico. I have been teaching for over 15 years and have experience working with students of all ages and backgrounds.
My classroom will be a collaborative and engaging environment where students are encouraged to explore, learn, and grow. I believe that every student is unique and talented and I will work with each student to help them reach
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president is directly elected by the people through the Electoral College. The president serves a four-year term, and can be re-elected once. The president has t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 K

ait

lyn

.


I

'm

 a

 

32

-year

-old

,

 energetic

,

 and

 optimistic

 individual

 who

 loves

 exploring

 new

 places

 and

 trying

 new

 things

.

 I

'm

 a

 bit

 of

 a

 perfection

ist

,

 but

 I

'm

 always

 up

 for

 a

 challenge

 and

 enjoy

 learning

 from

 my

 mistakes

.


When

 I

'm

 not

 working

,

 you

 can

 find

 me

 practicing

 yoga

,

 cooking

 in

 the

 kitchen

,

 or

 attempting

 to

 learn

 a

 new

 language

 (

I

'm

 currently

 learning

 Spanish

!).

 I

'm

 a

 bit

 of

 a

 night

 owl

,

 so

 you

'll

 often

 find

 me

 s

ipping

 coffee

 or

 tea

 late

 into

 the

 night

,

 binge

-w

atching

 my

 favorite

 TV

 shows

 or

 reading

 a

 good

 book

.


As

 for

 my

 interests

,

 I

'm



Prompt: The capital of France is
Generated text: 

 a

 city

 that

 requires

 no

 introduction

.

 Its

 grand

 bou

lev

ards

,

 historic

 landmarks

,

 and

 artistic

 treasures

 are

 renowned

 worldwide

.

 Yet

,

 Paris

 remains

 a

 city

 full

 of

 surprises

,

 offering

 an

 endless

 array

 of

 experiences

 for

 visitors

.

 Whether

 you

’re

 looking

 for

 the

 iconic

 E

iff

el

 Tower

,

 a

 romantic

 stroll

 along

 the

 Se

ine

,

 or

 an

 exploration

 of

 the

 city

’s

 many

 art

 museums

,

 Paris

 has

 something

 for

 everyone

.


The

 City

 of

 Light

 is

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 history

,

 art

,

 fashion

,

 cuisine

,

 and

 more

.

 Here

 are

 some

 of

 the

 top

 things

 to

 do

 and

 see

 in

 Paris

:


Explore

 the

 Lou

vre

 Museum

:

 The

 world

’s

 largest



Prompt: The future of AI is
Generated text: 

 all

 about

 multi

-agent

 systems

,

 where

 many

 agents

 work

 together

 to

 achieve

 complex

 tasks

.


As

 AI

 continues

 to

 evolve

,

 we

're

 moving

 from

 single

-agent

 systems

,

 where

 one

 AI

 model

 performs

 a

 specific

 task

,

 to

 multi

-agent

 systems

.

 In

 this

 setup

,

 multiple

 AI

 models

 work

 together

 to

 achieve

 more

 complex

 and

 nuanced

 goals

.


At

 the

 heart

 of

 this

 shift

 is

 the

 concept

 of

 agent

-to

-agent

 communication

 and

 cooperation

,

 where

 AI

 models

 interact

 with

 each

 other

 to

 achieve

 a

 shared

 objective

.


Here

 are

 some

 key

 aspects

 of

 multi

-agent

 systems

 and

 their

 potential

 applications

:


Agent

-to

-Agent

 Communication




In

 traditional

 AI

 systems

,

 models

 communicate

 with

 humans

 through

 APIs

 or

 interfaces

.

 In

 contrast

,

 multi




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Claire. I'm a 23-year-old American living in the south of Spain. I've been here for about 6 months, and I have to say, it's been a rollercoaster of emotions. I'm not sure if I'm fully embracing the Spanish way of life, or if I'm just pretending to be a local. Either way, I've learned a lot about myself and the culture.
My friends and I often have heated debates about the pros and cons of living in Spain. Some people love it here, while others, like me, are still figuring things out. We've talked about everything from the siesta

Prompt: The capital of France is
Generated text:  Paris. While the majority of visitors to Paris come for the city’s culture, art, and history, there are numerous things to do and see in the city’s surroundings. The Louvre Museum, the Arc de Triomphe, and the Eiffel Tower are some of the most well-known Parisian landmarks. The city is famous for its cuisine, including croissants, cheese, and wine.
The Eiffel Tower, on

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Kirst

y

 and

 I

 am

 a

 student

 at

 the

 University

 of

 Edinburgh

.

 I

 am

 a

 third

 year

 Psychology

 student

,

 currently

 studying

 the

 research

 methods

 module

,

 and

 I

 am

 excited

 to

 be

 one

 of

 the

 researchers

 on

 this

 project

.

 I

 am

 looking

 forward

 to

 working

 alongside

 the

 other

 team

 members

 and

 contributing

 to

 the

 study

 of

 how

 music

 can

 impact

 our

 emotional

 responses

.


I

 am

 particularly

 interested

 in

 the

 effects

 of

 music

 on

 our

 emotional

 responses

,

 as

 I

 have

 noticed

 how

 music

 can

 evoke

 a

 range

 of

 emotions

 in

 myself

 and

 others

.

 I

 am

 eager

 to

 explore

 this

 topic

 further

 and

 gain

 a

 deeper

 understanding

 of

 the

 mechanisms

 underlying

 the

 impact

 of

 music

 on

 our

 emotions

.

 I

 believe

 that

 this

 project



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 grand

eur

,

 style

 and

 romance

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 world

-class

 museums

 and

 fashion

 bout

iques

,

 Paris

 is

 a

 city

 that

 has

 something

 for

 everyone

.


The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 

12

th

 century

,

 and

 its

 architectural

 landmarks

 reflect

 its

 many

 transformations

 over

 the

 years

.

 Visitors

 can

 explore

 the

 magnificent

 Notre

 Dame

 Cathedral

,

 the

 imposing

 Arc

 de

 Tri

omp

he

 and

 the

 beautiful

 Palace

 of

 Vers

ailles

,

 among

 many

 other

 iconic

 sites

.


Paris

 is

 also

 renowned

 for

 its

 cuisine

,

 with

 famous

 dishes

 such

 as

 esc

arg

ots

,

 rat

at

ou

ille

 and

 cro

iss

ants

.

 The

 city

 is

 home

 to

 some

 of

 the



Prompt: The future of AI is
Generated text: 

 coming

,

 and

 it

's

 all

 about

 reinforcement

 learning




Re

in

forcement

 learning

 is

 a

 type

 of

 machine

 learning

 that

 is

 becoming

 increasingly

 important

 in

 the

 field

 of

 artificial

 intelligence

 (

AI

).

 It

's

 a

 sub

field

 of

 machine

 learning

 that

 deals

 with

 the

 training

 of

 artificial

 agents

 to

 make

 decisions

 and

 take

 actions

 in

 complex

,

 uncertain

 environments

.

 In

 this

 article

,

 we

'll

 explore

 the

 future

 of

 reinforcement

 learning

 and

 how

 it

's

 shaping

 the

 field

 of

 AI

.


Re

in

forcement

 learning

 is

 a

 type

 of

 machine

 learning

 that

 is

 inspired

 by

 the

 way

 humans

 learn

 through

 trial

 and

 error

.

 In

 reinforcement

 learning

,

 an

 artificial

 agent

 learns

 to

 make

 decisions

 and

 take

 actions

 in

 a

 specific

 environment

,

 such

 as




In [6]:
llm.shutdown()