# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.19it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.11it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.58it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.06s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.67it/s]

 13%|█▎        | 3/23 [00:01<00:09,  2.16it/s]

 17%|█▋        | 4/23 [00:01<00:07,  2.49it/s]

 22%|██▏       | 5/23 [00:02<00:05,  3.02it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.37it/s]

 30%|███       | 7/23 [00:02<00:04,  3.76it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.04it/s]

 39%|███▉      | 9/23 [00:03<00:03,  4.16it/s]

 43%|████▎     | 10/23 [00:03<00:02,  4.35it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.50it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.55it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.53it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.46it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.57it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.66it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.72it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.71it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.77it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.53it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.48it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.59it/s]

100%|██████████| 23/23 [00:06<00:00,  4.61it/s]100%|██████████| 23/23 [00:06<00:00,  3.82it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Liz! I am a certified birth doula and postpartum doula, providing support to expectant parents during pregnancy, labor, and postpartum. I am passionate about empowering women to have a positive and fulfilling birth experience and supporting them in the postpartum period.
What is a birth doula?
A birth doula is a trained professional who provides physical, emotional, and informational support to a woman and her partner before, during, and after childbirth. Doulas do not perform medical tasks, such as taking vital signs or administering medications, but rather focus on providing comfort measures, emotional support, and advocacy during the birth
Prompt: The president of the United States is
Generated text:  not a regular citizen; rather, he or she is the head of the federal government. The president's office is the highest elected position in the land, with a wide range of powers and responsibilities. The president serves as both the head of stat

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Katie

 and

 I

 am

 an

 associate

 at

 the

 Spokane

 office

 of

 the

 law

 firm

 of

 Gordon

 Thomas

 Honey

well

 Mal

anca

 Peterson

 &

 Da

heim

.

 I

 have

 been

 with

 the

 firm

 since

 

200

4

 and

 have

 been

 working

 in

 the

 Spokane

 office

 since

 

200

7

.

 My

 practice

 focuses

 primarily

 on

 representing

 individuals

 and

 families

 in

 personal

 injury

 and

 wrongful

 death

 claims

.

 I

 have

 extensive

 experience

 in

 handling

 cases

 involving

 motor

 vehicle

 accidents

,

 pedestrian

 and

 bicycle

 accidents

,

 and

 premises

 liability

.


I

 am

 licensed

 to

 practice

 law

 in

 both

 Washington

 and

 Idaho

 and

 am

 a

 member

 of

 the

 American

 Association

 for

 Justice

 and

 the

 Washington

 State

 Association

 for

 Justice

.

 I

 am

 committed

 to

 providing

 my

 clients

 with

 exceptional

 representation

 and

 personalized



Prompt: The capital of France is
Generated text: 

 Paris

 and

 the

 currency

 used

 is

 the

 Euro

.

 The

 average

 temperature

 is

 around

 

12

 degrees

 Celsius

.


France

 is

 famous

 for

 its

 delicious

 cuisine

,

 including

 esc

arg

ots

,

 cro

iss

ants

,

 and

 bag

uet

tes

.

 The

 country

 is

 also

 known

 for

 its

 beautiful

 landscapes

,

 historic

 landmarks

,

 and

 art

 museums

.


Some

 popular

 destinations

 in

 France

 include

:


The

 E

iff

el

 Tower

,

 which

 is

 an

 iconic

 iron

 lattice

 tower

 built

 in

 

188

9

.


The

 Lou

vre

 Museum

,

 which

 is

 home

 to

 the

 Mona

 Lisa

 and

 other

 famous

 works

 of

 art

.


The

 Palace

 of

 Vers

ailles

,

 which

 is

 a

 former

 royal

 palace

 with

 op

ulent

 decorations

 and

 beautiful

 gardens

.


The

 French

 Riv

iera

,



Prompt: The future of AI is
Generated text: 

 not

 about

 artificial

 intelligence

;

 it

's

 about

 intelligence

 augmentation

.

 |

 Sandy

 Pent

land

,

 director

 of

 the

 MIT

 Human

-

Computer

 Interaction

 Group




The

 future

 of

 AI

 is

 not

 about

 artificial

 intelligence

;

 it

's

 about

 intelligence

 augmentation

.

 |

 Sandy

 Pent

land

,

 director

 of

 the

 MIT

 Human

-

Computer

 Interaction

 Group




Art

ificial

 intelligence

 has

 the

 potential

 to

 augment

 human

 intelligence

,

 rather

 than

 replace

 it

.

 This

 shift

 in

 perspective

 is

 changing

 the

 way

 we

 think

 about

 AI

 and

 its

 applications

 in

 various

 fields

.

 Instead

 of

 creating

 intelligent

 machines

 that

 think

 and

 act

 like

 humans

,

 we

 are

 now

 focusing

 on

 developing

 systems

 that

 can

 enhance

 and

 support

 human

 capabilities

.


The

 concept

 of

 intelligence

 augmentation

 is

 not

 new

,




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Max and I'm a 3-year-old Papillon (butterfly dog) with a big personality. I love to play, cuddle, and explore the world around me. My favorite things to do are chasing squirrels and snuggling up with my favorite people. I'm a bit of a little diva, but I'm always up for a good time. I love to run around and play, but I also enjoy curling up in someone's lap for a nice long nap.
I'm still a puppy, so I do get a bit rambunctious at times, but my owners are working with me on being

Prompt: The capital of France is
Generated text:  a city that has been built and rebuilt over the centuries, with various styles of architecture reflecting its rich history. From the grandeur of the Baroque to the modernity of the contemporary era, Paris has a wealth of landmarks and monuments that have been influenced by its historical development.
The most famous landmark in Paris is the Eiffel Tower, a symbol of French culture and engineering prowess. Built in 1889

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Laura

 and

 I

'm

 a

 freelance

 writer

 and

 editor

.

 I

'm

 excited

 to

 be

 joining

 the

 team

 at

 Blue

 P

encil

 Agency

 as

 a

 content

 writer

 and

 editor

.

 I

'm

 looking

 forward

 to

 working

 with

 the

 team

 to

 create

 high

-quality

 content

 for

 our

 clients

.

 My

 background

 is

 in

 English

 literature

 and

 journalism

,

 and

 I

've

 been

 writing

 and

 editing

 for

 various

 publications

 and

 companies

 for

 several

 years

.

 I

'm

 passionate

 about

 crafting

 compelling

 stories

 and

 informative

 content

 that

 engages

 readers

 and

 drives

 results

.

 Outside

 of

 work

,

 I

 love

 to

 hike

,

 read

,

 and

 try

 out

 new

 restaurants

 in

 my

 free

 time

.

 I

'm

 also

 a

 proud

 dog

 mom

 to

 my

 furry

 companion

,

 Luna

.


Laura

 joined

 our



Prompt: The capital of France is
Generated text: 

 known

 for

 its

 stunning

 architecture

,

 art

 museums

,

 and

 romantic

 atmosphere

.

 Paris

 has

 been

 a

 major

 tourist

 destination

 for

 centuries

,

 and

 for

 good

 reason

.

 The

 city

 has

 a

 wealth

 of

 cultural

 and

 historical

 landmarks

 to

 explore

,

 from

 the

 iconic

 E

iff

el

 Tower

 to

 the

 world

-class

 Lou

vre

 Museum

.


Here

 are

 some

 of

 the

 top

 attractions

 to

 visit

 in

 Paris

:


The

 E

iff

el

 Tower

:

 This

 iconic

 iron

 lattice

 tower

 is

 a

 must

-

visit

 attraction

 in

 Paris

.

 Visitors

 can

 take

 the

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


The

 Lou

vre

 Museum

:

 One

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 the

 Lou

vre

 is

 home

 to

 an

 impressive

 collection



Prompt: The future of AI is
Generated text: 

 set

 to

 transform

 healthcare

 in

 numerous

 ways

,

 improving

 patient

 care

,

 diagnosis

,

 and

 outcomes

.

 These

 emerging

 technologies

 will

 revolution

ize

 the

 industry

.


AI

 in

 Healthcare

:

 The

 Future

 of

 Medicine




The

 future

 of

 AI

 is

 set

 to

 transform

 healthcare

 in

 numerous

 ways

,

 improving

 patient

 care

,

 diagnosis

,

 and

 outcomes

.

 These

 emerging

 technologies

 will

 revolution

ize

 the

 industry

.

 For

 instance

,

 AI

-powered

 systems

 can

 help

 analyze

 medical

 images

 and

 detect

 conditions

 such

 as

 cancer

 earlier

 and

 more

 accurately

 than

 human

 radi

ologists

.


Another

 area

 where

 AI

 is

 making

 a

 significant

 impact

 is

 in

 personalized

 medicine

.

 By

 analyzing

 a

 patient

's

 genetic

 profile

 and

 medical

 history

,

 AI

 can

 suggest

 tailored

 treatments

 and

 therapies

 that

 are

 more

 likely




In [6]:
llm.shutdown()