# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-05 09:17:25 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.11it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.00it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.01s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Samantha

 (

Sam

)

 and

 I

 am

 a

 licensed

 est

hetic

ian

 and

 makeup

 artist

.

 I

 love

 helping

 people

 feel

 confident

 and

 beautiful

 in

 their

 own

 skin

.

 I

 specialize

 in

 bridal

 and

 event

 makeup

,

 but

 I

 also

 offer

 regular

 maintenance

 and

 treatments

 for

 everyday

 beauty

.

 I

 have

 a

 passion

 for

 helping

 others

 and

 I

 love

 the

 feeling

 of

 making

 someone

 feel

 like

 the

 best

 version

 of

 themselves

.


My

 journey

 as

 a

 makeup

 artist

 started

 in

 

200

8

 when

 I

 worked

 as

 a

 makeup

 artist

 for

 a

 popular

 beauty

 store

.

 I

 quickly

 fell

 in

 love

 with

 the

 art

 of

 makeup

 and

 the

 impact

 it

 can

 have

 on

 a

 person

's

 self

-esteem

.

 I

 further

ed

 my

 education

 by

 attending

 the

 A




Generated text: 

 a

 city

 that

 is

 full

 of

 magic

 and

 mystery

.

 Paris

 is

 a

 must

-

visit

 destination

 for

 any

 traveler

.

 The

 city

 is

 known

 for

 its

 beautiful

 architecture

,

 art

 museums

,

 and

 romantic

 atmosphere

.


One

 of

 the

 most

 famous

 landmarks

 in

 Paris

 is

 the

 E

iff

el

 Tower

.

 This

 iconic

 iron

 lattice

 tower

 was

 built

 for

 the

 

188

9

 World

's

 Fair

 and

 has

 become

 a

 symbol

 of

 the

 city

.

 Visitors

 can

 take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


Another

 popular

 destination

 in

 Paris

 is

 the

 Lou

vre

 Museum

.

 The

 Lou

vre

 is

 one

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 housing

 over

 

550

,

000

 works

 of

 art




Generated text: 

 not

 just

 about

 machines

 learning

 from

 data

,

 but

 also

 about

 how

 humans

 interact

 with

 these

 machines

.

 At

 NVIDIA

,

 we

 are

 pushing

 the

 boundaries

 of

 AI

 with

 new

 technologies

 that

 enable

 humans

 and

 machines

 to

 collaborate

 more

 effectively

.


The

 first

 step

 towards

 achieving

 this

 collaboration

 is

 to

 develop

 new

 interfaces

 that

 allow

 humans

 to

 communicate

 with

 AI

 systems

 in

 a

 more

 natural

 way

.

 This

 is

 where

 the

 concept

 of

 Human

-

Computer

 Interaction

 (

HCI

)

 comes

 in

.


HCI

 is

 the

 study

 of

 how

 people

 interact

 with

 technology

.

 It

 involves

 designing

 interfaces

 that

 are

 intuitive

,

 user

-friendly

,

 and

 effective

 in

 conveying

 information

.

 In

 the

 context

 of

 AI

,

 HCI

 is

 crucial

 in

 enabling

 humans

 to

 communicate

 with

 machines

 in




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Adrian

 and

 I

 am

 an

 artist

 and

 writer

,

 currently

 based

 in

 the

 beautiful

 city

 of

 Melbourne

.

 I

 have

 a

 passion

 for

 creating

 unique

 and

 imaginative

 artworks

 that

 bring

 joy

 to

 those

 who

 experience

 them

.

 My

 artistic

 journey

 began

 at

 a

 young

 age

,

 experimenting

 with

 a

 variety

 of

 mediums

 and

 techniques

,

 and

 has

 since

 evolved

 to

 encompass

 a

 range

 of

 styles

 and

 themes

.


In

 addition

 to

 my

 visual

 art

 practice

,

 I

 am

 also

 a

 writer

 and

 have

 published

 several

 short

 stories

 and

 poetry

 collections

.

 My

 writing

 often

 explores

 themes

 of

 identity

,

 belonging

 and

 the

 human

 condition

,

 and

 I

 am

 drawn

 to

 the

 intersection

 of

 words

 and

 images

 in

 my

 creative

 practice

.


I

 am

 always

 excited

 to

 meet

 new




Generated text: 

 Paris

.

 Paris

 is

 home

 to

 the

 E

iff

el

 Tower

,

 one

 of

 the

 world

’s

 most

 famous

 landmarks

.

 It

 is

 also

 known

 for

 its

 art

 museums

,

 fashion

,

 and

 cuisine

.

 The

 city

 has

 a

 rich

 history

 and

 a

 unique

 cultural

 heritage

.

 The

 name

 Paris

 is

 derived

 from

 the

 Celtic

 language

 and

 means

 "

city

 of

 love

."

 It

 has

 been

 the

 capital

 of

 France

 since

 

987

.


Paris

 is

 a

 major

 center

 for

 art

,

 fashion

,

 and

 culture

.

 The

 city

 is

 home

 to

 the

 Lou

vre

 Museum

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Centre

 Pom

pid

ou

,

 which

 are

 among

 the

 world

's

 largest

 and

 most

 famous

 art

 museums

.

 The

 city

 is




Generated text: 

 all

 about

 Human

-A

I

 collaboration




AI

 is

 rapidly

 becoming

 more

 pervasive

 in

 our

 daily

 lives

,

 and

 it

's

 no

 longer

 just

 about

 replacing

 human

 tasks

 but

 rather

 augment

ing

 our

 abilities

 and

 working

 alongside

 us

 to

 achieve

 better

 outcomes

.


The

 future

 of

 AI

 is

 all

 about

 Human

-A

I

 collaboration

,

 a

 concept

 that

 has

 been

 gaining

 traction

 in

 recent

 years

.

 This

 collaboration

 is

 not

 just

 about

 machines

 performing

 tasks

,

 but

 rather

 about

 humans

 and

 machines

 working

 together

 in

 a

 way

 that

 lever

ages

 the

 strengths

 of

 both

.


The

 idea

 of

 Human

-A

I

 collaboration

 is

 based

 on

 the

 recognition

 that

 AI

 systems

 are

 capable

 of

 processing

 vast

 amounts

 of

 data

 quickly

 and

 accurately

,

 but

 they

 often

 lack

 the

 nu




In [6]:
llm.shutdown()