# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-24 12:51:06 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.93it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.71it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.71it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.06it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Sad

ie

!

 I

'm

 a

 sweet

 and

 gentle

 

6

-year

-old

 girl

 who

 loves

 playing

 with

 my

 family

 and

 sn

uggling

 with

 my

 favorite

 plush

ies

,

 especially

 my

 stuffed

 animal

 named

 Mr

.

 Wh

isk

ers

.

 I

 love

 making

 new

 friends

 and

 going

 on

 adventures

,

 whether

 it

's

 in

 the

 backyard

 or

 on

 a

 fun

 outing

.

 I

'm

 a

 bit

 shy

 at

 first

,

 but

 once

 I

 get

 to

 know

 you

,

 I

'll

 be

 your

 best

 buddy

!

 I

 hope

 you

'll

 be

 my

 friend

 too

!


I

'm

 still

 learning

 to

 communicate

 with

 you

 using

 this

 special

 tool

,

 so

 please

 be

 patient

 with

 me

.

 I

'll

 do

 my

 best

 to

 respond

 with

 simple

 messages

.

 If

 you

 have




Generated text: 

 home

 to

 the

 E

iff

el

 Tower

,

 a

 beautiful

 and

 iconic

 landmark

 that

 attracts

 millions

 of

 tourists

 every

 year

.

 In

 addition

 to

 its

 famous

 tower

,

 the

 city

 offers

 a

 wide

 range

 of

 cultural

 and

 historical

 attractions

,

 including

 the

 Lou

vre

 Museum

,

 the

 Arc

 de

 Tri

omp

he

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 known

 for

 its

 romantic

 atmosphere

,

 charming

 streets

,

 and

 picturesque

 river

 Se

ine

,

 making

 it

 a

 popular

 destination

 for

 couples

 and

 honeymoon

ers

.

 Whether

 you

're

 interested

 in

 art

,

 history

,

 fashion

,

 or

 food

,

 Paris

 has

 something

 for

 everyone

.


The

 city

's

 cuisine

 is

 also

 a

 major

 draw

,

 with

 famous

 dishes

 like

 esc

arg

ots

,

 rat

at




Generated text: 

 not

 just

 about

 the

 technology

 itself

,

 but

 about

 how

 we

 use

 it

 to

 make

 a

 positive

 impact

 on

 society

.


In

 this

 episode

,

 we

 explore

 the

 potential

 of

 AI

 for

 social

 good

 with

 Ar

v

ind

 Krishna

,

 the

 CEO

 of

 IBM

.

 We

 talk

 about

 the

 power

 of

 AI

 to

 drive

 innovation

 and

 progress

,

 and

 how

 it

 can

 be

 used

 to

 address

 some

 of

 the

 world

's

 most

 pressing

 challenges

,

 from

 climate

 change

 to

 healthcare

.


Ar

v

ind

 shares

 his

 vision

 for

 the

 future

 of

 AI

,

 and

 how

 IBM

 is

 working

 to

 develop

 and

 deploy

 AI

 solutions

 that

 can

 make

 a

 real

 difference

 in

 people

's

 lives

.

 We

 also

 discuss

 the

 importance

 of

 ethics

 and

 responsibility

 in

 the

 development

 and




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Deb

,

 and

 I

'm

 the

 proud

 owner

 and

 baker

 of

 this

 delightful

 bakery

,

 Sweet

 Treat

s

.

 I

've

 been

 in

 the

 baking

 business

 for

 over

 

20

 years

,

 and

 I

 must

 say

,

 it

's

 a

 dream

 come

 true

 to

 be

 able

 to

 share

 my

 passion

 with

 all

 of

 you

.


I

've

 always

 loved

 baking

,

 and

 I

 come

 from

 a

 long

 line

 of

 b

akers

.

 My

 grandmother

 was

 a

 legendary

 baker

 in

 our

 community

,

 and

 I

 used

 to

 love

 helping

 her

 in

 the

 kitchen

.

 She

 taught

 me

 the

 art

 of

 traditional

 baking

,

 and

 I

've

 carried

 on

 that

 tradition

 with

 my

 own

 unique

 twist

.


My

 bakery

 offers

 a

 wide

 variety

 of

 sweet

 treats

,

 from

 classic

 cookies




Generated text: 

 Paris

.

 It

 is

 situated

 in

 the

 north

-central

 part

 of

 the

 country

,

 along

 the

 Se

ine

 River

.

 The

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

 are

 all

 located

 in

 Paris

.


What

 is

 the

 capital

 of

 France

?


A

.

 Paris




B

.

 Lyon




C

.

 Berlin




D

.

 Rome




Answer

:

 A




Reason

ing

 Skill

:

 Ident

ifying

 Pros

 And

 Cons




In

 this

 question

,

 the

 student

 is

 asked

 to

 identify

 the

 capital

 of

 France

,

 which

 is

 a

 basic

 fact

 about

 the

 country

.

 To

 answer

 this

 question

 correctly

,

 the

 student

 needs

 to

 have

 knowledge

 of

 geography

 and

 be

 able

 to

 identify

 the

 correct

 city

.

 This

 question

 does




Generated text: 

 uncertain

,

 but

 one

 thing

 is

 clear

:

 it

 will

 be

 determined

 by

 the

 people

 who

 design

 and

 deploy

 these

 systems

.

 In

 this

 book

,

 Gary

 Marcus

 and

 Ernest

 Davis

 argue

 that

 current

 approaches

 to

 AI

 are

 misguided

,

 and

 that

 we

 need

 a

 new

 approach

 that

 priorit

izes

 transparency

,

 interpret

ability

,

 and

 human

 values

.

 They

 propose

 a

 new

 framework

 for

 AI

 that

 is

 grounded

 in

 human

 cognition

 and

 centered

 on

 the

 human

 experience

.


Gary

 Marcus

 and

 Ernest

 Davis

 are

 two

 of

 the

 most

 prominent

 and

 respected

 figures

 in

 the

 field

 of

 artificial

 intelligence

.

 Marcus

 is

 a

 renowned

 researcher

 and

 entrepreneur

,

 and

 Davis

 is

 a

 leading

 expert

 in

 artificial

 intelligence

 and

 cognitive

 science

.


"The

 Future

 of

 the

 Mind

"

 is




In [6]:
llm.shutdown()