# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-22 04:13:05 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.20it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.07it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.03it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Paul

 and

 I

 am

 the

 new

 Editor

-in

-Ch

ief

 of

 J

.

 I

 am

 honored

 to

 take

 on

 this

 role

 and

 I

 am

 excited

 to

 see

 where

 the

 publication

 will

 go

 in

 the

 future

.

 I

 have

 been

 a

 member

 of

 the

 J

 team

 for

 several

 years

,

 and

 I

 have

 seen

 firsthand

 the

 dedication

 and

 hard

 work

 that

 goes

 into

 producing

 this

 publication

.


As

 the

 new

 Editor

-in

-Ch

ief

,

 my

 goal

 is

 to

 build

 on

 the

 strong

 foundation

 that

 has

 been

 established

 and

 to

 take

 the

 publication

 in

 new

 and

 exciting

 directions

.

 I

 plan

 to

 do

 this

 by

 engaging

 with

 the

 community

 and

 solic

iting

 feedback

 from

 readers

,

 writers

,

 and

 staff

 members

.

 I

 also

 plan

 to

 explore

 new

 formats




Generated text: 

 a

 city

 of

 romance

,

 art

,

 fashion

,

 and

 beauty

.

 Paris

,

 the

 City

 of

 Light

,

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history

,

 and

 cultural

 landmarks

.

 With

 its

 picturesque

 streets

,

 charming

 cafes

,

 and

 world

-class

 museums

,

 Paris

 is

 a

 must

-

visit

 destination

 for

 any

 traveler

.


In

 this

 article

,

 we

 will

 explore

 some

 of

 the

 top

 attractions

 in

 Paris

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 We

 will

 also

 provide

 tips

 for

 visiting

 these

 popular

 destinations

 and

 some

 insider

 secrets

 for

 making

 the

 most

 of

 your

 trip

.


1

.

 The

 E

iff

el

 Tower




The

 E




Generated text: 

 human

-centered

:

 New

 AI

 strategies

 for

 the

 post

-p

and

emic

 world




The

 future

 of

 AI

 is

 human

-centered

:

 New

 AI

 strategies

 for

 the

 post

-p

and

emic

 world




The

 COVID

-

19

 pandemic

 has

 accelerated

 the

 adoption

 of

 artificial

 intelligence

 (

AI

)

 in

 various

 industries

,

 from

 healthcare

 to

 finance

.

 However

,

 as

 AI

 continues

 to

 transform

 the

 world

,

 it

's

 essential

 to

 ensure

 that

 its

 development

 and

 deployment

 prioritize

 human

 well

-being

,

 safety

,

 and

 dignity

.

 In

 this

 context

,

 human

-centered

 AI

 strategies

 are

 emerging

 as

 a

 crucial

 aspect

 of

 post

-p

and

emic

 recovery

 and

 growth

.


Human

-centered

 AI

 refers

 to

 the

 design

 and

 development

 of

 AI

 systems

 that

 prioritize

 human

 values

,

 needs




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Ab

bie

.


I

 was

 born

 in

 

199

5

 and

 I

 am

 

25

 years

 old

 now

.


I

 am

 a

 very

 outgoing

 person

,

 who

 loves

 to

 travel

 and

 try

 new

 things

.

 I

 am

 also

 a

 bit

 of

 a

 home

body

,

 who

 loves

 to

 stay

 at

 home

 and

 binge

 watch

 TV

 shows

.


I

 love

 animals

,

 especially

 cats

 and

 dogs

.

 I

 have

 a

 pet

 cat

 named

 Luna

,

 who

 is

 very

 spoiled

 and

 loves

 to

 cudd

le

.


I

 work

 as

 a

 software

 engineer

,

 which

 I

 love

.

 I

 find

 it

 very

 challenging

 and

 rewarding

.


I

 am

 a

 very

 optimistic

 person

,

 who

 believes

 that

 everything

 will

 work

 out

 in

 the

 end

.

 I

 am

 also

 a

 bit

 of

 a




Generated text: 

 home

 to

 countless

 iconic

 landmarks

,

 historic

 neighborhoods

,

 and

 world

-class

 museums

.

 Paris

 has

 been

 the

 capital

 of

 France

 since

 

987

 and

 is

 one

 of

 the

 world

’s

 most

 romantic

 cities

.

 Here

 are

 some

 of

 the

 top

 things

 to

 do

 in

 Paris

:


The

 E

iff

el

 Tower

 is

 one

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

.

 It

 was

 built

 for

 the

 

188

9

 World

’s

 Fair

 and

 stands

 at

 an

 impressive

 

324

 meters

 tall

.

 Visitors

 can

 take

 the

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


The

 Lou

vre

 Museum

 is

 one

 of

 the

 world

’s

 largest

 and

 most

 famous

 museums

.

 It

 houses

 an

 impressive

 collection

 of

 art

 and

 artifacts

 from

 around

 the

 world




Generated text: 

 uncertain

,

 but

 one

 thing

 is

 certain

:

 it

 will

 continue

 to

 change

 the

 world

 in

 ways

 both

 seen

 and

 unseen

.

 The

 AI

 industry

 is

 rapidly

 evolving

,

 with

 breakthrough

s

 and

 advancements

 happening

 at

 an

 exponential

 rate

.

 We

 explore

 the

 latest

 developments

,

 trends

,

 and

 innovations

 in

 the

 field

,

 and

 how

 they

 will

 impact

 our

 lives

,

 work

,

 and

 the

 planet

.


So

,

 what

 does

 the

 future

 of

 AI

 hold

?

 What

 are

 the

 potential

 risks

 and

 benefits

?

 And

 how

 can

 we

 ensure

 that

 the

 benefits

 are

 realized

 and

 the

 risks

 are

 mitig

ated

?

 These

 are

 just

 a

 few

 of

 the

 questions

 we

'll

 be

 exploring

 in

 this

 series

.



In

 this

 first

 episode

,

 we

're

 joined

 by




In [6]:
llm.shutdown()