# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104).

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-03 22:00:20 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.18it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.08it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.08it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Emily

,


I

 have

 a

 disability

,

 I

'm

 proud

 to

 say

.


I

 use

 a

 wheelchair

 to

 get

 around

,


It

 helps

 me

 to

 stay

 on

 my

 feet

,

 no

 need

 to

 be

 found

!


I

 love

 to

 play

 and

 have

 fun

,


Just

 like

 everyone

 else

,

 I

've

 just

 begun

.


I

 can

 play

 games

,

 and

 read

 a

 book

,


I

 can

 even

 do

 my

 homework

,

 that

's

 no

 big

 hook

!


I

 have

 friends

 who

 like

 to

 play

,


We

 have

 a

 blast

,

 every

 single

 day

.


We

 play

 games

,

 and

 go

 to

 the

 park

,


And

 sometimes

 we

 even

 go

 to

 the

 dark

.


I

 like

 to

 watch

 TV

,

 and

 listen

 to

 music

 too

,


I

 like

 to




Generated text: 

 home

 to

 many

 museums

 and

 galleries

 that

 showcase

 a

 diverse

 range

 of

 art

 and

 artifacts

 from

 around

 the

 world

.


The

 Lou

vre

 Museum

 is

 one

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 with

 a

 collection

 that

 includes

 the

 Mona

 Lisa

,

 Venus

 de

 Milo

,

 and

 other

 master

pieces

 of

 European

 art

.

 The

 museum

's

 collection

 spans

 from

 ancient

 civilizations

 to

 the

 

19

th

 century

,

 with

 a

 focus

 on

 European

 painting

,

 sculpture

,

 and

 decorative

 arts

.


The

 Or

say

 Museum

 is

 another

 world

-ren

owned

 museum

 in

 Paris

,

 featuring

 an

 impressive

 collection

 of

 Imp

ression

ist

 and

 Post

-I

mp

ression

ist

 art

,

 including

 works

 by

 Mon

et

,

 Reno

ir

,

 and

 Van

 G

ogh

.




Generated text: 

 looking

 bright

 –

 and

 personalized

.

 The

 latest

 advancements

 in

 artificial

 intelligence

 are

 focused

 on

 creating

 tailored

 experiences

 for

 individuals

,

 revolution

izing

 the

 way

 we

 interact

 with

 technology

,

 and

 transforming

 industries

 from

 healthcare

 to

 finance

.


One

 of

 the

 key

 drivers

 of

 this

 trend

 is

 the

 rise

 of

 explain

able

 AI

 (

X

AI

).

 X

AI

 is

 a

 type

 of

 AI

 that

 provides

 transparent

 and

 interpre

table

 results

,

 allowing

 humans

 to

 understand

 how

 and

 why

 decisions

 were

 made

.

 This

 transparency

 is

 essential

 for

 building

 trust

 in

 AI

 systems

,

 particularly

 in

 applications

 where

 accuracy

 and

 fairness

 are

 critical

.


Another

 area

 of

 innovation

 is

 the

 use

 of

 natural

 language

 processing

 (

N

LP

)

 to

 create

 convers

ational

 interfaces

 that

 mimic

 human

-like




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Robert

 and

 I

’m

 a

 senior

 developer

 and

 team

 lead

 at

 a

 company

 that

 specializes

 in

 developing

 mobile

 and

 web

 applications

.

 I

 have

 been

 working

 with

 PHP

 for

 over

 

10

 years

 and

 have

 extensive

 experience

 with

 various

 frameworks

 such

 as

 Laravel

,

 Symfony

,

 and

 Code

Ign

iter

.

 My

 expertise

 also

 includes

 MySQL

,

 MongoDB

,

 and

 PostgreSQL

 databases

.

 I

 have

 worked

 with

 a

 variety

 of

 web

 services

 and

 APIs

,

 including

 REST

ful

 APIs

,

 GraphQL

,

 and

 SOAP

.

 I

'm

 excited

 to

 contribute

 to

 this

 community

 and

 help

 others

 with

 their

 PHP

 and

 related

 technologies

.



###

 Top

 

5

 Technologies

 Used




-

 PHP




-

 Laravel




-

 MySQL




-

 MongoDB




-

 PostgreSQL





###

 Favorite

 Technologies




Generated text: 

 a

 city

 that

 has

 been

 known

 for

 centuries

 as

 a

 center

 of

 culture

,

 art

,

 fashion

,

 and

 cuisine

.

 Paris

 is

 also

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 situated

 on

 the

 Se

ine

 River

 and

 has

 a

 long

 history

 dating

 back

 to

 the

 Roman

 era

.

 Paris

 is

 a

 popular

 destination

 for

 tourists

 and

 is

 known

 for

 its

 romantic

 atmosphere

 and

 rich

 cultural

 heritage

.


As

 of

 

202

1

,

 the

 population

 of

 Paris

 is

 approximately

 

2

.

1

 million

 people

,

 and

 the

 metropolitan

 area

 has

 a

 population

 of

 around

 

12

.

2

 million

 people

.

 The

 city

 is

 divided

 into

 




Generated text: 

 bright

 –

 and

 it

's

 been

 forecast

ed

 to

 be

 very

 lucrative

.

 In

 fact

,

 a

 recent

 report

 by

 the

 International

 Data

 Corporation

 (

ID

C

)

 predicts

 that

 the

 global

 AI

 market

 will

 reach

 a

 whopping

 $

190

 billion

 by

 

202

5

,

 growing

 at

 a

 compound

 annual

 growth

 rate

 (

C

AGR

)

 of

 

38

.

3

%

 from

 

202

0

 to

 

202

5

.


AI

 is

 transforming

 industries

 in

 various

 ways

,

 from

 healthcare

 and

 finance

 to

 transportation

 and

 education

.

 Here

 are

 some

 of

 the

 most

 promising

 AI

 applications

 and

 their

 potential

 impact

 on

 the

 job

 market

:


1

.

 Virtual

 assistants

 and

 chat

bots

:

 AI

-powered

 virtual

 assistants

 and

 chat

bots

 are

 becoming

 increasingly

 popular

 in

 customer




In [6]:
llm.shutdown()