# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104).

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-03 07:25:52 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.21it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.09it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.08it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.30it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Arthur

 Lub

ow

.

 I

 am

 an

 assistant

 professor

 of

 classics

 at

 the

 University

 of

 Chicago

.

 I

 am

 a

 phil

ologist

,

 that

 is

 to

 say

,

 a

 classical

 phil

ologist

,

 with

 special

 interests

 in

 ancient

 Greek

 language

 and

 literature

.

 My

 research

 focuses

 on

 the

 transmission

 of

 texts

 from

 ancient

 Greek

 into

 Latin

,

 and

 the

 cultural

 and

 historical

 context

 of

 literary

 and

 poetic

 creation

 in

 the

 H

ellen

istic

 period

,

 which

 spans

 from

 the

 death

 of

 Alexander

 the

 Great

 to

 the

 rise

 of

 the

 Roman

 Empire

.

 My

 book

,

 The

 Story

 of

 a

 Po

em

 by

 Two

 Po

ets

:

 Call

im

ach

us

 and

 Cat

ull

us

 on

 the

 Origins

 of

 Call

im

ach

us

'

 H

ym

n

 to

 Apollo




Generated text: 

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 and

 world

-class

 museums

,

 but

 it

's

 also

 a

 hub

 for

 art

,

 fashion

,

 and

 food

.

 Here

 are

 some

 things

 to

 do

 and

 see

 in

 Paris

,

 from

 the

 E

iff

el

 Tower

 to

 the

 Cata

com

bs

:


The

 E

iff

el

 Tower

 (

Tour

 E

iff

el

)


No

 trip

 to

 Paris

 would

 be

 complete

 without

 a

 visit

 to

 the

 iconic

 E

iff

el

 Tower

.

 Built

 for

 the

 

188

9

 World

's

 Fair

,

 it

's

 an

 engineering

 marvel

 and

 a

 symbol

 of

 the

 city

.

 You

 can

 take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


The

 Lou

vre

 Museum

 (

Mus

ée

 du

 Lou




Generated text: 

 not

 a

 zero

-sum

 game

,

 where

 progress

 in

 one

 area

 necessarily

 means

 a

 decrease

 in

 another

.

 Rather

,

 AI

 has

 the

 potential

 to

 positively

 impact

 multiple

 areas

 simultaneously

,

 and

 we

're

 already

 seeing

 this

 in

 various

 industries

 and

 applications

.


For

 instance

,

 AI

-powered

 healthcare

 systems

 can

 help

 diagnose

 diseases

 more

 accurately

 and

 quickly

,

 while

 also

 reducing

 costs

 and

 improving

 patient

 outcomes

.

 Similarly

,

 AI

-driven

 transportation

 systems

 can

 optimize

 routes

,

 reduce

 congestion

,

 and

 improve

 safety

,

 while

 also

 creating

 new

 job

 opportunities

 in

 the

 sector

.


Moreover

,

 AI

 has

 the

 potential

 to

 enable

 more

 sustainable

 and

 environmentally

-friendly

 practices

 across

 various

 industries

.

 For

 example

,

 AI

 can

 help

 optimize

 energy

 consumption

,

 reduce

 waste

,

 and

 improve




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Tam

ara

 and

 I

'm

 

28

 years

 old

.

 I

'm

 a

 marketing

 professional

 with

 

5

+

 years

 of

 experience

 in

 developing

 marketing

 strategies

 for

 various

 industries

,

 including

 retail

,

 healthcare

,

 and

 finance

.

 I

'm

 a

 creative

 problem

 solver

,

 a

 strategic

 thinker

,

 and

 a

 passionate

 storyt

eller

.

 I

'm

 excited

 to

 connect

 with

 like

-minded

 professionals

 and

 collaborate

 on

 innovative

 marketing

 projects

.

 Let

's

 get

 creative

 together

!

 


What

 are

 you

 passionate

 about

?

 What

 drives

 you

?

 Let

's

 start

 a

 conversation

!

 


Feel

 free

 to

 reach

 out

 to

 me

 if

 you

 need

 any

 marketing

 advice

 or

 just

 want

 to

 chat

 about

 the

 latest

 marketing

 trends

.

 I

'm

 always

 up

 for

 a

 good

 conversation

!




Generated text: 

 Paris

.

 Paris

 is

 the

 largest

 city

 in

 France

 and

 is

 the

 country

's

 political

 and

 economic

 center

.

 Paris

 is

 a

 major

 cultural

,

 historical

,

 and

 economic

 center

.

 Paris

 is

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 also

 known

 for

 its

 fashion

 industry

,

 with

 many

 famous

 fashion

 designers

 and

 brands

 having

 their

 headquarters

 there

.

 Paris

 is

 a

 popular

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.


France

 has

 a

 diverse

 geography

,

 with

 mountains

,

 rivers

,

 and

 coast

lines

 along

 the

 Atlantic

 and

 the

 Mediterranean

.

 The

 country

 has

 several

 regions

,

 each

 with

 its

 own

 unique

 culture

,

 language




Generated text: 

 human




The

 future

 of

 AI

 is

 human




The

 current

 boom

 in

 artificial

 intelligence

 (

AI

)

 is

 often

 described

 as

 an

 "

AI

 revolution

."

 But

 the

 truth

 is

 that

 the

 future

 of

 AI

 is

 not

 about

 replacing

 humans

,

 but

 about

 working

 with

 them

.


AI

 has

 made

 tremendous

 progress

 in

 recent

 years

,

 and

 it

's

 now

 possible

 to

 automate

 many

 tasks

 that

 were

 previously

 done

 by

 humans

.

 However

,

 AI

 systems

 are

 not

 yet

 able

 to

 replicate

 the

 complexity

 and

 nu

ance

 of

 human

 decision

-making

.


In

 fact

,

 a

 recent

 survey

 by

 the

 Pew

 Research

 Center

 found

 that

 

67

%

 of

 experts

 believe

 that

 AI

 will

 augment

 human

 capabilities

,

 while

 only

 

24

%

 think

 it

 will

 replace




In [6]:
llm.shutdown()