# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

INFO 11-25 05:04:46 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.46it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.30it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.19it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.43it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I'm a proud owner of this beautiful cat. Her name is Lola and she's a sweetheart. I was wondering if anyone could help me with a problem I'm having with her.
Lola has been doing some things that are really stressing me out and I was hoping someone could give me some advice. She's been waking me up in the middle of the night and demanding attention. I've tried ignoring her and letting her go back to sleep, but she just gets louder and more persistent. If I try to pet her or give her treats, she'll calm down for a bit but then just start again a few hours later
Prompt: The president of the United States is
Generated text:  not a mere mortal. He or she is the leader of the free world and a symbol of American power and prestige. The office is one of great dignity and responsibility. However, as we have seen with the antics of Donald Trump, the president can also be a source of controversy and ridicule.
The office of the presidency is pro

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 George

.

 I

 have

 been

 a

 professional

 dancer

 for

 over

 

15

 years

,

 teaching

 ballet

,

 contemporary

,

 and

 modern

 dance

 to

 students

 of

 all

 ages

.

 My

 goal

 is

 to

 make

 learning

 dance

 fun

 and

 accessible

 to

 everyone

.

 I

 am

 certified

 in

 the

 Royal

 Academy

 of

 Dance

 (

RAD

)

 syll

abus

,

 and

 I

 am

 a

 member

 of

 the

 National

 Dance

 Education

 Organization

 (

N

DE

O

).

 I

 have

 taught

 students

 in

 both

 group

 and

 private

 settings

,

 and

 I

 have

 chore

ographed

 for

 various

 dance

 companies

 and

 productions

.

 In

 addition

 to

 my

 teaching

 and

 chore

ograph

ing

 experience

,

 I

 have

 also

 worked

 as

 a

 freelance

 dancer

 and

 have

 performed

 with

 several

 dance

 companies

 and

 on

 stage

 productions

.


I

 am



Prompt: The capital of France is
Generated text: 

 Paris

,

 but

 have

 you

 ever

 been

 to

 the

 city

 of

 A

ix

-en

-Pro

v

ence

?

 A

ix

-en

-Pro

v

ence

 is

 a

 charming

 and

 vibrant

 city

 in

 the

 Prov

ence

-Al

pes

-C

ôte

 d

'

Az

ur

 region

 of

 southern

 France

.

 It

's

 a

 popular

 tourist

 destination

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 and

 beautiful

 surroundings

.

 Here

 are

 

10

 interesting

 facts

 about

 A

ix

-en

-Pro

v

ence

:


1

.

 A

ix

-en

-Pro

v

ence

 was

 an

 ancient

 Roman

 settlement




The

 city

 was

 founded

 by

 the

 Romans

 in

 the

 

1

st

 century

 BC

 and

 was

 an

 important

 center

 for

 trade

 and

 commerce

.

 The

 Romans

 built

 the

 city

's

 foundation

,

 which

 is



Prompt: The future of AI is
Generated text: 

 being

 written

 in

 neural

 networks

,

 which

 are

 modeled

 after

 the

 structure

 and

 function

 of

 the

 human

 brain

.

 Neural

 networks

 are

 made

 up

 of

 layers

 of

 interconnected

 nodes

 or

 “

ne

urons

”

 that

 process

 and

 transmit

 information

.

 These

 networks

 are

 trained

 on

 large

 datasets

 to

 learn

 patterns

 and

 relationships

,

 and

 they

 have

 achieved

 state

-of

-the

-art

 performance

 in

 many

 areas

,

 including

 computer

 vision

,

 natural

 language

 processing

,

 and

 speech

 recognition

.


However

,

 traditional

 neural

 networks

 have

 some

 limitations

.

 They

 require

 large

 amounts

 of

 data

 to

 train

,

 can

 be

 slow

 to

 converge

,

 and

 may

 not

 generalize

 well

 to

 new

,

 unseen

 data

.

 To

 address

 these

 limitations

,

 researchers

 have

 been

 exploring

 new

 types

 of

 neural

 networks

,




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Kaitlyn. I'm a 20-year-old graduate student pursuing my Master's degree in Speech-Language Pathology. I'm so glad to be a part of the Speech Therapy Helper community and I'm excited to share my knowledge and experience with you all.
As a future speech-language pathologist, I'm passionate about helping individuals of all ages communicate more effectively and confidently. I believe that communication is a fundamental aspect of our humanity, and that every individual deserves the opportunity to express themselves in a way that is clear and meaningful.

Throughout my undergraduate studies, I had the opportunity to work with children and adults with a variety of communication disorders,

Prompt: The capital of France is
Generated text:  Paris. France is a unitary semi-presidential constitutional republic. The French presidential election was held on 22 April 2012. French presidential election, 2012 (first round), François Hollande (PS) 28.63% Nico

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Tam

my

.

 I

 have

 been

 a

 licensed

 insurance

 agent

 for

 over

 

30

 years

.

 I

 specialize

 in

 providing

 personal

 and

 business

 insurance

 solutions

 to

 my

 clients

.

 I

 work

 with

 a

 variety

 of

 insurance

 companies

 to

 find

 the

 best

 coverage

 at

 the

 best

 price

 for

 my

 clients

.

 My

 goal

 is

 to

 provide

 excellent

 customer

 service

 and

 to

 build

 long

-term

 relationships

 with

 my

 clients

.


I

 am

 licensed

 to

 sell

 insurance

 in

 multiple

 states

 and

 have

 experience

 working

 with

 a

 variety

 of

 industries

,

 including

 small

 businesses

,

 non

-pro

fits

,

 and

 individuals

.

 I

 am

 committed

 to

 staying

 up

-to

-date

 on

 the

 latest

 insurance

 trends

 and

 technologies

 to

 ensure

 that

 my

 clients

 have

 access

 to

 the

 best

 insurance

 solutions

 available

.


If



Prompt: The capital of France is
Generated text: 

 Paris

 and

 the

 largest

 city

 is

 Lyon

.

 France

 is

 part

 of

 the

 European

 Union

 and

 has

 a

 population

 of

 

67

 million

.


France

 is

 a

 beautiful

 country

 known

 for

 its

 stunning

 landscapes

,

 art

,

 fashion

 and

 rich

 history

.

 From

 the

 famous

 E

iff

el

 Tower

 in

 Paris

 to

 the

 stunning

 beaches

 of

 the

 French

 Riv

iera

,

 France

 has

 a

 lot

 to

 offer

 to

 visitors

 and

 residents

 alike

.


The

 official

 language

 of

 France

 is

 French

 and

 the

 currency

 is

 the

 Euro

.

 The

 climate

 in

 France

 varies

 from

 region

 to

 region

,

 with

 the

 north

 being

 cooler

 and

 the

 south

 being

 warmer

.


There

 are

 many

 things

 to

 do

 and

 see

 in

 France

,

 including

 visiting

 the

 famous

 Lou

vre

 Museum

,

 exploring



Prompt: The future of AI is
Generated text: 

 human

-centered

,

 and

 it

’s

 driven

 by

 the

 increasing

 need

 for

 augmentation

,

 not

 replacement

.

 AI

 is

 no

 longer

 a

 tool

 of

 efficiency

 but

 a

 means

 of

 enabling

 new

 experiences

,

 enhancing

 human

 capabilities

,

 and

 shaping

 a

 more

 inclusive

 and

 compassionate

 society

.

 AI

,

 as

 a

 human

-centered

 technology

,

 is

 a

 reflection

 of

 our

 values

,

 and

 it

’s

 our

 responsibility

 to

 ensure

 that

 it

 align

s

 with

 our

 collective

 aspirations

 for

 a

 better

 world

.


The

 Future

 of

 AI

:

 Human

-C

entered

 and

 Aug

ment

ative




The

 future

 of

 AI

 is

 human

-centered

,

 focusing

 on

 augmentation

,

 not

 replacement

.

 It

's

 driven

 by

 the

 need

 for

 new

 experiences

,

 enhancing

 human

 capabilities

,

 and

 shaping

 a

 more

 inclusive




In [6]:
llm.shutdown()