# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.16it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.18it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.72it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.40it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:33,  1.52s/it]

  9%|▊         | 2/23 [00:01<00:17,  1.22it/s] 13%|█▎        | 3/23 [00:02<00:10,  1.90it/s]

 17%|█▋        | 4/23 [00:02<00:07,  2.57it/s] 22%|██▏       | 5/23 [00:02<00:05,  3.20it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.66it/s] 30%|███       | 7/23 [00:02<00:03,  4.13it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.51it/s] 39%|███▉      | 9/23 [00:03<00:02,  4.77it/s]

 43%|████▎     | 10/23 [00:03<00:02,  5.01it/s] 48%|████▊     | 11/23 [00:03<00:02,  5.16it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  5.20it/s] 57%|█████▋    | 13/23 [00:03<00:01,  5.32it/s]

 61%|██████    | 14/23 [00:04<00:01,  5.40it/s] 65%|██████▌   | 15/23 [00:04<00:01,  5.47it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  5.51it/s] 74%|███████▍  | 17/23 [00:04<00:01,  5.43it/s]

 78%|███████▊  | 18/23 [00:04<00:00,  5.47it/s] 83%|████████▎ | 19/23 [00:04<00:00,  5.52it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  5.55it/s] 91%|█████████▏| 21/23 [00:05<00:00,  5.55it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  5.57it/s]100%|██████████| 23/23 [00:05<00:00,  5.58it/s]100%|██████████| 23/23 [00:05<00:00,  4.08it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Nicole and I am a 4th grade teacher at a public elementary school. I have been teaching for 10 years and I absolutely love my job! I teach a range of subjects including reading, writing, social studies, and science.
I am excited to be a part of this community and look forward to sharing my teaching experiences and tips with you. I am passionate about creating engaging and fun lesson plans that cater to the needs of all learners. In this blog, I will be sharing my favorite lesson plans, classroom management ideas, and educational technology tips.
I also love connecting with other teachers and learning from their experiences, so please don
Prompt: The president of the United States is
Generated text:  elected to serve as the head of state and the head of government for a four-year term. The president is also the commander-in-chief of the armed forces and is responsible for setting national policy. The president is elected through the Electoral C

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Dan

 and

 I

 am

 a

 

5

th

 gr

ader

 at

 John

 Adams

 Middle

 School

 in

 San

 Carlos

,

 CA

.

 I

 am

 very

 excited

 to

 be

 a

 part

 of

 the

 Kid

blog

 program

 and

 I

 look

 forward

 to

 sharing

 my

 thoughts

 and

 ideas

 with

 all

 of

 you

.

 My

 favorite

 subjects

 are

 math

 and

 science

 and

 I

 enjoy

 learning

 about

 space

 and

 the

 environment

.

 In

 my

 free

 time

,

 I

 love

 to

 play

 basketball

,

 ride

 my

 bike

,

 and

 go

 to

 the

 beach

.


I

 like

 learning

 about

 space

 because

 it

 is

 so

 vast

 and

 mysterious

.

 There

 are

 still

 so

 many

 things

 we

 don

't

 know

 about

 it

.

 I

 think

 it

's

 cool

 that

 we

 can

 explore

 it

 with

 robots

 and

 satellites

.




Prompt: The capital of France is
Generated text: 

 in

 the

 north

-east

 of

 the

 country

,

 located

 on

 the

 Se

ine

 River

.

 It

 is

 one

 of

 the

 most

 beautiful

 and

 romantic

 cities

 in

 the

 world

,

 with

 a

 rich

 history

 dating

 back

 over

 

2

,

000

 years

.

 Paris

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 which

 is

 home

 to

 the

 Mona

 Lisa

.


In

 addition

 to

 its

 rich

 history

 and

 cultural

 landmarks

,

 Paris

 is

 also

 known

 for

 its

 fashion

,

 cuisine

,

 and

 art

 scene

.

 The

 city

 is

 home

 to

 many

 world

-ren

owned

 fashion

 designers

,

 including

 Chanel

 and

 D

ior

,

 and

 its

 culinary

 scene

 is

 renowned

 for

 its

 haute

 cuisine

 and



Prompt: The future of AI is
Generated text: 

 bright

,

 but

 it

’s

 also

 uncertain

.

 The

 possibilities

 are

 endless

,

 and

 the

 potential

 impact

 is

 vast

.

 Artificial

 intelligence

 (

AI

)

 has

 the

 potential

 to

 revolution

ize

 numerous

 industries

,

 transform

 the

 way

 we

 live

 and

 work

,

 and

 bring

 about

 unprecedented

 levels

 of

 efficiency

,

 productivity

,

 and

 innovation

.

 However

,

 the

 future

 of

 AI

 is

 also

 sh

rou

ded

 in

 uncertainty

,

 with

 many

 unknown

s

 and

 challenges

 to

 be

 addressed

.


One

 of

 the

 biggest

 challenges

 facing

 AI

 is

 the

 issue

 of

 bias

 and

 fairness

.

 AI

 systems

 are

 only

 as

 good

 as

 the

 data

 they

 are

 trained

 on

,

 and

 if

 that

 data

 is

 biased

 or

 incomplete

,

 the

 AI

 system

 will

 reflect

 those

 biases

.

 This

 can

 lead




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Saeed and I am a software engineer with a passion for design and user experience. I am also the founder and lead developer of several applications, including our company’s flagship app, which is a leading productivity app in the Middle East. I enjoy creating clean and intuitive designs, as well as writing efficient and scalable code. My work experience ranges from freelancing to leading a team of developers, and I have a strong background in both design and development. I am also a certified Scrum Master and Agile practitioner, with experience in implementing Agile methodologies in different projects.
I have a strong background in design and development, and I am well-

Prompt: The capital of France is
Generated text: , of course, Paris, but many people know that it was actually established by the Gauls, a tribe of Celtic people. They called it Lutetia and it was a small fortified settlement. The Romans, who conquered the Gauls in the 1st cen

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Pat

rice

,

 and

 I

 am

 a

 modern

-day

 Jewish

 myst

ic

 and

 spiritual

 teacher

.

 I

 am

 passionate

 about

 helping

 people

 connect

 with

 their

 own

 inner

 wisdom

 and

 deepen

 their

 spiritual

 practice

.

 My

 approach

 is

 holistic

 and

 inclusive

,

 drawing

 on

 Jewish

 myst

icism

,

 mindfulness

,

 and

 som

atics

 to

 help

 people

 cultivate

 greater

 awareness

,

 compassion

,

 and

 joy

 in

 their

 lives

.

 I

 offer

 workshops

,

 retreat

s

,

 and

 private

 coaching

 to

 help

 people

 deepen

 their

 spiritual

 practice

 and

 live

 more

 auth

ent

ically

 and

 meaning

fully

.

 I

 also

 lead

 meditation

 and

 yoga

 classes

,

 and

 write

 articles

 and

 blogs

 on

 spiritual

 topics

.


I

 believe

 that

 every

 person

 has

 a

 unique

 inner

 wisdom

 and

 that

 it

 is

 our

 birth

right



Prompt: The capital of France is
Generated text: 

 known

 for

 its

 beautiful

 history

,

 art

,

 and

 architecture

.

 Paris

,

 the

 City

 of

 Light

,

 is

 a

 popular

 destination

 for

 travelers

 and

 art

 lovers

 alike

.

 With

 its

 charming

 streets

,

 picturesque

 bridges

,

 and

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

,

 Paris

 is

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 history

,

 culture

,

 and

 art

.


If

 you

're

 planning

 to

 visit

 Paris

,

 consider

 the

 following

 top

 attractions

:


The

 E

iff

el

 Tower

 -

 This

 iconic

 iron

 lattice

 tower

 is

 a

 symbol

 of

 Paris

 and

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

.

 Visitors

 can

 take

 the

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


The

 Lou

vre

 Museum

 -

 The

 Lou



Prompt: The future of AI is
Generated text: 

 bright

,

 but

 it

's

 also

 fraught

 with

 challenges

.

 One

 of

 the

 biggest

 challenges

 facing

 AI

 researchers

 is

 ensuring

 that

 AI

 systems

 are

 transparent

 and

 explain

able

.


There

 are

 many

 different

 definitions

 of

 transparency

 and

 explain

ability

 in

 AI

,

 but

 at

 a

 high

 level

,

 they

 both

 refer

 to

 the

 ability

 to

 understand

 how

 an

 AI

 system

 makes

 its

 decisions

 and

 predictions

.

 This

 is

 important

 for

 a

 number

 of

 reasons

,

 including

:


1

.

 Trust

:

 If

 we

 can

't

 understand

 how

 an

 AI

 system

 works

,

 how

 can

 we

 trust

 it

 to

 make

 decisions

 that

 are

 in

 our

 best

 interests

?


2

.

 Accountability

:

 If

 an

 AI

 system

 makes

 a

 mistake

,

 how

 can

 we

 hold

 it

 accountable

 if

 we

 don




In [6]:
llm.shutdown()