# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.94it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  2.01it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.47it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.49it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:39,  1.78s/it]

  9%|▊         | 2/23 [00:02<00:20,  1.04it/s]

 13%|█▎        | 3/23 [00:02<00:12,  1.62it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.13it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.63it/s]

 26%|██▌       | 6/23 [00:03<00:05,  3.02it/s]

 30%|███       | 7/23 [00:03<00:04,  3.42it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.74it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.96it/s]

 43%|████▎     | 10/23 [00:03<00:03,  4.15it/s]

 48%|████▊     | 11/23 [00:04<00:02,  4.31it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  4.42it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.50it/s]

 61%|██████    | 14/23 [00:04<00:01,  4.52it/s]

 65%|██████▌   | 15/23 [00:05<00:01,  4.53it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  4.47it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.01it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.02it/s]

 83%|████████▎ | 19/23 [00:06<00:00,  4.15it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  3.78it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  3.84it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  3.83it/s]

100%|██████████| 23/23 [00:07<00:00,  3.52it/s]100%|██████████| 23/23 [00:07<00:00,  3.20it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily and I am the owner and primary instructor of Bloom Yoga Studio. I am passionate about sharing the benefits of yoga with my community and helping each student develop a personal practice that nourishes their body, mind, and spirit.
I have been practicing yoga for over 10 years and have completed numerous training programs to become a certified yoga instructor. I hold a 500-hour certification in Hatha and Vinyasa Yoga through the Yoga Alliance. My teaching style is compassionate, clear, and adaptable to meet the needs of each student.
I believe that yoga is a journey, not a destination, and that every body is capable of growth and transformation
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president serves a four-year term and is elected by the people through the Electoral College. The president is responsible for appointing federal judges, ambassadors, an

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Wendy

,

 and

 I

 am

 a

 home

 cook

.

 I

 love

 trying

 out

 new

 recipes

 and

 experimenting

 with

 different

 flavors

 and

 ingredients

.

 My

 passion

 for

 cooking

 started

 when

 I

 was

 a

 child

,

 helping

 my

 mother

 in

 the

 kitchen

 and

 learning

 the

 basics

 of

 cooking

.

 As

 I

 grew

 older

,

 I

 began

 to

 explore

 more

 complex

 recipes

 and

 techniques

,

 and

 I

 have

 been

 hooked

 on

 cooking

 ever

 since

.


When

 I

’m

 not

 cooking

,

 you

 can

 find

 me

 spending

 time

 with

 my

 family

 and

 friends

,

 practicing

 yoga

,

 or

 taking

 long

 walks

 in

 nature

.

 I

 believe

 that

 cooking

 is

 not

 just

 about

 following

 a

 recipe

,

 but

 about

 creating

 a

 sense

 of

 community

 and

 bringing

 people

 together

 through

 food

.


I

’m



Prompt: The capital of France is
Generated text: 

 in

 lockdown

 due

 to

 rising

 tensions

 between

 pro

-

 and

 anti

-f

asc

ist

 groups

,

 with

 police

 in

 riot

 gear

 pat

rolling

 the

 streets

 and

 making

 arrests

.

 The

 city

 has

 been

 plagued

 by

 clashes

 between

 rival

 factions

 in

 recent

 months

,

 with

 both

 sides

 v

owing

 to

 take

 action

 against

 each

 other

.


As

 the

 situation

 continues

 to

 escalate

,

 the

 government

 has

 deployed

 more

 troops

 to

 the

 city

 to

 help

 maintain

 order

.

 The

 situation

 is

 becoming

 increasingly

 volatile

,

 with

 both

 sides

 using

 increasingly

 violent

 tactics

.


The

 city

's

 mayor

 has

 called

 for

 calm

 and

 urged

 citizens

 to

 stay

 indoors

 and

 avoid

 any

 areas

 of

 conflict

.

 However

,

 many

 are

 choosing

 to

 take

 to

 the

 streets

 to

 demonstrate

 their

 support

 for

 one



Prompt: The future of AI is
Generated text: 

 a

 topic

 of

 much

 debate

 and

 discussion

.

 While

 some

 people

 believe

 that

 AI

 will

 bring

 about

 a

 ut

opian

 future

,

 others

 fear

 that

 it

 will

 lead

 to

 a

 dyst

opian

 nightmare

.

 In

 this

 article

,

 we

 will

 explore

 some

 of

 the

 possible

 futures

 of

 AI

 and

 the

 potential

 implications

 of

 its

 development

.


The

 U

top

ian

 Future

:


In

 this

 scenario

,

 AI

 is

 used

 to

 solve

 some

 of

 the

 world

's

 most

 pressing

 problems

,

 such

 as

 poverty

,

 hunger

,

 and

 disease

.

 AI

 systems

 are

 used

 to

 optimize

 resource

 allocation

,

 predict

 and

 prevent

 natural

 disasters

,

 and

 develop

 new

 treatments

 for

 diseases

.

 This

 leads

 to

 a

 significant

 improvement

 in

 the

 quality

 of

 life

 for

 people

 around

 the

 world

.




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Jaycee and I am a proud and devoted daughter, sister, and friend. I am a strong believer in the importance of family and the value of every individual within it. I am passionate about making a difference in the lives of others, whether it be through volunteering, mentoring, or simply being a supportive and caring presence in someone's life. I am a bit of a hopeless romantic at heart and believe that love and kindness can conquer even the toughest of challenges.
I am a bit of a goofball and enjoy making people laugh. I am also a bit of a bookworm and love getting lost in a good novel or learning new things

Prompt: The capital of France is
Generated text: , of course, Paris. But have you ever wondered what makes this city so special? Here are some of the reasons why Paris is the City of Light.
One of the most famous landmarks in Paris is the Eiffel Tower. This iconic iron structure was built for the 1889 World’s Fair and stands 324 meters tall

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Diana

 and

 I

 am

 a

 

26

 year

 old

 from

 a

 small

 town

 in

 Brazil

.

 I

 have

 been

 living

 in

 Rio

 de

 Janeiro

 for

 a

 few

 months

 now

,

 and

 I

 am

 looking

 to

 make

 some

 new

 connections

.


I

 am

 a

 very

 outgoing

 person

,

 I

 love

 to

 try

 new

 things

 and

 meet

 new

 people

.

 When

 I

 am

 not

 working

,

 you

 can

 find

 me

 dancing

 to

 s

amba

 music

 or

 trying

 out

 new

 restaurants

 and

 cafes

.


I

 am

 looking

 for

 people

 who

 are

 also

 looking

 for

 new

 experiences

 and

 connections

.

 If

 you

 are

 a

 fellow

 food

ie

,

 music

 lover

 or

 simply

 someone

 who

 loves

 to

 try

 new

 things

,

 we

 would

 get

 along

 great

!


I

 am

 also

 interested

 in

 practicing



Prompt: The capital of France is
Generated text: 

 known

 for

 its

 iconic

 landmarks

,

 beautiful

 gardens

,

 and

 rich

 history

.

 But

,

 there

's

 more

 to

 Paris

 than

 meets

 the

 eye

.

 Let

's

 explore

 the

 city

's

 hidden

 gems

 and

 off

-the

-be

aten

-path

 attractions

.


1

.

 Mus

ée

 de

 la

 Vie

 Rom

ant

ique




Located

 in

 a

 charming

 

19

th

-century

 town

house

,

 this

 museum

 is

 dedicated

 to

 the

 art

 and

 literature

 of

 the

 Romantic

 era

.

 Its

 collection

 includes

 paintings

,

 sculptures

,

 and

 decorative

 arts

 that

 reflect

 the

 era

's

 emphasis

 on

 emotion

 and

 individual

ism

.


2

.

 Le

 J

ardin

 des

 Pl

antes




While

 many

 visitors

 flock

 to

 the

 Luxembourg

 Gardens

,

 Le

 J

ardin

 des

 Pl

antes

 is

 a

 lesser

-known

 gem



Prompt: The future of AI is
Generated text: 

 bright

,

 but

 it

 is

 also

 fraught

 with

 risks

 and

 challenges

.

 AI

 has

 the

 potential

 to

 revolution

ize

 many

 aspects

 of

 our

 lives

,

 from

 healthcare

 to

 finance

 to

 transportation

.

 However

,

 it

 also

 raises

 important

 questions

 about

 accountability

,

 bias

,

 and

 the

 impact

 on

 jobs

.

 In

 this

 article

,

 we

 will

 explore

 some

 of

 the

 key

 challenges

 and

 risks

 associated

 with

 AI

,

 and

 what

 we

 can

 do

 to

 mitigate

 them

.


One

 of

 the

 biggest

 challenges

 facing

 AI

 is

 the

 problem

 of

 bias

.

 AI

 systems

 are

 only

 as

 good

 as

 the

 data

 they

 are

 trained

 on

,

 and

 if

 that

 data

 is

 biased

,

 the

 AI

 system

 will

 learn

 to

 replicate

 those

 biases

.

 This

 can

 lead

 to

 AI

 systems

 that




In [6]:
llm.shutdown()