# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  5.32it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.63it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.25it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.05s/it]

  9%|▊         | 2/23 [00:01<00:11,  1.77it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.50it/s]

 17%|█▋        | 4/23 [00:01<00:06,  3.07it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.55it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.78it/s]

 30%|███       | 7/23 [00:02<00:03,  4.06it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.24it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.43it/s]

 43%|████▎     | 10/23 [00:02<00:02,  4.55it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.59it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.60it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.66it/s]

 61%|██████    | 14/23 [00:03<00:01,  4.73it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.71it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.72it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.76it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.80it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  4.82it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.83it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.75it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.43it/s]

100%|██████████| 23/23 [00:05<00:00,  4.44it/s]100%|██████████| 23/23 [00:05<00:00,  4.00it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Katherine, and I am a 4th grade teacher at South Elementary. I am excited to have the opportunity to share my educational background and teaching philosophy with you. I have been teaching for 10 years, and I have always had a passion for creating engaging and interactive lessons that cater to different learning styles.
My educational background includes a Bachelor’s degree in Elementary Education from Western Governors University. I have also completed several graduate courses in Reading Education and Math Education. I have been certified to teach grades K-6 in the state of Michigan.
My teaching philosophy is centered around creating a positive and inclusive learning environment. I believe that every student learns
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States of America. The president is indirectly elected by the people through the Electoral College, which was esta

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Greg

 and

 I

 am

 a

 semi

-ret

ired

 Environmental

ist

.

 I

 have

 spent

 the

 last

 

30

 years

 working

 on

 various

 projects

 and

 causes

 that

 promote

 sustainability

,

 conservation

,

 and

 eco

-friendly

 practices

.

 Some

 of

 my

 most

 notable

 achievements

 include

:


Establish

ing

 a

 network

 of

 community

 gardens

 in

 urban

 areas

,

 providing

 fresh

 produce

 to

 low

-income

 families

 and

 promoting

 green

 spaces

 in

 densely

 populated

 cities

.


Develop

ing

 a

 successful

 recycling

 program

 for

 electronic

 waste

,

 divert

ing

 thousands

 of

 tons

 of

 hazardous

 materials

 from

 land

fills

 and

 promoting

 responsible

 e

-w

aste

 disposal

 practices

.


Creating

 educational

 programs

 and

 workshops

 on

 sustainable

 living

,

 environmental

 science

,

 and

 conservation

,

 reaching

 thousands

 of

 students

,

 teachers

,

 and

 community

 members

.


Coll



Prompt: The capital of France is
Generated text: 

 known

 for

 its

 stunning

 architecture

,

 art

 museums

,

 and

 beautiful

 parks

.

 One

 of

 the

 most

 famous

 landmarks

 in

 the

 city

 is

 the

 E

iff

el

 Tower

,

 which

 was

 built

 for

 the

 

188

9

 World

's

 Fair

.

 The

 tower

 is

 made

 of

 iron

 and

 stands

 at

 over

 

1

,

000

 feet

 tall

.

 Visitors

 can

 take

 a

 lift

 to

 the

 top

 for

 panoramic

 views

 of

 the

 city

.


The

 Lou

vre

 Museum

 is

 another

 popular

 destination

 in

 Paris

.

 The

 museum

 is

 home

 to

 an

 impressive

 collection

 of

 art

 and

 artifacts

 from

 around

 the

 world

,

 including

 the

 Mona

 Lisa

.

 The

 museum

 is

 housed

 in

 a

 beautiful

 

16

th

-century

 palace

 and

 has

 a

 stunning

 glass

 pyramid

 entrance

.


The

 Se



Prompt: The future of AI is
Generated text: 

 being

 shaped

 by

 the

 convergence

 of

 various

 fields

,

 including

 machine

 learning

,

 natural

 language

 processing

,

 computer

 vision

,

 and

 robotics

.

 Here

 are

 some

 of

 the

 key

 trends

 and

 developments

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

:


 

 

1

.

 Explain

ability

 and

 Transparency

:

 As

 AI

 systems

 become

 increasingly

 complex

,

 there

 is

 a

 growing

 need

 for

 explain

ability

 and

 transparency

.

 This

 involves

 developing

 techniques

 to

 understand

 how

 AI

 models

 make

 decisions

 and

 providing

 clear

 explanations

 for

 their

 outputs

.


 

 

2

.

 Edge

 AI

:

 The

 proliferation

 of

 edge

 devices

 such

 as

 smartphones

,

 smart

 home

 devices

,

 and

 autonomous

 vehicles

 is

 driving

 the

 need

 for

 edge

 AI

.

 This

 involves

 deploying

 AI

 models

 on

 edge

 devices

,




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Lorena Aguilar, and I am a senior at The University of Texas at Austin. I am an Advertising and Public Relations major with a minor in Spanish. I am a first-generation college student, a proud Latina, and a passionate advocate for women's rights and social justice.
When I'm not in class, you can find me volunteering at the Texas Book Festival, interning at a local advertising agency, or practicing yoga with my favorite teachers. I love trying new restaurants, hiking in Barton Creek, and exploring the vibrant culture of Austin.
I am excited to join the 2022-2023 editorial team as the Latinx &

Prompt: The capital of France is
Generated text:  Paris, which is located in the northern part of the country. The city is famous for its history, art, fashion, cuisine, and architecture. Paris is a major tourist destination, attracting millions of visitors each year. It is also the seat of the French government and the country's largest city.
The city o

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Connor

!


I

'm

 a

 

14

 year

 old

 dog

,

 a

 gorgeous

 golden

 retrie

ver

 with

 a

 fluffy

 coat

 and

 a

 loving

 personality

.

 I

'm

 still

 as

 playful

 and

 energetic

 as

 I

 was

 when

 I

 was

 a

 puppy

,

 and

 I

 love

 to

 run

 around

 and

 play

 fetch

 in

 the

 park

.

 I

'm

 also

 very

 gentle

 and

 loving

,

 and

 I

 love

 to

 sn

uggle

 up

 with

 my

 humans

 on

 the

 couch

 for

 a

 good

 cudd

le

.


I

've

 been

 diagnosed

 with

 deg

enerative

 my

el

opathy

,

 a

 condition

 that

 affects

 my

 hind

 legs

.

 It

 makes

 it

 harder

 for

 me

 to

 walk

 and

 run

,

 but

 I

'm

 still

 a

 happy

 and

 loving

 pup

 at

 heart

.

 I

 just

 need

 a

 little



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 eternal

 romance

 and

 beauty

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 stunning

 Notre

-D

ame

 Cathedral

,

 Paris

 is

 a

 city

 that

 will

 capture

 your

 heart

 and

 leave

 you

 in

 awe

.

 Here

 are

 some

 of

 the

 top

 attractions

 and

 experiences

 you

 shouldn

't

 miss

 when

 visiting

 Paris

:


1

.

 E

iff

el

 Tower

:

 This

 iconic

 iron

 lattice

 tower

 is

 a

 symbol

 of

 Paris

 and

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

.

 You

 can

 take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


2

.

 Notre

-D

ame

 Cathedral

:

 This

 beautiful

 Gothic

 cathedral

 has

 been

 the

 site

 of

 coron

ations

,

 royal

 weddings

,

 and

 other

 significant

 events

 throughout

 history



Prompt: The future of AI is
Generated text: 

 here

 now




Art

ificial

 intelligence

 (

AI

)

 is

 no

 longer

 a

 futuristic

 concept

,

 but

 a

 reality

 that

 is

 transforming

 industries

 and

 changing

 the

 way

 we

 live

 and

 work

.

 AI

 has

 been

 integrated

 into

 various

 aspects

 of

 our

 lives

,

 from

 virtual

 assistants

 like

 Siri

 and

 Alexa

 to

 self

-driving

 cars

 and

 personalized

 product

 recommendations

.


The

 rapid

 advancement

 of

 AI

 has

 led

 to

 the

 development

 of

 various

 AI

 applications

,

 including

:


 

 

1

.

 Machine

 learning

:

 enables

 machines

 to

 learn

 from

 data

 and

 improve

 their

 performance

 over

 time

.


 

 

2

.

 Natural

 language

 processing

:

 allows

 computers

 to

 understand

 and

 generate

 human

 language

.


 

 

3

.

 Computer

 vision

:

 enables

 machines

 to

 interpret

 and

 understand

 visual

 data




In [6]:
llm.shutdown()