# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  6.00it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.64it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.22it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:35,  1.61s/it]

  9%|▊         | 2/23 [00:01<00:18,  1.13it/s]

 13%|█▎        | 3/23 [00:02<00:11,  1.74it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.33it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.88it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.24it/s]

 30%|███       | 7/23 [00:03<00:04,  3.58it/s]

 35%|███▍      | 8/23 [00:03<00:03,  3.93it/s]

 39%|███▉      | 9/23 [00:03<00:03,  4.18it/s] 43%|████▎     | 10/23 [00:03<00:02,  4.40it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.56it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  4.62it/s] 57%|█████▋    | 13/23 [00:04<00:02,  4.74it/s]

 61%|██████    | 14/23 [00:04<00:01,  4.80it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.86it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.21it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.24it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.33it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.36it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.39it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.55it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.55it/s]

100%|██████████| 23/23 [00:06<00:00,  4.66it/s]100%|██████████| 23/23 [00:06<00:00,  3.53it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Shelby and I am a senior at the University of North Georgia (UNG). I am a part of the Army ROTC program and will be commissioning as a Second Lieutenant in the United States Army. I am also a member of the UNG Diving Team and a member of the UNG community service organization, Alpha Phi Omega.
As a student in the Army ROTC program, I have had the opportunity to participate in various leadership development training courses. Some of these courses include Cadet Leadership Development, Advanced Leadership Development, and the Leader Development and Assessment Course (LDAC). I have also had the opportunity to lead a platoon of
Prompt: The president of the United States is
Generated text:  not immune from the law. The president can be sued for actions taken while in office. However, there are some exceptions to the president's immunity from prosecution.
The President is entitled to certain protections from prosecution, including executive privilege

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Karen

,

 and

 I

 am

 a

 recovering

 addict

.

 I

 have

 been

 in

 recovery

 for

 

5

 years

 now

,

 and

 I

 want

 to

 share

 my

 story

 with

 you

.

 I

 hope

 it

 might

 help

 someone

 who

 is

 struggling

 with

 addiction

 to

 know

 that

 there

 is

 hope

 and

 that

 recovery

 is

 possible

.


I

 grew

 up

 in

 a

 loving

 family

,

 but

 I

 have

 always

 been

 a

 bit

 of

 a

 perfection

ist

.

 I

 was

 always

 striving

 to

 be

 the

 best

,

 to

 get

 the

 best

 grades

,

 to

 be

 the

 best

 athlete

.

 And

 that

 perfection

ism

 followed

 me

 into

 adulthood

.

 I

 was

 always

 trying

 to

 be

 in

 control

,

 always

 trying

 to

 be

 perfect

.


But

 beneath

 all

 of

 that

 perfection

ism

,

 I

 was



Prompt: The capital of France is
Generated text: 

 a

 must

-

visit

 destination

 for

 any

 traveler

.

 The

 city

 has

 a

 long

 history

 and

 is

 known

 for

 its

 rich

 culture

,

 stunning

 architecture

,

 and

 romantic

 atmosphere

.

 There

 are

 so

 many

 things

 to

 do

 and

 see

 in

 Paris

,

 from

 visiting

 famous

 landmarks

 like

 the

 E

iff

el

 Tower

 and

 Notre

 Dame

 Cathedral

 to

 exploring

 the

 city

's

 many

 museums

,

 galleries

,

 and

 historic

 neighborhoods

.


Getting

 Around

 Paris




Getting

 around

 Paris

 is

 relatively

 easy

,

 with

 a

 comprehensive

 public

 transportation

 system

 that

 includes

 the

 metro

,

 buses

,

 and

 trains

.

 The

 city

 is

 also

 very

 walk

able

,

 with

 many

 streets

 and

 bou

lev

ards

 lined

 with

 cafes

,

 shops

,

 and

 restaurants

.


Some

 popular

 ways

 to

 get

 around

 Paris



Prompt: The future of AI is
Generated text: 

 not

 what

 you

 think




AI

 is

 often

 seen

 as

 a

 futuristic

 technology

 that

 will

 change

 the

 world

.

 While

 it

’s

 true

 that

 AI

 has

 the

 potential

 to

 revolution

ize

 many

 industries

,

 its

 impact

 will

 be

 more

 subtle

 and

 evolutionary

 than

 you

 might

 expect

.

 Here

’s

 a

 more

 nuanced

 view

 of

 AI

’s

 future

.


AI

 is

 already

 all

 around

 us




Art

ificial

 intelligence

 is

 already

 embedded

 in

 many

 aspects

 of

 our

 lives

,

 from

 voice

 assistants

 like

 Siri

 and

 Alexa

 to

 personalized

 product

 recommendations

 on

 e

-commerce

 websites

.

 AI

-powered

 chat

bots

 are

 being

 used

 in

 customer

 service

,

 and

 AI

-driven

 analytics

 are

 helping

 businesses

 make

 data

-driven

 decisions

.

 These

 applications

 are

 often

 invisible

 to

 us

,

 but they

 are




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Sarah and I am a UX Designer. I recently worked with a startup to redesign their website, and I am looking for feedback on my work. Specifically, I am seeking feedback on the following aspects:

1.  **Usability**: Is the website intuitive and easy to navigate? Are the main elements and actions clear and accessible?
2.  **Information Architecture**: Does the website's information architecture make sense and effectively organize content?
3.  **Visual Design**: Is the visual design aesthetically pleasing, and does it effectively communicate the brand's identity?
4.  **Accessibility**: Has the website been designed with accessibility in mind,

Prompt: The capital of France is
Generated text:  known for its stunning beauty, rich history, and vibrant culture. Whether you're interested in art, architecture, cuisine, or fashion, Paris has something to offer. From the iconic Eiffel Tower to the stunning Notre-Dame Cathedral, the City of Light is a mus

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Bob

bi

 and

 I

'm

 a

 bit

 of

 a

 laid

-back

,

 whims

ical

 gal

 with

 a

 love

 for

 the

 outdoors

.

 I

 live

 in

 beautiful

 Colorado

,

 where

 I

 spend

 most

 of

 my

 time

 hiking

,

 camping

,

 and

 exploring

 the

 great

 Rocky

 Mountains

.

 I

'm

 a

 bit

 of

 a

 free

 spirit

 and

 love

 trying

 new

 things

,

 whether

 it

's

 a

 new

 recipe

 in

 the

 kitchen

 or

 a

 new

 adventure

 on

 the

 trails

.


My

 friends

 would

 describe

 me

 as

 friendly

,

 adventurous

,

 and

 a

 bit

 quirky

.

 I

'm

 always

 up

 for

 a

 spontaneous

 hike

 or

 a

 night

 out

 with

 friends

,

 and

 I

 love

 trying

 new

 foods

 and

 drinks

.

 I

'm

 a

 bit

 of

 a

 home

body

,

 too

,

 and



Prompt: The capital of France is
Generated text: 

 a

 city

 that

 is

 steep

ed

 in

 history

 and

 culture

,

 with

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

 drawing

 millions

 of

 visitors

 every

 year

.

 But

 it

's

 not

 all

 about

 the

 sights

 –

 Paris

 is

 also

 a

 city

 of

 romance

,

 fashion

,

 and

 cuisine

,

 with

 a

 rich

 cultural

 scene

 and

 a

 vibrant

 nightlife

.


If

 you

're

 planning

 a

 trip

 to

 Paris

,

 here

 are

 some

 tips

 to

 help

 you

 make

 the

 most

 of

 your

 visit

:


1

.

 Get

 familiar

 with

 the

 city

's

 layout

:

 Paris

 is

 a

 big

 city

,

 and

 it

 can

 be

 easy

 to

 get

 lost

 in

 its

 winding

 streets

.

 Take

 some

 time

 to

 study

 a

 map

 of

 the

 city

,

 and



Prompt: The future of AI is
Generated text: 

 here

!


Click

 here

 to

 see

 our

 latest

 press

 release

.


Don

't

 just

 talk

 about

 innovation

,

 make

 it

 happen

.


At

 Qu

ob

ani

,

 we

 believe

 that

 technology

 should

 be

 simple

,

 intuitive

 and

 accessible

 to

 all

.

 Our

 mission

 is

 to

 empower

 businesses

 to

 innovate

 and

 transform

 their

 operations

 using

 cutting

-edge

 AI

 and

 IoT

 technologies

.


Our

 team

 of

 experts

 has

 extensive

 experience

 in

 AI

,

 IoT

,

 and

 software

 development

,

 and

 we

're

 passionate

 about

 bringing

 this

 expertise

 to

 the

 market

.

 We

're

 not

 just

 cod

ers

,

 we

're

 problem

 sol

vers

,

 with

 a

 focus

 on

 delivering

 tangible

 results

 that

 drive

 real

 value

 for

 our

 clients

.


We

're

 a

 team

 of

 innov

ators

,

 entrepreneurs

 and

 techn




In [6]:
llm.shutdown()