# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.18it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.26it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.84it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.47it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:34,  1.55s/it]

  9%|▊         | 2/23 [00:01<00:17,  1.19it/s] 13%|█▎        | 3/23 [00:02<00:10,  1.86it/s]

 17%|█▋        | 4/23 [00:02<00:07,  2.53it/s] 22%|██▏       | 5/23 [00:02<00:05,  3.15it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.54it/s] 30%|███       | 7/23 [00:02<00:03,  4.02it/s]

 35%|███▍      | 8/23 [00:03<00:03,  4.40it/s] 39%|███▉      | 9/23 [00:03<00:02,  4.71it/s]

 43%|████▎     | 10/23 [00:03<00:02,  4.93it/s] 48%|████▊     | 11/23 [00:03<00:02,  5.10it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  5.19it/s] 57%|█████▋    | 13/23 [00:03<00:01,  5.24it/s]

 61%|██████    | 14/23 [00:04<00:01,  5.26it/s] 65%|██████▌   | 15/23 [00:04<00:01,  5.31it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  5.29it/s] 74%|███████▍  | 17/23 [00:04<00:01,  5.28it/s]

 78%|███████▊  | 18/23 [00:04<00:00,  5.27it/s] 83%|████████▎ | 19/23 [00:05<00:00,  5.25it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  5.21it/s] 91%|█████████▏| 21/23 [00:05<00:00,  5.22it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  5.22it/s]100%|██████████| 23/23 [00:05<00:00,  5.22it/s]100%|██████████| 23/23 [00:05<00:00,  3.95it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kaitlyn Pitzer. I am a visual arts student at the University of California, Davis. My passion for photography began when I was a teenager, and I have been taking photos ever since. My favorite subjects are landscape, portrait, and still life photography. I love experimenting with different lighting, composition, and editing techniques to bring my photos to life. Currently, I am focusing on capturing the beauty of the natural world around me through my photography.
In my free time, I enjoy exploring the outdoors, hiking, and taking photos of the beautiful landscapes and scenery that California has to offer. I also love experimenting with different photography techniques and
Prompt: The president of the United States is
Generated text:  often considered the most powerful person in the world. Here are some of the most interesting and unusual facts about the office of the presidency.
The president is not just the head of state but also the head of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Jorge

,

 and

 I

 am

 a

 Certified

 Personal

 Trainer

 and

 Nutrition

ist

.

 I

 have

 been

 working

 with

 clients

 for

 over

 

10

 years

 and

 have

 a

 deep

 understanding

 of

 what

 it

 takes

 to

 achieve

 success

 in

 fitness

 and

 nutrition

.

 My

 approach

 is

 holistic

 and

 individual

ized

,

 focusing

 on

 helping

 my

 clients

 achieve

 their

 goals

 while

 maintaining

 a

 healthy

 and

 balanced

 lifestyle

.

 I

 am

 passionate

 about

 empowering

 individuals

 to

 take

 control

 of

 their

 health

 and

 well

-being

,

 and

 I

 am

 committed

 to

 providing

 exceptional

 guidance

 and

 support

 every

 step

 of

 the

 way

.


I

 specialize

 in

 creating

 customized

 fitness

 and

 nutrition

 plans

 that

 cater

 to

 each

 client

's

 unique

 needs

,

 goals

,

 and

 lifestyle

.

 My

 services

 include

:


Personal

 Training

:

 One



Prompt: The capital of France is
Generated text: 

 home

 to

 some

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 But

 Paris

 is

 more

 than

 just

 its

 famous

 attractions

 –

 it

's

 also

 a

 city

 with

 a

 rich

 history

,

 a

 vibrant

 cultural

 scene

,

 and

 a

 romantic

 atmosphere

 that

 has

 capt

ivated

 visitors

 for

 centuries

.


One

 of

 the

 best

 ways

 to

 experience

 the

 real

 Paris

 is

 to

 explore

 its

 lesser

-known

 neighborhoods

,

 or

 arr

ond

isse

ments

.

 Here

 are

 a

 few

 of

 the

 most

 charming

 neighborhoods

 to

 visit

 in

 Paris

:


1

.

 Le

 Mar

ais

:

 This

 historic

 neighborhood

 in

 the

 

3

rd

 and

 

4

th

 arr

ond

isse

ments

 is



Prompt: The future of AI is
Generated text: 

 human

-like

,

 not

 robotic




AI

 should

 be

 able

 to

 simulate

 a

 human

 thought

 process

 rather

 than

 relying

 on

 traditional

 computer

 programming

.

 This

 is

 known

 as

 "

c

ognitive

 AI

".


The

 world

 of

 AI

 is

 often

 perceived

 as

 a

 field

 of

 robotic

 machines

 and

 cold

,

 hard

 logic

.

 However

,

 this

 perception

 is

 beginning

 to

 shift

.

 A

 new

 wave

 of

 AI

 innovation

 focuses

 on

 creating

 systems

 that

 can

 think

 like

 humans

,

 making

 decisions

 based

 on

 contextual

 understanding

 and

 emotional

 intelligence

.


This

 approach

 is

 known

 as

 "

c

ognitive

 AI

,"

 and

 it

 has

 the

 potential

 to

 revolution

ize

 industries

 such

 as

 healthcare

,

 finance

,

 and

 education

.

 By

 sim

ulating

 a

 human

 thought

 process

,

 cognitive

 AI

 can

 provide

 more




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Austin and I am a seasoned web developer with over 8 years of experience in designing and developing cutting-edge web applications using a range of technologies including PHP, Laravel, JavaScript, React, and Node.js. I have a strong background in e-commerce, having developed several successful e-commerce platforms and integrations with various payment gateways.

Some of my key skills include:

* PHP and Laravel framework
* JavaScript and React for front-end development
* Node.js for back-end development
* MySQL and MongoDB databases
* API design and integration
* Payment gateway integration
* E-commerce platform development
* Responsive web design

I am passionate about

Prompt: The capital of France is
Generated text:  Paris, a beautiful city famous for its stunning architecture, famous museums, and romantic atmosphere. If you ever find yourself in Paris, you should not miss visiting the Eiffel Tower, the Louvre, Notre Dame Cathedral, and th

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Emma

,

 I

'm

 

21

 years

 old

,

 and

 I

'm

 a

 long

-time

 fan

 of

 anime

 and

 manga

.

 I

've

 been

 watching

 anime

 for

 over

 a

 decade

,

 and

 I

've

 read

 numerous

 manga

 series

.

 I

'm

 here

 to

 share

 my

 passion

 with

 you

 and

 discuss

 all

 things

 anime

 and

 manga

.


Recent

 Discussions

:


What

 is

 your

 favorite

 anime

 of

 all

 time

?


The

 Most

 Icon

ic

 Anime

 Characters

 of

 All

 Time




10

 Reasons

 Why

 You

 Should

 Watch

 "

Attack

 on

 Titan

"


Top

 

5

 Action

-P

acked

 Anime

 Series

 for

 Beginners




10

 Reasons

 Why

 You

 Should

 Read

 "

Death

 Note

"


Recent

 Reviews

:


"

Attack

 on

 Titan

"

 Review

:

 A

 Gri

pping

 and

 Emotional

 Experience




"



Prompt: The capital of France is
Generated text: 

 a

 city

 that

 needs

 no

 introduction

.

 Paris

 is

 a

 city

 steep

ed

 in

 history

,

 romance

,

 and

 culture

.

 Whether

 you

’re

 interested

 in

 art

,

 architecture

,

 food

,

 or

 fashion

,

 the

 City

 of

 Light

 has

 something

 to

 offer

.

 Here

 are

 some

 of

 the

 top

 things

 to

 do

 in

 Paris

:


1

.

 The

 E

iff

el

 Tower

:


The

 iconic

 E

iff

el

 Tower

 is

 a

 must

-

visit

 attraction

 in

 Paris

.

 This

 iron

 lattice

 tower

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

 and

 offers

 breathtaking

 views

 of

 the

 city

 from

 its

 top

 level

.


2

.

 The

 Lou

vre

 Museum

:


The

 Lou

vre

 is

 one

 of

 the

 world

’s

 largest

 and

 most

 famous

 museums

,

 housing



Prompt: The future of AI is
Generated text: 

 often

 described

 as

 bright

 and

 exciting

,

 but

 it

 also

 raises

 concerns

 about

 the

 impact

 on

 workers

 and

 society

.

 AI

 is

 increasingly

 being

 used

 to

 automate

 routine

 and

 repetitive

 tasks

,

 which

 can

 lead

 to

 job

 displacement

 for

 certain

 workers

.

 While

 some

 jobs

 may

 become

 obsolete

,

 others

 will

 be

 created

 that

 we

 cannot

 yet

 imagine

.

 This

 is

 the

 story

 of

 how

 AI

 is

 transforming

 the

 workplace

 and

 what

 we

 can

 do

 to

 prepare

 for

 the

 future

.


AI

 has

 been

 rapidly

 advancing

 in

 recent

 years

,

 leading

 to

 significant

 improvements

 in

 efficiency

 and

 productivity

.

 AI

 can

 analyze

 vast

 amounts

 of

 data

,

 identify

 patterns

,

 and

 make

 predictions

,

 which

 has

 led

 to

 the

 automation

 of

 many

 routine

 tasks

.

 While

 this

 can




In [6]:
llm.shutdown()