# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.39it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.20it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.70it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.49it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:39,  1.78s/it]

  9%|▊         | 2/23 [00:02<00:21,  1.01s/it]

 13%|█▎        | 3/23 [00:02<00:13,  1.47it/s]

 17%|█▋        | 4/23 [00:02<00:09,  1.95it/s]

 22%|██▏       | 5/23 [00:03<00:07,  2.36it/s]

 26%|██▌       | 6/23 [00:03<00:06,  2.51it/s]

 30%|███       | 7/23 [00:03<00:05,  2.72it/s]

 35%|███▍      | 8/23 [00:04<00:05,  2.86it/s]

 39%|███▉      | 9/23 [00:04<00:04,  2.94it/s]

 43%|████▎     | 10/23 [00:04<00:04,  3.00it/s]

 48%|████▊     | 11/23 [00:04<00:03,  3.05it/s]

 52%|█████▏    | 12/23 [00:05<00:03,  3.03it/s]

 57%|█████▋    | 13/23 [00:05<00:03,  3.05it/s]

 61%|██████    | 14/23 [00:05<00:02,  3.06it/s]

 65%|██████▌   | 15/23 [00:06<00:02,  3.06it/s]

 70%|██████▉   | 16/23 [00:06<00:02,  3.08it/s]

 74%|███████▍  | 17/23 [00:06<00:01,  3.09it/s]

 78%|███████▊  | 18/23 [00:07<00:01,  3.08it/s]

 83%|████████▎ | 19/23 [00:07<00:01,  3.03it/s]

 87%|████████▋ | 20/23 [00:07<00:00,  3.04it/s]

 91%|█████████▏| 21/23 [00:08<00:00,  3.04it/s]

 96%|█████████▌| 22/23 [00:08<00:00,  3.05it/s]

100%|██████████| 23/23 [00:08<00:00,  3.09it/s]100%|██████████| 23/23 [00:08<00:00,  2.59it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Zachary and I am a passionate software developer with a strong interest in creating innovative and user-friendly applications. With a strong foundation in computer science and a keen eye for detail, I strive to deliver high-quality solutions that exceed expectations.
I have a degree in Computer Science from [University Name], where I honed my skills in programming languages such as Java, Python, and C++. I also have experience with various development frameworks and tools, including Spring, React, and Docker.
My experience spans across multiple industries, including finance, healthcare, and e-commerce. I have worked on various projects, from building scalable web applications to developing mobile apps for
Prompt: The president of the United States is
Generated text:  required to be a natural-born citizen of the United States, at least 35 years old, and a resident of the country for at least 14 years. Is it constitutional for a 19-year-old to b

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Jenn

,

 and

 I

'm

 a

 teacher

,

 blogger

,

 and

 mom

 of

 two

 little

 munch

kins

.

 I

'm

 so

 glad

 you

're

 here

!

 On

 this

 blog

,

 I

 share

 my

 thoughts

 on

 parenting

,

 teaching

,

 and

 life

 in

 general

.

 I

'm

 a

 bit

 of

 a

 perfection

ist

,

 but

 I

'm

 also

 a

 firm

 believer

 in

 the

 importance

 of

 being

 real

 and

 authentic

.

 I

 hope

 you

'll

 join

 me

 on

 this

 journey

 and

 find

 some

 inspiration

,

 encouragement

,

 and

 maybe

 even

 a

 few

 laughs

 along

 the

 way

.

 Thanks

 for

 stopping

 by

!


I

'm

 a

 teacher

 at

 heart

,

 and

 I

 love

 sharing

 my

 passion

 for

 learning

 with

 others

.

 On

 this

 blog

,

 I

'll

 be

 sharing

 some

 of



Prompt: The capital of France is
Generated text: 

 a

 city

 that

 needs

 no

 introduction

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 world

-class

 museums

,

 and

 romantic

 atmosphere

.

 From

 the

 E

iff

el

 Tower

 to

 the

 Lou

vre

,

 the

 Se

ine

 River

 to

 the

 Mont

mart

re

 district

,

 there

's

 no

 shortage

 of

 iconic

 landmarks

 and

 experiences

 to

 explore

.


But

 Paris

 is

 more

 than

 just

 its

 famous

 sights

 –

 it

's

 also

 a

 vibrant

 and

 diverse

 city

 with

 a

 thriving

 food

 scene

,

 charming

 neighborhoods

,

 and

 a

 wide

 range

 of

 cultural

 attractions

.

 Whether

 you

're

 interested

 in

 history

,

 art

,

 fashion

,

 or

 entertainment

,

 there

's

 something

 for

 everyone

 in

 this

 beautiful

 and

 captivating

 city

.


Paris

 is

 home

 to

 many

 of

 the



Prompt: The future of AI is
Generated text: 

 uncertain

.

 In

 this

 timely

 and

 thought

-pro

v

oking

 book

,

 leading

 philosopher

 and

 AI

 researcher

,

 Nick

 B

ost

rom

,

 explores

 the

 risks

 and

 opportunities

 that

 AI

 poses

 to

 humanity

.

 He

 argues

 that

 the

 development

 of

 super

int

elligent

 machines

 could

 pose

 an

 existential

 risk

 to

 human

 civilization

,

 and

 that

 we

 need

 to

 develop

 strategies

 for

 mitig

ating

 these

 risks

,

 while

 also

 harness

ing

 the

 benefits

 of

 AI

.


“

Super

intelligence

:

 Paths

,

 D

angers

,

 Strategies

”

 is

 a

 book

 that

 challenges

 us

 to

 think

 critically

 about

 the

 future

 of

 AI

 and

 our

 relationship

 with

 machines

.

 It

 is

 a

 must

-read

 for

 anyone

 interested

 in

 the

 intersection

 of

 technology

,

 philosophy

,

 and

 humanity

.


Nick

 B

ost




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Kathleen Fong, and I am a fellow writer and blogger. I am excited to be part of this community, and I look forward to sharing my writing and interacting with others.
Welcome to the community, Kathleen! I'm excited to have you on board. What kind of writing do you do, and what kind of topics would you like to discuss or explore in your posts? Looking forward to seeing your contributions! 
Welcome Kathleen. I'm glad you're here. What kind of writing do you do? Do you have a blog or are you looking to start one? I'm happy to help in any way I can. 
Hello

Prompt: The capital of France is
Generated text:  known for its stunning architecture, world-class art museums, and romantic atmosphere. Whether you're a foodie, a history buff, or a shopaholic, Paris has something for everyone. Here are some of the top things to do in Paris:
1. Visit the Eiffel Tower: The iconic Eiffel Tower is a must-visit attraction in Paris. You can take the stairs or eleva

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Daniel

,

 and

 I

 am

 a

 senior

 at

 Alb

right

 College

,

 major

ing

 in

 Computer

 Science

.

 I

 am

 a

 member

 of

 the

 Alb

right

 College

 Coding

 Club

,

 and

 I

 am

 passionate

 about

 building

 software

 applications

 and

 systems

.

 In

 my

 free

 time

,

 I

 enjoy

 reading

,

 hiking

,

 and

 playing

 music

.



**

Projects

**



I

 have

 completed

 several

 projects

 throughout

 my

 college

 career

,

 including

:



1

.

 **

Personal

 Website

**:

 I

 built

 a

 personal

 website

 using

 HTML

,

 CSS

,

 and

 JavaScript

.

 The

 website

 includes

 a

 blog

,

 a

 portfolio

 of

 my

 projects

,

 and

 a

 contact

 form

.



2

.

 **

E

-

Commerce

 Website

**:

 I

 worked

 on

 a

 team

 to

 build

 an

 e

-commerce

 website

 using

 Python



Prompt: The capital of France is
Generated text: 

 the

 perfect

 destination

 for

 a

 romantic

 getaway

,

 especially

 during

 the

 spring

 season

.

 The

 city

 is

 filled

 with

 beautiful

 gardens

,

 picturesque

 streets

,

 and

 stunning

 architecture

 that

 will

 make

 your

 heart

 skip

 a

 beat

.

 Here

 are

 some

 romantic

 things

 to

 do

 in

 Paris

 during

 the

 spring

:


Take

 a

 Se

ine

 River

 Cruise




A

 romantic

 cruise

 along

 the

 Se

ine

 River

 is

 a

 classic

 Paris

ian

 experience

.

 You

 can

 enjoy

 the

 stunning

 city

 views

 while

 s

ipping

 champagne

 and

 holding

 hands

 with

 your

 loved

 one

.


Visit

 the

 Luxembourg

 Gardens




The

 Luxembourg

 Gardens

 are

 a

 beautiful

 green

 oasis

 in

 the

 heart

 of

 the

 city

.

 Take

 a

 stroll

 through

 the

 gardens

,

 visit

 the

 beautiful

 f

ountains

,

 and

 enjoy

 the

 vibrant



Prompt: The future of AI is
Generated text: 

 uncertain

,

 but

 one

 thing

 is

 clear

:

 AI

 will

 be

 in

 every

 aspect

 of

 our

 lives

,

 including

 healthcare

.


In

 this

 article

,

 we

'll

 explore

 the

 role

 of

 artificial

 intelligence

 (

AI

)

 in

 healthcare

,

 its

 benefits

,

 and

 the

 challenges

 it

 poses

.


What

 is

 AI

 in

 healthcare

?


AI

 in

 healthcare

 refers

 to

 the

 application

 of

 artificial

 intelligence

 techniques

 to

 medical

 data

,

 diagnosis

,

 treatment

,

 and

 patient

 care

.

 AI

 can

 be

 used

 in

 various

 ways

,

 such

 as

:


Predict

ive

 analytics

:

 AI

 can

 analyze

 patient

 data

 to

 predict

 the

 likelihood

 of

 developing

 a

 certain

 disease

 or

 condition

.


Image

 analysis

:

 AI

 can

 analyze

 medical

 images

,

 such

 as

 X

-rays

 and

 MR

Is

,

 to

 help




In [6]:
llm.shutdown()