# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.33it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.16it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.64it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.44it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:37,  1.72s/it]

  9%|▊         | 2/23 [00:02<00:19,  1.07it/s]

 13%|█▎        | 3/23 [00:02<00:12,  1.63it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.18it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.68it/s]

 26%|██▌       | 6/23 [00:03<00:05,  3.02it/s]

 30%|███       | 7/23 [00:03<00:04,  3.37it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.64it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.51it/s]

 43%|████▎     | 10/23 [00:04<00:03,  3.71it/s]

 48%|████▊     | 11/23 [00:04<00:03,  3.93it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  4.08it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.20it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.27it/s]

 65%|██████▌   | 15/23 [00:05<00:01,  4.22it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  4.04it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.02it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.08it/s]

 83%|████████▎ | 19/23 [00:06<00:00,  4.02it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  3.89it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  3.89it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  3.96it/s]

100%|██████████| 23/23 [00:07<00:00,  3.94it/s]100%|██████████| 23/23 [00:07<00:00,  3.19it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Stephen and I am a student at SUNY Brockport. I am currently pursuing a Bachelor of Arts degree in English, with a concentration in Creative Writing. My major interests include poetry, short stories, and creative nonfiction. I have a passion for the arts and a love for music and film. I am excited to share my thoughts and insights about the world around me and engage in meaningful discussions with others. I am eager to share my creative work with others and to learn from their experiences and perspectives.
In my free time, I enjoy listening to music, watching classic movies, and hiking in the beautiful Finger Lakes region of New York.
Prompt: The president of the United States is
Generated text:  not allowed to make a speech on television in the United States during the time of war. This is because the news media has a certain level of power over the news flow and can modify the president's message to reflect their own political views.
Q: Is i

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Aurora

 and

 I

'm

 a

 psychic

 medium

.

 I

 am

 here

 to

 help

 you

 navigate

 life

's

 challenges

 and

 provide

 guidance

 through

 my

 connections

 with

 loved

 ones

 who

 have

 passed

 on

.

 My

 abilities

 allow

 me

 to

 tap

 into

 the

 spiritual

 realm

,

 communicate

 with

 spirits

,

 and

 offer

 insight

 into

 your

 personal

 and

 spiritual

 journey

.


A

ur

ora

's

 Bio




I

've

 been

 aware

 of

 my

 abilities

 from

 a

 young

 age

,

 but

 it

 wasn

't

 until

 I

 experienced

 a

 near

-death

 experience

 at

 

19

 that

 I

 truly

 understood

 the

 depth

 of

 my

 connection

 to

 the

 spiritual

 realm

.

 During

 this

 experience

,

 I

 was

 able

 to

 see

 and

 communicate

 with

 loved

 ones

 who

 had

 passed

 on

,

 which

 validated

 my

 abilities

 and

 gave



Prompt: The capital of France is
Generated text: 

 divided

 into

 

20

 arr

ond

isse

ments

.

 Which

 of

 the

 following

 names

 does

 not

 belong

 to

 this

 list

?


A

)

 Ch

amps

-

É

lys

ées




B

)

 Le

 Mar

ais




C

)

 

1

er




D

)

 Mont

mart

re




E

)

 Saint

-G

er

main

-des

-

Pr

és




Answer

:

 C




Explanation

:

 The

 arr

ond

isse

ments

 of

 Paris

 are

 numbered

 one

 through

 

20

.

 Therefore

,

 

1

er

 is

 the

 only

 option

 that

 does

 not

 belong

 to

 the

 list

.

 

1

er

 is

 the

 first

 arr

ond

issement

,

 and

 the

 rest

 are

 referred

 to

 by

 their

 number

 and

 name

.

 The

 names

 listed

 are

 all

 famous

 neighborhoods

 in

 Paris

.



Prompt: The future of AI is
Generated text: 

 uncertain

,

 and

 its

 impact

 on

 our

 lives

 will

 depend

 on

 the

 choices

 we

 make

 today




Our

 AI

 reality

 check

 is

 far

 from

 over

.

 The

 world

 is

 still

 grappling

 with

 the

 implications

 of

 AI

 and

 its

 impact

 on

 our

 lives

.

 As

 we

 move

 forward

,

 we

 must

 acknowledge

 the

 complexity

 of

 the

 issue

 and

 the

 need

 for

 a

 multi

-f

ac

eted

 approach

.


In

 this

 section

,

 we

 examine

 the

 potential

 benefits

 and

 risks

 of

 AI

,

 as

 well

 as

 the

 importance

 of

 responsible

 AI

 development

 and

 governance

.

 We

 also

 look

 at

 the

 need

 for

 a

 diverse

 and

 inclusive

 AI

 development

 process

 and

 the

 role

 of

 education

 and

 awareness

 in

 shaping

 our

 relationship

 with

 AI

.


The

 Future

 of

 AI

:

 Challenges

 and




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Emma, and I'm a Creative who is all about spreading positivity and inspiring others to chase their dreams. As a passionate artist, I've always been fascinated by the connection between creativity, self-expression, and mental health. I believe that art can be a powerful tool for healing, growth, and self-discovery.
I love to explore various mediums, from painting and drawing to mixed media and digital art. My artwork often features vibrant colors, abstract shapes, and dreamy landscapes that aim to evoke a sense of calmness and wonder. I'm also a big fan of collaging, as it allows me to combine different textures, patterns,

Prompt: The capital of France is
Generated text:  known for its romanticism, rich history, and stunning architecture. But what makes Paris so famous? Let’s dive into the reasons behind its allure.
1. Architectural Marvels:
Paris is home to some of the world’s most iconic landmarks, such as the Eiffel Tower, Notre-Dame Cathe

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Jasmine

,

 and

 I

'm

 a

 life

 coach

 and

 wellness

 expert

.

 I

 help

 women

 overcome

 their

 limitations

 and

 tap

 into

 their

 inner

 strength

 and

 resilience

 to

 achieve

 their

 goals

 and

 live

 a

 more

 balanced

 and

 fulfilling

 life

.


After

 a

 decade

 of

 working

 in

 the

 corporate

 world

,

 I

 realized

 that

 my

 true

 passion

 lay

 in

 helping

 others

 achieve

 their

 full

 potential

.

 I

've

 spent

 years

 studying

 and

 practicing

 various

 forms

 of

 holistic

 healing

,

 personal

 development

,

 and

 wellness

,

 and

 I

've

 had

 the

 privilege

 of

 working

 with

 clients

 from

 all

 walks

 of

 life

.


My

 approach

 is

 centered

 around

 empowering

 women

 to

 take

 ownership

 of

 their

 lives

,

 their

 health

,

 and

 their

 happiness

.

 I

 believe

 that

 every

 woman

 has

 the

 strength



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 beauty

,

 history

,

 and

 culture

.

 With

 its

 stunning

 architecture

,

 world

-class

 museums

,

 and

 vibrant

 atmosphere

,

 Paris

 is

 a

 destination

 that

 has

 something

 to

 offer

 for

 everyone

.

 Here

 are

 some

 of

 the

 top

 things

 to

 do

 in

 Paris

:


1

.

 Visit

 the

 E

iff

el

 Tower

:

 The

 iconic

 E

iff

el

 Tower

 is

 a

 must

-

visit

 attraction

 in

 Paris

.

 You

 can

 take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


2

.

 Explore

 the

 Lou

vre

 Museum

:

 The

 Lou

vre

 is

 one

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 housing

 an

 impressive

 collection

 of

 art

 and

 artifacts

 from

 around

 the

 world

,

 including

 the

 Mona

 Lisa



Prompt: The future of AI is
Generated text: 

 exciting

,

 but

 also

 raises

 questions

 about

 responsibility

 and

 ethics

.

 How

 will

 we

 ensure

 that

 the

 benefits

 of

 AI

 are

 equ

itably

 distributed

 and

 its

 negative

 consequences

 are

 mitig

ated

?


The

 future

 of

 AI

 is

 exciting

,

 but

 also

 raises

 questions

 about

 responsibility

 and

 ethics

.

 How

 will

 we

 ensure

 that

 the

 benefits

 of

 AI

 are

 equ

itably

 distributed

 and

 its

 negative

 consequences

 are

 mitig

ated

?


The

 future

 of

 AI

 is

 exciting

,

 but

 also

 raises

 questions

 about

 responsibility

 and

 ethics

.

 How

 will

 we

 ensure

 that

 the

 benefits

 of

 AI

 are

 equ

itably

 distributed

 and

 its

 negative

 consequences

 are

 mitig

ated

?


The

 future

 of

 AI

 is

 exciting

,

 but

 also

 raises

 questions

 about

 responsibility

 and

 ethics

.

 How

 will

 we

 ensure

 that

 the




In [6]:
llm.shutdown()