# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.67it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.73it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:40,  1.86s/it]

  9%|▊         | 2/23 [00:02<00:21,  1.04s/it]

 13%|█▎        | 3/23 [00:02<00:14,  1.41it/s]

 17%|█▋        | 4/23 [00:02<00:10,  1.85it/s]

 22%|██▏       | 5/23 [00:03<00:07,  2.28it/s]

 26%|██▌       | 6/23 [00:03<00:06,  2.54it/s]

 30%|███       | 7/23 [00:03<00:05,  2.86it/s]

 35%|███▍      | 8/23 [00:04<00:05,  2.80it/s]

 39%|███▉      | 9/23 [00:04<00:04,  2.88it/s]

 43%|████▎     | 10/23 [00:04<00:04,  3.02it/s]

 48%|████▊     | 11/23 [00:05<00:03,  3.17it/s]

 52%|█████▏    | 12/23 [00:05<00:03,  3.42it/s]

 57%|█████▋    | 13/23 [00:05<00:02,  3.64it/s]

 61%|██████    | 14/23 [00:05<00:02,  3.62it/s]

 65%|██████▌   | 15/23 [00:06<00:02,  3.53it/s]

 70%|██████▉   | 16/23 [00:06<00:02,  3.50it/s]

 74%|███████▍  | 17/23 [00:06<00:01,  3.52it/s]

 78%|███████▊  | 18/23 [00:06<00:01,  3.45it/s]

 83%|████████▎ | 19/23 [00:07<00:01,  3.41it/s]

 87%|████████▋ | 20/23 [00:07<00:00,  3.38it/s]

 91%|█████████▏| 21/23 [00:07<00:00,  3.39it/s]

 96%|█████████▌| 22/23 [00:08<00:00,  3.39it/s]

100%|██████████| 23/23 [00:08<00:00,  3.41it/s]100%|██████████| 23/23 [00:08<00:00,  2.73it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  [your name] and I am a [your profession or student] from [your university name]. I am reaching out to you today because I am interested in learning more about your company and the opportunities that you offer. I came across your company while researching [industry/field] and was impressed by your company's commitment to [specific values or initiatives].
I would love the opportunity to speak with you more about how your company is making an impact in this field. Could we schedule a meeting or call at your convenience? I have attached my resume for your reference.
Thank you for considering my inquiry. I look forward to hearing from you soon
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president serves a four-year term and is limited to two terms. The president is elected through the Electoral College system, where each state is allocated a certain number of elec

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Jim

 and

 I

 am

 the

 "

new

"

 member

 of

 the

 N

RT

 team

!

 I

 have

 been

 a

 long

-time

 fan

 of

 the

 sport

 of

 park

our

 and

 I

 am

 excited

 to

 be

 a

 part

 of

 the

 N

RT

 team

.

 I

've

 been

 training

 park

our

 for

 about

 

10

 years

 now

 and

 I

've

 had

 the

 pleasure

 of

 learning

 from

 and

 training

 with

 some

 of

 the

 best

 in

 the

 world

.


As

 you

 may

 know

,

 N

RT

 has

 been

 around

 for

 over

 

20

 years

,

 providing

 top

-notch

 training

 and

 education

 to

 park

our

 enthusiasts

 from

 all

 over

 the

 world

.

 I

 am

 honored

 to

 be

 a

 part

 of

 such

 a

 well

-res

pected

 organization

 and

 I

 am

 excited

 to

 see

 what

 the



Prompt: The capital of France is
Generated text: 

 not

 the

 most

 visited

 city

 in

 the

 world

.

 However

,

 it

 is

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

 and

 a

 must

-

see

 destination

 for

 many

 people

.

 The

 city

 of

 Paris

 is

 home

 to

 some

 of

 the

 most

 famous

 landmarks

 in

 the

 world

,

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.

 Paris

 is

 also

 known

 for

 its

 fashion

,

 cuisine

,

 and

 romantic

 atmosphere

.


The

 most

 visited

 city

 in

 the

 world

 is

 Bangkok

,

 Thailand

.

 Bangkok

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 culture

 and

 history

.

 The

 city

 is

 home

 to

 many

 beautiful

 temples

,

 including

 the

 famous

 Wat

 Ph

ra

 Ka

ew

 and

 Wat

 Ar

un

.



Prompt: The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 its

 ability

 to

 learn

 from

 humans

 and

 improve

 over

 time

.

 This

 has

 sparked

 interest

 in

 developing

 more

 sophisticated

 AI

 systems

 that

 can

 learn

 from

 data

,

 adapt

 to

 changing

 environments

,

 and

 make

 decisions

 that

 are

 informed

 by

 human

 values

.


In

 this

 context

,

 we

 can

 explore

 the

 following

 AI

 topics

:


Human

-centered

 AI

:

 This

 refers

 to

 AI

 systems

 that

 are

 designed

 to

 learn

 from

 humans

 and

 improve

 over

 time

,

 with

 a

 focus

 on

 understanding

 human

 values

,

 behaviors

,

 and

 motivations

.


Ex

plain

ability

 and

 transparency

:

 As

 AI

 systems

 become

 more

 complex

,

 it

 is

 essential

 to

 develop

 methods

 that

 can

 explain

 their

 decision

-making

 processes

,

 making

 them

 more

 transparent

 and

 accountable

.


Value




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Olegs and I’m a new member of the community. I’m here to learn and I’m interested in everything related to darknet, Tor browser, VPN, and all things related to online privacy and security. I also have some knowledge in coding and would like to participate in the forums and maybe even contribute to some projects if possible.

Welcome to the community, Olegs! We're glad to have you on board.

Before we dive into the world of darknet and online security, I'd like to clarify a few things to ensure you have a safe and informed experience here. Our community values knowledge sharing, critical thinking, and responsible

Prompt: The capital of France is
Generated text:  Paris.
The capital of France is Paris. The Eiffel Tower is the symbol of the city.
Paris is the capital of France. The Eiffel Tower is the symbol of Paris. Paris is a beautiful city in Europe.
Paris is the capital of France. The Eiffel Tower is a famous landmark in Paris.
The capital 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Emily

.

 I

 am

 a

 freelance

 writer

 and

 editor

 based

 in

 London

.

 I

 have

 been

 working

 in

 the

 media

 industry

 for

 over

 

15

 years

,

 and

 I

 have

 a

 strong

 background

 in

 writing

,

 editing

 and

 proof

reading

.


I

 have

 worked

 with

 a

 wide

 range

 of

 clients

,

 from

 small

 businesses

 to

 large

 corporations

,

 and

 I

 have

 a

 proven

 track

 record

 of

 delivering

 high

-quality

 content

 on

 time

 and

 on

 budget

.


My

 areas

 of

 expertise

 include

:


Content

 creation

:

 I

 can

 write

 a

 wide

 range

 of

 content

,

 including

 articles

,

 blog

 posts

,

 social

 media

 posts

,

 press

 releases

 and

 more

.


Copy

writing

:

 I

 can

 write

 engaging

 and

 persuasive

 copy

 for

 websites

,

 bro

ch

ures

,

 marketing

 materials



Prompt: The capital of France is
Generated text: 

 known

 for

 its

 iconic

 landmarks

,

 rich

 history

,

 and

 cultural

 attractions

.

 Here

's

 a

 brief

 overview

:


The

 E

iff

el

 Tower

:

 This

 iron

 lattice

 tower

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

 and

 offers

 breathtaking

 views

 of

 the

 city

.


Not

re

-D

ame

 Cathedral

:

 A

 beautiful

 Gothic

 church

 that

 has

 been

 the

 site

 of

 coron

ations

,

 royal

 weddings

,

 and

 other

 important

 events

 throughout

 history

.


The

 Lou

vre

 Museum

:

 One

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 housing

 an

 impressive

 collection

 of

 art

 and

 artifacts

,

 including

 the

 Mona

 Lisa

.


The

 Arc

 de

 Tri

omp

he

:

 A

 monumental

 arch

 honoring

 the

 soldiers

 who

 fought

 and

 died

 for

 France

,

 offering

 stunning



Prompt: The future of AI is
Generated text: 

 here

 –

 and

 it

’s

 being

 driven

 by

 a

 new

 kind

 of

 intelligence

.


For

 decades

,

 humans

 have

 been

 the

 pinnacle

 of

 intelligence

 on

 this

 planet

.

 But

 with

 the

 advent

 of

 Artificial

 General

 Intelligence

 (

AG

I

),

 we

 are

 witnessing

 a

 new

 era

 in

 which

 machines

 can

 learn

,

 reason

,

 and

 apply

 knowledge

 across

 a

 wide

 range

 of

 tasks

,

 much

 like

 humans

.


AG

I

 represents

 a

 significant

 leap

 forward

 in

 AI

 capabilities

.

 It

 can

 process

 information

,

 understand

 language

,

 and

 make

 decisions

 on

 its

 own

,

 often

 more

 efficiently

 and

 accurately

 than

 humans

.

 However

,

 this

 also

 raises

 concerns

 about

 the

 potential

 impact

 on

 jobs

,

 the

 economy

,

 and

 our

 very

 way

 of

 life

.


One

 of

 the




In [6]:
llm.shutdown()