# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.81it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.64it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:22,  1.01s/it]  9%|▊         | 2/23 [00:01<00:11,  1.88it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.67it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.33it/s]

 22%|██▏       | 5/23 [00:01<00:04,  3.85it/s]

 26%|██▌       | 6/23 [00:01<00:04,  4.16it/s] 30%|███       | 7/23 [00:02<00:03,  4.48it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.77it/s] 39%|███▉      | 9/23 [00:02<00:02,  4.92it/s]

 43%|████▎     | 10/23 [00:02<00:02,  5.13it/s] 48%|████▊     | 11/23 [00:02<00:02,  5.24it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  5.26it/s] 57%|█████▋    | 13/23 [00:03<00:01,  5.25it/s]

 61%|██████    | 14/23 [00:03<00:01,  5.28it/s] 65%|██████▌   | 15/23 [00:03<00:01,  5.24it/s]

 70%|██████▉   | 16/23 [00:03<00:01,  5.30it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  5.11it/s] 78%|███████▊  | 18/23 [00:04<00:00,  5.17it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.20it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.30it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.35it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.35it/s]

100%|██████████| 23/23 [00:05<00:00,  5.35it/s]100%|██████████| 23/23 [00:05<00:00,  4.46it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Richard. I am a 40-year-old web developer who loves to learn and explore new things. I have a degree in Computer Science and have been working in the industry for over 15 years. I'm a bit of a nerd, but I like to think that's a good thing!

In my free time, I enjoy playing video games, watching movies, and reading science fiction novels. I'm also a bit of a coffee snob and love trying out new coffee shops in my area.

I'm excited to start this blog and share my thoughts and experiences with the world. I'm not sure what topics I'll cover, but
Prompt: The president of the United States is
Generated text:  not a monarch, but a public servant with a specific set of duties and responsibilities. One of those duties is to nominate a new justice to the Supreme Court when a vacancy arises, a power granted to the president by Article II, Section 2 of the Constitution. The Senate then advises and consents on the nomination, which is typically a lengthy p

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Nick

y

 Wat

erson

 and

 I

 am

 a

 fully

 qualified

,

 insured

 and

 DB

S

 checked

 member

 of

 the

 Register

 of

 Exercise

 Professionals

 (

RE

Ps

).

 I

 am

 a

 Personal

 Trainer

 based

 in

 Har

rogate

,

 North

 Yorkshire

.


I

 have

 been

 a

 Personal

 Trainer

 for

 over

 

10

 years

 and

 have

 helped

 numerous

 clients

 achieve

 their

 health

 and

 fitness

 goals

.

 I

 am

 passionate

 about

 helping

 people

 achieve

 a

 healthy

 lifestyle

 and

 improving

 their

 overall

 well

-being

.

 My

 approach

 is

 always

 tailored

 to

 the

 individual

,

 taking

 into

 account

 their

 needs

,

 goals

 and

 fitness

 level

.


As

 well

 as

 being

 a

 Personal

 Trainer

,

 I

 am

 also

 a

 qualified

 Pil

ates

 instructor

 and

 have

 experience

 working

 with

 clients

 of

 all

 ages

 and

 abilities

.



Prompt: The capital of France is
Generated text: 

 the

 perfect

 destination

 for

 a

 romantic

 getaway

.

 The

 city

 is

 full

 of

 history

,

 art

,

 and

 culture

,

 and

 its

 stunning

 architecture

,

 beautiful

 parks

,

 and

 picturesque

 streets

 make

 it

 a

 dream

 destination

 for

 couples

.

 Here

 are

 some

 reasons

 why

 Paris

 is

 a

 top

 choice

 for

 honeymoon

ers

 and

 couples

 celebrating

 special

 occasions

:


The

 City

 of

 Love

:

 Paris

 has

 been

 known

 as

 the

 City

 of

 Love

 for

 centuries

.

 Its

 charming

 atmosphere

,

 beautiful

 landscapes

,

 and

 romantic

 atmosphere

 make

 it

 the

 perfect

 destination

 for

 couples

 to

 celebrate

 their

 love

 and

 strengthen

 their

 bond

.


Art

 and

 Culture

:

 Paris

 is

 a

 hub

 for

 art

,

 culture

,

 and

 history

.

 Couples

 can

 visit

 the

 famous

 Lou

vre

 Museum

,

 which

 houses



Prompt: The future of AI is
Generated text: 

 human

-centric




As

 AI

 technology

 continues

 to

 advance

,

 the

 future

 of

 AI

 is

 increasingly

 becoming

 human

-centric

.

 This

 means

 that

 AI

 will

 be

 designed

 to

 enhance

 human

 capabilities

,

 rather

 than

 replace

 them

.

 In

 this

 article

,

 we

'll

 explore

 what

 this

 means

 and

 how

 it

 will

 shape

 the

 future

 of

 AI

.


The

 current

 state

 of

 AI




Currently

,

 AI

 is

 being

 used

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 education

.

 While

 AI

 has

 made

 significant

 progress

 in

 these

 areas

,

 it

 is

 still

 largely

 machine

-centric

.

 AI

 systems

 are

 designed

 to

 perform

 specific

 tasks

,

 often

 with

 limited

 human

 oversight

.


However

,

 as

 AI

 technology

 continues

 to

 advance

,

 we

're

 seeing

 a




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Sophia and I am a 22-year-old college student. I am a senior in a small liberal arts college in the Midwest, and I am planning to graduate in May with a degree in English literature. I have a passion for writing, reading, and learning, and I am excited to share my experiences and perspectives with you.
Throughout my college career, I have been involved in various extracurricular activities that have allowed me to grow both personally and professionally. As an English major, I have had the opportunity to take a wide range of courses that have helped me develop my critical thinking, writing, and communication skills. I have also

Prompt: The capital of France is
Generated text:  often called the "City of Light" and the most visited city in the world. Paris is renowned for its stunning architecture, artistic heritage, and world-class cuisine. The Eiffel Tower, the Louvre, the Seine River, and Notre-Dame Cathedral are just a few of the many famou

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Maya

 and

 I

 am

 a

 young

,

 independent

 woman

 from

 a

 developing

 country

.

 I

 am

 a

 passionate

 advocate

 for

 climate

 justice

,

 human

 rights

,

 and

 gender

 equality

.

 I

 have

 a

 degree

 in

 Environmental

 Science

 and

 have

 been

 working

 with

 communities

 in

 my

 country

 to

 promote

 sustainable

 development

 and

 raise

 awareness

 about

 the

 impacts

 of

 climate

 change

.


I

 have

 been

 following

 the

 news

 about

 the

 climate

 crisis

 and

 the

 growing

 activism

 around

 the

 world

,

 and

 I

 must

 say

 that

 I

 am

 both

 inspired

 and

 concerned

.

 Inspired

 by

 the

 courage

 and

 determination

 of

 activists

 like

 G

reta

 Th

un

berg

,

 and

 concerned

 about

 the

 lack

 of

 action

 from

 governments

 and

 corporations

 to

 address

 this

 crisis

.


As

 a

 young

 woman

 from

 a

 developing



Prompt: The capital of France is
Generated text: 

 facing

 a

 crisis

 of

 housing

 shortages

,

 but

 French

 housing

 law

 says

 that

 landlords

 are

 responsible

 for

 providing

 heat

,

 light

,

 and

 cleanliness

 in

 a

 rental

 property

.

 This

 means

 that

 tenants

 have

 rights

 and

 protections

.


For

 example

,

 landlords

 are

 required

 to

 ensure

 that

 the

 building

 has

 adequate

 insulation

 and

 that

 windows

 are

 properly

 sealed

 to

 maintain

 a

 comfortable

 temperature

.

 They

 also

 have

 to

 provide

 enough

 light

,

 which

 means

 installing

 suitable

 lighting

 systems

,

 such

 as

 LED

 light

 bulbs

.

 Additionally

,

 landlords

 are

 responsible

 for

 ensuring

 that

 the

 property

 is

 clean

 and

 hy

gien

ic

,

 including

 regular

 cleaning

 of

 common

 areas

 and

 maintaining

 the

 building

's

 systems

,

 like

 plumbing

 and

 electrical

.


Furthermore

,

 landlords

 have

 to

 maintain

 the

 property



Prompt: The future of AI is
Generated text: 

 here

,

 and

 it

's

 a

 lot

 more

 like

 you

 than

 you

 think

.

 A

 cutting

-edge

 AI

-powered

 chat

bot

 named

 "

Meta

 L

lama

 

3

"

 has

 just

 been

 unveiled

,

 and

 it

's

 capable

 of

 carrying

 on

 conversations

 that

 are

 surprisingly

 human

-like

.


Meta

 L

lama

 

3

 is

 an

 AI

 model

 developed

 by

 Meta

 AI

,

 the

 AI

 research

 lab

 of

 Meta

 Platforms

,

 Inc

.

 It

's

 designed

 to

 be

 a

 convers

ational

 AI

 that

 can

 engage

 in

 discussions

 on

 a

 wide

 range

 of

 topics

,

 from

 everyday

 conversations

 to

 complex

 subjects

 like

 science

,

 history

,

 and

 culture

.


The

 chat

bot

 has

 been

 trained

 on

 a

 massive

 dataset

 of

 text

 from

 various

 sources

,

 including

 books

,

 articles

,




In [6]:
llm.shutdown()