# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.11it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.13it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.65it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:33,  1.54s/it]

  9%|▊         | 2/23 [00:01<00:18,  1.15it/s]

 13%|█▎        | 3/23 [00:02<00:11,  1.70it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.27it/s] 22%|██▏       | 5/23 [00:02<00:06,  2.82it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.22it/s] 30%|███       | 7/23 [00:03<00:04,  3.67it/s]

 35%|███▍      | 8/23 [00:03<00:03,  3.94it/s] 39%|███▉      | 9/23 [00:03<00:03,  4.25it/s]

 43%|████▎     | 10/23 [00:03<00:03,  4.31it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.48it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  4.50it/s] 57%|█████▋    | 13/23 [00:04<00:02,  4.67it/s]

 61%|██████    | 14/23 [00:04<00:01,  4.65it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.66it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.71it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.68it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.68it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.52it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.60it/s] 91%|█████████▏| 21/23 [00:05<00:00,  4.73it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.73it/s]

100%|██████████| 23/23 [00:06<00:00,  4.75it/s]100%|██████████| 23/23 [00:06<00:00,  3.58it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rachel. I am a 29-year-old teacher and a mother of two young children. I am writing this letter to my future self, hoping to capture the essence of this moment in time.
As I sit here in my cozy home, surrounded by the chaos of motherhood, I am filled with a sense of gratitude and wonder. My children, Emily and Jackson, are growing up so fast. They are at that magical age where everything is new and exciting, and they are constantly exploring the world around them.
As a teacher, I have the privilege of being a part of my students' educational journey, guiding them through the ups and
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, and is the highest-ranking official in the federal government. The president is also the commander-in-chief of the armed forces and has significant executive power. The president is elected by the Electoral College and serves a four-year ter

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Je

annie

 and

 I

 am

 thrilled

 to

 be

 one

 of

 the

 newest

 members

 of

 the

 Colonial

 Heights

 Animal

 Shelter

 team

!

 As

 an

 animal

 lover

 and

 advocate

,

 I

 am

 committed

 to

 making

 a

 difference

 in

 the

 lives

 of

 our

 furry

 friends

 here

 at

 the

 shelter

.


I

 bring

 with

 me

 a

 strong

 background

 in

 customer

 service

,

 with

 a

 passion

 for

 working

 with

 people

 of

 all

 ages

 and

 backgrounds

.

 My

 experience

 has

 taught

 me

 the

 importance

 of

 empathy

,

 active

 listening

,

 and

 clear

 communication

 –

 essential

 skills

 for

 building

 strong

 relationships

 with

 our

 shelter

 visitors

,

 volunteers

,

 and

 community

 partners

.


As

 an

 animal

 lover

,

 I

 have

 always

 been

 drawn

 to

 the

 joy

 and

 unconditional

 love

 that

 animals

 bring

 to

 our

 lives

.



Prompt: The capital of France is
Generated text: 

 a

 city

 like

 no

 other

.

 From

 the

 stunning

 architecture

 to

 the

 rich

 history

 and

 culture

,

 Paris

 is

 a

 destination

 that

 has

 something

 to

 offer

 everyone

.

 Whether

 you

're

 interested

 in

 art

,

 fashion

,

 food

,

 or

 adventure

,

 Paris

 is

 the

 perfect

 place

 to

 explore

.

 In

 this

 article

,

 we

 will

 explore

 the

 top

 

10

 things

 to

 do

 in

 Paris

 and

 why

 they

 are

 a

 must

-

visit

.


1

.

 Visit

 the

 E

iff

el

 Tower




The

 E

iff

el

 Tower

 is

 one

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

 and

 a

 must

-

visit

 attraction

 in

 Paris

.

 Built

 for

 the

 

188

9

 World

's

 Fair

,

 the

 tower

 stands

 

324

 meters

 tall

 and

 offers

 breathtaking



Prompt: The future of AI is
Generated text: 

 in

 the

 hands

 of

 the

 users




Art

ificial

 intelligence

 (

AI

)

 is

 advancing

 rapidly

 and

 becoming

 increasingly

 ubiquitous

 in

 our

 daily

 lives

.

 From

 virtual

 assistants

 like

 Siri

 and

 Alexa

 to

 self

-driving

 cars

 and

 predictive

 analytics

,

 AI

 is

 transforming

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 one

 another

.

 As

 AI

 continues

 to

 evolve

 and

 improve

,

 it

’s

 becoming

 clear

 that

 its

 future

 is

 not

 just

 in

 the

 hands

 of

 developers

 and

 engineers

,

 but

 also

 in

 the

 hands

 of

 users

.


The

 democrat

ization

 of

 AI




The

 democrat

ization

 of

 AI

 refers

 to

 the

 process

 of

 making

 AI

 accessible

 and

 usable

 by

 a

 wider

 range

 of

 people

,

 beyond

 just

 experts

 and

 developers

.

 This

 is

 happening

 through




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Joseph and I'm a 17 year old student at a private boarding school in New York. I'm currently a junior and I'm struggling with keeping up with schoolwork and extracurricular activities. I'm not sure if I should take a gap year or stick with the traditional 4-year college route. I'm worried that if I take a gap year, I'll fall behind and not be as competitive when I apply to colleges. But, I'm also worried that if I stick with the traditional route, I'll burn out and not be able to fully enjoy my last year of high school.
It sounds like you're feeling

Prompt: The capital of France is
Generated text:  a city that has been shrouded in history, culture, and romance for centuries. Paris, the City of Light, is a place that has been a source of inspiration for artists, writers, and musicians for centuries. From the Eiffel Tower to the Louvre, there are countless iconic landmarks and attractions that make Paris a must-visit destination for anyone who

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 David

 and

 I

'm

 the

 owner

 of

 Del

i

Web

b

.

 

 I

'm

 a

 food

ie

 and

 a

 entrepreneur

 at

 heart

.

 

 After

 years

 of

 working

 in

 the

 food

 industry

 I

 decided

 to

 start

 my

 own

 business

 and

 share

 my

 passion

 for

 food

 with

 the

 world

.

 

 I

 believe

 that

 good

 food

 should

 be

 accessible

 to

 everyone

 and

 that

's

 why

 I

 created

 Del

i

Web

b

.

 

 I

'm

 proud

 to

 say

 that

 our

 online

 del

i

 is

 one

 of

 the

 best

 places

 to

 find

 unique

 and

 delicious

 food

 items

 from

 around

 the

 world

.


I

'm

 a

 firm

 believer

 that

 food

 has

 the

 power

 to

 bring

 people

 together

.

 

 That

's

 why

 I

'm

 committed

 to

 providing

 the

 best

 possible



Prompt: The capital of France is
Generated text: 

 a

 city

 that

 is

 steep

ed

 in

 history

 and

 culture

.

 Paris

 is

 a

 city

 that

 has

 something

 to

 offer

 everyone

,

 from

 its

 stunning

 architecture

 and

 art

 museums

 to

 its

 romantic

 Se

ine

 River

 and

 charming

 cafes

.

 Here

 are

 some

 of

 the

 top

 things

 to

 do

 in

 Paris

:


1

.

 Visit

 the

 E

iff

el

 Tower

:

 The

 E

iff

el

 Tower

 is

 one

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

 and

 a

 must

-

visit

 attraction

 in

 Paris

.

 Visitors

 can

 take

 a

 lift

 to

 the

 top

 of

 the

 tower

 for

 breathtaking

 views

 of

 the

 city

.


2

.

 Explore

 the

 Lou

vre

 Museum

:

 The

 Lou

vre

 is

 one

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 housing

 an

 impressive



Prompt: The future of AI is
Generated text: 

 being

 shaped

 by

 the

 growing

 importance

 of

 Explain

ability




Art

ificial

 Intelligence

 (

AI

)

 has

 made

 tremendous

 strides

 in

 recent

 years

,

 with

 applications

 in

 areas

 such

 as

 image

 recognition

,

 natural

 language

 processing

,

 and

 predictive

 analytics

.

 However

,

 as

 AI

 continues

 to

 become

 more

 ubiquitous

,

 there

 is

 a

 growing

 recognition

 of

 the

 need

 for

 Explain

ability

 in

 AI

 systems

.

 Explain

ability

 refers

 to

 the

 ability

 of

 an

 AI

 system

 to

 provide

 transparent

 and

 understandable

 explanations

 for

 its

 decisions

 and

 actions

.


Why

 is

 Explain

ability

 important

?


Ex

plain

ability

 is

 essential

 for

 several

 reasons

:


1

.

 **

Trust

**:

 When

 people

 can

 understand

 how

 an

 AI

 system

 arrives

 at

 a

 decision

,

 they

 are

 more

 likely

 to

 trust

 the




In [6]:
llm.shutdown()