# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.10it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.12it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.64it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.33it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.00it/s]

  9%|▊         | 2/23 [00:01<00:11,  1.84it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.45it/s]

 17%|█▋        | 4/23 [00:01<00:06,  3.04it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.42it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.37it/s]

 30%|███       | 7/23 [00:02<00:04,  3.31it/s]

 35%|███▍      | 8/23 [00:02<00:04,  3.18it/s]

 39%|███▉      | 9/23 [00:03<00:04,  3.49it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.73it/s]

 48%|████▊     | 11/23 [00:03<00:03,  3.99it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.20it/s] 57%|█████▋    | 13/23 [00:03<00:02,  4.54it/s]

 61%|██████    | 14/23 [00:04<00:01,  4.64it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.52it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.47it/s] 74%|███████▍  | 17/23 [00:04<00:01,  4.62it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.75it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.72it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.79it/s] 91%|█████████▏| 21/23 [00:05<00:00,  4.94it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.63it/s]

100%|██████████| 23/23 [00:06<00:00,  4.16it/s]100%|██████████| 23/23 [00:06<00:00,  3.75it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Liam and I am 10 years old. I am in the 5th grade. I love playing sports, reading books, and riding my bike. My favorite sport is soccer, and I have been playing it for 4 years. I also really like to read books about adventure and science fiction. Some of my favorite books are the Harry Potter series, the Diary of a Wimpy Kid series, and the Captain Underpants series. My favorite bike is a red and black mountain bike, and I like to ride it in the woods near my house. When I am older, I want to be a professional soccer player and travel
Prompt: The president of the United States is
Generated text:  a person who has been elected to serve as the head of state and government of the United States. The president serves a four-year term and is responsible for implementing the laws passed by Congress and serving as the commander-in-chief of the armed forces. The president is also the symbol of national unity and is responsible for representing the cou

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Viv

iana

.


I

’m

 a

 

16

-year

-old

 student

 from

 Spain

.


I

 love

 dancing

 and

 singing

 and

 I

 enjoy

 learning

 new

 languages

.

 I

 also

 love

 playing

 with

 my

 pets

 and

 spending

 time

 with

 my

 family

.


I'm

 currently in

 my second

 year of

 high school

 and I

'm studying

 English and

 French.

 I'm

 trying my

 best to

 become fluent

 in both

 languages.


I'm

 excited to

 learn more

 about other

 cultures and

 traditions through

 this exchange

. I

 hope to

 make new

 friends and

 learn from

 them.

 I also

 hope to

 teach them

 about Spanish

 culture

 and

 traditions

.


I

 love

 trying

 new

 foods

,

 especially

 desserts

!

 Do

 you

 have

 a

 favorite

 dessert

?


H

aha

,

 I



Prompt: The capital of France is
Generated text: 

 in

 a

 state

 of

 unrest

,

 with

 protests

 and

 riots

 continuing

 to

 unfold

 across

 the

 city

.

 The

 Yellow

 Vest

 movement

,

 a

 grassroots

 uprising

 against

 government

 policies

,

 has

 been

 ongoing

 for

 months

 and

 shows

 no

 signs

 of

 ab

ating

.


The

 protests

 are

 not

 just

 about

 economic

 inequality

,

 but

 also

 about

 the

 erosion

 of

 social

 protections

 and

 the

 growing

 sense

 of

 disillusion

ment

 with

 the

 political

 establishment

.

 The

 government

's

 response

 has

 been

 criticized

 as

 heavy

-handed

 and

 ineffective

,

 with

 many

 calling

 for

 greater

 reforms

 and

 a

 more

 inclusive

 approach

 to

 governance

.


The

 city

 is

 a

 hub

 of

 activity

,

 with

 demonstrations

 and

 rallies

 taking

 place

 across

 various

 neighborhoods

.

 The

 streets

 are

 filled

 with

 the

 sound

 of

 chanting

 and

 the

 smell



Prompt: The future of AI is
Generated text: 

 human




AI

 will

 need

 human

 imagination

,

 creativity

,

 and

 empathy

 to

 truly

 be

 beneficial

 for

 society




The

 future

 of

 AI

 is

 human

.

 Or

,

 at

 the

 very

 least

,

 the

 future

 of

 AI

 will

 rely

 heavily

 on

 humans

 to

 be

 truly

 beneficial

 for

 society

.


Right

 now

,

 artificial

 intelligence

 (

AI

)

 is

 being

 trained

 on

 vast

 amounts

 of

 data

 to

 perform

 tasks

 such

 as

 image

 recognition

,

 natural

 language

 processing

,

 and

 decision

-making

.

 These

 tasks

 are

 often

 narrow

 and

 focused

,

 and

 while

 AI

 has

 made

 tremendous

 progress

 in

 these

 areas

,

 it

 still

 lacks

 a

 key

 ingredient

:

 human

 imagination

,

 creativity

,

 and

 empathy

.


Imagine

 a

 world

 where

 AI

 is

 not

 just

 a

 tool

,

 but

 a




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Janine Smith and I am a midwife and lactation consultant. I am excited to introduce you to the world of breastfeeding support. As a healthcare provider, I understand the challenges that women face when trying to nurse their babies. That's why I'm dedicated to providing you with the best possible care and support throughout your breastfeeding journey.

With over 10 years of experience in midwifery and lactation consulting, I have helped countless women overcome breastfeeding difficulties and achieve their breastfeeding goals. My approach is holistic and compassionate, taking into account the physical, emotional, and social aspects of breastfeeding.

Whether you're a first-time mom or a

Prompt: The capital of France is
Generated text:  also known for its incredible selection of museums, art galleries, and historical landmarks. From the Eiffel Tower to the Louvre, Paris has something to offer for all interests and ages. Here are some of the top

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Tina

 and

 I

'm

 a

 senior

 at

 Colorado

 State

 University

.

 I

'm

 a

 little

 nervous

 about

 graduating

 and

 entering

 the

 real

 world

.

 I

've

 heard

 so

 many

 mixed

 messages

 about

 the

 job

 market

 and

 it

's

 hard

 to

 know

 what

 to

 expect

.

 I

've

 always

 been

 interested

 in

 writing

,

 but

 I

'm

 not

 sure

 if

 that

's

 a

 stable

 or

 well

-paying

 career

 path

.


First

,

 let

's

 talk

 about

 your

 background

 and

 interests

.

 What

 kind

 of

 writing

 are

 you

 interested

 in

 doing

?

 Do

 you

 have

 any

 specific

 industries

 or

 fields

 in

 mind

?

 For

 example

,

 are

 you

 interested

 in

 copy

writing

,

 journalism

,

 technical

 writing

,

 or

 something

 else

?


Second

,

 let

's

 talk

 about

 your

 goals



Prompt: The capital of France is
Generated text: 

 in

 chaos

.

 Pro

tests

 and

 strikes

 have

 been

 taking

 place

 for

 weeks

 over

 President

 Emmanuel

 Macron

's

 proposed

 pension

 reform

.

 The

 demonstrations

 are

 led

 by

 a

 vast

 coalition

 of

 unions

,

 activists

,

 and

 ordinary

 citizens

 who

 fear

 the

 reform

 will

 lead

 to

 a

 reduction

 in

 their

 retirement

 benefits

.


France

's

 pension

 system

 is

 a

 complex

 and

 highly

 decentralized

 system

 that

 covers

 over

 

40

 million

 workers

.

 The

 system

 has

 been

 criticized

 for

 being

 inefficient

 and

 plagued

 by

 abuses

.

 The

 proposed

 reform

 aims

 to

 consolidate

 the

 

42

 different

 pension

 schemes

 into

 a

 single

 system

,

 which

 could

 save

 the

 government

 billions

 of

 euros

.


The

 protests

 have

 turned

 violent

,

 with

 clashes

 between

 protesters

 and

 police

.

 The

 government

 has

 deployed

 thousands



Prompt: The future of AI is
Generated text: 

 now

:

 Meet

 the

 brilliant

 students

 behind

 the

 world

’s

 top

 AI

 projects




In

 recent

 years

,

 AI

 has

 become

 a

 hot

 topic

 in

 the

 tech

 world

,

 with

 many

 companies

 investing

 heavily

 in

 AI

 research

 and

 development

.

 But

 behind

 the

 scenes

,

 students

 are

 working

 tirelessly

 to

 push

 the

 boundaries

 of

 what

 is

 possible

 with

 AI

.

 Here

,

 we

’ll

 meet

 some

 of

 the

 brilliant

 students

 behind

 the

 world

’s

 top

 AI

 projects

.


The

 winners

 of

 the

 

202

2

 Google

 AI

 Impact

 Challenge




The

 Google

 AI

 Impact

 Challenge

 is

 a

 prestigious

 competition

 that

 aims

 to

 identify

 and

 support

 AI

 projects

 with

 the

 potential

 to

 drive

 meaningful

 impact

.

 This

 year

,

 the

 winners

 were

 chosen

 from

 over

 

2

,

000




In [6]:
llm.shutdown()