# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.07it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.56it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:20,  1.06it/s]  9%|▊         | 2/23 [00:01<00:10,  2.01it/s]

 13%|█▎        | 3/23 [00:01<00:06,  2.86it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.56it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.15it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.41it/s]

 30%|███       | 7/23 [00:02<00:03,  4.76it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.04it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.22it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.37it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.49it/s] 52%|█████▏    | 12/23 [00:02<00:01,  5.53it/s]

 57%|█████▋    | 13/23 [00:03<00:01,  5.62it/s] 61%|██████    | 14/23 [00:03<00:01,  5.61it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.62it/s] 70%|██████▉   | 16/23 [00:03<00:01,  5.67it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  5.66it/s] 78%|███████▊  | 18/23 [00:03<00:00,  5.64it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.69it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.65it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.67it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.71it/s]

100%|██████████| 23/23 [00:04<00:00,  5.68it/s]100%|██████████| 23/23 [00:04<00:00,  4.76it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sante Applegate and I am excited to be part of the Lee’s Summit R-7 School District team as a school counselor at Lee’s Summit High School. This will be my 14th year in education, and I am passionate about helping students succeed academically, personally, and professionally.
Prior to joining the district, I worked as a counselor in several schools in the Kansas City area. I have a Master’s degree in School Counseling from the University of Kansas and a Bachelor’s degree in Psychology from Benedictine College.
I am committed to providing a supportive and inclusive environment for all students. As a school counselor, my goal
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, and is the most powerful person in the world. The president is elected by the Electoral College, and serves a four-year term. The president has many responsibilities, including setting national polic

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Rev

.

 Linda

 Campbell

 and

 I

 am

 a

 licensed

 and

 ordained

 United

 Methodist

 Minister

.

 I

 have

 been

 serving

 in

 a

 variety

 of

 settings

,

 including

 congreg

ations

,

 camps

,

 and

 non

-profit

 organizations

,

 for

 over

 

20

 years

.

 My

 passion

 is

 to

 help

 people

 discover

 and

 deepen

 their

 faith

,

 to

 find

 meaning

 and

 purpose

 in

 their

 lives

,

 and

 to

 live

 more

 auth

ent

ically

 as

 children

 of

 God

.

 I

 have

 a

 Master

's

 degree

 in

 Div

inity

 from

 Drew

 University

,

 and

 I

 am

 a

 certified

 spiritual

 director

 and

 coach

.

 I

 am

 also

 a

 trained

 mediator

 and

 conflict

 resolver

,

 and

 I

 enjoy

 facilitating

 small

 groups

 and

 workshops

 on

 a

 variety

 of

 topics

 related

 to

 faith

,

 spirituality

,

 and



Prompt: The capital of France is
Generated text: 

 one

 of

 the

 most

 romantic

 and

 beautiful

 cities

 in

 the

 world

.

 From

 the

 E

iff

el

 Tower

 to

 the

 Lou

vre

 Museum

,

 there

 is

 so

 much

 to

 see

 and

 experience

 in

 Paris

.

 Here

 are

 a

 few

 reasons

 why

 you

 should

 visit

 this

 amazing

 city

:


The

 E

iff

el

 Tower

,

 one

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

,

 is

 a

 must

-

visit

 attraction

 in

 Paris

.

 You

 can

 take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.

 The

 tower

 was

 built

 for

 the

 World

's

 Fair

 in

 

188

9

 and

 was

 initially

 intended

 to

 be

 a

 temporary

 structure

.

 However

,

 it

 has

 become

 an

 iconic

 symbol

 of

 Paris

 and

 one

 of

 the

 most



Prompt: The future of AI is
Generated text: 

 not

 about

 the

 machines

 themselves

,

 but

 about

 how

 they

 can

 improve

 and

 enhance

 our

 lives

.

 In

 his

 book

,

 “

Life

 

3

.

0

:

 Being

 Human

 in

 the

 Age

 of

 Artificial

 Intelligence

,”

 Max

 T

eg

mark

,

 a

 professor

 of

 physics

 at

 MIT

,

 argues

 that

 AI

 has

 the

 potential

 to

 solve

 some

 of

 the

 world

’s

 most

 pressing

 problems

,

 such

 as

 disease

,

 poverty

,

 and

 climate

 change

.


T

eg

mark

 suggests

 that

 AI

 can

 be

 a

 powerful

 tool

 for

 good

,

 but

 it

 also

 raises

 important

 questions

 about

 the

 future

 of

 work

,

 the

 nature

 of

 consciousness

,

 and

 the

 potential

 risks

 of

 creating

 intelligent

 machines

.

 He

 argues

 that

 we

 need

 to

 have

 a

 more

 nuanced

 and

 informed




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Captain Jack Sparrow, and I have been transported back in time to the year 1777. The American colonies are still under British rule, and the seeds of revolution are beginning to be sown. I am in the midst of a heist, stealing a valuable shipment of rum, when I am suddenly confronted by a group of Patriot militia. They are armed to the teeth and look like they mean business.
As the leader of the group steps forward, I can see the fire of rebellion burning in his eyes. He introduces himself as Benjamin Franklin, and I can sense the intellectual and cunning that lies beneath his humble demeanor. I quickly

Prompt: The capital of France is
Generated text:  known for its rich history, stunning architecture, and famous landmarks. It is the epicenter of French culture, cuisine, and fashion. However, the city is also known for its less-than-favorable reputation when it comes to street crime and petty theft. Visitors to Paris are often warned to be ca

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Victoria

.

 I

 am

 a

 senior

 at

 Oklahoma

 Christian

 University

,

 major

ing

 in

 early

 childhood

 education

.

 This

 is

 my

 first

 year

 as

 a

 student

 teacher

 and

 I

 am

 so

 excited

 to

 be

 working

 with

 you

 and

 your

 child

 this

 year

!

 I

 am

 looking

 forward

 to

 getting

 to

 know

 you

 and

 your

 child

 and

 to

 watching

 them

 grow

 and

 learn

 over

 the

 next

 few

 months

.


As

 a

 student

 teacher

,

 I

 will

 be

 working

 closely

 with

 Mrs

.

 [

Teacher

's

 Name

]

 to

 plan

 and

 implement

 lessons

 that

 will

 meet

 the

 needs

 of

 your

 child

.

 I

 will

 also

 be

 assisting

 with

 classroom

 management

,

 grading

,

 and

 other

 tasks

 that

 will

 help

 make

 our

 classroom

 run

 smoothly

.


If

 you

 have

 any

 questions



Prompt: The capital of France is
Generated text: 

 getting

 a

 major

 makeover




Paris

 is

 getting

 a

 massive

 renovation

 that

 will

 transform

 the

 city

's

 infrastructure

,

 public

 spaces

,

 and

 architecture

.

 The

 city

's

 mayor

,

 Anne

 H

idal

go

,

 has

 announced

 a

 €

20

 billion

 investment

 plan

 to

 rev

amp

 the

 city

,

 which

 will

 include

 the

 construction

 of

 new

 metro

 lines

,

 the

 renovation

 of

 historic

 buildings

,

 and

 the

 creation

 of

 new

 public

 spaces

.

 The

 plan

 is

 expected

 to

 create

 thousands

 of

 jobs

 and

 boost

 the

 city

's

 economy

.


One

 of

 the

 most

 significant

 projects

 is

 the

 creation

 of

 a

 new

 metro

 line

,

 which

 will

 connect

 the

 city

 center

 to

 the

 suburbs

.

 The

 line

 will

 be

 built

 using

 innovative

 and

 sustainable

 technologies

,

 and

 will

 feature



Prompt: The future of AI is
Generated text: 

 being

 shaped

 by

 innovations

 in

 various

 fields

,

 including

 computer

 vision

,

 natural

 language

 processing

,

 and

 machine

 learning

.

 One

 of

 the

 most

 exciting

 areas

 of

 research

 is

 in

 the

 development

 of

 multim

odal

 AI

,

 which

 enables

 machines

 to

 process

 and

 understand

 multiple

 sources

 of

 data

,

 such

 as

 text

,

 images

,

 and

 audio

,

 simultaneously

.

 In

 this

 article

,

 we

'll

 explore

 the

 advancements

 in

 multim

odal

 AI

 and

 their

 potential

 applications

.


What

 is

 Mult

im

odal

 AI

?


Mult

im

odal

 AI

 refers

 to

 the

 ability

 of

 machines

 to

 process

 and

 integrate

 multiple

 modal

ities

 of

 data

,

 such

 as

 text

,

 images

,

 audio

,

 and

 video

.

 This

 involves

 developing

 AI

 models

 that

 can

 recognize

 and

 understand

 patterns

 across




In [6]:
llm.shutdown()