# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  6.06it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.67it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.34it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.03it/s]

  9%|▊         | 2/23 [00:01<00:11,  1.85it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.49it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.96it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.30it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.44it/s]

 30%|███       | 7/23 [00:02<00:04,  3.62it/s]

 35%|███▍      | 8/23 [00:02<00:04,  3.74it/s]

 39%|███▉      | 9/23 [00:02<00:03,  3.85it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.93it/s]

 48%|████▊     | 11/23 [00:03<00:03,  3.95it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  3.98it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.12it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.23it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.38it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.53it/s] 74%|███████▍  | 17/23 [00:04<00:01,  4.67it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.65it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.70it/s] 87%|████████▋ | 20/23 [00:05<00:00,  4.82it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.88it/s] 96%|█████████▌| 22/23 [00:05<00:00,  4.99it/s]

100%|██████████| 23/23 [00:05<00:00,  4.65it/s]100%|██████████| 23/23 [00:05<00:00,  3.84it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Travis and I am a senior at the University of Southern California. I am currently studying business with a focus on finance. I am excited to start this internship program with American Express and I am looking forward to learning more about the financial industry.
As a student, I have been involved in several extracurricular activities including being a member of the Trojan Investment Club, where I work alongside other students to manage a portfolio of stocks and bonds. I have also been involved in a few entrepreneurial ventures, including starting a small e-commerce business with a friend.
I am excited to learn more about the financial industry and gain experience in a professional setting.
Prompt: The president of the United States is
Generated text:  about to take a very big step into the world of climate change, one that has the potential to make a significant dent in America’s contributions to global warming.
President Joe Biden is set to

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Tom

,

 and

 I

 am

 a

 

17

-year

-old

 high

 school

 student

.

 I

 am

 excited

 to

 share

 my

 journey

 with

 you

,

 as

 I

 have

 just

 started

 my

 senior

 year

 and

 have

 big

 plans

 ahead

.


I

 am

 a

 bit

 of

 a

 perfection

ist

,

 which

 can

 sometimes

 be

 a

 blessing

 and

 a

 curse

.

 I

 like

 to

 stay

 organized

 and

 have

 a

 plan

 in

 place

,

 but

 at

 the

 same

 time

,

 it

 can

 make

 me

 a

 bit

 too

 hard

 on

 myself

 when

 things

 don

’t

 go

 according

 to

 plan

.


Throughout

 my

 high

 school

 years

,

 I

 have

 been

 fortunate

 enough

 to

 have

 had

 many

 opportunities

 to

 explore

 my

 interests

 and

 passions

.

 I

 have

 been

 involved

 in

 the

 school

 debate

 team

,

 which



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 enchant

ment

,

 a

 place

 where

 art

,

 culture

,

 fashion

,

 and

 history

 come

 together

 in

 a

 swirl

 of

 jo

ie

 de

 viv

re

.

 Whether

 you

're

 wandering

 the

 charming

 streets

 of

 Mont

mart

re

,

 st

rolling

 along

 the

 Se

ine

,

 or

 shopping

 on

 the

 Ch

amps

-

É

lys

ées

,

 Paris

 is

 a

 city

 that

 will

 leave

 you

 breath

less

 and

 year

ning

 for

 more

.


The

 City

 of

 Light

 is

 famous

 for

 its

 iconic

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 But

 there

's

 so

 much

 more

 to

 Paris

 than

 its

 grand

 monuments

.

 The

 city

 is

 a

 treasure

 tro

ve

 of

 museums



Prompt: The future of AI is
Generated text: 

 bright

,

 but

 we

 must

 navigate

 its

 challenges




Art

ificial

 intelligence

 has

 the

 potential

 to

 revolution

ize

 many

 aspects

 of

 our

 lives

,

 from

 healthcare

 and

 education

 to

 transportation

 and

 finance

.

 However

,

 as

 AI

 becomes

 increasingly

 sophisticated

,

 it

 also

 raises

 important

 questions

 about

 its

 impact

 on

 society

 and

 our

 collective

 future

.


One

 of

 the

 biggest

 challenges

 facing

 the

 development

 of

 AI

 is

 ensuring

 that

 it

 is

 developed

 and

 used

 in

 a

 way

 that

 benefits

 all

 people

,

 not

 just

 those

 with

 the

 resources

 to

 invest

 in

 it

.

 This

 includes

 addressing

 issues

 such

 as

 bias

,

 privacy

,

 and

 job

 displacement

.


To

 ensure

 that

 AI

 is

 developed

 and

 used

 responsibly

,

 we

 need

 to

 prioritize

 ethics

 and

 transparency

 in

 its

 development

.




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Chris, I am a second-year MSc student at the University of Warwick studying a course called Business and Management. I have lived in Warwick since 2017 and I love the campus and the university environment. I am an international student from Malaysia and I am excited to share my experiences and perspectives on studying in the UK.
I have always been fascinated by the world of business and management, and I chose this course to broaden my knowledge and skills in this field. As a Malaysian student, I was attracted to the UK's high standard of education and the diversity of students at the university. I am enjoying my time at the University of Warwick

Prompt: The capital of France is
Generated text:  Paris. The city of Paris is situated on the Seine River, in the northern part of the country. The population of Paris is approximately 2.2 million people. The city is one of the most visited cities in the world, with millions of tourists visiting eac

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Emma

.

 I

 am

 a

 

20

-year

-old

 psychology

 student

 with

 a

 passion

 for

 playing

 the

 piano

.

 I

 enjoy

 writing

 poetry

 and

 short

 stories

 in

 my

 free

 time

.

 I

 like

 to

 read

 books

 on

 psychology

,

 philosophy

,

 and

 history

.

 I

 am

 interested

 in

 traveling

 and

 experiencing

 different

 cultures

.


The

 psychological

 perspectives

 on

 why

 we

 do

 the

 things

 we

 do

 are

 complex

 and

 multif

ac

eted

.

 The

 interaction

 of

 internal

 and

 external

 factors

 is

 intricate

,

 making

 it

 challenging

 to

 determine

 the

 exact

 reasons

 behind

 human

 behavior

.

 However

,

 the

 following

 perspectives

 can

 provide

 some

 insight

 into

 why

 people

 engage

 in

 certain

 behaviors

.


The

 Biological

 Perspective

 suggests

 that

 behavior

 is

 determined

 by

 genetics

 and

 neurotrans

mitters

 in

 the

 brain

.



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 unparalleled

 beauty

 and

 elegance

.

 The

 City

 of

 Light

 has

 been

 a

 cultural

 and

 artistic

 hub

 for

 centuries

,

 attracting

 visitors

 from

 around

 the

 world

 with

 its

 stunning

 architecture

,

 world

-class

 museums

,

 and

 charming

 atmosphere

.


Must

-

see

 attractions

 in

 Paris

 include

 the

 iconic

 E

iff

el

 Tower

,

 the

 majestic

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 which

 houses

 some

 of

 the

 world

's

 most

 famous

 paintings

,

 including

 the

 Mona

 Lisa

.

 Visitors

 can

 also

 explore

 the

 charming

 streets

 of

 Mont

mart

re

,

 the

 Latin

 Quarter

,

 and

 the

 Ch

amps

-

É

lys

ées

,

 where

 they

 can

 find

 cafes

,

 shops

,

 and

 street

 performers

.


Paris

 is

 also

 known

 for

 its

 fashion

,



Prompt: The future of AI is
Generated text: 

 looking

 brighter

 by

 the

 day

,

 with

 breakthrough

s

 in

 natural

 language

 processing

,

 computer

 vision

,

 and

 reinforcement

 learning

 prop

elling

 the

 field

 forward

.

 One

 of

 the

 most

 exciting

 areas

 of

 research

 is

 multim

odal

 learning

,

 which

 involves

 the

 integration

 of

 different

 AI

 models

 and

 data

 types

 to

 create

 more

 comprehensive

 and

 effective

 systems

.


Mult

im

odal

 learning

 has

 many

 potential

 applications

,

 including

:


1

.

 Image

 and

 text

 analysis

:

 AI

 systems

 can

 learn

 to

 recognize

 patterns

 in

 images

 and

 text

,

 and

 combine

 this

 information

 to

 improve

 tasks

 such

 as

 object

 detection

,

 sentiment

 analysis

,

 and

 image

 caption

ing

.


2

.

 Human

-com

puter

 interaction

:

 Mult

im

odal

 learning

 can

 enable

 AI

 systems

 to

 understand

 and

 respond

 to




In [6]:
llm.shutdown()