# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.21it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.13it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.09it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  David and I am a 30-year-old Software Engineer from Buenos Aires, Argentina. I am here to share my experience with the community. I am passionate about Artificial Intelligence, Machine Learning, and Cloud Computing, and I enjoy sharing my knowledge and learning from others.
My professional experience started about 8 years ago as a junior developer, working on web and mobile applications using various technologies like Java, PHP, and JavaScript. Over time, I became interested in AI and ML, and I started to learn and work with libraries like TensorFlow and PyTorch. I also got involved with cloud computing, and I gained experience with AWS and Google Cloud
Prompt: The president of the United States is
Generated text:  an elected official, as well as the head of state and head of government for the country. The president is responsible for executing the laws of the land and overseeing the federal government. The president is elected through the El

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Dr

.

 S

iam

ak

 Pour

gh

ol

am

R

oo

zb

ah

ani

,

 and

 I

 am

 an

 Assistant

 Professor

 in

 the

 Department

 of

 Computer

 Science

 and

 Engineering

 at

 the

 University

 of

 Nebraska

-L

in

coln

.

 I

 received

 my

 Ph

.D

.

 in

 Electrical

 Engineering

 from

 the

 University

 of

 California

,

 Los

 Angeles

 (

U

CLA

)

 in

 

201

4

.

 My

 research

 interests

 are

 in

 the

 areas

 of

 computer

 networks

,

 network

 security

,

 and

 machine

 learning

.

 I

 have

 published

 numerous

 papers

 in

 top

-tier

 conferences

 and

 journals

,

 including

 IEEE

 Inf

ocom

,

 ACM

 M

obic

om

,

 and

 IEEE

 Transactions

 on

 Information

 Theory

.


My

 current

 research

 focuses

 on

 designing

 and

 analyzing

 machine

 learning

-based

 solutions

 for

 various

 network

 security

 and



Prompt: The capital of France is
Generated text: 

 a

 city

 that

 is

 famous

 for

 its

 stunning

 beauty

,

 rich

 history

,

 and

 world

-class

 attractions

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 Lou

vre

 Museum

,

 there

 are

 countless

 things

 to

 see

 and

 do

 in

 Paris

.

 Here

 are

 some

 of

 the

 top

 attractions

 and

 experiences

 to

 add

 to

 your

 Paris

 itinerary

:


1

.

 The

 E

iff

el

 Tower

:

 This

 iron

 lattice

 tower

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

 and

 offers

 breathtaking

 views

 of

 the

 city

 from

 its

 observation

 decks

.


2

.

 The

 Lou

vre

 Museum

:

 One

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 the

 Lou

vre

 is

 home

 to

 an

 impressive

 collection

 of

 art

 and

 artifacts

,

 including

 the

 Mona

 Lisa



Prompt: The future of AI is
Generated text: 

 sh

rou

ded

 in

 uncertainty




Art

ificial

 intelligence

 (

AI

)

 has

 revolution

ized

 the

 way

 we

 live

 and

 work

,

 transforming

 industries

 and

 altering

 the

 economic

 landscape

.

 However

,

 the

 future

 of

 AI

 is

 uncertain

 and

 sh

rou

ded

 in

 controversy

.


There

 are

 several

 reasons

 why

 the

 future

 of

 AI

 is

 uncertain

:


1

.

 Lack

 of

 regulation

:

 There

 is

 a

 lack

 of

 regulation

 and

 oversight

 in

 the

 development

 and

 deployment

 of

 AI

 systems

,

 which

 raises

 concerns

 about

 their

 safety

 and

 accountability

.


2

.

 Job

 displacement

:

 AI

 has

 the

 potential

 to

 automate

 many

 jobs

,

 which

 could

 lead

 to

 widespread

 unemployment

 and

 social

 unrest

.


3

.

 Bias

 and

 discrimination

:

 AI

 systems

 can

 perpet

uate

 and

 amplify

 biases

 and




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Kyle and I am the lead author of this blog. I am a co-founder of the company Agiloft, and my main area of expertise is in Contract Lifecycle Management (CLM) and Enterprise Legal Management (ELM). I have over 20 years of experience in the software industry, with a strong background in product development, operations, and customer success.
Over the years, I have worked with numerous companies and organizations to develop and implement contract management solutions that meet their specific needs and goals. My experience has shown me the importance of having a robust contract management system in place to streamline processes, reduce costs, and increase efficiency.
In

Prompt: The capital of France is
Generated text:  Paris. Paris is a major city located in the north-central part of the country, where the Seine River flows into the English Channel. The city is famous for its fashion, art, cuisine, and historical landmarks, such as the Eiffel Tow

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 John

.

 I

 am

 a

 survivor

 of

 the

 hurricane

 that

 hit

 your

 city

.

 I

 lost

 my

 home

 and

 all

 my

 belongings

 in

 the

 storm

.

 My

 family

 and

 I

 are

 now

 living

 in

 a

 shelter

.

 We

 are

 in

 need

 of

 assistance

.

 We

 need

 food

,

 clothing

,

 and

 other

 essential

 items

 to

 get

 us

 back

 on

 our

 feet

.

 We

 are

 grateful

 for

 any

 help

 we

 can

 get

.


I

 can

 be

 reached

 at

 the

 shelter

 address

 below

.

 Please

 contact

 me

 to

 arrange

 for

 assistance

.


I

 am

 just

 a

 regular

 guy

 trying

 to

 make

 a

 living

 in

 the

 city

.

 I

 don

't

 have

 any

 connections

 or

 resources

 to

 fall

 back

 on

.

 I

 am

 struggling

 to

 make

 ends

 meet

 as

 it

 is



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 beauty

,

 romance

,

 and

 intellectual

ism

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 majestic

 Notre

 Dame

 Cathedral

,

 Paris

 is

 a

 city

 that

 has

 inspired

 artists

,

 writers

,

 and

 musicians

 for

 centuries

.

 Visitors

 can

 stroll

 along

 the

 Se

ine

 River

,

 visit

 the

 famous

 Lou

vre

 Museum

,

 and

 indulge

 in

 the

 city

's

 culinary

 delights

,

 including

 cro

iss

ants

,

 cheese

,

 and

 wine

.

 Whether

 you

're

 interested

 in

 history

,

 art

,

 fashion

,

 or

 food

,

 Paris

 is

 a

 city

 that

 has

 something

 for

 everyone

.


Paris

 is

 a

 popular

 tourist

 destination

,

 and

 many

 people

 visit

 the

 city

 every

 year

.

 However

,

 it

 can

 be

 overwhelming

 to

 navigate

 the

 city

,

 especially



Prompt: The future of AI is
Generated text: 

 now

 –

 and

 it

’s

 all

 about

 virtual

 assistants




In

 

201

1

,

 the

 world

 of

 artificial

 intelligence

 (

AI

)

 was

 on

 the

 c

usp

 of

 a

 revolution

.

 Virtual

 assistants

 were

 about

 to

 take

 center

 stage

,

 and

 they

 would

 change

 the

 game

 forever

.

 From

 Siri

 on

 Apple

 devices

 to

 Google

 Assistant

 and

 Alexa

,

 the

 AI

-powered

 virtual

 assistant

 has

 become

 an

 integral

 part

 of

 our

 daily

 lives

.


So

,

 what

 exactly

 is

 a

 virtual

 assistant

?


A

 virtual

 assistant

,

 also

 known

 as

 a

 convers

ational

 AI

,

 is

 a

 computer

 program

 that

 uses

 natural

 language

 processing

 (

N

LP

)

 and

 machine

 learning

 to

 interact

 with

 humans

 in

 a

 convers

ational

 manner

.

 These

 AI

-powered

 assistants

 can

 perform




In [6]:
llm.shutdown()