# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-15 19:11:59 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.35it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.22it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.20it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.46it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Rosa

 and

 I

 am

 an

 artist

 from

 Italy

.

 I

 have

 been

 teaching

 English

 as

 a

 second

 language

 for

 a

 long

 time

.

 I

 love

 to

 learn

,

 teach

,

 and

 explore

 new

 cultures

.


I

 am

 a

 highly

 qualified

 and

 experienced

 English

 teacher

 with

 a

 degree

 in

 English

 Language

 and

 Literature

.

 I

 have

 a

 TE

FL

 (

Te

aching

 English

 as

 a

 Foreign

 Language

)

 certification

 and

 a

 Master

’s

 degree

 in

 Education

.

 I

 have

 taught

 students

 of

 all

 levels

,

 from

 beginners

 to

 advanced

 learners

,

 and

 have

 helped

 them

 achieve

 their

 language

 goals

.


I

 enjoy

 teaching

 English

 because

 it

 allows

 me

 to

 share

 my

 passion

 for

 language

 and

 culture

 with

 my

 students

.

 I

 believe

 that

 learning

 a

 language

 is

 not

 just




Generated text: 

 famous

 for

 its

 rich

 history

,

 art

,

 fashion

,

 cuisine

,

 and

 culture

.

 Paris

 is

 the

 most

 visited

 city

 in

 the

 world

,

 attracting

 millions

 of

 tourists

 every

 year

.

 The

 city

 is

 home

 to

 many

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 In

 addition

 to

 its

 historical

 and

 cultural

 attractions

,

 Paris

 is

 also

 known

 for

 its

 fashion

,

 beauty

,

 and

 luxury

 industries

.

 The

 city

 is

 a

 hub

 for

 fashion

 designers

,

 models

,

 and

 photographers

,

 and

 is

 home

 to

 many

 famous

 fashion

 houses

 and

 brands

.


The

 city

 is

 also

 famous

 for

 its

 cuisine

,

 with

 a

 wide

 range

 of

 traditional

 French

 dishes

 such

 as

 esc

arg

ots




Generated text: 

 about

 more

 than

 just

 machines

 –

 it

's

 about

 people

,

 too




As

 artificial

 intelligence

 (

AI

)

 becomes

 increasingly

 prominent

 in

 our

 daily

 lives

,

 we

 must

 not

 forget

 that

 it

 is

 not

 just

 a

 machine

,

 but

 a

 tool

 created

 by

 and

 for

 humans

.

 While

 AI

 systems

 can

 process

 vast

 amounts

 of

 data

,

 make

 predictions

,

 and

 even

 learn

 from

 their

 experiences

,

 they

 are

 ultimately

 limited

 by

 their

 programming

 and

 the

 data

 they

 are

 trained

 on

.


The

 future

 of

 AI

 is

 not

 just

 about

 the

 technology

 itself

,

 but

 about

 how

 it

 will

 impact

 human

 societies

 and

 individuals

.

 As

 AI

 becomes

 more

 pervasive

,

 we

 must

 consider

 the

 social

,

 economic

,

 and

 cultural

 implications

 of

 its

 adoption

.


Here




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Natasha

 De

 Sanct

is

 and

 I

 am

 a

 

19

 year

 old

 student

 from

 Malta

.

 I

 am

 currently

 studying

 at

 the

 University

 of

 Malta

 and

 I

 will

 be

 graduating

 this

 year

 with

 a

 degree

 in

 Environmental

 Science

.

 I

 am

 passionate

 about

 sustainability

 and

 want

 to

 make

 a

 positive

 impact

 on

 the

 environment

.

 I

 love

 spending

 time

 outdoors

 and

 exploring

 nature

.

 In

 my

 free

 time

,

 I

 enjoy

 hiking

,

 reading

 and

 cooking

.

 I

 am

 excited

 to

 share

 my

 experiences

 and

 knowledge

 with

 the

 world

 and

 I

 am

 looking

 forward

 to

 the

 opportunity

 to

 contribute

 to

 this

 amazing

 community

.

 Hello

,

 my

 name

 is

 Natasha

 De

 Sanct

is

 and

 I

 am

 a

 

19

 year

 old

 student

 from

 Malta

.

 I

 am

 currently




Generated text: 

 Paris

.

 It

 is

 the

 most

 populated

 city

 in

 the

 country

 and

 is

 the

 center

 of

 politics

,

 economy

,

 culture

,

 and

 entertainment

.

 Paris

 is

 known

 for

 its

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 famous

 for

 its

 romantic

 atmosphere

,

 fashion

,

 cuisine

,

 and

 art

.


The

 official

 language

 of

 France

 is

 French

,

 but

 many

 Paris

ians

 also

 speak

 English

,

 especially

 in

 tourist

 areas

.

 French

 is

 a

 Romance

 language

 that

 originated

 from

 Latin

 and

 is

 closely

 related

 to

 other

 languages

 like

 Spanish

,

 Italian

,

 and

 Portuguese

.


Here

 are

 some

 basic

 French

 phrases

 that

 you

 may

 find

 useful

 during

 your

 visit

 to

 Paris

:


Hello




Generated text: 

 being

 shaped

 by

 the

 convergence

 of

 various

 technologies

,

 including

 machine

 learning

,

 natural

 language

 processing

,

 computer

 vision

,

 and

 robotics

.

 The

 increasing

 availability

 of

 data

,

 computational

 power

,

 and

 storage

 capacity

 are

 driving

 the

 development

 of

 more

 sophisticated

 AI

 systems

.

 These

 systems

 are

 expected

 to

 have

 a

 significant

 impact

 on

 various

 industries

,

 including

 healthcare

,

 finance

,

 transportation

,

 and

 education

.


However

,

 the

 development

 and

 deployment

 of

 AI

 systems

 also

 raise

 important

 ethical

 and

 societal

 concerns

,

 including

 issues

 related

 to

 bias

,

 transparency

,

 and

 accountability

.

 For

 example

,

 AI

 systems

 can

 perpet

uate

 existing

 biases

 if

 they

 are

 trained

 on

 biased

 data

,

 and

 they

 can

 also

 be

 used

 to

 manipulate

 public

 opinion

 through

 dis

information

.




In [6]:
llm.shutdown()