# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.27it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.19it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.14it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]

  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:27,  1.25s/it]

  9%|▊         | 2/23 [00:01<00:14,  1.47it/s]

 13%|█▎        | 3/23 [00:01<00:09,  2.03it/s]

 17%|█▋        | 4/23 [00:02<00:07,  2.57it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.78it/s]

 26%|██▌       | 6/23 [00:02<00:05,  2.95it/s]

 30%|███       | 7/23 [00:02<00:05,  3.09it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.03it/s]

 39%|███▉      | 9/23 [00:03<00:04,  3.40it/s] 43%|████▎     | 10/23 [00:03<00:03,  3.82it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.18it/s] 52%|█████▏    | 12/23 [00:04<00:02,  4.51it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.35it/s] 61%|██████    | 14/23 [00:04<00:01,  4.53it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.79it/s] 70%|██████▉   | 16/23 [00:04<00:01,  4.86it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  5.09it/s] 78%|███████▊  | 18/23 [00:05<00:00,  5.27it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  5.41it/s] 87%|████████▋ | 20/23 [00:05<00:00,  5.47it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  5.45it/s] 96%|█████████▌| 22/23 [00:05<00:00,  5.50it/s]

100%|██████████| 23/23 [00:06<00:00,  5.60it/s]100%|██████████| 23/23 [00:06<00:00,  3.76it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alvin, and I'm a programmer with a passion for building innovative web applications. I'm currently working on a project that involves creating a web scraper to extract data from a website. The website in question has a complex structure, with multiple nested tables and a lot of JavaScript-generated content.

To tackle this challenge, I've decided to use Selenium WebDriver for browser automation, as it allows me to render the website in a real browser and execute JavaScript code. However, I'm running into some issues with the website's layout and JavaScript-generated content.

Here's a simplified example of the HTML structure:
```html
<div class="main-container">

Prompt: The president of the United States is
Generated text:  not the same as the president of the United States government. The United States government is a sovereign state that governs the country, while the president is the head of that government. A president is a political lead

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Karen

,

 and

 I

'm

 a

 struggling

 writer

.


I

've

 been

 a

 writer

 for

 many

 years

,

 but

 lately

,

 I

've

 been

 feeling

 stuck

 and

 uncertain

 about

 my

 writing

 path

.

 I

'm

 a

 creative

 non

fiction

 writer

,

 which

 means

 my

 work

 often

 involves

 a

 lot

 of

 research

,

 self

-ref

lection

,

 and

 experimentation

 with

 form

 and

 style

.

 While

 I

 love

 the

 freedom

 and

 flexibility

 of

 creative

 non

fiction

,

 I

've

 been

 feeling

 frustrated

 by

 my

 inability

 to

 break

 through

 to

 a

 wider

 audience

.


I

've

 written

 several

 books

,

 essays

,

 and

 articles

 that

 have

 been

 well

-re

ceived

 by

 critics

 and

 readers

,

 but

 they

 haven

't

 translated

 into

 commercial

 success

.

 I

've

 tried

 pitching

 my

 work

 to



Prompt: The capital of France is
Generated text: 

 Paris

,

 and

 it

 is

 the

 most

 populous

 city

 in

 the

 country

.

 The

 city

 is

 known

 for

 its

 beautiful

 and

 historic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.


France

 has

 a

 long

 history

 and

 has

 been

 an

 important

 center

 of

 culture

,

 art

,

 and

 politics

 for

 centuries

.

 The

 country

 has

 produced

 many

 famous

 artists

,

 writers

,

 and

 thinkers

,

 including

 Claude

 Mon

et

,

 Vincent

 van

 G

ogh

,

 and

 Charles

 de

 Gaul

le

.


France

 is

 also

 known

 for

 its

 cuisine

,

 which

 is

 famous

 for

 its

 use

 of

 butter

,

 cheese

,

 and

 wine

.

 Some

 popular

 French

 dishes

 include

 esc

arg

ots

,

 rat

at

ou

ille



Prompt: The future of AI is
Generated text: 

 here

,

 and

 it

’s

 happening

 right

 now

.

 Artificial

 Intelligence

 (

AI

)

 is

 increasingly

 becoming

 a

 driving

 force

 in

 our

 personal

 and

 professional

 lives

.

 From

 smart

 home

 devices

 to

 self

-driving

 cars

,

 AI

 is

 transforming

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 each

 other

.


As

 AI

 continues

 to

 evolve

,

 it

’s

 essential

 to

 stay

 up

-to

-date

 with

 the

 latest

 developments

 and

 trends

 in

 the

 field

.

 In

 this

 article

,

 we

’ll

 explore

 the

 future

 of

 AI

,

 including

 its

 potential

 applications

,

 benefits

,

 and

 challenges

.


The

 Future

 of

 AI

:

 Emerging

 Trends




1

.

 **

Edge

 AI

**:

 Edge

 AI

 involves

 processing

 data

 closer

 to

 where

 it

’s

 generated

,

 reducing

 latency

 and

 improving




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Steve. I am a 56-year-old father of two grown children, living in a small town in upstate New York. I have been married to my wonderful wife, Mary, for 34 years. I work as a printer for a small business in town. I enjoy reading, hiking, and gardening in my free time.
I also enjoy photography, and I have been taking pictures for over 40 years. I have even had a few of my photos published in local magazines and newspapers. I love capturing the beauty of the natural world, and I am always looking for new and interesting subjects to photograph.
I am a bit of

Prompt: The capital of France is
Generated text:  a city of grandeur, beauty and elegance. Paris has been a center of culture, fashion, and history for centuries and attracts millions of visitors every year. From the Eiffel Tower to the Louvre, the city is a treasure trove of iconic landmarks and artistic treasures. Here are some of the top things to do in Paris.
The Eiffel Tower is a must-v

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Holly

!

 I

 am

 an

 N

TR

 (

Non

-

Trigger

ing

 Romantic

)

 Ther

apist

,

 which

 means

 I

 specialize

 in

 working

 with

 individuals

 who

 are

 seeking

 to

 heal

 and

 resolve

 their

 romantic

 relationships

.

 My

 approach

 is

 holistic

 and

 focuses

 on

 empowering

 you

 to

 tap

 into

 your

 own

 inner

 wisdom

 and

 develop

 a

 deeper

 understanding

 of

 yourself

 and

 your

 needs

.


As

 a

 therapist

,

 I

 am

 passionate

 about

 creating

 a

 safe

,

 non

-j

ud

gment

al

 space

 for

 you

 to

 explore

 your

 thoughts

,

 feelings

,

 and

 experiences

.

 I

 am

 trained

 in

 a

 variety

 of

 evidence

-based

 approaches

,

 including

 attachment

 theory

,

 trauma

-in

formed

 care

,

 and

 mindfulness

-based

 therapies

.


My

 approach

 is

 centered

 around

 the

 idea

 that

 relationships

 are



Prompt: The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 known

 as

 the

 City

 of

 Light

 and

 is

 a

 popular

 tourist

 destination

.

 It

 is

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.


A

:

 The

 capital

 of

 France

 is

 Paris

.


B

:

 Paris

 is

 a

 popular

 tourist

 destination

.


C

:

 Paris

 is

 known

 as

 the

 City

 of

 Light

.


D

:

 Paris

 is

 home

 to

 many

 famous

 landmarks

.


Answer

:

 A

,

 B

,

 C

,

 D




Reason

ing

 Skill

:

 Con

sequence

 Evaluation




This

 question

 requires

 the

 ability

 to

 evaluate

 the

 consequences

 of

 a

 statement

 being

 true

 or

 false

.

 In

 this

 case

,

 the

 statement

 is

 about

 the

 capital

 of

 France



Prompt: The future of AI is
Generated text: 

 now

:

 What

 to

 expect

 in

 the

 coming

 years




Art

ificial

 intelligence

 (

AI

)

 has

 made

 tremendous

 progress

 in

 recent

 years

,

 transforming

 industries

 and

 revolution

izing

 the

 way

 we

 live

 and

 work

.

 As

 AI

 continues

 to

 evolve

,

 we

 can

 expect

 significant

 advancements

 in

 the

 coming

 years

.

 Here

 are

 some

 trends

 and

 predictions

 for

 the

 future

 of

 AI

:


1

.

 Increased

 adoption

 and

 integration

:

 AI

 will

 become

 more

 ubiquitous

,

 with

 increased

 adoption

 across

 various

 industries

,

 including

 healthcare

,

 finance

,

 education

,

 and

 transportation

.


2

.

 Improved

 natural

 language

 processing

 (

N

LP

):

 AI

-powered

 chat

bots

 and

 virtual

 assistants

 will

 become

 more

 sophisticated

,

 enabling

 more

 human

-like

 conversations

 and

 better

 understanding

 of

 user

 intent

.





In [6]:
llm.shutdown()