Skip to content

Commit

Permalink
[examples] Improve the S3 Source example (LangStream#706)
Browse files Browse the repository at this point in the history
  • Loading branch information
eolivelli committed Nov 9, 2023
1 parent 5534f33 commit ba7d9c4
Show file tree
Hide file tree
Showing 12 changed files with 250 additions and 188 deletions.
8 changes: 4 additions & 4 deletions dev/s3_upload.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@

# Usage: ./minio-upload my-bucket my-file.zip

bucket=$1
file=$2
host=$1
url=$2
bucket=$3
file=$4

host=localhost
url=http://localhost:9000
s3_key=minioadmin
s3_secret=minioadmin

Expand Down
11 changes: 1 addition & 10 deletions examples/applications/docker-chatbot/crawler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,21 +42,12 @@ pipeline:
allowed-domains: ["https://docs.langstream.ai"]
forbidden-paths: []
min-time-between-requests: 500
reindex-interval-seconds: 3600
max-error-count: 5
max-urls: 1000
max-depth: 50
handle-robots-file: true
user-agent: "" # this is computed automatically, but you can override it
scan-html-documents: true
http-timeout: 10000
handle-cookies: true
max-unflushed-pages: 100
bucketName: "${secrets.s3.bucket-name}"
endpoint: "${secrets.s3.endpoint}"
access-key: "${secrets.s3.access-key}"
secret-key: "${secrets.s3.secret}"
region: "${secrets.s3.region}"
state-storage: disk
- name: "Extract text"
type: "text-extractor"
- name: "Normalise text"
Expand Down
File renamed without changes.
40 changes: 40 additions & 0 deletions examples/applications/s3-source/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Preprocessing Text

This sample application shows how to use some common NLP techniques to preprocess text data and then write the results to a Vector Database.

We have two pipelines:

The extract-text.yaml file defines a pipeline that will:

- Extract text from document files (PDF, Word...)
- Detect the language and filter out non-English documents
- Normalize the text
- Split the text into chunks
- Write the chunks to a Vector Database, in this case DataStax Astra DB

## Prerequisites

Prepare some PDF files and upload them to a bucket in S3.

## Deploy the LangStream application

```
./bin/langstream docker run test -app examples/applications/s3-source -s examples/secrets/secrets.yaml --docker-args="-p9900:9000"
```

Please note that here we are adding --docker-args="-p9900:9000" to expose the S3 API on port 9900.


## Write some documents in the S3 bucket

```
# Upload a document to the S3 bucket
dev/s3_upload.sh localhost http://localhost:9900 documents README.md
dev/s3_upload.sh localhost http://localhost:9900 documents examples/applications/s3-source/simple.pdf
```

## Interact with the Chatbot

Now you can use the developer UI to ask questions to the chatbot about your documents.

If you have uploaded the README file then you should be able to ask "what is LangStream ?"
101 changes: 101 additions & 0 deletions examples/applications/s3-source/chatbot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#
# Copyright DataStax, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

topics:
- name: "questions-topic"
creation-mode: create-if-not-exists
- name: "answers-topic"
creation-mode: create-if-not-exists
- name: "log-topic"
creation-mode: create-if-not-exists
errors:
on-failure: "skip"
pipeline:
- name: "convert-to-structure"
type: "document-to-json"
input: "questions-topic"
configuration:
text-field: "question"
- name: "compute-embeddings"
type: "compute-ai-embeddings"
configuration:
model: "${secrets.open-ai.embeddings-model}" # This needs to match the name of the model deployment, not the base model
embeddings-field: "value.question_embeddings"
text: "{{ value.question }}"
flush-interval: 0
- name: "lookup-related-documents"
type: "query-vector-db"
configuration:
datasource: "JdbcDatasource"
query: "SELECT text,embeddings_vector FROM documents ORDER BY cosine_similarity(embeddings_vector, CAST(? as FLOAT ARRAY)) DESC LIMIT 20"
fields:
- "value.question_embeddings"
output-field: "value.related_documents"
- name: "re-rank documents with MMR"
type: "re-rank"
configuration:
max: 5 # keep only the top 5 documents, because we have an hard limit on the prompt size
field: "value.related_documents"
query-text: "value.question"
query-embeddings: "value.question_embeddings"
output-field: "value.related_documents"
text-field: "record.text"
embeddings-field: "record.embeddings_vector"
algorithm: "MMR"
lambda: 0.5
k1: 1.2
b: 0.75
- name: "ai-chat-completions"
type: "ai-chat-completions"

configuration:
model: "${secrets.open-ai.chat-completions-model}" # This needs to be set to the model deployment name, not the base name
# on the log-topic we add a field with the answer
completion-field: "value.answer"
# we are also logging the prompt we sent to the LLM
log-field: "value.prompt"
# here we configure the streaming behavior
# as soon as the LLM answers with a chunk we send it to the answers-topic
stream-to-topic: "answers-topic"
# on the streaming answer we send the answer as whole message
# the 'value' syntax is used to refer to the whole value of the message
stream-response-completion-field: "value"
# we want to stream the answer as soon as we have 20 chunks
# in order to reduce latency for the first message the agent sends the first message
# with 1 chunk, then with 2 chunks....up to the min-chunks-per-message value
# eventually we want to send bigger messages to reduce the overhead of each message on the topic
min-chunks-per-message: 20
messages:
- role: system
content: |
An user is going to perform a questions, The documents below may help you in answering to their questions.
Please try to leverage them in your answer as much as possible.
If you provide code or YAML snippets, please explicitly state that they are examples.
Do not provide information that is not related to the documents provided below.
Documents:
{{# value.related_documents}}
{{ text}}
{{/ value.related_documents}}
- role: user
content: "{{ value.question}}"
- name: "cleanup-response"
type: "drop-fields"
output: "log-topic"
configuration:
fields:
- "question_embeddings"
- "related_documents"
38 changes: 38 additions & 0 deletions examples/applications/s3-source/configuration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#
#
# Copyright DataStax, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

configuration:
resources:
- type: "datasource"
name: "JdbcDatasource"
configuration:
service: "jdbc"
driverClass: "herddb.jdbc.Driver"
url: "${secrets.herddb.url}"
user: "${secrets.herddb.user}"
password: "${secrets.herddb.password}"
- type: "open-ai-configuration"
name: "OpenAI Azure configuration"
configuration:
url: "${secrets.open-ai.url}"
access-key: "${secrets.open-ai.access-key}"
provider: "${secrets.open-ai.provider}"
dependencies:
- name: "HerdDB.org JDBC Driver"
url: "https://repo1.maven.org/maven2/org/herddb/herddb-jdbc/0.28.0/herddb-jdbc-0.28.0-thin.jar"
sha512sum: "d8ea8fbb12eada8f860ed660cbc63d66659ab3506bc165c85c420889aa8a1dac53dab7906ef61c4415a038c5a034f0d75900543dd0013bdae50feafd46f51c8e"
type: "java-library"
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,23 @@
#

name: "Extract and manipulate text"
topics:
- name: "chunks-topic"
assets:
- name: "documents-table"
asset-type: "jdbc-table"
creation-mode: create-if-not-exists
config:
table-name: "documents"
datasource: "JdbcDatasource"
create-statements:
- |
CREATE TABLE documents (
filename TEXT,
chunk_id int,
num_tokens int,
lang TEXT,
text TEXT,
embeddings_vector FLOATA,
PRIMARY KEY (filename, chunk_id));
pipeline:
- name: "Read from S3"
type: "s3-source"
Expand All @@ -27,7 +41,6 @@ pipeline:
access-key: "${secrets.s3.access-key}"
secret-key: "${secrets.s3.secret}"
region: "${secrets.s3.region}"
idle-time: 5
- name: "Extract text"
type: "text-extractor"
- name: "Normalise text"
Expand Down Expand Up @@ -73,10 +86,40 @@ pipeline:
- name: "compute-embeddings"
id: "step1"
type: "compute-ai-embeddings"
output: "chunks-topic"
configuration:
model: "${secrets.open-ai.embeddings-model}" # This needs to match the name of the model deployment, not the base model
embeddings-field: "value.embeddings_vector"
text: "{{ value.text }}"
batch-size: 10
flush-interval: 500
- name: "Delete stale chunks"
type: "query"
configuration:
datasource: "JdbcDatasource"
when: "fn:toInt(properties.text_num_chunks) == (fn:toInt(properties.chunk_id) + 1)"
mode: "execute"
query: "DELETE FROM documents WHERE filename = ? AND chunk_id > ?"
output-field: "value.delete-results"
fields:
- "value.filename"
- "fn:toInt(value.chunk_id)"
- name: "Write"
type: "vector-db-sink"
configuration:
datasource: "JdbcDatasource"
table-name: "documents"
fields:
- name: "filename"
expression: "value.filename"
primary-key: true
- name: "chunk_id"
expression: "value.chunk_id"
primary-key: true
- name: "embeddings_vector"
expression: "fn:toListOfFloat(value.embeddings_vector)"
- name: "lang"
expression: "value.language"
- name: "text"
expression: "value.text"
- name: "num_tokens"
expression: "value.chunk_num_tokens"
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,23 @@
#

gateways:
- id: "consume-chunks"
- id: "user-input"
type: produce
topic: "questions-topic"
parameters:
- sessionId
produceOptions:
headers:
- key: langstream-client-session-id
valueFromParameters: sessionId

- id: "bot-output"
type: consume
topic: "chunks-topic"
topic: "answers-topic"
parameters:
- sessionId
consumeOptions:
filters:
headers:
- key: langstream-client-session-id
valueFromParameters: sessionId
File renamed without changes.
Loading

0 comments on commit ba7d9c4

Please sign in to comment.