# A simple Vector Pipeline for a RAG-based AI using Slack as Knowledge Base

This notebook has three sections:
1. Scraping the conversations from a slack channel
2. Preparing and chunking the scraped conversations
3. Configure an embedding model to use for computating the vector embeddings for the conversations
4. Uploading the vector embeddings to a vector database
5. Prompting an LLM with a retrieval augmentation using the vector database

## Scraping Slack Conversations using Slack API 

In [18]:
!pip install slack_sdk | grep -v 'already satisfied'


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Create a little helper class to simplify the usage of Slack SDK for our purposes.

In [1]:
import datetime
import getpass
import time
import os
from slack_sdk import WebClient

class SlackHelper:

    def __init__(self, token: str):
        self._username_cache = dict()
        self._slack_client = WebClient(token)
        self._slack_client.api_test()

    def get_messages_from_channel(self, channel):
        messages = self._slack_client.conversations_history(channel=channel)["messages"]
        conversation = ""
        for message in reversed(messages):
            message_ts = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(float(message["ts"])))
            conversation = conversation + "(" + message_ts + ", user=" + self.get_user_name(message["user"]) + "): " + message["text"] + "\n"
            if "thread_ts" in message:
                replies = self._slack_client.conversations_replies(channel=channel, ts=message["thread_ts"])["messages"]
                for reply in replies[1:]:
                    reply_ts = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(float(message["ts"])))
                    conversation = conversation + "      (" + reply_ts + ", user=" + self.get_user_name(reply["user"]) + "): " + reply["text"]  + "\n"
        return conversation    

    def get_user_name(self, user):
        if user not in self._username_cache:
            self._username_cache[user] = self._slack_client.users_info(user=user)["user"]["real_name"]
        return self._username_cache[user]

    def get_conversations_info(self, channel):
        response = self._slack_client.conversations_info(channel=channel, include_num_members=1)
        return response["channel"]

    def list_channels(self):
        response = self._slack_client.conversations_list(types="public_channel, private_channel, mpim, im")
        return response["channels"]

    def get_channel_by_name(self, channel_name):
        for channel in self.list_channels():
            if channel['name'] == channel_name:
                return channel['id']
        return None

We need a Slack API Token for a custom app. You can create one [here](https://api.slack.com/apps/). Select `Create New App`->`
From scratch` and provide an app name and a workspace where you develop the app. This needs to be a personal Slack workspace as you don't have permissions to create apps in IBM Slack workspaces.

Now set up a bot token for your app in your workspace. In the navigation pane to the left select `Features`->`OAuth & Permissions`. Scroll down to section `Scopes` and add scopes for `app_mentions:read`, `channels:history`, `channels:read`, `groups:history`, `groups:read`,  `im:read`, `mpim:read` and `users:read`.
Now scroll up to section `OAuth Tokens for Your Workspace` and click `Install to Workspace`. Now select your workspace where the app should be deployed (again note: You can't deploy to IBM Slack workspaces yourselfes. This requires an IBM Slack admin). Slack generates a new `Bot User OAuth Token` as part of this workspace deployment. Copy this token. You need it to run the below logic.

You now need to add the new app to the channel(s) for which you want to retrieve the conversations. To do so, go to your Slack client, open the channel you want to use and type `@<your app name>` and press enter. Slack opens a dialog asking you whether you want to `Add to channel`. Click this option to add the app bot to your channel. 

You are now set up to proceed with the notebook.

Set the Slack API Token:

In [2]:
try:
    slack_api_token = os.environ["SLACK_API_TOKEN"]
except KeyError:
    slack_api_token = getpass.getpass("Please enter your Slack API Token (hit enter): ")

Initialize our Slack helper class.

In [3]:
slack = SlackHelper(slack_api_token)

Set the slack channel:

In [4]:
try:
    slack_channel_name = os.environ["SLACK_CHANNEL"]
except KeyError:
    slack_channel_name = input("Please enter the name of your Slack channel to work with (hit enter): ")

slack_channel = slack.get_channel_by_name(slack_channel_name)
if not slack_channel:
    raise BaseException("Channel #" + slack_channel_name + " does not exist")
print("Your slack channel is #" + slack_channel_name + " with channel ID " + slack_channel + ".")

Your slack channel is #vector-test with channel ID C062WE3UXS7.


Retrieve the full conversation history of the selected channel:

In [5]:
slack_conversation = slack.get_messages_from_channel(slack_channel)
print(slack_conversation)

(2023-10-26 08:42:00, user=torsten): <@U05S5PHRU7J> has joined the channel
(2023-10-26 08:42:08, user=Michael Behrendt): <@U05RALRP84W> has joined the channel
(2023-10-26 08:42:08, user=Volkmar): <@U05RHA6C4US> has joined the channel
(2023-10-26 08:43:30, user=torsten): I have a large tree in my garden that I need to get rid of. Any suggestions?
      (2023-10-26 08:43:30, user=torsten): Hmm, how large is it?
      (2023-10-26 08:43:30, user=torsten): Probably 10 meters.
      (2023-10-26 08:43:30, user=torsten): Oh, this is a large tree indeed. Do you own a chainsaw?
      (2023-10-26 08:43:30, user=torsten): Yes, I got one.
      (2023-10-26 08:43:30, user=torsten): Well, you could theoretically cut it with your chainsaw. But that requires some skills. It can be dangerous for such a large tree.
      (2023-10-26 08:43:30, user=torsten): I would recommend that you better hire an expert to get rid of the tree for you.
      (2023-10-26 08:43:30, user=torsten): OK, thank you for the adv

List all channels for demonstration purposes.

In [6]:
for channel in slack.list_channels():
    if "name" in channel:
        print("#" + channel["name"] + "(id: " + channel["id"] + ")")

#random(id: C05RGLSNBFV)
#foo(id: C05RK7JB336)
#general(id: C05RVD0B141)
#vector-test(id: C062WE3UXS7)


## Prepare and Chunk the Slack Conversations

We use LangChain as the framework to run the remainder parts of this Notebook.

In [19]:
%pip install langchain --upgrade | grep -v 'already satisfied'

Note: you may need to restart the kernel to use updated packages.


In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1100,
    chunk_overlap  = 100,
)

slack_conversations = text_splitter.create_documents([slack_conversation])

In [8]:
for idx, conversation in enumerate(slack_conversations):
    print("==========> Chunk " + str(idx))
    print(conversation.page_content)

(2023-10-26 08:42:00, user=torsten): <@U05S5PHRU7J> has joined the channel
(2023-10-26 08:42:08, user=Michael Behrendt): <@U05RALRP84W> has joined the channel
(2023-10-26 08:42:08, user=Volkmar): <@U05RHA6C4US> has joined the channel
(2023-10-26 08:43:30, user=torsten): I have a large tree in my garden that I need to get rid of. Any suggestions?
      (2023-10-26 08:43:30, user=torsten): Hmm, how large is it?
      (2023-10-26 08:43:30, user=torsten): Probably 10 meters.
      (2023-10-26 08:43:30, user=torsten): Oh, this is a large tree indeed. Do you own a chainsaw?
      (2023-10-26 08:43:30, user=torsten): Yes, I got one.
      (2023-10-26 08:43:30, user=torsten): Well, you could theoretically cut it with your chainsaw. But that requires some skills. It can be dangerous for such a large tree.
      (2023-10-26 08:43:30, user=torsten): I would recommend that you better hire an expert to get rid of the tree for you.
      (2023-10-26 08:43:30, user=torsten): OK, thank you for the adv

## Configure AI model for Computation of Vector Embeddings for our Slack Conversations

We use an embedding model from HuggingFace that we load and run in local runtime:

In [9]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.base import Embeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Test embeddings computation with a sample text:

In [10]:
import numpy as np
np.set_printoptions(threshold=0)
query_result = embeddings.embed_query("Hello Embedding!")
print(np.array(query_result))

[-0.03261888 -0.04178722  0.03815975 ...  0.04368008  0.0326837
  0.03922174]


## Upload Embeddings to a Vector Database

We use <a href="https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-getting-started" target="_blank" rel="noopener no referrer">IBM Cloud® Databases for Elasticsearch.</a> as vector database.

The following cell retrieves the Elasticsearch users, password, host and port from the environment if available and prompts you otherwise.

In [23]:
%pip install elasticsearch --upgrade | grep -v 'already satisfied'

Note: you may need to restart the kernel to use updated packages.


In [11]:
try:
    esuser = os.environ["ESUSER"]
except KeyError:
    esuser = input("Please enter your Elasticsearch user name (hit enter): ")
try:
    espassword = os.environ["ESPASSWORD"]
except KeyError:
    espassword = getpass.getpass("Please enter your Elasticsearch password (hit enter): ")
try:
    eshost = os.environ["ESHOST"]
except KeyError:
    eshost = input("Please enter your Elasticsearch hostname (hit enter): ")
try:
    esport = os.environ["ESPORT"]
except KeyError:
    esport = input("Please enter your Elasticsearch port number (hit enter): ")

By default Elasticsearch will start with security features like authentication and TLS enabled. To connect to the Elasticsearch cluster you’ll need to configure the Python Elasticsearch client to use HTTPS with the generated CA certificate in order to make requests successfully. Details can be found <a href="https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new" target="_blank" rel="noopener no referrer">here</a>. In this notebook certificate fingerprints will be used for authentication. 

**Verifying HTTPS with certificate fingerprints (Python 3.10 or later)** If you don’t have access to the generated CA file from Elasticsearch you can use the following script to output the root CA fingerprint of the Elasticsearch instance with openssl s_client <a href="https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#_verifying_https_with_certificate_fingerprints_python_3_10_or_later" target="_blank" rel="noopener no referrer"> (docs)</a>:

The following cell retrieves the fingerprint information using a shell command and stores it in variable `ssl_assert_fingerprint`.

In [12]:
es_ssl_fingerprint = !openssl s_client -connect $ESHOST:$ESPORT -showcerts </dev/null 2>/dev/null | openssl x509 -fingerprint -sha256 -noout -in /dev/stdin
es_ssl_fingerprint = es_ssl_fingerprint[0].lstrip("SHA256 Fingerprint=")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [23]:
from langchain.vectorstores.elasticsearch import ElasticsearchStore
from elasticsearch import Elasticsearch
es_index="slack_index"

es_connection = Elasticsearch([f"https://{esuser}:{espassword}@{eshost}:{esport}"],
                              basic_auth=(esuser, espassword),
                              request_timeout=None,
                              ssl_assert_fingerprint=es_ssl_fingerprint)

knowledge_base = ElasticsearchStore(es_connection=es_connection,
                                    index_name=es_index,
                                    embedding=embeddings,
                                    strategy=ElasticsearchStore.ApproxRetrievalStrategy(),
                                    distance_strategy="DOT_PRODUCT")

The `add_texts()` function of the ElasticsearchStore wrapper in LangChain is a compound function that prepares the document data, computes the embeddings using the HuggingFace embedding model and then loads everything to Elasticsearch.

In [24]:
if es_connection.indices.exists(index=es_index):
    es_connection.indices.delete(index=es_index)
_ = knowledge_base.add_texts(texts=[chunk.page_content for chunk in slack_conversations],
                             metadatas=[{'title': "Chunk-"+str(idx), 'id': idx}
                                for idx, chunk in enumerate(slack_conversations)],
                             index_name=es_index,
                             ids=[str(idx) for idx, chunk in enumerate(slack_conversations)]  # unique for each doc
                            )

In [38]:
dict(es_connection.indices.get(index=es_index))["slack_index"]["mappings"]

{'properties': {'metadata': {'properties': {'id': {'type': 'long'},
    'title': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}},
  'text': {'type': 'text',
   'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
  'vector': {'type': 'dense_vector',
   'dims': 384,
   'index': True,
   'similarity': 'dot_product'}}}

In [39]:
es_connection.count(index=es_index)["count"]

3

For testing we run a few simple vector similarity searches:

In [32]:
for similar_document in knowledge_base.similarity_search_with_score("Do you know anything about card games?", k=10):
    if similar_document[1] > 0.59:
        print("Similarity: " + str(similar_document[1]))
        print(similar_document[0].page_content)

In [33]:
for similar_document in knowledge_base.similarity_search_with_score("Any cooking best practices?", k=10):
    if similar_document[1] > 0.56:
        print("Similarity: " + str(similar_document[1]))
        print(similar_document[0].page_content)

Similarity: 0.6478079
(2023-10-26 08:49:17, user=Torsten): <@U05S5QZF6TA> has joined the channel
(2023-10-26 11:20:33, user=torsten): I never manage to get my eggs cooked to the right point.
      (2023-10-26 11:20:33, user=torsten): Oh boy. What's the problem? Are they too soft or too hard?
      (2023-10-26 11:20:33, user=torsten): Well, it happens both. Sometimes they are too soft, and another time when I cook them for exactly the same amount of time they are too hard.
      (2023-10-26 11:20:33, user=torsten): Hmm, did you pay attention to the size of the eggs?
      (2023-10-26 11:20:33, user=torsten): What? Why should I do that?
      (2023-10-26 11:20:33, user=torsten): Well, larger eggs need longer time to become hard than smaller eggs.
      (2023-10-26 11:20:33, user=torsten): Wow! I did not know that. Stupid me! Thank you very much for this information. That explains everything! I need to adjust the cooking time depending on the size of the eggs.
      (2023-10-26 11:20:33, 

In [35]:
for similar_document in knowledge_base.similarity_search_with_score("Do you know anything about trees?", k=5):
    if similar_document[1] > 0.59:
        print("Similarity: " + str(similar_document[1]))
        print(similar_document[0].page_content)

Similarity: 0.6981352
(2023-10-26 08:42:00, user=torsten): <@U05S5PHRU7J> has joined the channel
(2023-10-26 08:42:08, user=Michael Behrendt): <@U05RALRP84W> has joined the channel
(2023-10-26 08:42:08, user=Volkmar): <@U05RHA6C4US> has joined the channel
(2023-10-26 08:43:30, user=torsten): I have a large tree in my garden that I need to get rid of. Any suggestions?
      (2023-10-26 08:43:30, user=torsten): Hmm, how large is it?
      (2023-10-26 08:43:30, user=torsten): Probably 10 meters.
      (2023-10-26 08:43:30, user=torsten): Oh, this is a large tree indeed. Do you own a chainsaw?
      (2023-10-26 08:43:30, user=torsten): Yes, I got one.
      (2023-10-26 08:43:30, user=torsten): Well, you could theoretically cut it with your chainsaw. But that requires some skills. It can be dangerous for such a large tree.
      (2023-10-26 08:43:30, user=torsten): I would recommend that you better hire an expert to get rid of the tree for you.
      (2023-10-26 08:43:30, user=torsten): OK,

## Configure Large Language Model for for the AI Generation with RAG

In [42]:
try:
    apikey = os.environ["IBM_CLOUD_API_KEY"]
except KeyError:
    apikey = getpass.getpass("Please enter your WML api key (hit enter): ")

In [43]:
credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": apikey
}

The API requires a WatsonX project id that provides the context for the call. We will obtain the id from the project in which this notebook runs. Otherwise, please provide the project id.

**Hint**: You can find the `project_id` as follows. Open the prompt lab in watsonx.ai. At the very top of the UI, there will be `Projects / <project name> /`. Click on the `<project name>` link. Then get the `project_id` from Project's Manage tab (Project -> Manage -> General -> Details).

In [44]:
try:
    project_id = os.environ["PROJECT_ID"]
except KeyError:
    project_id = input("Please enter your project_id (hit enter): ")

In [92]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

model_id = ModelTypes.FLAN_T5_XXL

Set model parameters that will influence the result:

In [94]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 1,
    GenParams.MAX_NEW_TOKENS: 50
}

Initialize the Model in WatsonX.ai:

In [95]:
from ibm_watson_machine_learning.foundation_models import Model

llm = Model(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id
).to_langchain()

## Prompting an LLM with a Retrieval Augmentation using the Vector Database

We use the WatsonX large language model to built a Question-Answer prompt chain and we use the vector database that we prepared above as knowledge base:

In [96]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                 retriever=knowledge_base.as_retriever(), return_source_documents=True)

In [97]:
print(qa({"query": "What can I do with a chainsaw?"})["result"])

Cut a tree.


In [98]:
print(qa({"query": "What is the best method to get rid of a 12 meter high tree?"})["result"])

Hire an expert to get rid of the tree for you.


In [99]:
print(qa({"query": "How can I cook eggs to the point?"})["result"])

Adjust the cooking time depending on the size of the eggs.


In [100]:
qa({"query": "What is a good card game?"})["result"]

"I don't know."