The purpose of the chatbot is to streamline the process of gathering organization wide information for answering internal questions in real-time. This documentation is for internal Generate members who may implement or work on the chatbot in the future. Knowledge of Python, Slack APIs, LLM, Hugging Face, and Digital Ocean may be useful to understand this documentation.
- Create a Slack App at https://api.slack.com/apps
- Disable Socket Mode
- Add bot token scopes:
app_mentions:read,channels:history,channels:read,chat:write,chat:write.public,groups:history,groups:read,im:history,im:read,im:write,incoming-webhook,mpim:history,users:read - Install the app to your workspace
- libraries needed pip install RecursiveCharacterTextSplitter SentenceTransformer numpy faiss-cpu torch pip install -r requirements.txt
DigitalOcean is a platform that we decided to use for hosting. There are two components that need to be hosted. First, there is the Slack backend, which can be found in the app.py file. You can host this by creating a App Platform. Once this is done, you will have to change the event subscription link within the slack api website. The second aspect is hosting the model. We were originally going to run an LLM on top of our RAG model to produce conversational answers using the relavant text generated by the RAG model. We were exploring using droplets, but unfortunately the resources that we chose were not enough to run the LLM for the cost. This is something that needs to be looked at. GPU droplets or a higher spec droplets could work. There is also a GenAI Platform on DigitalOcean that we did not look at.
The data that was used was an export of the Generate Notion Wiki (folder named "Wiki Export"). The export contained markdown files of each page. To preprocess this data, we converted all the .md into .txt files (md_to_txt.py). The text files are stored in a folder called "Wiki_txt". We cleaned the .txt files to delete any unnecessary md formatting, emojis, and empty lines and combined all the .txt files to a .json file (wiki_json.py and gen_wiki.json). The gen_wiki.json was used as the knowledge base for the models.
We created training data with Question-Answer pairs called training.json. This data encompasses content that can be found in Generate's Notion and was manually created. We created a parser to convert the .json to .jsonl file (json_jsonl.py and training.jsonl). We found that the training.jsonl worked better for the DistilBERT model. This data was not used for the RAG.
- xyz, fix and clean
- we initially tried the rag.py, didnt work well
- rag3.py works well
- xyz - talk about other limitations and processes to reach final stage
- doesnt run on a M1 Mac Pro with 8GB of memory
- runs on a M1 Mac Pro with 16 GB of memory, however it runs very slow, may not run depending on availble memory on local machine
- if taking an LLM class, you may have access to a GPU you can run this on, it will be much faster, we succesfully ran this on a GPU with 48 GB of memory
- comments regarding how the code works are within the file. If you have more questions, try running the code through Claude.ai first or reach out to us.
This file is legacy code. It was the first implementation of our rag model and we improved to use JSON in rag2.py.
- download ollama and models
- uses a RAG and feeds it into Ollama
- Ollama produces good responses, however, after initial research Ollama isn't meant to be deployed, only to use on your local machine
- this model is a good example of what the chatbot responses should look like
- if you want to run ollama, you have to download from https://ollama.com/ and download the mistral model
- distilBERT
- uses training.jsonl to train model
- slackbot
- rag3.py?
- rag3.py is too resource intensive, need to figure out a less resource intensive or more money for more ram on hosting?
xyz
- Create an API with Notion SDK to integrate with the Generate Notion
- Integrate with the Notion Calendar and automate reminders for events
- Automate reminders for team meetings
- Integrate with Slack message history to return more personalized answers
| Name | Email Address | Role | Date of Last Edit |
|---|---|---|---|
| Chigo Ike | ike.c@northeastern.edu | Data Analyst | 4/11/2025 |
| Matthew Li | li.matt@northeastern.edu | Data Analyst | 4/11/2025 |
| Kaydence Lin | lin.kay@northeastern.edu | Data Analyst | 4/11/2025 |