Skip to content

sonalg22/generateChatbot

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

132 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genny - Generate Slack Chatbot

Operations - Data Solutions - Internal Insights

Spring 2025

Chigo Ike, Matthew Li, Kaydence Lin

ADD ANY RESOURCES OR LINKS YOU REFERENCED

The purpose of the chatbot is to streamline the process of gathering organization wide information for answering internal questions in real-time. This documentation is for internal Generate members who may implement or work on the chatbot in the future. Knowledge of Python, Slack APIs, LLM, Hugging Face, and Digital Ocean may be useful to understand this documentation.

User Guide - Matthew

app.py

Slack

  1. Create a Slack App at https://api.slack.com/apps
    • Disable Socket Mode
    • Add bot token scopes: app_mentions:read, channels:history, channels:read, chat:write, chat:write.public, groups:history, groups:read, im:history, im:read, im:write, incoming-webhook, mpim:history, users:read
    • Install the app to your workspace

Environment

  • libraries needed pip install RecursiveCharacterTextSplitter SentenceTransformer numpy faiss-cpu torch pip install -r requirements.txt

DigitalOcean

DigitalOcean is a platform that we decided to use for hosting. There are two components that need to be hosted. First, there is the Slack backend, which can be found in the app.py file. You can host this by creating a App Platform. Once this is done, you will have to change the event subscription link within the slack api website. The second aspect is hosting the model. We were originally going to run an LLM on top of our RAG model to produce conversational answers using the relavant text generated by the RAG model. We were exploring using droplets, but unfortunately the resources that we chose were not enough to run the LLM for the cost. This is something that needs to be looked at. GPU droplets or a higher spec droplets could work. There is also a GenAI Platform on DigitalOcean that we did not look at.

Technical Details

Data

The data that was used was an export of the Generate Notion Wiki (folder named "Wiki Export"). The export contained markdown files of each page. To preprocess this data, we converted all the .md into .txt files (md_to_txt.py). The text files are stored in a folder called "Wiki_txt". We cleaned the .txt files to delete any unnecessary md formatting, emojis, and empty lines and combined all the .txt files to a .json file (wiki_json.py and gen_wiki.json). The gen_wiki.json was used as the knowledge base for the models.

Training Data

We created training data with Question-Answer pairs called training.json. This data encompasses content that can be found in Generate's Notion and was manually created. We created a parser to convert the .json to .jsonl file (json_jsonl.py and training.jsonl). We found that the training.jsonl worked better for the DistilBERT model. This data was not used for the RAG.

  • xyz, fix and clean

Models

Introduction

  • we initially tried the rag.py, didnt work well
  • rag3.py works well
  • xyz - talk about other limitations and processes to reach final stage

rag3.py - Matthew

  • doesnt run on a M1 Mac Pro with 8GB of memory
  • runs on a M1 Mac Pro with 16 GB of memory, however it runs very slow, may not run depending on availble memory on local machine
  • if taking an LLM class, you may have access to a GPU you can run this on, it will be much faster, we succesfully ran this on a GPU with 48 GB of memory
  • comments regarding how the code works are within the file. If you have more questions, try running the code through Claude.ai first or reach out to us.

rag.py - matthew

This file is legacy code. It was the first implementation of our rag model and we improved to use JSON in rag2.py.

rag_ollama.py - edit this, kaydence

  • download ollama and models
  • uses a RAG and feeds it into Ollama
  • Ollama produces good responses, however, after initial research Ollama isn't meant to be deployed, only to use on your local machine
  • this model is a good example of what the chatbot responses should look like
  • if you want to run ollama, you have to download from https://ollama.com/ and download the mistral model

training.py -chigo

  • distilBERT
  • uses training.jsonl to train model

Final Product

  • slackbot
  • rag3.py?

Maintenance and Updates

  • rag3.py is too resource intensive, need to figure out a less resource intensive or more money for more ram on hosting?

FAQs

xyz

Potential Future Work

  1. Create an API with Notion SDK to integrate with the Generate Notion
  2. Integrate with the Notion Calendar and automate reminders for events
  3. Automate reminders for team meetings
  4. Integrate with Slack message history to return more personalized answers

Contact Information

Name Email Address Role Date of Last Edit
Chigo Ike ike.c@northeastern.edu Data Analyst 4/11/2025
Matthew Li li.matt@northeastern.edu Data Analyst 4/11/2025
Kaydence Lin lin.kay@northeastern.edu Data Analyst 4/11/2025

About

Data Solutions Internal Insights

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.8%
  • Other 0.2%