This project's origin is here.
In this project, you'll discover the power of semantic search. First, we will index the Toxic Comment Classification dataset in Weaviate. This dataset comprises two columns: a comment and a binary label indicating whether it is toxic or not. When a user enters a comment and wants to determine if it is toxic or not, we will conduct a semantic search and display the label of the comment that is most similar to the entered one.
This demo relies on the technique called Contextual classification. It involves making predictions about cross-references based on the context, without relying on pre-existing training data. When you need to assess the similarity between a source item and a potential target item, contextual classification is an excellent choice, particularly when your data features a robust semantic connection (for instance, like 'The Iconic Statue of Liberty' and 'The Vibrant New York City').
- Python
- Weaviate
- Streamlit
text2vec-contextionary (Contextionary) vectorizer
This vectorizer module is used to build 300-dimensional vectors using a Weighted Mean of Word Embeddings (WMOWE) technique.
Contextionary is Weaviate's own language vectorizer that is trained using fastText
on Wiki and CommonCrawl data.
To be able to use it you need to enable it in the docker compose file
- Python3 interpreter installed
- Ability to execute docker compose (The most straightforward way to do it on Windows/Mac is to install Docker Desktop)
-
Clone the repository
-
Create a virtual environment and activate it
python3 -m venv venv source venv/bin/activate
-
Install all required dependencies
pip install -r requirements.txt
-
Run containerized instance of Weaviate. It also includes vectorizer module to compute the embeddings.
Note: Make sure you don't have anything occupying port 8080
If you do, you have the option to either stop that process or change the port that Weaviate is using.docker compose up
-
Index the dataset in Weaviate
python add_data.py
-
Run the Streamlit demo
streamlit run app.py
./test.sh
- Both streamlit app and docker compose can be stopped with
Ctrl+C
in the corresponding terminal window - To remove created docker containers and volumes use
docker compose down -v
- Enter a comment
- Press
Classify
button to see if it's classified as toxic or not.
The dataset used for this example is available on Kaggle: https://www.kaggle.com/datasets/akashsuper2000/toxic-comment-classification