Toxic Comment Classifier

This project's origin is here.

Overview

In this project, you'll discover the power of semantic search. First, we will index the Toxic Comment Classification dataset in Weaviate. This dataset comprises two columns: a comment and a binary label indicating whether it is toxic or not. When a user enters a comment and wants to determine if it is toxic or not, we will conduct a semantic search and display the label of the comment that is most similar to the entered one.

Contextual classification

This demo relies on the technique called Contextual classification. It involves making predictions about cross-references based on the context, without relying on pre-existing training data. When you need to assess the similarity between a source item and a potential target item, contextual classification is an excellent choice, particularly when your data features a robust semantic connection (for instance, like 'The Iconic Statue of Liberty' and 'The Vibrant New York City').

Technology stack

Python
Weaviate
Streamlit

Used Weaviate modules/models

text2vec-contextionary (Contextionary) vectorizer
This vectorizer module is used to build 300-dimensional vectors using a Weighted Mean of Word Embeddings (WMOWE) technique. Contextionary is Weaviate's own language vectorizer that is trained using fastText on Wiki and CommonCrawl data.

To be able to use it you need to enable it in the docker compose file

Prerequisites

Python3 interpreter installed
Ability to execute docker compose (The most straightforward way to do it on Windows/Mac is to install Docker Desktop)

Setup instructions

Start up

Clone the repository

Create a virtual environment and activate it

python3 -m venv venv
source venv/bin/activate

Install all required dependencies
```
pip install -r requirements.txt
```
Run containerized instance of Weaviate. It also includes vectorizer module to compute the embeddings.

Note: Make sure you don't have anything occupying port 8080
If you do, you have the option to either stop that process or change the port that Weaviate is using.
```
docker compose up
```
Index the dataset in Weaviate
```
python add_data.py
```
Run the Streamlit demo
```
streamlit run app.py
```

Run integration test

./test.sh

Shut down

Both streamlit app and docker compose can be stopped with Ctrl+C in the corresponding terminal window
To remove created docker containers and volumes use

docker compose down -v

Usage instructions

Enter a comment
Press Classify button to see if it's classified as toxic or not.

Dataset license

The dataset used for this example is available on Kaggle: https://www.kaggle.com/datasets/akashsuper2000/toxic-comment-classification

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.md		README.md
add_data.py		add_data.py
app.py		app.py
demo.gif		demo.gif
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
test.sh		test.sh
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

add_data.py

add_data.py

app.py

app.py

demo.gif

demo.gif

docker-compose.yml

docker-compose.yml

requirements.txt

requirements.txt

test.sh

test.sh

train.csv

train.csv

Repository files navigation

Toxic Comment Classifier

Overview

Contextual classification

Technology stack

Used Weaviate modules/models

Prerequisites

Setup instructions

Start up

Run integration test

Shut down

Usage instructions

Dataset license

About

Releases

Packages

Contributors 3

Languages

weaviate-tutorials/DEMO-classification-toxic-comment

Folders and files

Latest commit

History

Repository files navigation

Toxic Comment Classifier

Overview

Contextual classification

Technology stack

Used Weaviate modules/models

Prerequisites

Setup instructions

Start up

Run integration test

Shut down

Usage instructions

Dataset license

About

Topics

Resources

Stars

Watchers

Forks

Languages