Skip to content

weaviate-tutorials/DEMO-classification-toxic-comment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Toxic Comment Classifier

This project's origin is here.

Overview

Demo

In this project, you'll discover the power of semantic search. First, we will index the Toxic Comment Classification dataset in Weaviate. This dataset comprises two columns: a comment and a binary label indicating whether it is toxic or not. When a user enters a comment and wants to determine if it is toxic or not, we will conduct a semantic search and display the label of the comment that is most similar to the entered one.

Contextual classification

This demo relies on the technique called Contextual classification. It involves making predictions about cross-references based on the context, without relying on pre-existing training data. When you need to assess the similarity between a source item and a potential target item, contextual classification is an excellent choice, particularly when your data features a robust semantic connection (for instance, like 'The Iconic Statue of Liberty' and 'The Vibrant New York City').

Technology stack

  • Python
  • Weaviate
  • Streamlit

Used Weaviate modules/models

text2vec-contextionary (Contextionary) vectorizer
This vectorizer module is used to build 300-dimensional vectors using a Weighted Mean of Word Embeddings (WMOWE) technique. Contextionary is Weaviate's own language vectorizer that is trained using fastText on Wiki and CommonCrawl data.

To be able to use it you need to enable it in the docker compose file

Prerequisites

  1. Python3 interpreter installed
  2. Ability to execute docker compose (The most straightforward way to do it on Windows/Mac is to install Docker Desktop)

Setup instructions

Start up

  1. Clone the repository

  2. Create a virtual environment and activate it

    python3 -m venv venv
    source venv/bin/activate
  3. Install all required dependencies

    pip install -r requirements.txt
  4. Run containerized instance of Weaviate. It also includes vectorizer module to compute the embeddings.

    Note: Make sure you don't have anything occupying port 8080
    If you do, you have the option to either stop that process or change the port that Weaviate is using.

    docker compose up
  5. Index the dataset in Weaviate

    python add_data.py
  6. Run the Streamlit demo

    streamlit run app.py

Run integration test

./test.sh

Shut down

  1. Both streamlit app and docker compose can be stopped with Ctrl+C in the corresponding terminal window
  2. To remove created docker containers and volumes use
docker compose down -v

Usage instructions

  1. Enter a comment
  2. Press Classify button to see if it's classified as toxic or not.

Dataset license

The dataset used for this example is available on Kaggle: https://www.kaggle.com/datasets/akashsuper2000/toxic-comment-classification

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published