Skip to content

Latest commit

 

History

History
64 lines (41 loc) · 3.44 KB

README.md

File metadata and controls

64 lines (41 loc) · 3.44 KB

BullyDetect (Techniques to Detect Cyberbully)

My final year project at Multimedia University, Cyberjaya. It involves Natural Language Processing to detect cyberbullies using a combination of supervised and unsupervised learning based on the text comment.

Supervised Learning

The dataset used for classification is from Kaggle and the following supervised machine learning algorithms were used:

  • Random Forest (100 Trees)
  • Naive Bayes (Gaussian Model)
  • Support Vector Machines (Linear SVC)

KAGGLE DATASET LINK

EXTRA: Currently trying out the following approaches since the main phases are over:

  • Fine-tuning parameters of machine learning approaches
  • Using XGBoost for a try out

Unsupervised Learning

The framework used is Word2Vec Skip-Gram model. The model was trained using comments from the Reddit corpus, from January 2015 to May 2015. Also, K-Means Clustering was used in conjunction with Word2Vec. The skip-gram model is shown below:

Skip-gram model

REDDIT CORPUS LIST

Methods used

Some of the main methods used are:

  • Average Words: The most basic approach. Add the feature vectors of words, then divide by the total number of words.
  • Mean Similarity: Finding the feature vectors of words that are above a mean cosine similarity. This is done by finding the top-n words, and averaging their mean similarity. This is done word-by-word.
  • Word Feature: Using the mean feature of each specific word, provided it is in the model.
  • Clustering Word Vectors: Using K-Means Clustering to cluster a group of words together.

Some of the above methods can be combined using the TF-IDF from the Kaggle Dataset

Evaluation and Results

The following evaluation metrics were used after being cross-validated with Stratified 10 Fold Sampling: Accuracy, Precision, False Positive Rate (FPR), Area Under ROC, Log Loss, Brier Score Loss and Run-Time Prediction. Due to the dataset being negatively skewed (about 75% non-bully comments), a lot of importance were put on Precision, FPR, Brier Score Loss, and Run-Time Prediction. The results are divided into two jupyter notebooks, based on two different datasets:

Also, for evaluation of Word2Vec can be found here

Tools Used

Python 3.5+ was used as the scripting language, while MongoDB was used to store the comments from reddit. Some of the main libraries used:

  • Gensim: For Word2Vec.
  • Scikit-learn: For Machine Learning and Evaluation Metrics.
  • Regex: For handling character-level expressions in text.