My final year project at Multimedia University, Cyberjaya. It involves Natural Language Processing to detect cyberbullies using a combination of supervised and unsupervised learning based on the text comment.
The dataset used for classification is from Kaggle and the following supervised machine learning algorithms were used:
- Random Forest (100 Trees)
- Naive Bayes (Gaussian Model)
- Support Vector Machines (Linear SVC)
EXTRA: Currently trying out the following approaches since the main phases are over:
- Fine-tuning parameters of machine learning approaches
- Using XGBoost for a try out
The framework used is Word2Vec Skip-Gram model. The model was trained using comments from the Reddit corpus, from January 2015 to May 2015. Also, K-Means Clustering was used in conjunction with Word2Vec. The skip-gram model is shown below:
Some of the main methods used are:
- Average Words: The most basic approach. Add the feature vectors of words, then divide by the total number of words.
- Mean Similarity: Finding the feature vectors of words that are above a mean cosine similarity. This is done by finding the top-n words, and averaging their mean similarity. This is done word-by-word.
- Word Feature: Using the mean feature of each specific word, provided it is in the model.
- Clustering Word Vectors: Using K-Means Clustering to cluster a group of words together.
Some of the above methods can be combined using the TF-IDF from the Kaggle Dataset
The following evaluation metrics were used after being cross-validated with Stratified 10 Fold Sampling: Accuracy, Precision, False Positive Rate (FPR), Area Under ROC, Log Loss, Brier Score Loss and Run-Time Prediction. Due to the dataset being negatively skewed (about 75% non-bully comments), a lot of importance were put on Precision, FPR, Brier Score Loss, and Run-Time Prediction. The results are divided into two jupyter notebooks, based on two different datasets:
- Balanced Dataset: Using an even number of bully and non-bully comments
- Imbalanced Dataset: Using the full dataset
Also, for evaluation of Word2Vec can be found here
Python 3.5+ was used as the scripting language, while MongoDB was used to store the comments from reddit. Some of the main libraries used:
- Gensim: For Word2Vec.
- Scikit-learn: For Machine Learning and Evaluation Metrics.
- Regex: For handling character-level expressions in text.