Skip to content
Using CNN on text to classify toxic comments
Branch: master
Clone or download
xitizzz Modify model name
No need to use word simple
Latest commit 83fb269 Oct 22, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Initial commit Oct 22, 2018
Bi-LSTM.ipynb Include Bi-directional LSTM Oct 22, 2018
CNN_Advance_2_FC.ipynb Include advanced CNN Oct 22, 2018
CNN_Baseline-Cross_Validation.ipynb Include baseline CNN Oct 22, 2018
CNN_Baseline.ipynb Clear outputs Oct 22, 2018
CNN_Baseline_Infer.ipynb Include baseline CNN Oct 22, 2018
CNN_Baseline_Train.ipynb Include baseline CNN Oct 22, 2018
Data Analysis.ipynb Include data analysis Oct 22, 2018
LSTM_Baseline_Infer.ipynb Include LSTM as benchmark Oct 22, 2018
NBSVM_BoW.ipynb Modify model name Oct 22, 2018
pkl_to_csv.ipynb Convert pickle to csv Oct 22, 2018


Using text CNN and other methods to classify toxic comments


The dataset for this project came from Jigsaw Toxic Comment Classification Challenge on Kaggle. I also participated in the challenge and submitted prediction, but the focus was on exploring CNN on text and comparing it's performance with other methods.


The challenge task was to given a comment predict probabilities for 6 toxic comment classes. The classes where toxic, severe_toxic, obscene, threat, insult, identity_hate. Comment can be classified in more than on those classes, hence the probability for each class is independent (in theory, in practice most comments were also classified toxic if they fall under any other class). The evaluation was the mean of area under ROC curve for each class.


Note: Higher is better. The practical range is from 0.5 to 1, with 0.5 being random coin toss and 1 being perfect classification.

Model Mean Area Under ROC Curve
Bidirectional LSTM 0.9686
NB SVM + Bigam Features 0.9762
Text CNN with 2 fully connected layers 0.9745
Optimized text CNN 0.9771
You can’t perform that action at this time.