System Submission for SemEval Task 6: OffensEval 2019 (https://competitions.codalab.org/competitions/20011)
Abstract: Developed an Ensemble Approach (Vote based) Classifier for Offensive Language detection trained on the OLID dataset (https://scholar.harvard.edu/malmasi/olid). Also includes a simple LSTM network to compare performance with DLL methods
Files:
- proto.py - Ensemble model approach
- LSTM.ipynb - Deep Learning Approach (Rudimentary Model)
Resources Required:
- CMU POS Tagger (http://www.cs.cmu.edu/~ark/TweetNLP/)
- OLID Training Data (https://scholar.harvard.edu/malmasi/olid)
- GLoVe Embeddings (Current Version uses the Twitter, 2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vector variant)for LSTM (https://nlp.stanford.edu/projects/glove/)
Getting the code Ready:
- Set the global variables filename and test_filename to your dataset paths
- Download ark tweet nlp and extract into the code location (https://bit.ly/33x2WJT) also download the python wrapper (https://github.com/ianozsvald/ark-tweet-nlp-python/blob/master/CMUTweetTagger.py) (Used as library)
Changing Subtasks/Running the code:
-
Use the terminal command: python3 proto.py q
-
Replace q: a,b,c (to run the different subtasks)
-
Change the test_filename if performing submission prediction
Detatiled System Description: https://www.aclweb.org/anthology/papers/S/S19/S19-2124/