As social platforms become accessible nowadays, more and more people get used to posting opinions on various topics online. The existence of nagetive online behaviors such as hateful comments is also unavoidable. These platforms thus become prolific sources for hate detection, which motivates many people to apply various techniques in order to detect hateful users or hateful speeches.
This project investigates contents from Reddit. its goal is to classify hateful posts from the normal ones. This not only enables platforms to improve user experiences, but also helps to maintain a positive online environment.
The project is mainly built upon following packages:
-
Data Preprocessing & Feature Extraction
-
Labeling & ML Deployment
You can build a docker image out of the provided DockerFile
docker build . # This will build using the same env as in a)
Run a container, replacing the ID with the output of the previous command
docker run -it -p 8888:8888 -p 8787:8787 <container_id_or_tag>
The above command will give an URL (Like http://(container_id or 127.0.0.1):8888/?token=) which can be used to access the notebook from browser. You may need to replace the given hostname with "localhost" or "127.0.0.1".
-
Modify the config file located in
config/data-params.json
. For testing, useconfig/test-params.json
, you may define an output root[data-path]
underconfig/data-params.json
. -
The HinReddit's etl process uses the python script file
run.py
with targetdata[-test]
. -
You may change the
nlp_model.zip
file with custom nlp labeling rules. -
The etl process result will be under "<data-path>/raw" and "<data-path>/interim/label" directories.
-
The HinReddit's graph process uses the python script file
run.py
with targetgraph[-test]
. -
The graph process result will be under "<data-path>/interim/graph/*.mat"
-
Modify the config files located in
config/embedding/graph_<1/2>/[test-]<informax/metapath2vec/node2vec>.json
for corresponding parameters of the embedding models. -
The HinReddit's embedding process uses the python script file
run.py
with following targets:node2vec[-test]
: for node2vec embedding.metapath2vec[-test]
: for metapath2vec embedding.infomax[-test]
: for deep graph infomax (DGI) embedding.
-
run
$ python run.py data[-test] graph[-test] node2vec[-test] metapath2vec[-test] infomax[-test]
-
You can find a detailed explaination of configuration arguments here
Development Guide is provided and under ./writeups/DEVGUIDE.md
Authors contributed equally to this project
@paper{Hou/Ye/2017,
title={HinDroid: {An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network}},
author={Hou, Ye, Song, Abdulhayoglu}
year={2017}
}
@inproceedings{Fey/Lenssen/2019,
title={Fast Graph Representation Learning with {PyTorch Geometric}},
author={Fey, Matthias and Lenssen, Jan E.},
booktitle={ICLR Workshop on Representation Learning on Graphs and Manifolds},
year={2019},
}
@article{turc2019,
title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1908.08962v2 },
year={2019}
}
@course{Koutra/2018,
title={Mining Large-scale Graph Data},
author={Danai Koutra},
link={http://web.eecs.umich.edu/~dkoutra/courses/W18_598/},
year={2018}
}
@collection{src-d/2019,
title={Awesome Machine Learning On Source Code},
author={src-d},
link={https://github.com/src-d/awesome-machine-learning-on-source-code},
year={2019}
}