This project is done as part of the course Advanced Data Mining in the fall of 2019 (currently a work-in-progress).
Abstract Due to their popularity online social networks are a popular target for spam, scams, malware distribution and more recently state-actor propaganda. In this paper we review a number of recent approaches to fake account and bot classification. Based on this review and our experiments, we propose our own method which leverages the social graph's topology and differences in ego graphs of legitimate and fake user accounts to improve identification of the latter. We evaluate our approach against other common approaches on a real-world dataset of users of the social network Twitter.
Keywords: Fake account detection, social graph, network topology, neighborhood aggregation
Requires Python3 as well as a working installation of pip3 and venv for Python3 (on Ubuntu you can get this via "sudo apt install python3-venv"). Other requirements can be installed via the makefile. On some systems it may be necessary to install tkinter to display matplotlib plots ("sudo apt install python3-tk" on Ubuntu).
Use the makefile to install the requirements. A Python3 virtualenv will be installed in ".env". See below for a description of what the makefile can do for you:
make help show this message make clean remove intermediate files and clean the directory make install make a virtualenv in the base directory and install requirements make paper.pdf build the paper (recommended) make pdf compile the paper's .tex source using latexmk make pdflatex compile the paper's .tex source using pdflatex, might require multiple runs make all.tar creates a .tar ready for distribution
In order to run the project start the virtualenv using the command "source .env/bin/activate". You can then run any of our project files using Python3. You can leave the virtual env using the command "deactivate".
Files:
baseline.py Trains and evaluates the baseline models rf.py Trains and evaluates our random forest classifier rf_minimal.py Trains and evaluates a minimal random forest classifier using only three features nn.py Trains and evaluates our neural network classifier (NF + GF) nn_baseline.py Trains and evaluates the baseline neural network model nn_nf.py Trains and evaluates our neural network classifier (only NF) paper/ Contains the LateX source files to build the paper data/ Contains datasets data/train_baseline.csv Dataset with the features retrieved using the Twitter API, used to train the baseline data/train.csv Dataset for neighborhood features data/train_graph.csv Dataset containing all features used in the paper
4.6 million accounts ~13,000 labelled accounts
Some interesting aggregate distributions over a node's predecessors/successors in the social graph that we found are below. For the full version please read our paper.
Median reputation of predecessors
Median in-degree of predecessors
Median in-degree of successors
UNIX/Linux: type 'make' to build the paper/main.pdf file.
Windows: use a Latex distribution to build the paper/main.pdf