This repository aims to provide a set of tools for data-driven media studies on the gab.ai platform.
These tools require python3 and access to a MongoDB server. On a debian system, run:
sudo apt-get install python3-pymongo python3-igraph python3-nltk python3-scipy mongodb-server
pip3 install hatesonar gensim pyLDAvis
The minegab.py script is meant for scraping data from the gab.ai platform. All scraped data is stored in MongoDB for further parsing/analysis.
Scraping data from gab.ai starts at a particular account, whose username has to be manually provided to the script:
./minegab.py -u <username>
From there, the script will discover other accounts through reposts, follow-relations, comments, and quotes. Once the first account has been processed, the -a parameter will tell the script to scrape data from all the discovered accounts. In doing so, more accounts will likely be discovered:
./minegab.py -a
Keep running the script with -a until no new accounts are discovered. The giant graph within gab.ai has now been scraped. The minegab.py will give verbose output with the -d flag. Note that this might contain special characters that could be problematic to print on your terminal:
export PYTHONIOENCODING=UTF-8
./minegab.py -da
To keep a logfile of the scraping, you could use the following command:
./minegab.py -a | tee -a ./scrapelog.txt
To redo scraping of accounts, first remove the account from the profiles collection, and then scrape it again:
./minegab.py -d <username> ; ./minegab.py -u <username>
To scrape the news section, simply run:
./minegab.py -n
Performance will increase when multiple scrapers are run simultaneously. Ideally, the scrapers would use different outbound IP addresses to decrease the impact of rate-limiting, but performance is already greatly improved when running multiple scrapes from the same node. Note that running scrapers from multiple nodes requires replication of the MongoDB backend.
The minegab.py script can not scrape beyond the giant graph of which the manually provided accounts are a part. It will not find other communities if they are completely isolated from the accounts provided to the script.
Furthermore, the minegab.py script does not retrieve any media content. It will store links to media assets in the database, which could be used as an input for a downloading script, but this functionality is not provided by the script. Note that scraping all media content will require considerable bandwidth and storage capacity.
Finally, the 'groups' section of gab is mostly ignored. Group metadata is shown in the posts, but group membership is not scraped.
The gabcommunities.py script reads from a graphml file generated by the gabgraph.py script. It detects communities and can output to file as well as mongodb.
Usage:
./gabcommunities.py -i <graphml file> [-n <community type>] [-p] [-o output directory]
The script gives the modularity score as output on the command line.
If the -p parameter is given, the script will calculate the pagerank for each user within the detected community.
If the -n parameter is given, user profiles in the mongodb will be enriched with the community id and optionally the pagerank. The parameter expects a name for the edge type the community is based on, e.g., follow, quote, repost, or comment. Values will be written under the communities attributes of the user profile.
If the -o parameter is given, an output directory will be created and graphml files for each detected community will be written in this directory. The filenames match the 'id' field written to mongodb if the -n parameter was given.
Once you are done with all community detection, run the com2posts.py to copy the community metadata from the profiles collection to the actuser attribute of every post and the user attribute of every comment.
The gabgroups.py script will gather all group metadata found in the scraped posts and fill a mongo collection named groups. It will also add a post count to the metadata.
By default, gabgroups.py will only consider original posts. Use the -r parameter to also include reposts in the gathering of groups and counting of posts.
The gabhate.py script uses the HateSonar to detect hate- and offsensive speech in all english posts and comments. Other languages are not supported. Classification and confidence is stored in the hateometer attribute in all affected posts and comments.
The gabtopics.py script uses LDA modelling to generate topics for a specific community. It will output plaintext as well as generate a visualisation in HTML. Be sure to have run com2posts.py first. Usage:
./gabtopics.py -l [language] -e [edgetype] -c [community id] -t [number of topics] -o [output file]
Currently only english, dutch, and german are supported. Note that running this script on larger communities will require serious computational resources, in particular lots of memory.
The gabactivity.py script will export a CSV with counts of total active users, total amount of posts, total amount of reposts, and total amount of comments per month. Use -o to export to a specific filename, by default the export will be written to gabactivity.csv.
The gabgraph.py script will export to a GraphML file for further processing with for instance iGraph or (if you have a powerful desktop) Gephi. It supports 4 different edge types: follow edges, repost edges, quote edges, and comment edges. Run:
./gabgraph.py -h
To see all possible parameters.
Note that the language attribute is taken from Gab itself, take these values with a grain of salt.
The groups2csv.py script will export group metadata to a csv file. Use -o to export to a specific filename, by default the export will be written to gabgroups.csv.
The format of the export is comma separated and single quote delimited CSV.
The gabhashtags.py script will export a sorted list of all hashtags used in posts and comments on gab, including a count of how often they were used. Use -o to export to a specific filename, by default the export will be written to gabhashtags.csv.
The format of the export is comma separated and single quote delimited CSV.
Note that no weighing is applied in the hashtag count.
The gabhatestats.py script will output statistics on the overall amount of hate- and offensive speech detected by the gabhate.py script, as well as statistics per community detected in the gabcommunities.py script. Beware these statistics only account for english posts and comments.