Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

CentreForDigitalHumanities/gabber

Repository files navigation

Gabber - data-analysis tools for gab.ai

This repository aims to provide a set of tools for data-driven media studies on the gab.ai platform.

Requirements

These tools require python3 and access to a MongoDB server. On a debian system, run:

sudo apt-get install python3-pymongo python3-igraph python3-nltk python3-scipy mongodb-server
pip3 install hatesonar gensim pyLDAvis

Scraping

The minegab.py script is meant for scraping data from the gab.ai platform. All scraped data is stored in MongoDB for further parsing/analysis.

Usage

Scraping data from gab.ai starts at a particular account, whose username has to be manually provided to the script:

./minegab.py -u <username>

From there, the script will discover other accounts through reposts, follow-relations, comments, and quotes. Once the first account has been processed, the -a parameter will tell the script to scrape data from all the discovered accounts. In doing so, more accounts will likely be discovered:

./minegab.py -a

Keep running the script with -a until no new accounts are discovered. The giant graph within gab.ai has now been scraped. The minegab.py will give verbose output with the -d flag. Note that this might contain special characters that could be problematic to print on your terminal:

export PYTHONIOENCODING=UTF-8
./minegab.py -da

To keep a logfile of the scraping, you could use the following command:

./minegab.py -a | tee -a ./scrapelog.txt

To redo scraping of accounts, first remove the account from the profiles collection, and then scrape it again:

./minegab.py -d <username> ; ./minegab.py -u <username>

To scrape the news section, simply run:

./minegab.py -n

Performance

Performance will increase when multiple scrapers are run simultaneously. Ideally, the scrapers would use different outbound IP addresses to decrease the impact of rate-limiting, but performance is already greatly improved when running multiple scrapes from the same node. Note that running scrapers from multiple nodes requires replication of the MongoDB backend.

Limitations

The minegab.py script can not scrape beyond the giant graph of which the manually provided accounts are a part. It will not find other communities if they are completely isolated from the accounts provided to the script.

Furthermore, the minegab.py script does not retrieve any media content. It will store links to media assets in the database, which could be used as an input for a downloading script, but this functionality is not provided by the script. Note that scraping all media content will require considerable bandwidth and storage capacity.

Finally, the 'groups' section of gab is mostly ignored. Group metadata is shown in the posts, but group membership is not scraped.

Processing

Communities

The gabcommunities.py script reads from a graphml file generated by the gabgraph.py script. It detects communities and can output to file as well as mongodb.

Usage:

./gabcommunities.py -i <graphml file> [-n <community type>] [-p] [-o output directory]

The script gives the modularity score as output on the command line.

If the -p parameter is given, the script will calculate the pagerank for each user within the detected community.

If the -n parameter is given, user profiles in the mongodb will be enriched with the community id and optionally the pagerank. The parameter expects a name for the edge type the community is based on, e.g., follow, quote, repost, or comment. Values will be written under the communities attributes of the user profile.

If the -o parameter is given, an output directory will be created and graphml files for each detected community will be written in this directory. The filenames match the 'id' field written to mongodb if the -n parameter was given.

Once you are done with all community detection, run the com2posts.py to copy the community metadata from the profiles collection to the actuser attribute of every post and the user attribute of every comment.

Groups

The gabgroups.py script will gather all group metadata found in the scraped posts and fill a mongo collection named groups. It will also add a post count to the metadata.

By default, gabgroups.py will only consider original posts. Use the -r parameter to also include reposts in the gathering of groups and counting of posts.

Hatespeech

The gabhate.py script uses the HateSonar to detect hate- and offsensive speech in all english posts and comments. Other languages are not supported. Classification and confidence is stored in the hateometer attribute in all affected posts and comments.

Topics

The gabtopics.py script uses LDA modelling to generate topics for a specific community. It will output plaintext as well as generate a visualisation in HTML. Be sure to have run com2posts.py first. Usage:

./gabtopics.py -l [language] -e [edgetype] -c [community id] -t [number of topics] -o [output file]

Currently only english, dutch, and german are supported. Note that running this script on larger communities will require serious computational resources, in particular lots of memory.

Exporting

Activity

The gabactivity.py script will export a CSV with counts of total active users, total amount of posts, total amount of reposts, and total amount of comments per month. Use -o to export to a specific filename, by default the export will be written to gabactivity.csv.

GraphML

The gabgraph.py script will export to a GraphML file for further processing with for instance iGraph or (if you have a powerful desktop) Gephi. It supports 4 different edge types: follow edges, repost edges, quote edges, and comment edges. Run:

./gabgraph.py -h

To see all possible parameters.

Note that the language attribute is taken from Gab itself, take these values with a grain of salt.

Groups

The groups2csv.py script will export group metadata to a csv file. Use -o to export to a specific filename, by default the export will be written to gabgroups.csv.

The format of the export is comma separated and single quote delimited CSV.

Hashtags

The gabhashtags.py script will export a sorted list of all hashtags used in posts and comments on gab, including a count of how often they were used. Use -o to export to a specific filename, by default the export will be written to gabhashtags.csv.

The format of the export is comma separated and single quote delimited CSV.

Note that no weighing is applied in the hashtag count.

Hate statistics

The gabhatestats.py script will output statistics on the overall amount of hate- and offensive speech detected by the gabhate.py script, as well as statistics per community detected in the gabcommunities.py script. Beware these statistics only account for english posts and comments.

Releases

No releases published

Packages

No packages published

Languages