Skip to content

Identifies highly upvoted removed comments and posts on reddit by aggregating historical data. Results are displayed on Reveddit's subreddit top pages.

License

Notifications You must be signed in to change notification settings

reveddit/ragger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ragger

Identifies highly upvoted removed comments and posts on reddit by aggregating historical data provided by files.pushshift.io/reddit. Results are displayed on subreddit top pages: Reveddit.com/r/<subreddit>/top

Requirements

To process a full month's worth of comment data you need,

  • 2TB HD: 1 TB of disk space to download the data and another 400 GB for intermediate processing files
  • 40GB RAM: For the 2-aggregate-monthly.py step. Splitting monthly files into smaller parts may use less memory.

Without this, you can run the code on the included test set in under a minute.

Environment

Create a conda virtual environment and activate it,

conda create --name reveddit --file requirements-conda.txt
conda activate reveddit

Optionally, install PostgreSQL and include credentials in a dbconfig.ini as shown in dbconfig-example.ini

Test

To process the test dataset included in this repo,

./processData.sh all test

Results appear in test/3-aggregate_all and test/4-add_fields.

To load results into a database, prepare database credentials in dbconfig-example.ini and run either,

  • ./test.sh runs the above command and load results into a local PostgreSQL database, or
  • ./test.sh normal loads full results into the database if files have been downloaded (see below)

Download

To download the subset of Pushshift comment and submission dumps used by this project, run

./downloadPushshiftDumps.sh

The results will be in data/0-pushshift_raw/. That script's comments mention why only a subset of data is used.

Then run ./groupDaily.sh. This creates monthly files from daily files and moves the daily files to another directory.

Other Pushshift download scripts:

Usage

To process full results,

  1. Download pushshift monthly dumps
  2. Store them in data/0-pushshift_raw/ as specified in config.ini
  3. Run ./processData.sh all normal

With a remote database

I used a DO droplet. These are the rough steps,

  1. Set up ssh keys
  2. Install Postgres with docker
  3. Create a database login and password for your script
  4. Add the top 4 lines of droplet-config/pg_hba.conf.head to /var/lib/docker/volumes/hasura_db_data/_data/pg_hba.conf
  5. sudo docker-compose up -d
  6. git clone this repo
  7. Put the database login and password into a file called dbconfig.ini in the root directory of this repo

Then, locally,

  1. In prod.sh change ssh.rviewit.com to the domain name of the droplet
  2. Run prod.sh
  3. Check the local and remote logs to know when it's done

About

Identifies highly upvoted removed comments and posts on reddit by aggregating historical data. Results are displayed on Reveddit's subreddit top pages.

Topics

Resources

License

Stars

Watchers

Forks