# Setup

You can just edit the variables below and then run the entire notebook. 

The setup below will get all submissions and comments matching the keyword "China" on the body, title and self-text fields from 01/2006 to 12/2010.

In [1]:
START_YEAR = 2006
START_MONTH = 1
END_YEAR = 2010
END_MONTH = 12
KEYWORDS = "China,china"
FIELDS = "body,title,selftext"

# Installation


## Install Rust

In [2]:
!apt update
!apt install rustc

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:5 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [696 B]
Get:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:10 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:12 https://cloud.r-project.org/bin/linux/ubuntu bi

## Install dependencies

In [3]:
%cd /content/
!pip install pandas loguru requests
!git clone https://github.com/Paul-E/Pushshift-Importer.git
!git clone https://github.com/ruanchaves/reddit_keywords.git

/content
Collecting loguru
  Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 5.4 MB/s 
Installing collected packages: loguru
Successfully installed loguru-0.6.0
Cloning into 'Pushshift-Importer'...
remote: Enumerating objects: 197, done.[K
remote: Counting objects: 100% (197/197), done.[K
remote: Compressing objects: 100% (148/148), done.[K
remote: Total 197 (delta 117), reused 125 (delta 46), pack-reused 0[K
Receiving objects: 100% (197/197), 41.81 KiB | 13.94 MiB/s, done.
Resolving deltas: 100% (117/117), done.
Cloning into 'reddit_keywords'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 28 (delta 3), reused 19 (delta 1), pack-reused 0[K
Unpacking objects: 100% (28/28), done.


# Download dumps 

Download all dumps from START_YEAR and START_MONTH to (and including) END_YEAR and END_MONTH.

In [None]:
%cd reddit_keywords
!python reddit_download.py --start_year {START_YEAR} --end_year {END_YEAR} --start_month {START_MONTH} --end_month {END_MONTH}

/content/reddit_keywords
[32m2022-03-07 23:00:54.064[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mdownload_dumps[0m:[36m89[0m - [34m[1mDownloading https://files.pushshift.io/reddit/submissions/RS_2006-01.zst.[0m
[32m2022-03-07 23:00:54.955[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mdownload_dumps[0m:[36m92[0m - [34m[1mhttps://files.pushshift.io/reddit/submissions/RS_2006-01.bz2 does not exist. Skipping.[0m
[32m2022-03-07 23:00:55.353[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mdownload_dumps[0m:[36m89[0m - [34m[1mDownloading https://files.pushshift.io/reddit/submissions/RS_2006-02.zst.[0m
[32m2022-03-07 23:00:56.679[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mdownload_dumps[0m:[36m92[0m - [34m[1mhttps://files.pushshift.io/reddit/submissions/RS_2006-02.bz2 does not exist. Skipping.[0m
[32m2022-03-07 23:00:57.092[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mdownload_dumps[0m:[36m89[0m - [34m[1mDownloading https

# Generate dataframes

The commands below will generate the `comment.csv` and `submission.csv` files in the `reddit_keywords` folder.

In [None]:
%cd /content/reddit_keywords
!bash build_db.sh
!python db_to_csv.py --keywords {KEYWORDS} --fields {FIELDS}

# Download

In [None]:
import time
import json

metadata = {
    "start_year": START_YEAR,
    "start_month": START_MONTH,
    "end_year": END_YEAR,
    "end_month": END_MONTH,
    "keywords": KEYWORDS,
    "fields": FIELDS
}

with open("metadata.json", "w+") as f:
    json.dump(metadata, f)

from google.colab import files
timestamp = str(int(time.time()))
filename = f"archive_{timestamp}.zip"
!zip {filename} comment.csv submission.csv metadata.json
files.download(filename)

# Disconnect

In [None]:
!kill $(ps aux | awk '{print $2}')