# Data from 2005-06 to 2022-12

Follow steps in this PushShift Reddit data dump link for data from 2005-06 to 2022-12: https://www.reddit.com/r/pushshift/comments/11ef9if/separate_dump_files_for_the_top_20k_subreddits/

In case Reddit removes the above link, I have copied & pasted the details contained in the above link below:

***
### Separate dump files for the top 20k subreddits

I've gotten a number of requests for subreddit specific dump files extracted from the monthly dumps. So I've extracted out the top twenty thousand subreddits and uploaded them as a torrent so they can be individually downloaded without having to download the entire set of dumps.

https://academictorrents.com/details/c398a571976c78d346c325bd75c47b82edf6124e

### How to download the subreddit you want

This is a torrent. If you are not familiar, torrents are a way to share large files like these without having to pay hundreds of dollars in server hosting costs. They are peer to peer, which means as you download, you're also uploading the files on to other people. To do this, you can't just click a download button in your browser, you have to download a type of program called a torrent client. There are many different torrent clients, but I recommend a simple, open source one called qBittorrent (https://www.qbittorrent.org/).

Once you have that installed, go to the torrent link (https://academictorrents.com/details/c398a571976c78d346c325bd75c47b82edf6124e) and click download, this will download a small ".torrent" file. In qBittorrent, click the plus at the top and select this torrent file. This will open the list of all the subreddits. Click "Select None" to unselect everything, then use the filter box in the top right to search for the subreddit you want. Select the files you're interested in, there's a separate one for the comments and submissions of each subreddit, then click okay. The files will then be downloaded.

### How to use the files

These files are in a format called zstandard compressed ndjson. ZStandard is a super efficient compression format, similar to a zip file. NDJson is "Newline Delimited JavaScript Object Notation", with separate "JSON" objects on each line of the text file.

There are a number of ways to interact with these files, but they all have various drawbacks due to the massive size of many of these files. The efficient compression means a file like "wallstreetbets_submissions.zst" is 5.5 gigabytes uncompressed, far larger than most programs can open at once.

You can extract the files yourself with 7Zip. You can install 7Zip from here (https://www.7-zip.org/) and then install this plugin (https://github.com/mcmilk/7-Zip-zstd) to extract ZStandard files, or you can directly install the modified 7Zip with the plugin already from that plugin page. Then simply open the zst file you downloaded with 7Zip and extract it.

Once you've extracted it, you'll need a text editor capable of opening very large files. I use glogg (https://glogg.bonnefon.org/) which lets you open files like this without loading the whole thing at once.

As an alternative, if you want to save the data in a different format or extract out lines matching specific filters (keyword searching, or dates, etc), you can use a python script like the examples I have here (https://github.com/Watchful1/PushshiftDumps). This lets you iterate through each comment/submission in the file without having to extract the whole thing.

If you only want a specific part of a subreddit file, you can use this script (https://github.com/Watchful1/PushshiftDumps/blob/master/scripts/filter_file.py) which lets you filter by time period or return objects that have a certain word in the title or body.

If you have a specific use case and can't figure out how to extract the data you want, send me a DM, I'm happy to help put something together.

### Api script

In addition to the dump files, pushshift offers an API with powerful filtering options. The main limitation is that it takes quite some time to download a substantial amount of data. If you have a use case that doesn't cleanly align to specific subreddits, take a look at my api download script here (https://github.com/Watchful1/Sketchpad/blob/master/postDownloader.py). Again I'm happy to work with you to build something for a specific use case.

### Can I cite you in my research paper

This data is originally collected by Pushshift. Extracted, split and re-packaged by me, u/Watchful1 (https://www.reddit.com/user/Watchful1). And hosted on academictorrents.com.

Having never published a paper myself, I'm not very familiar with the correct format for citing, but I would recommend including all three.

If you do complete a project or publish a paper using this data, I'd love to hear about it! Send me a DM once you're done.

### How can I donate as thanks

I pay roughly $30 a month for the seedbox I use to host the torrent, if you'd like to chip in towards that cost you can donate here (https://ko-fi.com/watchful1).

You can also donate to the organization behind pushshift here (https://www.paypal.com/US/fundraiser/charity/3521050) so they can continue hosting and collecting more data like this.

### Note

This data is currently from the beginning of reddit to the end of December 2022. I will update it with a new torrent every 6 months.
***

# Data from 2023-01 to 2023-03, and 2023-07

Follow steps in this Arctic Shift Reddit data dump link for data from 2023-01 to 2023-03, and 2023-07: https://old.reddit.com/r/pushshift/comments/15h172e/post_comment_data_dumps_202307/

In case Reddit removes the above link, I have copied & pasted the details contained in the above link below:

***
First off, I'm not associated with pushshift. Yet, mods please don't delete this :)

For downloads and usage instructions, visit the GitHub page (https://github.com/ArthurHeitmann/arctic_shift).

How is this possible under reddits new rate limit rules?

Over the last month almost 300 million post and comments were created. That's about 6,500 per minute. With one API request you can fetch 100 posts/comments. So you need to make about 65 requests per minute. Now, what are the new rate limits? 100 request per minute. That leaves enough room to handle peaks and for retrieving older content.

There's a small catch though. The dumps use a slightly different file format, than the one pushshift uses. It is easier for me to maintain. But fear not, usage instructions are on the above GitHub page.

If you want to help speed up the archiving of the previous 3 months, DM me.
***

In [1]:
# Extracting submissions (ie. posts) where the title contains the keywords ethereum or eth

import zstandard
import os
import json
import sys
import csv
from datetime import datetime
import logging.handlers

# put the path to the input file
input_file = r"C:\Users\leo_c\Downloads\reddit\subreddits\CryptoCurrency_submissions.zst"

# put the name or path to the output file. The file extension from below will be added automatically
output_file = r"C:\Users\leo_c\Downloads\filtered_title_CryptoCurrency_submissions"

# the format to output in, pick from the following options
#   zst: same as the input, a zstandard compressed ndjson file. Can be read by the other scripts in the repo
#   txt: an ndjson file, which is a text file with a separate json object on each line. Can be opened by any text editor
#   csv: a comma separated value file. Can be opened by a text editor or excel
# WARNING READ THIS: if you use txt or csv output on a large input file without filtering out most of the rows, the resulting file will be 
# extremely large. Usually about 7 times as large as the compressed input file
output_format = "csv"

# override the above format and output only this field into a text file, one per line. Useful if you want to make a list of authors or ids. 
# See the examples below
# any field that's in the dump is supported, but useful ones are
#   author: the username of the author
#   id: the id of the submission or comment
#   link_id: only for comments, the fullname of the submission the comment is associated with
#   parent_id: only for comments, the fullname of the parent of the comment. Either another comment or the submission if it's top level
single_field = None

# the fields in the file are different depending on whether it has comments or submissions. If we're writing a csv, we need to know which 
# fields to write.
# The filename from the torrent has which type it is, but you'll need to change this if you removed that from the filename
# EXTRA COMMENT: is_submission = "submission" in input_file sets the is_submission to true if it is a submission file, and false otherwise. You don't need to change this at all.
is_submission = "submission" in input_file

# only output items between these two dates
from_date = datetime.strptime("2022-12-01", "%Y-%m-%d")
to_date = datetime.strptime("2022-12-31", "%Y-%m-%d")

# the field to filter on, the values to filter with and whether it should be an exact match
# some examples:
# return only objects where the author is u/watchful1 or u/spez
# field = "author"
# values = ["watchful1","spez"]
# exact_match = True
#
# return only objects where the title contains either "stonk" or "moon"
# field = "title"
# values = ["stonk","moon"]
# exact_match = False
#
# return only objects where the body contains either "stonk" or "moon". 
# For submissions the body is in the "selftext" field, for comments it's in the "body" field
# field = "selftext"
# values = ["stonk","moon"]
# exact_match = False

# filter a submission file and then get a file with all the comments only in those submissions. 
# This is a multi step process add your submission filters and set the output file name to something unique
# input_file = "redditdev_submissions.zst"
# output_file = "filtered_submissions"
# output_format = "csv"
# field = "author"
# values = ["watchful1"]
#
# run the script, this will result in a file called "filtered_submissions.csv" that contains only submissions by u/watchful1
# now we'll run the script again with the same input and same filters, but set the output to single field.
# Be sure to change the output file to a new name, but don't change any of the other inputs
# output_file = "submission_ids"
# single_field = "id"
#
# run the script again, this will result in a file called "submission_ids.txt" that has an id on each line
# now we'll remove all the other filters and update the script to input from the comments file, and use the submission ids list we created before. 
# And change the output name again so we don't override anything
# input_file = "redditdev_comments.zst"
# output_file = "filtered_comments"
# single_field = None  # resetting this back so it's not used
# field = "link_id"  # in the comment object, this is the field that contains the submission id
# exact_match = False  # the link_id field has a prefix on it, so we can't do an exact match
# values_file = "submission_ids.txt"
#
# run the script one last time and now you have a file called "filtered_comments.csv" that only has comments from your submissions above
# if you want only top level comments instead of all comments, you can set field to "parent_id" instead of "link_id"

field = "title"
values = ["ethereum","eth"]
exact_match = False

# if you have a long list of values, you can put them in a file and put the filename here. If set this overrides the value list above
# if this list is very large, it could greatly slow down the process
values_file = None




# sets up logging to the console as well as a file
log = logging.getLogger("bot")
log.setLevel(logging.INFO)
log_formatter = logging.Formatter('%(asctime)s - %(levelname)s: %(message)s')
log_str_handler = logging.StreamHandler()
log_str_handler.setFormatter(log_formatter)
log.addHandler(log_str_handler)
if not os.path.exists("logs"):
	os.makedirs("logs")
log_file_handler = logging.handlers.RotatingFileHandler(os.path.join("logs", "bot.log"), maxBytes=1024*1024*16, backupCount=5)
log_file_handler.setFormatter(log_formatter)
log.addHandler(log_file_handler)


def write_line_zst(handle, line):
	handle.write(line.encode('utf-8'))
	handle.write("\n".encode('utf-8'))


def write_line_json(handle, obj):
	handle.write(json.dumps(obj))
	handle.write("\n")


def write_line_single(handle, obj, field):
	if field in obj:
		handle.write(obj[field])
	else:
		log.info(f"{field} not in object {obj['id']}")
	handle.write("\n")


def write_line_csv(writer, obj, is_submission):
	output_list = []
	output_list.append(str(obj['score']))
	output_list.append(datetime.fromtimestamp(obj['created_utc']).strftime("%Y-%m-%d"))
	if is_submission:
		output_list.append(obj['title'])
	output_list.append(f"u/{obj['author']}")
	output_list.append(f"https://www.reddit.com{obj['permalink']}")
	if is_submission:
		if obj['is_self']:
			if 'selftext' in obj:
				output_list.append(obj['selftext'])
			else:
				output_list.append("")
		else:
			output_list.append(obj['url'])
	else:
		output_list.append(obj['body'])
	writer.writerow(output_list)


def read_and_decode(reader, chunk_size, max_window_size, previous_chunk=None, bytes_read=0):
	chunk = reader.read(chunk_size)
	bytes_read += chunk_size
	if previous_chunk is not None:
		chunk = previous_chunk + chunk
	try:
		return chunk.decode()
	except UnicodeDecodeError:
		if bytes_read > max_window_size:
			raise UnicodeError(f"Unable to decode frame after reading {bytes_read:,} bytes")
		log.info(f"Decoding error with {bytes_read:,} bytes, reading another chunk")
		return read_and_decode(reader, chunk_size, max_window_size, chunk, bytes_read)


def read_lines_zst(file_name):
	with open(file_name, 'rb') as file_handle:
		buffer = ''
		reader = zstandard.ZstdDecompressor(max_window_size=2**31).stream_reader(file_handle)
		while True:
			chunk = read_and_decode(reader, 2**27, (2**29) * 2)

			if not chunk:
				break
			lines = (buffer + chunk).split("\n")

			for line in lines[:-1]:
				yield line.strip(), file_handle.tell()

			buffer = lines[-1]

		reader.close()


if __name__ == "__main__":
	if single_field is not None:
		log.info("Single field output mode, changing output file format to txt")
		output_format = "txt"
	output_path = f"{output_file}.{output_format}"

	writer = None
	if output_format == "zst":
		log.info("Output format set to zst")
		handle = zstandard.ZstdCompressor().stream_writer(open(output_path, 'wb'))
	elif output_format == "txt":
		log.info("Output format set to txt")
		handle = open(output_path, 'w', encoding='UTF-8')
	elif output_format == "csv":
		log.info("Output format set to csv")
		handle = open(output_path, 'w', encoding='UTF-8', newline='')
		writer = csv.writer(handle)
	else:
		log.error(f"Unsupported output format {output_format}")
		sys.exit()

	if values_file is not None:
		values = []
		with open(values_file, 'r') as values_handle:
			for value in values_handle:
				values.append(value.strip().lower())
		log.info(f"Loaded {len(values)} from values file")
	else:
		values = [value.lower() for value in values]  # convert to lowercase

	file_size = os.stat(input_file).st_size
	file_bytes_processed = 0
	created = None
	matched_lines = 0
	bad_lines = 0
	total_lines = 0
	for line, file_bytes_processed in read_lines_zst(input_file):
		total_lines += 1
		if total_lines % 100000 == 0:
			log.info(f"{created.strftime('%Y-%m-%d %H:%M:%S')} : {total_lines:,} : {matched_lines:,} : {bad_lines:,} : {file_bytes_processed:,}:{(file_bytes_processed / file_size) * 100:.0f}%")

		try:
			obj = json.loads(line)
			created = datetime.utcfromtimestamp(int(obj['created_utc']))

			if created < from_date:
				continue
			if created > to_date:
				continue

			field_value = obj[field].lower()
			matched = False
			for value in values:
				if exact_match:
					if value == field_value:
						matched = True
						break
				else:
					if value in field_value:
						matched = True
						break
			if not matched:
				continue

			matched_lines += 1
			if output_format == "zst":
				write_line_zst(handle, line)
			elif output_format == "csv":
				write_line_csv(writer, obj, is_submission)
			elif output_format == "txt":
				if single_field is not None:
					write_line_single(handle, obj, single_field)
				else:
					write_line_json(handle, obj)
		except (KeyError, json.JSONDecodeError) as err:
			bad_lines += 1

	handle.close()
	log.info(f"Complete : {total_lines:,} : {matched_lines:,} : {bad_lines:,}")

2023-07-17 23:43:12,102 - INFO: Output format set to csv
2023-07-17 23:43:14,213 - INFO: 2017-10-25 08:45:13 : 100,000 : 0 : 0 : 36,045,625:9%
2023-07-17 23:43:16,000 - INFO: 2018-01-14 09:25:19 : 200,000 : 0 : 0 : 51,774,625:13%
2023-07-17 23:43:18,125 - INFO: 2018-04-05 19:57:54 : 300,000 : 0 : 0 : 79,038,225:20%
2023-07-17 23:43:20,562 - INFO: 2018-08-26 20:03:59 : 400,000 : 0 : 0 : 102,631,725:25%
2023-07-17 23:43:23,126 - INFO: 2019-03-18 09:46:35 : 500,000 : 0 : 0 : 126,094,150:31%
2023-07-17 23:43:25,604 - INFO: 2020-02-21 18:02:14 : 600,000 : 0 : 0 : 146,804,000:36%
2023-07-17 23:43:28,309 - INFO: 2021-01-12 06:47:46 : 700,000 : 0 : 0 : 166,727,400:41%
2023-07-17 23:43:31,326 - INFO: 2021-03-23 01:55:36 : 800,000 : 0 : 0 : 193,991,000:48%
2023-07-17 23:43:34,045 - INFO: 2021-05-09 11:45:41 : 900,000 : 0 : 0 : 212,865,800:53%
2023-07-17 23:43:37,203 - INFO: 2021-06-24 12:15:51 : 1,000,000 : 0 : 0 : 241,833,375:60%
2023-07-17 23:43:41,833 - INFO: 2021-08-18 07:31:59 : 1,100,000 :

In [4]:
# Extracting submissions (ie. posts) where the body contains the keywords ethereum or eth

import zstandard
import os
import json
import sys
import csv
from datetime import datetime
import logging.handlers

# put the path to the input file
input_file = r"C:\Users\leo_c\Downloads\reddit\subreddits\CryptoCurrency_submissions.zst"

# put the name or path to the output file. The file extension from below will be added automatically
output_file = r"C:\Users\leo_c\Downloads\filtered_selftext_CryptoCurrency_submissions"

# the format to output in, pick from the following options
#   zst: same as the input, a zstandard compressed ndjson file. Can be read by the other scripts in the repo
#   txt: an ndjson file, which is a text file with a separate json object on each line. Can be opened by any text editor
#   csv: a comma separated value file. Can be opened by a text editor or excel
# WARNING READ THIS: if you use txt or csv output on a large input file without filtering out most of the rows, the resulting file will be 
# extremely large. Usually about 7 times as large as the compressed input file
output_format = "csv"

# override the above format and output only this field into a text file, one per line. Useful if you want to make a list of authors or ids. 
# See the examples below
# any field that's in the dump is supported, but useful ones are
#   author: the username of the author
#   id: the id of the submission or comment
#   link_id: only for comments, the fullname of the submission the comment is associated with
#   parent_id: only for comments, the fullname of the parent of the comment. Either another comment or the submission if it's top level
single_field = None

# the fields in the file are different depending on whether it has comments or submissions. If we're writing a csv, we need to know which 
# fields to write.
# The filename from the torrent has which type it is, but you'll need to change this if you removed that from the filename
# EXTRA COMMENT: is_submission = "submission" in input_file sets the is_submission to true if it is a submission file, and false otherwise. You don't need to change this at all.
is_submission = "submission" in input_file

# only output items between these two dates
from_date = datetime.strptime("2022-12-01", "%Y-%m-%d")
to_date = datetime.strptime("2022-12-31", "%Y-%m-%d")

# the field to filter on, the values to filter with and whether it should be an exact match
# some examples:
# return only objects where the author is u/watchful1 or u/spez
# field = "author"
# values = ["watchful1","spez"]
# exact_match = True
#
# return only objects where the title contains either "stonk" or "moon"
# field = "title"
# values = ["stonk","moon"]
# exact_match = False
#
# return only objects where the body contains either "stonk" or "moon". 
# For submissions the body is in the "selftext" field, for comments it's in the "body" field
# field = "selftext"
# values = ["stonk","moon"]
# exact_match = False

# filter a submission file and then get a file with all the comments only in those submissions. 
# This is a multi step process add your submission filters and set the output file name to something unique
# input_file = "redditdev_submissions.zst"
# output_file = "filtered_submissions"
# output_format = "csv"
# field = "author"
# values = ["watchful1"]
#
# run the script, this will result in a file called "filtered_submissions.csv" that contains only submissions by u/watchful1
# now we'll run the script again with the same input and same filters, but set the output to single field.
# Be sure to change the output file to a new name, but don't change any of the other inputs
# output_file = "submission_ids"
# single_field = "id"
#
# run the script again, this will result in a file called "submission_ids.txt" that has an id on each line
# now we'll remove all the other filters and update the script to input from the comments file, and use the submission ids list we created before. 
# And change the output name again so we don't override anything
# input_file = "redditdev_comments.zst"
# output_file = "filtered_comments"
# single_field = None  # resetting this back so it's not used
# field = "link_id"  # in the comment object, this is the field that contains the submission id
# exact_match = False  # the link_id field has a prefix on it, so we can't do an exact match
# values_file = "submission_ids.txt"
#
# run the script one last time and now you have a file called "filtered_comments.csv" that only has comments from your submissions above
# if you want only top level comments instead of all comments, you can set field to "parent_id" instead of "link_id"

field = "selftext"
values = ["ethereum","eth"]
exact_match = False

# if you have a long list of values, you can put them in a file and put the filename here. If set this overrides the value list above
# if this list is very large, it could greatly slow down the process
values_file = None




# sets up logging to the console as well as a file
log = logging.getLogger("bot")
log.setLevel(logging.INFO)
log_formatter = logging.Formatter('%(asctime)s - %(levelname)s: %(message)s')
log_str_handler = logging.StreamHandler()
log_str_handler.setFormatter(log_formatter)
log.addHandler(log_str_handler)
if not os.path.exists("logs"):
	os.makedirs("logs")
log_file_handler = logging.handlers.RotatingFileHandler(os.path.join("logs", "bot.log"), maxBytes=1024*1024*16, backupCount=5)
log_file_handler.setFormatter(log_formatter)
log.addHandler(log_file_handler)


def write_line_zst(handle, line):
	handle.write(line.encode('utf-8'))
	handle.write("\n".encode('utf-8'))


def write_line_json(handle, obj):
	handle.write(json.dumps(obj))
	handle.write("\n")


def write_line_single(handle, obj, field):
	if field in obj:
		handle.write(obj[field])
	else:
		log.info(f"{field} not in object {obj['id']}")
	handle.write("\n")


def write_line_csv(writer, obj, is_submission):
	output_list = []
	output_list.append(str(obj['score']))
	output_list.append(datetime.fromtimestamp(obj['created_utc']).strftime("%Y-%m-%d"))
	if is_submission:
		output_list.append(obj['title'])
	output_list.append(f"u/{obj['author']}")
	output_list.append(f"https://www.reddit.com{obj['permalink']}")
	if is_submission:
		if obj['is_self']:
			if 'selftext' in obj:
				output_list.append(obj['selftext'])
			else:
				output_list.append("")
		else:
			output_list.append(obj['url'])
	else:
		output_list.append(obj['body'])
	writer.writerow(output_list)


def read_and_decode(reader, chunk_size, max_window_size, previous_chunk=None, bytes_read=0):
	chunk = reader.read(chunk_size)
	bytes_read += chunk_size
	if previous_chunk is not None:
		chunk = previous_chunk + chunk
	try:
		return chunk.decode()
	except UnicodeDecodeError:
		if bytes_read > max_window_size:
			raise UnicodeError(f"Unable to decode frame after reading {bytes_read:,} bytes")
		log.info(f"Decoding error with {bytes_read:,} bytes, reading another chunk")
		return read_and_decode(reader, chunk_size, max_window_size, chunk, bytes_read)


def read_lines_zst(file_name):
	with open(file_name, 'rb') as file_handle:
		buffer = ''
		reader = zstandard.ZstdDecompressor(max_window_size=2**31).stream_reader(file_handle)
		while True:
			chunk = read_and_decode(reader, 2**27, (2**29) * 2)

			if not chunk:
				break
			lines = (buffer + chunk).split("\n")

			for line in lines[:-1]:
				yield line.strip(), file_handle.tell()

			buffer = lines[-1]

		reader.close()


if __name__ == "__main__":
	if single_field is not None:
		log.info("Single field output mode, changing output file format to txt")
		output_format = "txt"
	output_path = f"{output_file}.{output_format}"

	writer = None
	if output_format == "zst":
		log.info("Output format set to zst")
		handle = zstandard.ZstdCompressor().stream_writer(open(output_path, 'wb'))
	elif output_format == "txt":
		log.info("Output format set to txt")
		handle = open(output_path, 'w', encoding='UTF-8')
	elif output_format == "csv":
		log.info("Output format set to csv")
		handle = open(output_path, 'w', encoding='UTF-8', newline='')
		writer = csv.writer(handle)
	else:
		log.error(f"Unsupported output format {output_format}")
		sys.exit()

	if values_file is not None:
		values = []
		with open(values_file, 'r') as values_handle:
			for value in values_handle:
				values.append(value.strip().lower())
		log.info(f"Loaded {len(values)} from values file")
	else:
		values = [value.lower() for value in values]  # convert to lowercase

	file_size = os.stat(input_file).st_size
	file_bytes_processed = 0
	created = None
	matched_lines = 0
	bad_lines = 0
	total_lines = 0
	for line, file_bytes_processed in read_lines_zst(input_file):
		total_lines += 1
		if total_lines % 100000 == 0:
			log.info(f"{created.strftime('%Y-%m-%d %H:%M:%S')} : {total_lines:,} : {matched_lines:,} : {bad_lines:,} : {file_bytes_processed:,}:{(file_bytes_processed / file_size) * 100:.0f}%")

		try:
			obj = json.loads(line)
			created = datetime.utcfromtimestamp(int(obj['created_utc']))

			if created < from_date:
				continue
			if created > to_date:
				continue

			field_value = obj[field].lower()
			matched = False
			for value in values:
				if exact_match:
					if value == field_value:
						matched = True
						break
				else:
					if value in field_value:
						matched = True
						break
			if not matched:
				continue

			matched_lines += 1
			if output_format == "zst":
				write_line_zst(handle, line)
			elif output_format == "csv":
				write_line_csv(writer, obj, is_submission)
			elif output_format == "txt":
				if single_field is not None:
					write_line_single(handle, obj, single_field)
				else:
					write_line_json(handle, obj)
		except (KeyError, json.JSONDecodeError) as err:
			bad_lines += 1

	handle.close()
	log.info(f"Complete : {total_lines:,} : {matched_lines:,} : {bad_lines:,}")

2023-07-18 00:48:49,001 - INFO: Output format set to csv
2023-07-18 00:48:49,001 - INFO: Output format set to csv
2023-07-18 00:48:49,001 - INFO: Output format set to csv
2023-07-18 00:48:49,001 - INFO: Output format set to csv
2023-07-18 00:48:50,875 - INFO: 2017-10-25 08:45:13 : 100,000 : 0 : 0 : 36,045,625:9%
2023-07-18 00:48:50,875 - INFO: 2017-10-25 08:45:13 : 100,000 : 0 : 0 : 36,045,625:9%
2023-07-18 00:48:50,875 - INFO: 2017-10-25 08:45:13 : 100,000 : 0 : 0 : 36,045,625:9%
2023-07-18 00:48:50,875 - INFO: 2017-10-25 08:45:13 : 100,000 : 0 : 0 : 36,045,625:9%
2023-07-18 00:48:52,446 - INFO: 2018-01-14 09:25:19 : 200,000 : 0 : 0 : 51,774,625:13%
2023-07-18 00:48:52,446 - INFO: 2018-01-14 09:25:19 : 200,000 : 0 : 0 : 51,774,625:13%
2023-07-18 00:48:52,446 - INFO: 2018-01-14 09:25:19 : 200,000 : 0 : 0 : 51,774,625:13%
2023-07-18 00:48:52,446 - INFO: 2018-01-14 09:25:19 : 200,000 : 0 : 0 : 51,774,625:13%
2023-07-18 00:48:54,286 - INFO: 2018-04-05 19:57:54 : 300,000 : 0 : 0 : 79,038,2

In [5]:
# Extracting comments where the body contains the keywords ethereum or eth

import zstandard
import os
import json
import sys
import csv
from datetime import datetime
import logging.handlers

# put the path to the input file
input_file = r"C:\Users\leo_c\Downloads\reddit\subreddits\CryptoCurrency_comments.zst"

# put the name or path to the output file. The file extension from below will be added automatically
output_file = r"C:\Users\leo_c\Downloads\filtered_body_CryptoCurrency_comments"

# the format to output in, pick from the following options
#   zst: same as the input, a zstandard compressed ndjson file. Can be read by the other scripts in the repo
#   txt: an ndjson file, which is a text file with a separate json object on each line. Can be opened by any text editor
#   csv: a comma separated value file. Can be opened by a text editor or excel
# WARNING READ THIS: if you use txt or csv output on a large input file without filtering out most of the rows, the resulting file will be 
# extremely large. Usually about 7 times as large as the compressed input file
output_format = "csv"

# override the above format and output only this field into a text file, one per line. Useful if you want to make a list of authors or ids. 
# See the examples below
# any field that's in the dump is supported, but useful ones are
#   author: the username of the author
#   id: the id of the submission or comment
#   link_id: only for comments, the fullname of the submission the comment is associated with
#   parent_id: only for comments, the fullname of the parent of the comment. Either another comment or the submission if it's top level
single_field = None

# the fields in the file are different depending on whether it has comments or submissions. If we're writing a csv, we need to know which 
# fields to write.
# The filename from the torrent has which type it is, but you'll need to change this if you removed that from the filename
# EXTRA COMMENT: is_submission = "submission" in input_file sets the is_submission to true if it is a submission file, and false otherwise. You don't need to change this at all.
is_submission = "submission" in input_file

# only output items between these two dates
from_date = datetime.strptime("2022-12-01", "%Y-%m-%d")
to_date = datetime.strptime("2022-12-31", "%Y-%m-%d")

# the field to filter on, the values to filter with and whether it should be an exact match
# some examples:
# return only objects where the author is u/watchful1 or u/spez
# field = "author"
# values = ["watchful1","spez"]
# exact_match = True
#
# return only objects where the title contains either "stonk" or "moon"
# field = "title"
# values = ["stonk","moon"]
# exact_match = False
#
# return only objects where the body contains either "stonk" or "moon". 
# For submissions the body is in the "selftext" field, for comments it's in the "body" field
# field = "selftext"
# values = ["stonk","moon"]
# exact_match = False

# filter a submission file and then get a file with all the comments only in those submissions. 
# This is a multi step process add your submission filters and set the output file name to something unique
# input_file = "redditdev_submissions.zst"
# output_file = "filtered_submissions"
# output_format = "csv"
# field = "author"
# values = ["watchful1"]
#
# run the script, this will result in a file called "filtered_submissions.csv" that contains only submissions by u/watchful1
# now we'll run the script again with the same input and same filters, but set the output to single field.
# Be sure to change the output file to a new name, but don't change any of the other inputs
# output_file = "submission_ids"
# single_field = "id"
#
# run the script again, this will result in a file called "submission_ids.txt" that has an id on each line
# now we'll remove all the other filters and update the script to input from the comments file, and use the submission ids list we created before. 
# And change the output name again so we don't override anything
# input_file = "redditdev_comments.zst"
# output_file = "filtered_comments"
# single_field = None  # resetting this back so it's not used
# field = "link_id"  # in the comment object, this is the field that contains the submission id
# exact_match = False  # the link_id field has a prefix on it, so we can't do an exact match
# values_file = "submission_ids.txt"
#
# run the script one last time and now you have a file called "filtered_comments.csv" that only has comments from your submissions above
# if you want only top level comments instead of all comments, you can set field to "parent_id" instead of "link_id"

field = "body"
values = ["ethereum","eth"]
exact_match = False

# if you have a long list of values, you can put them in a file and put the filename here. If set this overrides the value list above
# if this list is very large, it could greatly slow down the process
values_file = None




# sets up logging to the console as well as a file
log = logging.getLogger("bot")
log.setLevel(logging.INFO)
log_formatter = logging.Formatter('%(asctime)s - %(levelname)s: %(message)s')
log_str_handler = logging.StreamHandler()
log_str_handler.setFormatter(log_formatter)
log.addHandler(log_str_handler)
if not os.path.exists("logs"):
	os.makedirs("logs")
log_file_handler = logging.handlers.RotatingFileHandler(os.path.join("logs", "bot.log"), maxBytes=1024*1024*16, backupCount=5)
log_file_handler.setFormatter(log_formatter)
log.addHandler(log_file_handler)


def write_line_zst(handle, line):
	handle.write(line.encode('utf-8'))
	handle.write("\n".encode('utf-8'))


def write_line_json(handle, obj):
	handle.write(json.dumps(obj))
	handle.write("\n")


def write_line_single(handle, obj, field):
	if field in obj:
		handle.write(obj[field])
	else:
		log.info(f"{field} not in object {obj['id']}")
	handle.write("\n")


def write_line_csv(writer, obj, is_submission):
	output_list = []
	output_list.append(str(obj['score']))
	output_list.append(datetime.fromtimestamp(obj['created_utc']).strftime("%Y-%m-%d"))
	if is_submission:
		output_list.append(obj['title'])
	output_list.append(f"u/{obj['author']}")
	output_list.append(f"https://www.reddit.com{obj['permalink']}")
	if is_submission:
		if obj['is_self']:
			if 'selftext' in obj:
				output_list.append(obj['selftext'])
			else:
				output_list.append("")
		else:
			output_list.append(obj['url'])
	else:
		output_list.append(obj['body'])
	writer.writerow(output_list)


def read_and_decode(reader, chunk_size, max_window_size, previous_chunk=None, bytes_read=0):
	chunk = reader.read(chunk_size)
	bytes_read += chunk_size
	if previous_chunk is not None:
		chunk = previous_chunk + chunk
	try:
		return chunk.decode()
	except UnicodeDecodeError:
		if bytes_read > max_window_size:
			raise UnicodeError(f"Unable to decode frame after reading {bytes_read:,} bytes")
		log.info(f"Decoding error with {bytes_read:,} bytes, reading another chunk")
		return read_and_decode(reader, chunk_size, max_window_size, chunk, bytes_read)


def read_lines_zst(file_name):
	with open(file_name, 'rb') as file_handle:
		buffer = ''
		reader = zstandard.ZstdDecompressor(max_window_size=2**31).stream_reader(file_handle)
		while True:
			chunk = read_and_decode(reader, 2**27, (2**29) * 2)

			if not chunk:
				break
			lines = (buffer + chunk).split("\n")

			for line in lines[:-1]:
				yield line.strip(), file_handle.tell()

			buffer = lines[-1]

		reader.close()


if __name__ == "__main__":
	if single_field is not None:
		log.info("Single field output mode, changing output file format to txt")
		output_format = "txt"
	output_path = f"{output_file}.{output_format}"

	writer = None
	if output_format == "zst":
		log.info("Output format set to zst")
		handle = zstandard.ZstdCompressor().stream_writer(open(output_path, 'wb'))
	elif output_format == "txt":
		log.info("Output format set to txt")
		handle = open(output_path, 'w', encoding='UTF-8')
	elif output_format == "csv":
		log.info("Output format set to csv")
		handle = open(output_path, 'w', encoding='UTF-8', newline='')
		writer = csv.writer(handle)
	else:
		log.error(f"Unsupported output format {output_format}")
		sys.exit()

	if values_file is not None:
		values = []
		with open(values_file, 'r') as values_handle:
			for value in values_handle:
				values.append(value.strip().lower())
		log.info(f"Loaded {len(values)} from values file")
	else:
		values = [value.lower() for value in values]  # convert to lowercase

	file_size = os.stat(input_file).st_size
	file_bytes_processed = 0
	created = None
	matched_lines = 0
	bad_lines = 0
	total_lines = 0
	for line, file_bytes_processed in read_lines_zst(input_file):
		total_lines += 1
		if total_lines % 100000 == 0:
			log.info(f"{created.strftime('%Y-%m-%d %H:%M:%S')} : {total_lines:,} : {matched_lines:,} : {bad_lines:,} : {file_bytes_processed:,}:{(file_bytes_processed / file_size) * 100:.0f}%")

		try:
			obj = json.loads(line)
			created = datetime.utcfromtimestamp(int(obj['created_utc']))

			if created < from_date:
				continue
			if created > to_date:
				continue

			field_value = obj[field].lower()
			matched = False
			for value in values:
				if exact_match:
					if value == field_value:
						matched = True
						break
				else:
					if value in field_value:
						matched = True
						break
			if not matched:
				continue

			matched_lines += 1
			if output_format == "zst":
				write_line_zst(handle, line)
			elif output_format == "csv":
				write_line_csv(writer, obj, is_submission)
			elif output_format == "txt":
				if single_field is not None:
					write_line_single(handle, obj, single_field)
				else:
					write_line_json(handle, obj)
		except (KeyError, json.JSONDecodeError) as err:
			bad_lines += 1

	handle.close()
	log.info(f"Complete : {total_lines:,} : {matched_lines:,} : {bad_lines:,}")

2023-07-18 01:09:26,559 - INFO: Output format set to csv
2023-07-18 01:09:26,559 - INFO: Output format set to csv
2023-07-18 01:09:26,559 - INFO: Output format set to csv
2023-07-18 01:09:26,559 - INFO: Output format set to csv
2023-07-18 01:09:26,559 - INFO: Output format set to csv
2023-07-18 01:09:27,611 - INFO: 2016-10-15 12:45:09 : 100,000 : 0 : 0 : 21,758,450:1%
2023-07-18 01:09:27,611 - INFO: 2016-10-15 12:45:09 : 100,000 : 0 : 0 : 21,758,450:1%
2023-07-18 01:09:27,611 - INFO: 2016-10-15 12:45:09 : 100,000 : 0 : 0 : 21,758,450:1%
2023-07-18 01:09:27,611 - INFO: 2016-10-15 12:45:09 : 100,000 : 0 : 0 : 21,758,450:1%
2023-07-18 01:09:27,611 - INFO: 2016-10-15 12:45:09 : 100,000 : 0 : 0 : 21,758,450:1%
2023-07-18 01:09:28,245 - INFO: 2017-07-13 09:25:00 : 200,000 : 0 : 0 : 21,758,450:1%
2023-07-18 01:09:28,245 - INFO: 2017-07-13 09:25:00 : 200,000 : 0 : 0 : 21,758,450:1%
2023-07-18 01:09:28,245 - INFO: 2017-07-13 09:25:00 : 200,000 : 0 : 0 : 21,758,450:1%
2023-07-18 01:09:28,245 - I