An extensive reddit comment scraper
Explore the docs »
Table of Contents
Kularity provides a simplistic, customisable and straightforward way to scrape reddit to retrieve comments and their scores. Data may be used for machine learning datasets, statistics or other tasks. It uses an algorithm which scrapes for comments specifically ignoring inactive users while continually looking for more related users to use. Each scrape is split into layers.
Creation stage:
- Get posts from a starting point and their posters
- For each post, collect comments and their users
- Build next layer of users
- Start layer process from first layer
Layer processing stage: This stage is creates 1 layer of data.
- For each username:
- Get all comments + scores made by users
- For each comment, get poster username
- Add submission ID poster to next layer
- Repeat process if there are remaining layers
This section should list any major frameworks that you built your project using. Leave any add-ons/plugins for the acknowledgements section. Here are a few examples.
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
You will need to install some Python modules for this to work.
python -m pip install -r requirements.txt
You also need a client ID, client secret & user agent.
They should be stored in function/.env
. A typical .env file would look like this:
client_id="wai98jtujsorsomething"
client_secret="jdwa8a9hrrui8jawd89jua09aju"
user_agent="windows:myscraper:1.0.0 (by u/user)"
- Get a client ID and secret using these steps
- Insert the client id, secret and your own user agent into
functions/.env
- Clone the repo
git clone https://github.com/Cyclip/Kularity.git
- Run the test (
debug.bat
orinfo.bat
) - Collect results at
test/dump.db
Use this space to show useful examples of how a project can be used. Additional screenshots, code examples and demos work well in this space. You may also link to more resources.
For more examples, please refer to the Documentation
Distributed under GNU General Public License v3.0. See LICENSE
for more information.
python main.py -h
for a shortened version
Default: 100
(capped at 100)
Number of initial posts to start off with.
Default: all
Scrape startingPostLimit
posts from this subreddit at the start.
This does not mean all posts scraped will be from here, unless stated in restrictSubs
Default: hot
Sorting to use when scraping the initial posts
Default: 5000
Maximum number of comments to scrape from each of the initial posts.
A higher value means more comments and more users to start scraping from.
Default: 1000
(capped at 1000)
Maximum number of comments to scrape from a single user.
Default: None
The maximum number of users to scrape from in per layer
Default: 15
Maximum number of submissions to scrape from under a user's comments.
For example when scraping a user's comments, the first 15 comments will have the submissions retrieved as well (the username of the submission poster specifically).
This is very heavy on performance.
More verbose logging (for debugging or nice to look at)
Log to build/log.log
as well as the console
Default: 3
Maximum number of layers to process excluding creation.
Default: dump
Directory to dump all data in.
nargs: 2 (int
, int
)
Example: --normalize -5 100
Normalize the score between a given range to output from 0 to 1.
Disable all possible inputs
Store the data in dump.json
as well as dump.db
Default: None
Example: --blockUsers users.txt
Argument should be a file path. File should follow this format:
user1
user2
...
Default: None
Example: --blockSubreddits subreddits.txt
Argument should be a file path. File should follow this format:
subreddit1
subreddit2
...
Block all NSFW profiles/subreddits. This may significantly decrease the number of comments scraped.
Default: -10000000
Minimum comment score required
Default: 10000000
Maximum comment score required
Example: --minTime "01/01/2021 00:00:00"
All comments/submissions must be made from this value (d/m/y H:M:S)
Default: None
Exmaple: --restrictSubs subreddits.txt
Restrict all comments and posts to be in that subreddit. Not intended for this algorithm but it's an option nonetheless.
Play a notification sound when scraping is complete.
--
Sound effects obtained from https://www.zapsplat.com