Information

Kularity

An extensive reddit comment scraper
Explore the docs »

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
License
Information

About The Project

Kularity provides a simplistic, customisable and straightforward way to scrape reddit to retrieve comments and their scores. Data may be used for machine learning datasets, statistics or other tasks. It uses an algorithm which scrapes for comments specifically ignoring inactive users while continually looking for more related users to use. Each scrape is split into layers.

Algorithm

Creation stage:

Get posts from a starting point and their posters
For each post, collect comments and their users
Build next layer of users
Start layer process from first layer

Layer processing stage: This stage is creates 1 layer of data.

For each username:
- Get all comments + scores made by users
- For each comment, get poster username
- Add submission ID poster to next layer
Repeat process if there are remaining layers

Built With

This section should list any major frameworks that you built your project using. Leave any add-ons/plugins for the acknowledgements section. Here are a few examples.

Python
Praw

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

Prerequisites

You will need to install some Python modules for this to work.

python -m pip install -r requirements.txt

You also need a client ID, client secret & user agent. They should be stored in function/.env. A typical .env file would look like this:

client_id="wai98jtujsorsomething"
client_secret="jdwa8a9hrrui8jawd89jua09aju"
user_agent="windows:myscraper:1.0.0 (by u/user)"

Installation

Get a client ID and secret using these steps
Insert the client id, secret and your own user agent into functions/.env

Clone the repo

git clone https://github.com/Cyclip/Kularity.git

Run the test (debug.bat or info.bat)
Collect results at test/dump.db

Usage

Use this space to show useful examples of how a project can be used. Additional screenshots, code examples and demos work well in this space. You may also link to more resources.

For more examples, please refer to the Documentation

License

Distributed under GNU General Public License v3.0. See LICENSE for more information.

Information

Arguments

python main.py -h for a shortened version

startingPostLimit

Default: 100 (capped at 100) Number of initial posts to start off with.

startingSubreddit

Default: all Scrape startingPostLimit posts from this subreddit at the start. This does not mean all posts scraped will be from here, unless stated in restrictSubs

startingSort

Default: hot Sorting to use when scraping the initial posts

postCommentLimit

Default: 5000 Maximum number of comments to scrape from each of the initial posts. A higher value means more comments and more users to start scraping from.

userCommentLimit

Default: 1000 (capped at 1000) Maximum number of comments to scrape from a single user.

userLimit

Default: None The maximum number of users to scrape from in per layer

submissionLimit

Default: 15 Maximum number of submissions to scrape from under a user's comments. For example when scraping a user's comments, the first 15 comments will have the submissions retrieved as well (the username of the submission poster specifically). This is very heavy on performance.

verbose

More verbose logging (for debugging or nice to look at)

fileLogging

Log to build/log.log as well as the console

layers

Default: 3 Maximum number of layers to process excluding creation.

dir

Default: dump Directory to dump all data in.

normalize

nargs: 2 (int, int) Example: --normalize -5 100 Normalize the score between a given range to output from 0 to 1.

noInput

Disable all possible inputs

formatJSON

Store the data in dump.json as well as dump.db

blockUsers

Default: None Example: --blockUsers users.txt Argument should be a file path. File should follow this format:

user1
user2
...

blockSubreddits

Default: None Example: --blockSubreddits subreddits.txt Argument should be a file path. File should follow this format:

subreddit1
subreddit2
...

blockNSFW

Block all NSFW profiles/subreddits. This may significantly decrease the number of comments scraped.

minScore

Default: -10000000 Minimum comment score required

maxScore

Default: 10000000 Maximum comment score required

minTime

Example: --minTime "01/01/2021 00:00:00" All comments/submissions must be made from this value (d/m/y H:M:S)

restrictSubs

Default: None Exmaple: --restrictSubs subreddits.txt Restrict all comments and posts to be in that subreddit. Not intended for this algorithm but it's an option nonetheless.

notify

Play a notification sound when scraping is complete.

--

Sound effects obtained from https://www.zapsplat.com

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vs		.vs
dump		dump
functions		functions
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
debug.bat		debug.bat
gui.py		gui.py
info.bat		info.bat
kularity.ui		kularity.ui
main.py		main.py
main_gui.py		main_gui.py
notif.wav		notif.wav
requirements.txt		requirements.txt

License

rsclip/Kularity

Folders and files

Latest commit

History

Repository files navigation

Kularity

About The Project

Algorithm

Built With

Getting Started

Prerequisites

Installation

Usage

License

Information

Arguments

startingPostLimit

startingSubreddit

startingSort

postCommentLimit

userCommentLimit

userLimit

submissionLimit

verbose

fileLogging

layers

dir

normalize

noInput

formatJSON

blockUsers

blockSubreddits

blockNSFW

minScore

maxScore

minTime

restrictSubs

notify

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages