Content-Based Community Discovery and Recommendation on Reddit
The python scripts for this project can be used to download and vectorize subreddit data. They require the standard numerical computing libraries in the Anaconda environment as well as a handful of other packages for content processing. The code has been tested on python 3.6. Note that the downloader makes use of a pre-made dataset for subreddit lists which can be downloaded here.
- download.py: Subreddit content scraper (See file for details)
- make_vecs.py: Content vectorizer (See file for details)
- visualize.py: Tool for dimensionality reduction and plotting (See file for details)
I try to keep records of all the experiments I perform in the following files. Each notebook usually encompasses analysis or creation of a single data file, so the journal is fairly modular in structure. The notebooks link to the external site plot.ly, which I use to save interactive visualizations of the data.
User-ready functionality is limited as of yet to the scripts mentioned in the 'running' section. However, I hope to deploy the models created here at app-scale once a useful model and front-end are both complete
Seeing how many newsfeed and content-suggestion systems work nowadays, I was inspired to create a new sort of discovery platform using data science techniques to construct the models.
Reddit has for a long time been my go-to platform for aggregated web content. Recently, in an effort to break free of the closed circle of communities (subreddits) I had found myself in, I started to make new accounts just so I could go through the process of re-subscribing to new subs that I thought might be interesting. One of my hopes for this project is to automate and enhance this process by training a content-based recommendation system (that hopefully incorporates an element of deviation from current sources)
Right now this repo is mostly me just tinkering around from time to time, but feel free to contribute or give feedback!