Alternative Content-Based Community Discovery and Recommendation
Content-Based Community Discovery and Recommendation on Reddit


The python scripts for this project can be used to download and vectorize subreddit data. They require the standard numerical computing libraries in the Anaconda environment as well as a handful of other packages for content processing. The code has been tested on python 3.6. Note that the downloader makes use of a pre-made dataset for subreddit lists which can be downloaded here.


  • Subreddit content scraper (See file for details)
  • Content vectorizer (See file for details)
  • Tool for dimensionality reduction and plotting (See file for details)



I try to keep records of all the experiments I perform in the following files. Each notebook usually encompasses analysis or creation of a single data file, so the journal is fairly modular in structure. The notebooks link to the external site, which I use to save interactive visualizations of the data.

  1. TFIDF Vectorization and t-SNE Dimensionality Reduction
  2. Representation Analysis and Basic Clustering


User-ready functionality is limited as of yet to the scripts mentioned in the 'running' section. However, I hope to deploy the models created here at app-scale once a useful model and front-end are both complete


Seeing how many newsfeed and content-suggestion systems work nowadays, I was inspired to create a new sort of discovery platform using data science techniques to construct the models.

Why Reddit?

Reddit has for a long time been my go-to platform for aggregated web content. Recently, in an effort to break free of the closed circle of communities (subreddits) I had found myself in, I started to make new accounts just so I could go through the process of re-subscribing to new subs that I thought might be interesting. One of my hopes for this project is to automate and enhance this process by training a content-based recommendation system (that hopefully incorporates an element of deviation from current sources)


Right now this repo is mostly me just tinkering around from time to time, but feel free to contribute or give feedback!

