Subreddit classification via Natural Language Processing
Project 3 Subreddit NLP.pdf

Project 3: Subreddit Classification


  1. Using Reddit's API, collect posts from two subreddits: Communism & Socialism
  2. Use NLP to train a classifier on which subreddit a given post came from.

Executive Summary


Use NLP and classification algorithms to distinguish between two similar subreddit posts: Communism v. Socialism


  • Query PushShift API to retrieve submissions
  • Clean & pre-process text
  • Vectorize / tokenized text
  • Gridsearch to optimize hyper-parameters across two classification algorithms

Natural Language Processing:

  1. Removed extraneous tags: ‘removed’ , moderator posts, hyperlinks, non-letter characters

  2. Lemmatized text to reduce duplicates and better compare similarities

  3. Vectorized data using TF-IDF:

  • 75K vectors
  • Removed / edited stopwords
  • N-Gram Range: (1, 2)


This analysis includes the following:

  • A README markdown file that provides an introduction to and overview
  • A two (2) Jupyter notebooks that describes the following: (1) API Query via PushShift API (2) Natural Language Processing and Model Selection Evaluation
  • Accompanying presentation slideshow rendered as a .pdf file.


For this analysis, I leveraged PushShift API to query the following subreddits:

SOURCES: Reddit / PushShift API

Next Steps

  1. Increase max features produced via TF-IDF Vectorization
  2. Increase num of features considered in Random Forest

