# Post Here: The Subreddit Suggester

## Notebook 1: The Data

> Small Model (limited dataset)

### By _Tobias Reaper_

---

## Notebook outline

* [_Imports and Configuration_](#Imports-and-Configuration)
* [Introduction](#Introduction)
  * [The Problem](#The-Problem)
  * [The Solution (The App)](#The-Solution)
  * [My Role](#My-Role)
* [The Data](#The-Data)
  * [Wrangling](#Wrangling)
  * [Exploration](#Exploration)
* [Modeling](#Modeling)
  * [Challenges](#Challenges)
  * [Feature Selection](#Feature-Selection)
  * [Vectorization](#Vectorization)
  * [Baseline](#Baseline)
  * [Multinomial Naive Bayes](#Multinomial-Naive-Bayes)
* [Final Thoughts](#Final-Thoughts)

---

## Imports and Configuration

In [1]:
# === General imports === #
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import janitor

In [2]:
# === ML imports === #
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import chi2, SelectKBest

# === NLP Imports === #
from sklearn.feature_extraction.text import TfidfVectorizer
# import spacy

In [3]:
# === Configure === #
# Load spacy language model
# nlp = spacy.load("en_core_web_md")

# Configure pandas display settings
pd.options.display.max_colwidth = 200

# Set random seed
seed = 92

---

## Introduction

### The Problem

Reddit is an expansive site. Anyone who has spent any significant amount of time on it knows what I mean. There is a subreddit for seemingly every topic anyone could ever want to discuss or even think about (and many that most do not want think about).

Reddit is a powerful site; a tool for connecting and sharing information with like- or unlike-minded individuals around the world. When used well, it can be a very useful resource.

On the other hand, the deluge of information that's constantly piling into the pages of can be overwhelming and lead to wasted time. As with any tool, it can be used for good or for not-so-good.

A common problem that Redditors experience, particularly those who are relatively new to the site, is where to post content. Given that there are subreddits for just about everything, with wildly varying degrees of specificity it can be quite overwhelming trying to find the best place for each post.

Just to illustrate the point, some subreddits get _weirdly_ specific. I won't go into the _really_ weird or NSFW, but here are some good examples of what I mean by specific:

* [r/Borderporn](https://www.reddit.com/r/Borderporn/)
* [r/BreadStapledtoTrees](https://www.reddit.com/r/BreadStapledToTrees/)
* [r/birdswitharms](https://www.reddit.com/r/birdswitharms/)
* [r/totallynotrobots](https://old.reddit.com/r/totallynotrobots)

...need I go on? (If you're curious and/or want to be entertained indefinitely, here is a [thread](https://www.reddit.com/r/AskReddit/comments/dd49gw/what_are_some_really_really_weird_subreddits/) with these and much, much more.)

Most of the time when a post is deemed irrelevant to a particular subreddit, it will simply be removed by moderators or a bot. However, depending on the subreddit and how welcoming they are to newbies, sometimes it can lead to very unfriendly responses and/or bans.

So how does one go about deciding where to post or pose a question?

Post Here aims to take the guesswork out of this process.

### The Solution

The goal with the Post Here app, as mentioned, is to provide a tool that makes it quick and easy to find the most appropriate subreddits for any given post. A user would simply provide the title and text of the their prospective post and the app would provide the user with a list of subreddit recommendations.

Recommendations are produced by a model attempts to predict which subreddit a given post would belong to. The model was built using Scikit-learn, and was trained on a large dataset of reddit posts. In order to serve the recommendations to the web app, an API was built using Flask and deployed to Heroku.

The live version of the app is linked below.

[Post Here: The Subreddit Suggester](https://github.com/tobias-fyi/post_here_ds)

### My Role

I worked on the Post Here app with a remote, interdisciplinary team of data scientists, machine learning engineers, and web developers. I was one of two machine learning engineers on the team, responsible for the entire process of building and training the machine learning models. The two data scientists on the team were primarily responsible for building and deploying the API.

The main challenge we ran into, which directed much of the iterative process, was scope management.

At this point in my machine learning journey, this was one of the larger datasets that I'd taken on. Uncompressed, the dataset we used was over 800mb of mostly natural language text. The dataset and the time constraint—we had less than four full days of work to finish the project—were the primary causes of the challenges we ended up facing.

With such a dataset, one important concept we had to keep in mind was the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality), which is basically a title for the various problems and phenomena that occur when dealing with extremely highly dimensional datasets. When processed, a natural language dataset of this size would likely fall prey to this curse and may prove somewhat unwieldy without large amounts of processing power.

I ended up researching and applying various methods of addressing this problem in order to fit the processing/modeling pipeline on the free Heroku Dyno, with a memory limit of 500mb, while preserving adequate performance. Many of our deployments failed because the pipeline, when loaded into memory on the server, exceeded that limit.

One important tradeoff we had to wrangle with was how much, and in what ways we could limit the dataset—i.e. how many classes to try and predict, and how many observations per class to include when training. The original dataset contains data for 1,000 subreddits. It was not within the scope of a a four-day project to build a classification model of a caliber that could accurately classify 1,000 classes.

In the beginning, we did try to build a basic model trained on all 1,000 classes. But with the time and processing power I had, it proved to be untenable. In the end, we settled for a model that classified text into 305 subreddits with a test precision-at-k of .75, .88, and .92 for 'k' of 1, 3, and 5, respectively.

---

## The Data

The dataset we ended up using to train the recommendation system is called the [Reddit Self-Post Classification Task dataset](https://www.kaggle.com/mswarbrickjones/reddit-selfposts), available on Kaggle thanks to Evolution AI. The full dataset clocks in at over 800mb, containing 1,013,000 rows: 1,000 posts each from 1,013 subreddits.

For more details on the dataset, including a nice interactive plot of all the subreddits and their relevance to one another, refer to Evolution AI's [blog post](https://evolution.ai//blog/page/5/an-imagenet-like-text-classification-task-based-on-reddit-posts/).

### Wrangling and Exploration

First, I needed to reduce the size of the dataset. I defined a subset of 12 categories which I thought were most relevant to the task at hand, and used that list to do the initial pruning. Those 12 categories left me with 305 unique subreddits and 305,000 rows. The list I used was as follows:

* health
* profession
* electronics
* hobby
* writing/stories
* advice/question
* social_group
* stem
* parenting
* books
* finance/money
* travel

Next, I took a random sample of those 305,000 rows. The result was a dataset with 91,500 rows, now consisting of between 250 and 340 rows per subreddit. If I tried to use all of the features (tokens, or words) that resulted from this corpus, even in its reduced state, it would still result in a serialized vocabulary and/or model too large for our free Heroku Dyno. However, the features used in the final model can be chosen based on how useful they are for the classification.

According to the dataset preview on Kaggle, there are quite a large number of missing values in each of the features—12%, 25%, and 39% of the subreddit, title, and selftext columns, respectively. However, I did not find any sign of those null values in the dataset nor mention of them in the dataset's companion blog post or article. I chocked it up to an error in the Kaggle preview.

Finally, I went about doing some basic preprocessing to get the data ready for vectorization. As described in the description page on Kaggle, newline and tab characters were replaced with their HTML equivalents, `<lb>` and `<tab>`. I removed those and other HTML entities using a simple regular expression. I also concatenated `title` and `selftext` into a single text feature in order to process them together.

In [4]:
# === Load the dataset === #
rspct = pd.read_csv("assets/data/rspct.tsv", sep="\t")
print(rspct.shape)
rspct.head(3)

(1013000, 4)


Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?","Did he ever say what his addiction was or is he still chugging beers while talking about how sober he is?<lb><lb>Edited to add: As an addict myself, anyone I know whose been an addict doesn't drin..."
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. This was before I knew anything about motorcycling whatsoever. Me and some college buddies would always go out on the strip to the dance clubs. We alwa...


#### Nulls

Kaggle says that 12%, 25%, and 39% of the subreddit, title, and selftext columns are null, respectively. If that is indeed the case, they did not get read into the dataframe correctly. However, it could be an error on Kaggle's part, seeing as there is no mention of these anywhere else in the description or blog post or article, nor sign of them during my explorations.

In [5]:
# === Null values === #
rspct.isnull().sum()

id           0
subreddit    0
title        0
selftext     0
dtype: int64

#### Preprocess

To prune the list of subreddits, I'll load in the `subreddit_info.csv` file, join, then choose a certain number of categories (category_1) to filter on.

In [6]:
# === Load info === #
info = pd.read_csv("assets/data/subreddit_info.csv", usecols=["subreddit", "category_1", "category_2"])
print(info.shape)
info.head()

(3394, 3)


Unnamed: 0,subreddit,category_1,category_2
0,whatsthatbook,advice/question,book
1,CasualConversation,advice/question,broad
2,Clairvoyantreadings,advice/question,broad
3,DecidingToBeBetter,advice/question,broad
4,HelpMeFind,advice/question,broad


In [7]:
# === Join the two dataframes === #
rspct = pd.merge(rspct, info, on="subreddit").drop(columns=["id"])
print(rspct.shape)
rspct.head()

(1013000, 5)


Unnamed: 0,subreddit,title,selftext,category_1,category_2
0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow...",writing/stories,tech support
1,talesfromtechsupport,I work IT for a certain clothing company and they use iPod Touchs for scanning some items,"[ME]- Thank you fro calling Store support, this is David. How may I help you?<lb><lb>[Store]- Yeah, my iPod is frozen<lb><lb>[ME]- Okay, can I have you hold down the power and the home button at t...",writing/stories,tech support
2,talesfromtechsupport,It... It says right there on the screen...?,"Hi guys! <lb><lb>&amp;nbsp;<lb><lb>LTL, FTP - all that jazz. Starting you off with a short one.<lb><lb>&amp;nbsp;<lb><lb>I'm the senior supporter at a smaller tech company with clients all over t...",writing/stories,tech support
3,talesfromtechsupport,The computers not working. FIX IT NOW!,"Hey there TFTS! This is my second time posting. I don't work for any tech support company, but I do have friends, family and teachers at school that have no idea how stuff works.<lb><lb>This tale ...",writing/stories,tech support
4,talesfromtechsupport,A Storm of Unreasonableness,"Usual LTL, FTP. I have shared this story on a different site, but after reading TFTS for sometime I figured it'd belong here as well. <lb><lb>This is from when I worked at a 3rd party call center ...",writing/stories,tech support


In [8]:
# === Still no nulls === #
rspct.isnull().sum()  # That's a good sign

subreddit     0
title         0
selftext      0
category_1    0
category_2    0
dtype: int64

In [9]:
# === Look at categories === #
rspct["category_1"].value_counts()

video_game               100000
tv_show                   68000
health                    58000
profession                56000
software                  52000
electronics               51000
music                     43000
sports                    40000
sex/relationships         31000
hobby                     30000
geo                       29000
crypto                    29000
company/website           28000
other                     27000
anime/manga               26000
drugs                     23000
writing/stories           22000
programming               21000
arts                      21000
autos                     20000
advice/question           18000
education                 17000
animals                   17000
politics/viewpoint        16000
social_group              16000
card_game                 15000
food/drink                15000
stem                      14000
hardware/tools            14000
parenting                 13000
religion/supernatural     13000
books   

In [10]:
# === Define list of categories to keep === #
keep_cats = [
    "health",
    "profession",
    "electronics",
    "hobby",
    "writing/stories",
    "advice/question",
    "social_group",
    "stem",
    "parenting",
    "books",
    "finance/money",
    "travel",
]

# === Prune dataset to above categories === #
# Overwriting to save memory
rspct = rspct[rspct["category_1"].isin(keep_cats)]
print(rspct.shape)
print("Unique subreddits:", len(rspct["subreddit"].unique()))
rspct.head(2)

(305000, 5)
Unique subreddits: 305


Unnamed: 0,subreddit,title,selftext,category_1,category_2
0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow...",writing/stories,tech support
1,talesfromtechsupport,I work IT for a certain clothing company and they use iPod Touchs for scanning some items,"[ME]- Thank you fro calling Store support, this is David. How may I help you?<lb><lb>[Store]- Yeah, my iPod is frozen<lb><lb>[ME]- Okay, can I have you hold down the power and the home button at t...",writing/stories,tech support


In [11]:
# === Take a sample of that === #
rspct = rspct.sample(frac=.3, random_state=seed)
print(rspct.shape)
rspct.head()

(91500, 5)


Unnamed: 0,subreddit,title,selftext,category_1,category_2
594781,stepparents,Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce,Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious relationship her AP (23M) whom she met and cheated on me with 6 mont...,parenting,step parenting
617757,bigseo,Do we raise our pricing?,"I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a struggle to get clients to...",profession,seo
642368,chemistry,Mac vs. PC?,"Hello, all! I am currently a senior in high school and in the fall I will be going to SUNY Geneseo, majoring in chemistry and minoring in mathematics. <lb><lb>Geneseo requires it’s students to get...",stem,chemistry
325221,migraine,Beer as an aural abortive?,"Hiya folks,<lb><lb>I've been a migraine sufferer pretty much my whole life. For me intense auras, numbness, confusion, the inability to speak or see is BY FAR the worst aspect of the ordeal. When ...",health,migraine
524939,MouseReview,Recommend office mouse,I was hoping you folks could help me out. Here's my situation and requirements:<lb><lb>* I don't play games at all<lb>* Budget $30.00 or less<lb>* Shape as close to old Microsoft Intellimouse Opti...,electronics,computer mouse


In [12]:
# === Clean up a bit === #
# Concatenate title and selftext
rspct["text"] = rspct["title"] + " " + rspct["selftext"]

# Drop categories
rspct = rspct.drop(columns=["category_1", "category_2", "title", "selftext"])

print(rspct.shape)
rspct.head(2)

(91500, 2)


Unnamed: 0,subreddit,text
594781,stepparents,Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious...
617757,bigseo,"Do we raise our pricing? I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a s..."


In [13]:
# === Remove <lb>, <tab>, and other HTML entities === #
# NOTE: takes a couple minutes to run
rspct["text"] = rspct["text"].str.replace("(<lb>)*|(<tab>)*|(&amp;)*|(nbsp;)*", "")
rspct.head()

Unnamed: 0,subreddit,text
594781,stepparents,Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious...
617757,bigseo,"Do we raise our pricing? I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a s..."
642368,chemistry,"Mac vs. PC? Hello, all! I am currently a senior in high school and in the fall I will be going to SUNY Geneseo, majoring in chemistry and minoring in mathematics. Geneseo requires it’s students to..."
325221,migraine,"Beer as an aural abortive? Hiya folks,I've been a migraine sufferer pretty much my whole life. For me intense auras, numbness, confusion, the inability to speak or see is BY FAR the worst aspect o..."
524939,MouseReview,Recommend office mouse I was hoping you folks could help me out. Here's my situation and requirements:* I don't play games at all* Budget $30.00 or less* Shape as close to old Microsoft Intellimou...


In [14]:
# === Reset the index === #
rspct = rspct.reset_index(drop=True)
rspct.head(2)

Unnamed: 0,subreddit,text
0,stepparents,Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious...
1,bigseo,"Do we raise our pricing? I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a s..."


In [15]:
# === Save pruned dataset to file === #
rspct.to_csv("assets/data/rspct_small.csv")

In [16]:
# === List of subreddits === #
subreddits = rspct["subreddit"].unique()
print(len(subreddits))
subreddits[:50]

305


array(['stepparents', 'bigseo', 'chemistry', 'migraine', 'MouseReview',
       'Malazan', 'Standup', 'preppers', 'Invisalign', 'whatsthisplant',
       'CrohnsDisease', 'KingkillerChronicle', 'OccupationalTherapy',
       'churning', 'Libraries', 'acting', 'eczema', 'Allergies',
       'bigboobproblems', 'AskAnthropology', 'psychotherapy',
       'WayfarersPub', 'synthesizers', 'StopGaming', 'stopsmoking',
       'eroticauthors', 'amazonecho', 'TalesFromThePizzaGuy',
       'rheumatoid', 'homestead', 'VoiceActing', 'FinancialCareers',
       'Sleepparalysis', 'ProtectAndServe', 'short', 'Fibromyalgia',
       'teaching', 'PlasticSurgery', 'insomnia', 'PLC', 'rapecounseling',
       'peacecorps', 'paintball', 'autism', 'Nanny', 'Plumbing',
       'Epilepsy', 'asmr', 'fatpeoplestories', 'Magic'], dtype=object)

In [17]:
rspct["subreddit"].value_counts()

Dreams             340
Gifts              337
HFY                333
Cubers             333
cassetteculture    333
                  ... 
foreignservice     265
WritingPrompts     263
immigration        263
TryingForABaby     262
Physics            250
Name: subreddit, Length: 305, dtype: int64