# Post Here: The Subreddit Suggester

### Notebook by: _Tobias Reaper_

---

## Notebook outline

* [_Imports and Configuration_](#Imports-and-Configuration)
* [Introduction](#Introduction)
  * [The Problem](#The-Problem)
  * [The Solution (The App)](#The-Solution)
  * [My Role](#My-Role)
* [The Data](#The-Data)
  * [Wrangling](#Wrangling)
  * [Exploration](#Exploration)
* [Modeling](#Modeling)
  * [Challenges](#Challenges)
  * [Feature Selection](#Feature-Selection)
  * [Vectorization](#Vectorization)
  * [Baseline](#Baseline)
  * [Multinomial Naive Bayes](#Multinomial-Naive-Bayes)
* [Final Thoughts](#Final-Thoughts)

---

## Imports and Configuration

In [1]:
# === General imports === #
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# === ML imports === #
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import chi2, SelectKBest

# === NLP Imports === #
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
nlp = spacy.load("en_core_web_md")

In [3]:
# === Configure pandas display settings === #
pd.options.display.max_colwidth = 200

---

## Introduction

### The Problem

Reddit is an expansive site. Anyone who has spent any significant amount of time on it knows what I mean. There is a subreddit for seemingly every topic anyone could ever want to discuss or even think about (and many that most do not want think about).

Reddit is a powerful site; a tool for connecting and sharing information with like- or unlike-minded individuals around the world. When used well, it can be a very useful resource.

On the other hand, the deluge of information that's constantly piling into the pages of can be overwhelming and lead to wasted time. As with any tool, it can be used for good or for not-so-good.

A common problem that Redditors experience, particularly those who are relatively new to the site, is where to post content. Given that there are subreddits for just about everything, with wildly varying degrees of specificity it can be quite overwhelming trying to find the best place for each post.

Just to illustrate the point, some subreddits get _weirdly_ specific. I won't go into the _really_ weird or NSFW, but here are some good examples of what I mean by specific:

* [r/Borderporn](https://www.reddit.com/r/Borderporn/)
* [r/BreadStapledtoTrees](https://www.reddit.com/r/BreadStapledToTrees/)
* [r/birdswitharms](https://www.reddit.com/r/birdswitharms/)
* [r/totallynotrobots](https://old.reddit.com/r/totallynotrobots)

...need I go on? (If you're curious and/or want to be entertained indefinitely, here is a [thread](https://www.reddit.com/r/AskReddit/comments/dd49gw/what_are_some_really_really_weird_subreddits/) with these and much, much more.)

Most of the time when a post is deemed irrelevant to a particular subreddit, it will simply be removed by moderators or a bot. However, depending on the subreddit and how welcoming they are to newbies, sometimes it can lead to very unfriendly responses and/or bans.

So how does one go about deciding where to post or pose a question?

Post Here aims to take the guesswork out of this process.

### The Solution

The goal with the Post Here app, as mentioned, is to provide a tool that makes it quick and easy to find the most appropriate subreddits for any given post. A user would simply provide the title and text of the their prospective post and the app would provide the user with a list of subreddit recommendations.

Recommendations are produced by a model attempts to predict which subreddit a given post would belong to. The model was built using Scikit-learn, and was trained on a large dataset of reddit posts. In order to serve the recommendations to the web app, an API was built using Flask and deployed to Heroku.

The live version of the app is linked below.

[Post Here: The Subreddit Suggester](https://github.com/tobias-fyi/post_here_ds)

### My Role

I worked on the Post Here app with a remote, interdisciplinary team of data scientists, machine learning engineers, and web developers. I was the sole machine learning engineer on the team, responsible for the entire process of building and training the machine learning models.

The main challenge I ran into, which directed the iterative process, was scope and dimensionality management.

At this point in my machine learning journey, this was one of the larger datasets that I'd taken on. Uncompressed, the dataset we used was over 800mb of mostly natural language text.

One aspect of natural language processing to keep in mind with such a dataset is the curse of dimensionality. When processed, a natural language dataset of this size would likely fall prey to the curse of dimensionality and prove somewhat unwieldy without large amounts of processing power.

I was forced to research and apply various methods of addressing this problem in order to fit the resulting models on the free Heroku Dyno (500mb) while preserving adequate performance.

One important way I had to wrangle with scope management was in deciding how many classes to try and predict. The original dataset contains data for 1,000 subreddits. It was not within the scope of a a four-day project to build a classification model of a caliber that could accurately classify 1,000 classes.

In the beginning, I did try to build a basic model trained on all 1,000 classes. But with the time and processing power I had, it proved to be untenable. In the end, I settled for a model that classified text into 200 subreddits with a test accuracy of over 90%.
[review after re-validating]

---

## The Data

The dataset we ended up using to train the recommendation system is called the [Reddit Self-Post Classification Task dataset](https://www.kaggle.com/mswarbrickjones/reddit-selfposts), available on Kaggle thanks to Evolution AI. The full dataset clocks in at over 800mb, containing 1,013,000 rows: 1,000 posts each from 1,013 subreddits.

The data was posted to reddit between June 2016 and June 2018.

[more info from the article]

For more details on the dataset, refer to Evolution AI's [blog post](https://evolution.ai//blog/page/5/an-imagenet-like-text-classification-task-based-on-reddit-posts/).

### Wrangling

As seems to be common with NLP projects, the process of wrangling the data was very much intertwined with the modeling process. Of course, this could be said about any machine learning project. However, I feel like it is particularly so in the case of NLP.

Therefore, this section—the one dedicated to data wrangling only—will be rather brief and basic. I go into much more detail in the Modeling section.

In [4]:
# === Load the dataset === #
rspct = pd.read_csv("assets/data/rspct.tsv", sep="\t")
print(rspct.shape)
rspct.head()

(1013000, 4)


Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?","Did he ever say what his addiction was or is he still chugging beers while talking about how sober he is?<lb><lb>Edited to add: As an addict myself, anyone I know whose been an addict doesn't drin..."
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. This was before I knew anything about motorcycling whatsoever. Me and some college buddies would always go out on the strip to the dance clubs. We alwa...
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.","I know this is a sub for the 'Ring Doorbell' but has anyone used the Floodlight? I already have the wire and existing bracket for the floodlight on the back of my house, but the problem is that i..."
4,77sxto,intel,Worried about my 8700k small fft/data stress results...,"Prime95 (regardless of version) and OCCT both, the ""small"" tests (including those parts of blend) make my temps shoot up to 100c+/throttling even at pure stock with MCE off instantaneously (I find..."


#### Nulls

Kaggle says that 12% of the subreddit column is null. If that is indeed the case, they did not get read into the dataframe correctly.

In [5]:
# === Null values === #
rspct.isnull().sum()

id           0
subreddit    0
title        0
selftext     0
dtype: int64

In [46]:
# === Get list of subreddits === #
subreddits = df1["subreddit"].unique()
subreddits

array(['talesfromtechsupport', 'teenmom', 'Harley', ..., 'halo',
       'gtaonline', 'mead'], dtype=object)

In [47]:
# === Prune list of subreddits === #
num_classes = 200
sub_small = subreddits[:num_classes]
sub_small.shape

(200,)

### Exploration

In [None]:
# === The list of subreddits === #

---

## Modeling

The Kaggle page description:

> We recommend splitting out the last 20% of the data as a test set (we have organised so that this is a random, stratified sample of all the data. In our experiments, we have been optimising for the precision-at-K metric for K = {1, 3, 5}.

In [None]:
# === Split up dataset into train and test === #

# First 80% is train; last 20% is test
train, test = 

train.shape, test.shape

In [54]:
# === Encode the target using LabelEncoder === #

# This process naively transforms each class of the target into a number
le = LabelEncoder() # Instantiate a new encoder instance
le.fit(y_train)  # Fit it on training label data

# Transform both using the train-fit instance
y_train = le.transform(y_train)
y_test  = le.transform(y_test)

y_train[:8]

array([ 48,  87,  27,  76,  84,  59,  20, 111])

### Vectorization

### Dimensionality Reduction

...and Feature Selection

* Large dataset = tons of text features
* Dimensionality reduction techniques
  * Chi^2 & SelectKBest

In [56]:
# === Feature Selection === #

from sklearn.feature_selection import chi2, SelectKBest

selector = SelectKBest(chi2, 10000)

selector.fit(X_train_sparse, y_train)

X_train_select = selector.transform(X_train_sparse)
X_test_select  = selector.transform(X_test_sparse)

X_train_select.shape, X_test_select.shape

((16032, 10000), (4009, 10000))

### Baseline

In [None]:
# === Evaluate performance using precision-at-k === #
def precision_at_k(y_true, y_pred, k=5):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    y_pred = np.argsort(y_pred, axis=1)
    y_pred = y_pred[:, ::-1][:, :k]
    arr = [y in s for y, s in zip(y_true, y_pred)]
    return np.mean(arr)

print('precision@1 =', np.mean(y_test == y_pred_rfc))
print('precision@3 =', precision_at_k(y_test, y_pred_proba_rfc, 3))
print('precision@5 =', precision_at_k(y_test, y_pred_proba_rfc, 5))

In [None]:
# === Baseline RandomForest model === #
rfc = RandomForestClassifier(max_depth=32, n_jobs=-1, n_estimators=200)
rfc.fit(X_train, y_train)

### Multinomial Naive Bayes

In [57]:
# === Naive Bayes model === #
nb = MultinomialNB(alpha=0.1)
nb.fit(X_train_select, y_train)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [59]:
# === Evaluate precision at k === #
print('precision@1 =', np.mean(y_test == y_pred))
print('precision@3 =', precision_at_k(y_test, y_pred_proba, 3))
print('precision@5 =', precision_at_k(y_test, y_pred_proba, 5))

precision@1 = 0.786979296582689
precision@3 = 0.8877525567473186
precision@5 = 0.9156896981790971


---

## Final Thoughts



### Scope Management, Revisited

[Potential improvements to the models here]

* Classifying first into category, then by specific subreddit