# Post Here: The Subreddit Suggester

## Small Model (limited dataset)

### Notebook by: _Tobias Reaper_

---

## Notebook outline

* [_Imports and Configuration_](#Imports-and-Configuration)
* [Introduction](#Introduction)
  * [The Problem](#The-Problem)
  * [The Solution (The App)](#The-Solution)
  * [My Role](#My-Role)
* [The Data](#The-Data)
  * [Wrangling](#Wrangling)
  * [Exploration](#Exploration)
* [Modeling](#Modeling)
  * [Challenges](#Challenges)
  * [Feature Selection](#Feature-Selection)
  * [Vectorization](#Vectorization)
  * [Baseline](#Baseline)
  * [Multinomial Naive Bayes](#Multinomial-Naive-Bayes)
* [Final Thoughts](#Final-Thoughts)

---

## Imports and Configuration

In [35]:
# === General imports === #
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import janitor

In [2]:
# === ML imports === #
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import chi2, SelectKBest

# === NLP Imports === #
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy

In [3]:
# === Configure === #
# Load spacy language model
nlp = spacy.load("en_core_web_md")

# Configure pandas display settings
pd.options.display.max_colwidth = 200

# Set random seed
seed = 92

---

## Introduction

### The Problem

Reddit is an expansive site. Anyone who has spent any significant amount of time on it knows what I mean. There is a subreddit for seemingly every topic anyone could ever want to discuss or even think about (and many that most do not want think about).

Reddit is a powerful site; a tool for connecting and sharing information with like- or unlike-minded individuals around the world. When used well, it can be a very useful resource.

On the other hand, the deluge of information that's constantly piling into the pages of can be overwhelming and lead to wasted time. As with any tool, it can be used for good or for not-so-good.

A common problem that Redditors experience, particularly those who are relatively new to the site, is where to post content. Given that there are subreddits for just about everything, with wildly varying degrees of specificity it can be quite overwhelming trying to find the best place for each post.

Just to illustrate the point, some subreddits get _weirdly_ specific. I won't go into the _really_ weird or NSFW, but here are some good examples of what I mean by specific:

* [r/Borderporn](https://www.reddit.com/r/Borderporn/)
* [r/BreadStapledtoTrees](https://www.reddit.com/r/BreadStapledToTrees/)
* [r/birdswitharms](https://www.reddit.com/r/birdswitharms/)
* [r/totallynotrobots](https://old.reddit.com/r/totallynotrobots)

...need I go on? (If you're curious and/or want to be entertained indefinitely, here is a [thread](https://www.reddit.com/r/AskReddit/comments/dd49gw/what_are_some_really_really_weird_subreddits/) with these and much, much more.)

Most of the time when a post is deemed irrelevant to a particular subreddit, it will simply be removed by moderators or a bot. However, depending on the subreddit and how welcoming they are to newbies, sometimes it can lead to very unfriendly responses and/or bans.

So how does one go about deciding where to post or pose a question?

Post Here aims to take the guesswork out of this process.

### The Solution

The goal with the Post Here app, as mentioned, is to provide a tool that makes it quick and easy to find the most appropriate subreddits for any given post. A user would simply provide the title and text of the their prospective post and the app would provide the user with a list of subreddit recommendations.

Recommendations are produced by a model attempts to predict which subreddit a given post would belong to. The model was built using Scikit-learn, and was trained on a large dataset of reddit posts. In order to serve the recommendations to the web app, an API was built using Flask and deployed to Heroku.

The live version of the app is linked below.

[Post Here: The Subreddit Suggester](https://github.com/tobias-fyi/post_here_ds)

### My Role

I worked on the Post Here app with a remote, interdisciplinary team of data scientists, machine learning engineers, and web developers. I was the sole machine learning engineer on the team, responsible for the entire process of building and training the machine learning models.

The main challenge I ran into, which directed the iterative process, was scope and dimensionality management.

At this point in my machine learning journey, this was one of the larger datasets that I'd taken on. Uncompressed, the dataset we used was over 800mb of mostly natural language text.

One aspect of natural language processing to keep in mind with such a dataset is the curse of dimensionality. When processed, a natural language dataset of this size would likely fall prey to the curse of dimensionality and prove somewhat unwieldy without large amounts of processing power.

I was forced to research and apply various methods of addressing this problem in order to fit the resulting models on the free Heroku Dyno (500mb) while preserving adequate performance.

One important way I had to wrangle with scope management was in deciding how many classes to try and predict. The original dataset contains data for 1,000 subreddits. It was not within the scope of a a four-day project to build a classification model of a caliber that could accurately classify 1,000 classes.

In the beginning, I did try to build a basic model trained on all 1,000 classes. But with the time and processing power I had, it proved to be untenable. In the end, I settled for a model that classified text into 200 subreddits with a test accuracy of over 90%.
[review after re-validating]

---

## The Data

The dataset we ended up using to train the recommendation system is called the [Reddit Self-Post Classification Task dataset](https://www.kaggle.com/mswarbrickjones/reddit-selfposts), available on Kaggle thanks to Evolution AI. The full dataset clocks in at over 800mb, containing 1,013,000 rows: 1,000 posts each from 1,013 subreddits.

The data was posted to reddit between June 2016 and June 2018.

[more info from the article]

For more details on the dataset, refer to Evolution AI's [blog post](https://evolution.ai//blog/page/5/an-imagenet-like-text-classification-task-based-on-reddit-posts/).

### Wrangling and Exploration

As seems to be common with NLP projects, the process of wrangling the data was very much intertwined with the modeling process. Of course, this could be said about any machine learning project. However, I feel like it is particularly so in the case of NLP.

Therefore, this section—the one dedicated to data wrangling only—will be rather brief and basic. I go into much more detail in the Modeling section.

First, I needed to reduce the size of the dataset. I defined a subset of 12 categories which I thought were most relevant to the task at hand, and used that list to do the initial pruning. Those 12 categories left me with 305 unique subreddits and 305,000 rows. The list I used was as follows:

* health
* profession
* electronics
* hobby
* writing/stories
* advice/question
* social_group
* stem
* parenting
* books
* finance/money
* travel

Next, I took a random sample of those 305,000 rows. The result was a dataset with 91,500 rows, now consisting of between 250 and 340 rows per subreddit. If I tried to use all of the features (tokens, or words) that resulted from this corpus, even in its reduced state, it would still result in a serialized vocabulary and/or model too large for our free Heroku Dyno. However, the features used in the final model can be chosen based on how useful they are for the classification.

According to the dataset preview on Kaggle, there are quite a large number of missing values in each of the features—12%, 25%, and 39% of the subreddit, title, and selftext columns, respectively. However, I did not find any sign of those null values in the dataset nor mention of them in the dataset's companion blog post or article. I chocked it up to an error in the Kaggle preview.

Finally, I went about doing some basic preprocessing to get the data ready for vectorization. As described in the description page on Kaggle, newline and tab characters were replaced with their HTML equivalents, `<lb>` and `<tab>`. I removed those and other HTML entities using a regular expression. I also concatenated the `title` and `selftext` features into a single text feature in order to process them together.

In [4]:
# === Load the dataset === #
rspct = pd.read_csv("assets/data/rspct.tsv", sep="\t")
print(rspct.shape)
rspct.head(3)

(1013000, 4)


Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?","Did he ever say what his addiction was or is he still chugging beers while talking about how sober he is?<lb><lb>Edited to add: As an addict myself, anyone I know whose been an addict doesn't drin..."
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. This was before I knew anything about motorcycling whatsoever. Me and some college buddies would always go out on the strip to the dance clubs. We alwa...


#### Nulls

Kaggle says that 12%, 25%, and 39% of the subreddit, title, and selftext columns are null, respectively. If that is indeed the case, they did not get read into the dataframe correctly. However, it could be an error on Kaggle's part, seeing as there is no mention of these anywhere else in the description or blog post or article, nor sign of them during my explorations.

In [5]:
# === Null values === #
rspct.isnull().sum()

id           0
subreddit    0
title        0
selftext     0
dtype: int64

#### Preprocess

To prune the list of subreddits, I'll load in the `subreddit_info.csv` file, join, then choose a certain number of categories (category_1) to filter on.

In [6]:
# === Load info === #
info = pd.read_csv("assets/data/subreddit_info.csv", usecols=["subreddit", "category_1", "category_2"])
print(info.shape)
info.head()

(3394, 3)


Unnamed: 0,subreddit,category_1,category_2
0,whatsthatbook,advice/question,book
1,CasualConversation,advice/question,broad
2,Clairvoyantreadings,advice/question,broad
3,DecidingToBeBetter,advice/question,broad
4,HelpMeFind,advice/question,broad


In [7]:
# === Join the two dataframes === #
rspct = pd.merge(rspct, info, on="subreddit").drop(columns=["id"])
print(rspct.shape)
rspct.head()

(1013000, 5)


Unnamed: 0,subreddit,title,selftext,category_1,category_2
0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow...",writing/stories,tech support
1,talesfromtechsupport,I work IT for a certain clothing company and they use iPod Touchs for scanning some items,"[ME]- Thank you fro calling Store support, this is David. How may I help you?<lb><lb>[Store]- Yeah, my iPod is frozen<lb><lb>[ME]- Okay, can I have you hold down the power and the home button at t...",writing/stories,tech support
2,talesfromtechsupport,It... It says right there on the screen...?,"Hi guys! <lb><lb>&amp;nbsp;<lb><lb>LTL, FTP - all that jazz. Starting you off with a short one.<lb><lb>&amp;nbsp;<lb><lb>I'm the senior supporter at a smaller tech company with clients all over t...",writing/stories,tech support
3,talesfromtechsupport,The computers not working. FIX IT NOW!,"Hey there TFTS! This is my second time posting. I don't work for any tech support company, but I do have friends, family and teachers at school that have no idea how stuff works.<lb><lb>This tale ...",writing/stories,tech support
4,talesfromtechsupport,A Storm of Unreasonableness,"Usual LTL, FTP. I have shared this story on a different site, but after reading TFTS for sometime I figured it'd belong here as well. <lb><lb>This is from when I worked at a 3rd party call center ...",writing/stories,tech support


In [8]:
# === Still no nulls === #
rspct.isnull().sum()  # That's a good sign

subreddit     0
title         0
selftext      0
category_1    0
category_2    0
dtype: int64

In [9]:
# === Look at categories === #
rspct["category_1"].value_counts()

video_game               100000
tv_show                   68000
health                    58000
profession                56000
software                  52000
electronics               51000
music                     43000
sports                    40000
sex/relationships         31000
hobby                     30000
geo                       29000
crypto                    29000
company/website           28000
other                     27000
anime/manga               26000
drugs                     23000
writing/stories           22000
programming               21000
arts                      21000
autos                     20000
advice/question           18000
education                 17000
animals                   17000
social_group              16000
politics/viewpoint        16000
food/drink                15000
card_game                 15000
stem                      14000
hardware/tools            14000
religion/supernatural     13000
parenting                 13000
books   

In [10]:
# === Define list of categories to keep === #
keep_cats = [
    "health",
    "profession",
    "electronics",
    "hobby",
    "writing/stories",
    "advice/question",
    "social_group",
    "stem",
    "parenting",
    "books",
    "finance/money",
    "travel",
]

# === Prune dataset to above categories === #
# Overwriting to save memory
rspct = rspct[rspct["category_1"].isin(keep_cats)]
print(rspct.shape)
print("Unique subreddits:", len(rspct["subreddit"].unique()))
rspct.head(2)

(305000, 5)
Unique subreddits: 305


Unnamed: 0,subreddit,title,selftext,category_1,category_2
0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow...",writing/stories,tech support
1,talesfromtechsupport,I work IT for a certain clothing company and they use iPod Touchs for scanning some items,"[ME]- Thank you fro calling Store support, this is David. How may I help you?<lb><lb>[Store]- Yeah, my iPod is frozen<lb><lb>[ME]- Okay, can I have you hold down the power and the home button at t...",writing/stories,tech support


In [11]:
# === Take a sample of that === #
rspct = rspct.sample(frac=.3, random_state=seed)
print(rspct.shape)
rspct.head()

(91500, 5)


Unnamed: 0,subreddit,title,selftext,category_1,category_2
594781,stepparents,Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce,Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious relationship her AP (23M) whom she met and cheated on me with 6 mont...,parenting,step parenting
617757,bigseo,Do we raise our pricing?,"I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a struggle to get clients to...",profession,seo
642368,chemistry,Mac vs. PC?,"Hello, all! I am currently a senior in high school and in the fall I will be going to SUNY Geneseo, majoring in chemistry and minoring in mathematics. <lb><lb>Geneseo requires it’s students to get...",stem,chemistry
325221,migraine,Beer as an aural abortive?,"Hiya folks,<lb><lb>I've been a migraine sufferer pretty much my whole life. For me intense auras, numbness, confusion, the inability to speak or see is BY FAR the worst aspect of the ordeal. When ...",health,migraine
524939,MouseReview,Recommend office mouse,I was hoping you folks could help me out. Here's my situation and requirements:<lb><lb>* I don't play games at all<lb>* Budget $30.00 or less<lb>* Shape as close to old Microsoft Intellimouse Opti...,electronics,computer mouse


In [12]:
# === Clean up a bit === #
# Concatenate title and selftext
rspct["text"] = rspct["title"] + " " + rspct["selftext"]

# Drop categories
rspct = rspct.drop(columns=["category_1", "category_2", "title", "selftext"])

print(rspct.shape)
rspct.head(2)

(91500, 2)


Unnamed: 0,subreddit,text
594781,stepparents,Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious...
617757,bigseo,"Do we raise our pricing? I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a s..."


In [13]:
# === Remove <lb>, <tab>, and other HTML entities === #
# NOTE: takes a couple minutes to run
rspct["text"] = rspct["text"].str.replace("(<lb>)*|(<tab>)*|(&amp;)*|(nbsp;)*", "")
rspct.head()

Unnamed: 0,subreddit,text
594781,stepparents,Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious...
617757,bigseo,"Do we raise our pricing? I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a s..."
642368,chemistry,"Mac vs. PC? Hello, all! I am currently a senior in high school and in the fall I will be going to SUNY Geneseo, majoring in chemistry and minoring in mathematics. Geneseo requires it’s students to..."
325221,migraine,"Beer as an aural abortive? Hiya folks,I've been a migraine sufferer pretty much my whole life. For me intense auras, numbness, confusion, the inability to speak or see is BY FAR the worst aspect o..."
524939,MouseReview,Recommend office mouse I was hoping you folks could help me out. Here's my situation and requirements:* I don't play games at all* Budget $30.00 or less* Shape as close to old Microsoft Intellimou...


In [14]:
# === Reset the index === #
rspct = rspct.reset_index(drop=True)
rspct.head(2)

Unnamed: 0,subreddit,text
0,stepparents,Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious...
1,bigseo,"Do we raise our pricing? I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a s..."


In [15]:
# === Save pruned dataset to file === #
rspct.to_csv("assets/data/rspct_small.csv")

In [16]:
# === List of subreddits === #
subreddits = rspct["subreddit"].unique()
print(len(subreddits))
subreddits[:50]

305


array(['stepparents', 'bigseo', 'chemistry', 'migraine', 'MouseReview',
       'Malazan', 'Standup', 'preppers', 'Invisalign', 'whatsthisplant',
       'CrohnsDisease', 'KingkillerChronicle', 'OccupationalTherapy',
       'churning', 'Libraries', 'acting', 'eczema', 'Allergies',
       'bigboobproblems', 'AskAnthropology', 'psychotherapy',
       'WayfarersPub', 'synthesizers', 'StopGaming', 'stopsmoking',
       'eroticauthors', 'amazonecho', 'TalesFromThePizzaGuy',
       'rheumatoid', 'homestead', 'VoiceActing', 'FinancialCareers',
       'Sleepparalysis', 'ProtectAndServe', 'short', 'Fibromyalgia',
       'teaching', 'PlasticSurgery', 'insomnia', 'PLC', 'rapecounseling',
       'peacecorps', 'paintball', 'autism', 'Nanny', 'Plumbing',
       'Epilepsy', 'asmr', 'fatpeoplestories', 'Magic'], dtype=object)

In [17]:
rspct["subreddit"].value_counts()

Dreams             340
Gifts              337
Cubers             333
cassetteculture    333
HFY                333
                  ... 
foreignservice     265
immigration        263
WritingPrompts     263
TryingForABaby     262
Physics            250
Name: subreddit, Length: 305, dtype: int64

---

## Modeling

In [18]:
# === Split up dataset into train and test === #
train, test = train_test_split(rspct, test_size=0.2)
train.shape, test.shape

((73200, 2), (18300, 2))

In [19]:
# === Split out feature/target === #
X_train = train["text"]
X_test = test["text"]

y_train = train["subreddit"]
y_test = test["subreddit"]

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(73200,) (18300,)
(73200,) (18300,)


In [20]:
# === Encode the target using LabelEncoder === #

# This process naively transforms each class of the target into a number
le = LabelEncoder() # Instantiate a new encoder instance
le.fit(y_train)  # Fit it on training label data

# Transform both using the trained instance
y_train = le.transform(y_train)
y_test  = le.transform(y_test)

y_train[:8]

array([233, 271,  89,  86, 278, 197, 201, 126])

### Vectorization

Custom tokenizer function that removes stop words and punctuation, and reduces each token down to its lemma.

In [21]:
def tokenize(doc):
    """Simple version: extracts spacy lemmas and returns them as a list.
    Only filters spacy stopwords and punctuation.
    """
    doc = nlp(doc)
    tokens = []
    
    for token in doc:
        if (token.is_stop == False) and (token.is_punct == False):
            tokens.append(token.lemma_.strip().lower())

    return tokens

In [23]:
# === Vectorize! === #

# Extract features from the text data using bag-of-words method

tfidf = TfidfVectorizer(
    max_features=100000,
    min_df=5,
#     max_df=.98,
    ngram_range=(1,2),
#     tokenizer=tokenize
#     stop_words=nlp.Defaults.stop_words,  # Use spacy's stop words
    stop_words="english",
)

# Fit the vectorizer on the feature column to create vocab (doc-term matrix)
vocab = tfidf.fit(X_train)

# Get sparse document-term matrices
X_train_sparse = vocab.transform(X_train)
X_test_sparse = vocab.transform(X_test)

In [24]:
X_train_sparse

<73200x100000 sparse matrix of type '<class 'numpy.float64'>'
	with 5332578 stored elements in Compressed Sparse Row format>

### Feature Selection

As mentioned previously, the size of the corpus means the dimensionality of the featureset after vectorization will be very high. In fact, I passed in  100,000 as the maximum number of features to the vectorizer. It is generally not good practice to have a larger number of features (100,000) than observations (91,500).

To reduce it down from that 100,000, I used a process called select k best that does exactly what it sounds like: selects a certain number of the best features. The key aspect of this process is how to measure the value of the features; how to find which ones are the "best". The scoring function I used in this case is called [ch2](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) (chi-squared).

This function calculates chi-squared statistics between each feature and the target, measuring the dependence, or correlation, between them. The intuition here is that features which are more correlated with the target are more likely to be useful to the model.

I played around with some different values for the maximum number of features to be selected. Ultimately, I was once again limited by the size of the free Heroku Dyno, and settled on 10,000. This allowed the deployment to go smoothly while retaining enough information for the model to have adequate performance.

In [26]:
# === Feature Selection === #
selector = SelectKBest(chi2, k=10000)

selector.fit(X_train_sparse, y_train)

X_train_select = selector.transform(X_train_sparse)
X_test_select  = selector.transform(X_test_sparse)

X_train_select.shape, X_test_select.shape

((73200, 10000), (18300, 10000))

In [27]:
X_train_select

<73200x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 1272777 stored elements in Compressed Sparse Row format>

### Model validation

In this case, the model has a target that it is attempting to predict—a supervised problem. Therefore, the performance can be measured on a validation and test set.

To test out the recommendations I picked some posts and put them through the prediction pipeline to see what kinds of subreddits were getting recommended. For the most part, the predictions were decent. Sometimes it would fall over...[put some examples here]

The Kaggle page description:

> In our experiments, we have been optimising for the precision-at-K metric for K = {1, 3, 5}.

#### Baseline

For the baseline model, I decided to go with a basic random forest.

Model performance...

In [28]:
# === Evaluate performance using precision-at-k === #
def precision_at_k(y_true, y_pred, k=5):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    y_pred = np.argsort(y_pred, axis=1)
    y_pred = y_pred[:, ::-1][:, :k]
    arr = [y in s for y, s in zip(y_true, y_pred)]
    return np.mean(arr)

In [29]:
# === Baseline RandomForest model === #
rfc = RandomForestClassifier(max_depth=32, n_jobs=-1, n_estimators=200)
rfc.fit(X_train_select, y_train)

RandomForestClassifier(max_depth=32, n_estimators=200, n_jobs=-1)

In [30]:
# === Create predictions on test feature === #
y_pred_proba_rfc = rfc.predict_proba(X_test_select)

# === For each prediction, find the index with the highest probability === #
y_pred_rfc = np.argmax(y_pred_proba_rfc, axis=1)
y_pred_rfc[:10]

array([135, 302, 274, 302, 129, 195,  44, 137,  70, 199])

In [31]:
print('precision@1 =', np.mean(y_test == y_pred_rfc))
print('precision@3 =', precision_at_k(y_test, y_pred_proba_rfc, 3))
print('precision@5 =', precision_at_k(y_test, y_pred_proba_rfc, 5))

precision@1 = 0.5002185792349727
precision@3 = 0.5840983606557377
precision@5 = 0.6060655737704918


#### Multinomial Naive Bayes

Multinomial naive Bayes is one of two classic naive Bayes models used for text classification. It is a probabilistic learning method for multinomially distributed data.

Model performance

In [32]:
# === Naive Bayes model === #
nb = MultinomialNB(alpha=0.1)
nb.fit(X_train_select, y_train)

MultinomialNB(alpha=0.1)

In [33]:
# === Create predictions on test feature === #
y_pred_proba = nb.predict_proba(X_test_select)

# === For each pred, find index with highest proba === #
y_pred = np.argmax(y_pred_proba, axis=1)
y_pred[:10]

array([135, 171, 273, 214, 129, 260, 250, 263,  70, 124])

In [34]:
# === Evaluate precision at k === #
print('precision@1 =', np.mean(y_test == y_pred))
print('precision@3 =', precision_at_k(y_test, y_pred_proba, 3))
print('precision@5 =', precision_at_k(y_test, y_pred_proba, 5))

precision@1 = 0.7566120218579235
precision@3 = 0.8753551912568306
precision@5 = 0.9073770491803279


### Recommendations

The API should return a list of recommendations, not a single prediction. To accomplish this, I wrote a function that returns the top 5 most likely subreddits and their respective probabilities.

In [36]:
# === Function to serve predictions === #
# The main functionality of the predict API endpoint

def predict(title: str, submission_text: str, return_count: int = 5):
    """
    Serve subreddit predictions.
    
    Parameters
    ----------
    title : string
        Title of post.
    submission_text : string
        Selftext that needs a home.
    return_count    : integer
        The desired number of recommendations.

    Returns
    -------
    Python dictionary formatted as follows:
        [{'subreddit': 'PLC', 'proba': 0.014454},
         ...
         {'subreddit': 'Rowing', 'proba': 0.005206}]
    """
    # Concatenate title and post text
    fulltext = str(title) + str(submission_text)
    # Vectorize the post -> sparse doc-term matrix
    post_sparse = vocab.transform([fulltext])
    # Feature selection
    post_select = selector.transform(post_sparse)
    # Generate predicted probabilities from trained model
    proba = nb.predict_proba(post_select)
    # Wrangle into correct format
    proba_dict = (pd
                .DataFrame(proba, columns=[le.classes_])  # Classes as column names
                .T  # Transpose so column names become index
                .reset_index()  # Pull out index into a column
                .rename(columns={"level_0": "name", 0: "proba"})  # Rename for aesthetics
                .sort_values(by="proba", ascending=False)  # Sort by probability
                .iloc[:return_count]  # n-top predictions to serve
                .to_dict(orient="records")
               )
    proba_json = {"predictions": proba_dict}
    
    return proba_json

In [40]:
title_science = """Is there an evolutionary benefit to eating spicy food that lead to consumption across numerous cultures throughout history? Or do humans just like the sensation?"""

post_science = """I love spicy food and have done ever since I tried it. By spicy I mean HOT, like chilli peppers (we say spicy in England, I don't mean to state the obvious I'm just not sure if that's a global term and I've assumed too much before). I love a vast array of spicy foods from all around the world. I was just wondering if there was some evolutionary basis as to why spicy food managed to become some widely consumed historically. Though there seem to

It way well be that we just like a tingly mouth, the simple things in life."""

science_recs = predict(title_science, post_science)
science_recs

{'predictions': [{'name': 'GERD', 'proba': 0.00990604371082869},
  {'name': 'misophonia', 'proba': 0.009255419029406112},
  {'name': 'AskAnthropology', 'proba': 0.008865383231338406},
  {'name': 'fatpeoplestories', 'proba': 0.008636894366240455},
  {'name': 'emetophobia', 'proba': 0.008542160336507437}]}

In [41]:
# === Test post from r/buildmeapc === #

title_pc = """Looking for help with a build"""

post_pc = """I posted my wants for my build about 2 months ago. Ordered them and when I went to build it I was soooooo lost. It took 3 days to put things together because I was afraid I would break something when I finally got the parts together it wouldn’t start, I was so defeated. With virtually replacing everything yesterday it finally booted and I couldn’t be more excited!"""

post_pc_recs = predict(title_pc, post_pc, 10)
post_pc_recs

{'predictions': [{'name': 'lego', 'proba': 0.009030090934930563},
  {'name': 'vandwellers', 'proba': 0.0077818300061768255},
  {'name': 'Luthier', 'proba': 0.006943420255904971},
  {'name': 'rccars', 'proba': 0.006840724486847484},
  {'name': 'cosplay', 'proba': 0.006534249663711068},
  {'name': 'fightsticks', 'proba': 0.006515524774501326},
  {'name': 'Machinists', 'proba': 0.006406856199114954},
  {'name': 'cade', 'proba': 0.0063971277684109805},
  {'name': 'MechanicalKeyboards', 'proba': 0.006307983344624362},
  {'name': 'robotics', 'proba': 0.00599901604666438}]}

In [42]:
# === Example post from 'r/learnprogramming' === #

post_title = """What to do about java vs javascript"""

post = """I am a new grad looking for a job and currently in the process with a company for a junior backend engineer role. I was under the impression that the position was Javascript but instead it is actually Java. My general programming and "leet code" skills are pretty good, but my understanding of Java is pretty shallow. How can I use the next three days to best improve my general Java knowledge? Most resources on the web seem to be targeting complete beginners. Maybe a book I can skim through in the next few days?

Edit:

A lot of people are saying "the company is a sinking ship don't even go to the interview". I just want to add that the position was always for a "junior backend engineer". This company uses multiple languages and the recruiter just told me the incorrect language for the specific team I'm interviewing for. I'm sure they're mainly interested in seeing my understanding of good backend principles and software design, it's not a senior lead Java position."""

# === Test out the function === #
post_pred = predict(post_title, post)  # Default is 5 results
post_pred

{'predictions': [{'name': 'cscareerquestions', 'proba': 0.4090955393654394},
  {'name': 'devops', 'proba': 0.02566286765112885},
  {'name': 'interviews', 'proba': 0.02562100876400959},
  {'name': 'resumes', 'proba': 0.024405441404559483},
  {'name': 'datascience', 'proba': 0.023768874262813995}]}

In [44]:
# === Test it out with another dummy post === #

title_book = "Looking for books with great plot twists"

# This one comes from r/suggestmeabook
post2 = """I've been dreaming about writing my own stort story for a while but I want to give it an unexpected ending. I've read lots of books, but none of them had the plot twist I want. I want to read books with the best plot twists, so that I can analyze what makes a good plot twist and write my own story based on that points. I don't like romance novels and I mostly enjoy sci-fi or historical books but anything beside romance novels would work for me, it doesn't have to be my type of novel. I'm open to experience after all. I need your help guys. Thanks in advance."""

# === This time with 10 results === #
post2_pred = predict(title_book, post2, 10)
post2_pred

{'predictions': [{'name': 'suggestmeabook', 'proba': 0.3762950415068399},
  {'name': 'writing', 'proba': 0.1661331608591028},
  {'name': 'whatsthatbook', 'proba': 0.06101540341494861},
  {'name': 'eroticauthors', 'proba': 0.04492331535382657},
  {'name': 'ComicBookCollabs', 'proba': 0.024625040605292445},
  {'name': 'TheDarkTower', 'proba': 0.019166284059049136},
  {'name': 'Malazan', 'proba': 0.01742526148901986},
  {'name': 'DestructiveReaders', 'proba': 0.01740848101764242},
  {'name': 'WoT', 'proba': 0.013847077414874202},
  {'name': 'WritingPrompts', 'proba': 0.007749859890078947}]}

### Model deployment

As mentioned, the model, vocab, and feature selector were all serialized using Python's pickle module. In the Flask app, the pickled objects are loaded and ready for use, just like that.

I will go over the details of how the Flask app was set up in a separate blog post.

---

## Final Thoughts



### Scope Management, Revisited

[Potential improvements to the models here]

* Tune hyperparameters
* Deploying larger model to AWS
* Classifying first into category, then by specific subreddit
* Using more data for more subreddits up front, then reducing the number of features
  * TODO: train a model with the full dataset in separate notebook