# Post Here: Subreddit Predictor

## Recommendation API - 1.6

> aka: the MVP Classifier® With fewer classes

---
---

## Intro - MVP Classifier (Model #6)

The sixth iteration of the model for recommending (predicting) appropriate subreddits.

The model will be trained using the [reddit self-post classification task dataset](https://www.kaggle.com/mswarbrickjones/reddit-selfposts), available on Kaggle thanks to [Evolution AI](https://evolution.ai//blog/page/5/an-imagenet-like-text-classification-task-based-on-reddit-posts/).

The full dataset includes 1,013,000 rows (1000 records each from 1013 subreddits).

---

### Imports

In [1]:
# === General imports === #
import pandas as pd
import numpy as np
import os

In [2]:
# === sklearn imports === #
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

---

### Load and preprocess the data

In [45]:
# === Load the saved version === #
df1 = pd.read_csv("rspct_100k.csv", sep="\t")

# === First looks === #
print(df1.shape)
df1.head()

(100000, 4)


Unnamed: 0,id,subreddit,title,selftext
0,0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,1,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,2,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,3,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,4,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


In [46]:
# === Get list of subreddits === #
subreddits = df1["subreddit"].unique()
subreddits

array(['talesfromtechsupport', 'teenmom', 'Harley', ..., 'halo',
       'gtaonline', 'mead'], dtype=object)

In [47]:
# === Prune list of subreddits === #
num_classes = 200
sub_small = subreddits[:num_classes]
sub_small.shape

(200,)

In [65]:
sub_small

array(['talesfromtechsupport', 'teenmom', 'Harley', 'ringdoorbell',
       'intel', 'residentevil', 'BATProject', 'hockeyplayers', 'asmr',
       'rawdenim', 'steinsgate', 'DBZDokkanBattle', 'Nootropics', 'l5r',
       'NameThatSong', 'homeless', 'antidepressants', 'absolver',
       'KissAnime', 'sissyhypno', 'oculusnsfw', 'dpdr', 'Garmin',
       'AskLiteraryStudies', 'poetry_critics', 'skiing', 'shrimptank',
       'logorequests', 'Stargate', 'foreskin_restoration', 'sharepoint',
       'synthesizers', 'gravityfalls', 'androiddev', 'Grimdawn',
       'driving', 'FORTnITE', 'dndnext', 'Magic', 'MtvChallenge',
       'FoWtcg', 'harrypotter', 'TryingForABaby', 'sewing', 'foxholegame',
       'madmen', 'JUSTNOMIL', 'APStudents', 'sharditkeepit',
       'amateurradio', 'sleeptrain', 'fatpeoplestories', 'GameStop',
       'scuba', 'Firefighting', 'Mustang', 'riverdale', 'flying',
       'bartenders', 'scooters', 'trumpet', 'projecteternity',
       'musictheory', 'factorio', 'SexToys', 'E

In [48]:
# === Extract only those rows from original dataframe === #
df2 = df1[df1["subreddit"].isin(sub_small)]
df2.shape

(20041, 4)

In [49]:
# === Concatenate title and selftext === #
df3["text"] = df3["title"] + " " + df3["selftext"]
df3.head()

Unnamed: 0,subreddit,title,selftext,text
0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi...",Remember your command line switches... Hi ther...
1,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...,"So what was Matt ""addicted"" to? Did he ever sa..."
2,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,No Club Colors Funny story. I went to college ...
3,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...,"Not door bell, but floodlight mount height. I ..."
4,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,...",Worried about my 8700k small fft/data stress r...


In [50]:
# === Drop unneeded columns === #
df4 = df3.drop(columns=["title", "selftext"])
df4.head()

Unnamed: 0,subreddit,text
0,talesfromtechsupport,Remember your command line switches... Hi ther...
1,teenmom,"So what was Matt ""addicted"" to? Did he ever sa..."
2,Harley,No Club Colors Funny story. I went to college ...
3,ringdoorbell,"Not door bell, but floodlight mount height. I ..."
4,intel,Worried about my 8700k small fft/data stress r...


In [51]:
# === Export new pruned dataframe to csv === #
df4.to_csv("rspct_200_class.csv")

In [52]:
# === Split up dataset into train and test === #
# from sklearn.model_selection import train_test_split

# 80% train, 20% test, stratified on the target
train, test = train_test_split(df3, test_size=0.2)

train.shape, test.shape

((16032, 4), (4009, 4))

In [53]:
# === Arrange data into feature and target === #

# MVP model only uses 'text' feature
X_train = train["text"]
X_test = test["text"]

# Predict the subreddit of each post
y_train = train["subreddit"]
y_test = test["subreddit"]

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(16032,) (4009,)
(16032,) (4009,)


In [54]:
# === Encode the target using LabelEncoder === #

# This process naively transforms each class of the target into a number
# from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() # Instantiate a new encoder instance
le.fit(y_train)  # Fit it on training label data

# Transform both using the train-fit instance
y_train = le.transform(y_train)
y_test  = le.transform(y_test)

y_train[:8]

array([ 48,  87,  27,  76,  84,  59,  20, 111])

---
---

### Vectorization

In [55]:
# === Vectorize! === #

# Extract features from the text data using bag-of-words (single words + bigrams).
# Uses tfidf weighting (helps a little for Naive Bayes in general).
# from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=100000,
    min_df=5,
    ngram_range=(1,2),
    stop_words="english",  # TODO: try out spacy's or sklearn's stopwords
)

# Fit the vectorizer on the feature column to create vocab (doc-term matrix)
vocab = tfidf.fit(X_train)

# Get sparse document-term matrices
X_train_sparse = vocab.transform(X_train)
X_test_sparse = vocab.transform(X_test)

In [56]:
# === Feature Selection === #

from sklearn.feature_selection import chi2, SelectKBest

selector = SelectKBest(chi2, 10000)

selector.fit(X_train_sparse, y_train)

X_train_select = selector.transform(X_train_sparse)
X_test_select  = selector.transform(X_test_sparse)

X_train_select.shape, X_test_select.shape

((16032, 10000), (4009, 10000))

In [24]:
# === Baseline RandomForest model === #
from sklearn.ensemble import RandomForestClassifier

# Instantiate and train the model
rfc = RandomForestClassifier(max_depth=32, n_jobs=-1, n_estimators=200)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=32, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [26]:
# === Evaluate performance using precision-at-k === #

def precision_at_k(y_true, y_pred, k=5):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    y_pred = np.argsort(y_pred, axis=1)
    y_pred = y_pred[:, ::-1][:, :k]
    arr = [y in s for y, s in zip(y_true, y_pred)]
    return np.mean(arr)

print('precision@1 =', np.mean(y_test == y_pred_rfc))
print('precision@3 =', precision_at_k(y_test, y_pred_proba_rfc, 3))
print('precision@5 =', precision_at_k(y_test, y_pred_proba_rfc, 5))

precision@1 = 0.27855
precision@3 = 0.319
precision@5 = 0.32755


In [30]:
# === Function to serve predictions - Random Forest === #


def predict(post: str, n: int = 5) -> dict:
    """
    Serve subreddit predictions.
    
    Parameters
    ----------
    post : string
        Selftext that needs a home.
    n    : integer
        The desired name of the output file,
        not including the '.pkl' extension.

    Returns
    -------
    Python dictionary formatted as follows:
        [{'subreddit': 'PLC', 'proba': 0.014454},
         ...
         {'subreddit': 'Rowing', 'proba': 0.005206}]
    """
    
    # Vectorize the post -> sparse doc-term matrix
    post_vec = vocab.transform([post])
    
    # Generate predicted probabilities from trained model
    proba = rfc.predict_proba(post_vec)
    
    # Wrangle into correct format
    return (pd
                .DataFrame(proba, columns=[le.classes_])  # Classes as column names
                .T  # Transpose so column names become index
                .reset_index()  # Pull out index into a column
                .rename(columns={"level_0": "subreddit", 0: "proba"})  # Rename for aesthetics
                .sort_values(by="proba", ascending=False)  # Sort by probability
                .iloc[:n]  # n-top predictions to serve
                .to_dict(orient="records")
               )

In [57]:
# === Naive Bayes model === #
# from sklearn.naive_bayes import MultinomialNB

# Instantiate and train the model
nb = MultinomialNB(alpha=0.1)
nb.fit(X_train_select, y_train)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [59]:
def precision_at_k(y_true, y_pred, k=5):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    y_pred = np.argsort(y_pred, axis=1)
    y_pred = y_pred[:, ::-1][:, :k]
    arr = [y in s for y, s in zip(y_true, y_pred)]
    return np.mean(arr)

print('precision@1 =', np.mean(y_test == y_pred))
print('precision@3 =', precision_at_k(y_test, y_pred_proba, 3))
print('precision@5 =', precision_at_k(y_test, y_pred_proba, 5))

precision@1 = 0.786979296582689
precision@3 = 0.8877525567473186
precision@5 = 0.9156896981790971


In [43]:
# === Function to serve predictions - Naive Bayes === #


def predict(post: str, n: int = 5) -> dict:
    """
    Serve subreddit predictions.
    
    Parameters
    ----------
    post : string
        Selftext that needs a home.
    n    : integer
        The desired name of the output file,
        not including the '.pkl' extension.

    Returns
    -------
    Python dictionary formatted as follows:
        [{'subreddit': 'PLC', 'proba': 0.014454},
         ...
         {'subreddit': 'Rowing', 'proba': 0.005206}]
    """
    
    # Vectorize the post -> sparse doc-term matrix
    post_vec = vocab.transform([post])
    
    # Generate predicted probabilities from trained model
    proba = nb.predict_proba(post_vec)
    
    # Wrangle into correct format
    return (pd
                .DataFrame(proba, columns=[le.classes_])  # Classes as column names
                .T  # Transpose so column names become index
                .reset_index()  # Pull out index into a column
                .rename(columns={"level_0": "subreddit", 0: "proba"})  # Rename for aesthetics
                .sort_values(by="proba", ascending=False)  # Sort by probability
                .iloc[:n]  # n-top predictions to serve
                .to_dict(orient="records")
               )

In [62]:
# === Function to serve predictions === #
# The main functionality of the predict API endpoint

def predict(title: str, submission_text: str, return_count: int = 5):
    """
    Serve subreddit predictions.
    
    Parameters
    ----------
    post : string
        Selftext that needs a home.
    n    : integer
        The desired name of the output file,
        not including the '.pkl' extension.

    Returns
    -------
    Python dictionary formatted as follows:
        [{'subreddit': 'PLC', 'proba': 0.014454},
         ...
         {'subreddit': 'Rowing', 'proba': 0.005206}]
    """
    # Concatenate title and post text
    fulltext = str(title) + str(submission_text)
    
    # Vectorize the post -> sparse doc-term matrix
    post_sparse = vocab.transform([fulltext])
    
    # Feature selection
    post_select = selector.transform(post_sparse)
    
    # Generate predicted probabilities from trained model
    proba = nb.predict_proba(post_select)
    
    # Wrangle into correct format
    proba_dict = (pd
                .DataFrame(proba, columns=[le.classes_])  # Classes as column names
                .T  # Transpose so column names become index
                .reset_index()  # Pull out index into a column
                .rename(columns={"level_0": "name", 0: "proba"})  # Rename for aesthetics
                .sort_values(by="proba", ascending=False)  # Sort by probability
                .iloc[:return_count]  # n-top predictions to serve
                .to_dict(orient="records")
               )
    
    proba_json = {"predictions": proba_dict}
    
    return proba_json

In [63]:
title_science = """Is there an evolutionary benefit to eating spicy food that lead to consumption across numerous cultures throughout history? Or do humans just like the sensation?"""

post_science = """I love spicy food and have done ever since I tried it. By spicy I mean HOT, like chilli peppers (we say spicy in England, I don't mean to state the obvious I'm just not sure if that's a global term and I've assumed too much before). I love a vast array of spicy foods from all around the world. I was just wondering if there was some evolutionary basis as to why spicy food managed to become some widely consumed historically. Though there seem to

It way well be that we just like a tingly mouth, the simple things in life."""

science_recs = predict(title_science, post_science)
science_recs

{'predictions': [{'name': 'tacobell', 'proba': 0.03354927608126176},
  {'name': 'bourbon', 'proba': 0.025214174214901387},
  {'name': 'AskAnthropology', 'proba': 0.015927474358951307},
  {'name': 'communism101', 'proba': 0.012132597568157405},
  {'name': 'homeless', 'proba': 0.012027423707963732}]}

In [22]:
# === Test post from r/buildmeapc === #

post_pc = """I posted my wants for my build about 2 months ago. Ordered them and when I went to build it I was soooooo lost. It took 3 days to put things together because I was afraid I would break something when I finally got the parts together it wouldn’t start, I was so defeated. With virtually replacing everything yesterday it finally booted and I couldn’t be more excited!"""

post_pc_recs = predict(post_pc, 10)
post_pc_recs

[{'subreddit': 'scooters', 'proba': 0.01277275318447536},
 {'subreddit': 'Volkswagen', 'proba': 0.012102325928117287},
 {'subreddit': 'Nootropics', 'proba': 0.01152084521575761},
 {'subreddit': 'Harley', 'proba': 0.01143704193857149},
 {'subreddit': 'woodworking', 'proba': 0.009904775134851692},
 {'subreddit': 'intel', 'proba': 0.009557180707430255},
 {'subreddit': 'Mustang', 'proba': 0.00940683354156282},
 {'subreddit': 'Warframe', 'proba': 0.009253723053607617},
 {'subreddit': 'cosplay', 'proba': 0.009198133668539598},
 {'subreddit': 'needforspeed', 'proba': 0.008983413839178141}]

In [23]:
# === Example post from 'r/learnprogramming' === #

post = """I am a new grad looking for a job and currently in the process with a company for a junior backend engineer role. I was under the impression that the position was Javascript but instead it is actually Java. My general programming and "leet code" skills are pretty good, but my understanding of Java is pretty shallow. How can I use the next three days to best improve my general Java knowledge? Most resources on the web seem to be targeting complete beginners. Maybe a book I can skim through in the next few days?

Edit:

A lot of people are saying "the company is a sinking ship don't even go to the interview". I just want to add that the position was always for a "junior backend engineer". This company uses multiple languages and the recruiter just told me the incorrect language for the specific team I'm interviewing for. I'm sure they're mainly interested in seeing my understanding of good backend principles and software design, it's not a senior lead Java position."""

# === Test out the function === #
post_pred = predict(post)  # Default is 5 results
post_pred

[{'subreddit': 'androiddev', 'proba': 0.058162389152266694},
 {'subreddit': 'resumes', 'proba': 0.05783334198143744},
 {'subreddit': 'PLC', 'proba': 0.042594243616314705},
 {'subreddit': 'graphic_design', 'proba': 0.02975165820725173},
 {'subreddit': 'digitalnomad', 'proba': 0.027636577001649492}]

In [24]:
# === Test it out with another dummy post === #

# This one comes from r/suggestmeabook
post2 = """I've been dreaming about writing my own stort story for a while but I want to give it an unexpected ending. I've read lots of books, but none of them had the plot twist I want. I want to read books with the best plot twists, so that I can analyze what makes a good plot twist and write my own story based on that points. I don't like romance novels and I mostly enjoy sci-fi or historical books but anything beside romance novels would work for me, it doesn't have to be my type of novel. I'm open to experience after all. I need your help guys. Thanks in advance."""

# === This time with 10 results === #
post2_pred = predict(post2, n=10)
post2_pred

[{'subreddit': 'AskLiteraryStudies', 'proba': 0.06125011240086272},
 {'subreddit': 'dresdenfiles', 'proba': 0.036049946904906274},
 {'subreddit': 'TheExpanse', 'proba': 0.022492214853231957},
 {'subreddit': 'fatestaynight', 'proba': 0.02155218197542317},
 {'subreddit': 'blackmirror', 'proba': 0.020893245561080725},
 {'subreddit': 'Stargate', 'proba': 0.02070034165375228},
 {'subreddit': 'WhiteWolfRPG', 'proba': 0.019633219869847146},
 {'subreddit': 'harrypotter', 'proba': 0.01505526636330939},
 {'subreddit': 'steinsgate', 'proba': 0.013516360533768814},
 {'subreddit': 'APStudents', 'proba': 0.012423117830967478}]

---

### Pickle Time

In [25]:
# === Create pickle func to make pickling (a little) easier === #

def picklizer(to_pickle, filename, path):
    """
    Creates a pickle file.
    
    Parameters
    ----------
    to_pickle : Python object
        The trained / fitted instance of the 
        transformer or model to be pickled.
    filename : string
        The desired name of the output file,
        not including the '.pkl' extension.
    path : string or path-like object
        The path to the desired output directory.
    """
    import os
    import pickle

    # Create the path to save location
    picklepath = os.path.join(path, filename)

    # Use context manager to open file
    with open(picklepath, "wb") as p:
        pickle.dump(to_pickle, p)

In [64]:
# === Picklize! === #
filepath = "./pickles"  # Change this accordingly

# Export LabelEncoder as pickle
picklizer(le, "06_le.pkl", filepath)

# Export selector as pickle
picklizer(selector, "06_selector.pkl", filepath)

# Export vectorizer as pickle
picklizer(vocab, "06_vocab.pkl", filepath)

# Export naive bayes model as pickle
picklizer(nb, "06_nb.pkl", filepath)

Load and consume pickles...

> _Enjoy responsibly._

In [68]:
# === Load the trained vectorizer === #
import pickle

le_path = os.path.join(filepath, "04_le.pkl")

# Use context manager to open and load pickle
with open(le_path, "rb") as p:
    le = pickle.load(p)

In [69]:
# === Load the trained vectorizer === #
import pickle

vocab_path = os.path.join(filepath, "04_vocab.pkl")

# Use context manager to open and load pickle
with open(vocab_path, "rb") as p:
    vocab = pickle.load(p)

In [70]:
# === Load the somewhat-trained Random Forest classifier === #
# import pickle

rfc_path = os.path.join(filepath, "04_rfc.pkl")

# Use context manager to open and load pickle
with open(rfc_path, "rb") as p:
    rfc = pickle.load(p)

In [71]:
# === Test out the pickled versions === #

# This one comes from r/buildmeapc
post3 = """I posted my wants for my build about 2 months ago. Ordered them and when I went to build it I was soooooo lost. It took 3 days to put things together because I was afraid I would break something when I finally got the parts together it wouldn’t start, I was so defeated. With virtually replacing everything yesterday it finally booted and I couldn’t be more excited!"""

# This time I'll pass in 20, because why not?
post3_recs = predict(post3, 20)
post3_recs

[{'subreddit': 'parrots', 'proba': 0.00211370360467277},
 {'subreddit': 'flexibility', 'proba': 0.002096921513234382},
 {'subreddit': 'bladeandsoul', 'proba': 0.002071709687128192},
 {'subreddit': 'Rowing', 'proba': 0.0020525020657867544},
 {'subreddit': 'StudentLoans', 'proba': 0.0020488032050522157},
 {'subreddit': 'gigantic', 'proba': 0.0019414817435439607},
 {'subreddit': 'PLC', 'proba': 0.0018997017900174518},
 {'subreddit': 'TransDIY', 'proba': 0.001892048211568393},
 {'subreddit': 'breastfeeding', 'proba': 0.0018844645436299533},
 {'subreddit': 'RocketLeague', 'proba': 0.0018758224984953434},
 {'subreddit': 'benzodiazepines', 'proba': 0.0018668138495396874},
 {'subreddit': 'The100', 'proba': 0.0018580496088653011},
 {'subreddit': 'minecraftsuggestions', 'proba': 0.0018453927006669075},
 {'subreddit': 'techsupport', 'proba': 0.0018442340100311113},
 {'subreddit': 'Blink182', 'proba': 0.0018239715732455283},
 {'subreddit': 'dpdr', 'proba': 0.0018163028805450827},
 {'subreddit': 'l