# Post Here: Subreddit Predictor

## Recommendation API - 1.2

> aka: the Most Voluptuous Pipeline

---
---

## Intro - MVP Classifier (Model #3)

This is the third iteration of the model for predicting the most appropriate subreddits for a given post.

- Text embeddings:
- Prediction model:

The model will be trained using data from the [reddit self-post classification task dataset](https://www.kaggle.com/mswarbrickjones/reddit-selfposts), available on Kaggle thanks to [Evolution AI](https://evolution.ai//blog/page/5/an-imagenet-like-text-classification-task-based-on-reddit-posts/).

The full dataset includes 1,013,000 rows (1000 records each from 1013 subreddits). To keep things manageable for the proof-of-concept, the first 100,000 records will be used to train the embeddings and models.

According to the dataset description on Kaggle, the data has already been randomized. Therefore, reading in the first 100k records, as opposed to extracting a random sample, will not introduce any more bias into the data.

---

### Load and preprocess the data

In [1]:
# === General imports === #
import pandas as pd
import numpy as np
import os

In [2]:
# === Load the dataset === #
data_filepath = "../../dox/reddit-selfposts/rspct.tsv"

# Tab-separated, read first 100k rows
# The dataset has already been randomized, so not randomizing the sample will not introduce bias
df1 = pd.read_csv(data_filepath, sep="\t", nrows=100000)

In [3]:
# === Confirm it worked as expected === #
df1.shape

(100000, 4)

In [9]:
# === Remove the "id" column === #
# The index will be used as the new "id"

df2 = df1.drop(columns=["id"])
df2.head(4)

Unnamed: 0,subreddit,title,selftext
0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...


In [10]:
# === Save the sampled dataset to file === #
df2.to_csv("./rspct_100k.csv", sep="\t", index=True, index_label="id")

In [11]:
# === Load it again from file to use === #
df3 = pd.read_csv("rspct_100k.csv", sep="\t")
df3.head()

Unnamed: 0,id,subreddit,title,selftext
0,0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,1,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,2,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,3,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,4,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


---

### Encoding the label

ML models cannot predict on the string representations of a target. Therefore, it is necessary to encode the target into a numerical representation.

Ideally the encoding will be done in such a way that allows for consistency between the model, which provides the index of the predicted subreddit, and the Flask API, which will use that index to look up the subreddit that corresponds to that index (or indices).

In [42]:
# === Create set of unique subreddits === #

# The data includes a supplementary csv that has subreddit information
# However, best to generate one based on the sample of data that is used

subreddits = (pd.DataFrame(df3["subreddit"].unique(), columns=["subreddit"])
              .sort_values(by="subreddit")
              .reset_index()
              .drop(columns=["index"]))
subreddits

Unnamed: 0,subreddit
0,13ReasonsWhy
1,3Dprinting
2,3d6
3,4Runner
4,7daystodie
...,...
1008,yandere_simulator
1009,ynab
1010,yoga
1011,yorku


In [47]:
# === Save the subreddits === #
subreddits.to_csv("subreddits_100k.csv", index=True, index_label="id")

In [48]:
# === Look at generated Subreddit / ID Index === #
df_lookup = pd.read_csv("subreddits_100k.csv")
df_lookup.head()

Unnamed: 0,id,subreddit
0,0,13ReasonsWhy
1,1,3Dprinting
2,2,3d6
3,3,4Runner
4,4,7daystodie


In [51]:
# === Create mapping dictionary from lookup dataframe === #
# This mapper will be passed into the ce.OrdinalEncoder

mapper = df_lookup.drop(columns=["id"]).to_dict()

In [93]:
# === Convert mapping to list of tuples === #
mapper_tuples = []

for sub in mapper["subreddit"]:
    mapper_tuples.append((mapper["subreddit"][sub], sub))

In [94]:
mapper_tuples

[('13ReasonsWhy', 0),
 ('3Dprinting', 1),
 ('3d6', 2),
 ('4Runner', 3),
 ('7daystodie', 4),
 ('90DayFiance', 5),
 ('ABDL', 6),
 ('ABraThatFits', 7),
 ('ACL', 8),
 ('ACT', 9),
 ('ADHD', 10),
 ('APStudents', 11),
 ('ASUS', 12),
 ('AcademicPsychology', 13),
 ('Accounting', 14),
 ('AdobeIllustrator', 15),
 ('Adoption', 16),
 ('AirBnB', 17),
 ('Allergies', 18),
 ('AlphaBayMarket', 19),
 ('AmericanHorrorStory', 20),
 ('Anarchism', 21),
 ('AndroidAuto', 22),
 ('Anki', 23),
 ('ApocalypseRising', 24),
 ('ArcherFX', 25),
 ('Archery', 26),
 ('AskAnthropology', 27),
 ('AskEconomics', 28),
 ('AskHR', 29),
 ('AskLiteraryStudies', 30),
 ('AskVet', 31),
 ('AstralProjection', 32),
 ('Audi', 33),
 ('AutoDetailing', 34),
 ('AutoHotkey', 35),
 ('AutoModerator', 36),
 ('AvPD', 37),
 ('Ayahuasca', 38),
 ('BATProject', 39),
 ('BDSMcommunity', 40),
 ('BMW', 41),
 ('BackYardChickens', 42),
 ('Bass', 43),
 ('BeardedDragons', 44),
 ('Beatmatch', 45),
 ('BeautyBoxes', 46),
 ('Bedbugs', 47),
 ('Beekeeping', 48),
 

In [70]:
# === Arrange data into feature and target === #

# Use selftext column as feature, X
X = df3[["selftext"]]

# Predict subreddit, y
y = df3[["subreddit"]]

# Confirm the shape is as expected
print(X.shape, y.shape)

(100000, 1) (100000, 1)


In [75]:
y

Unnamed: 0,subreddit
0,talesfromtechsupport
1,teenmom
2,Harley
3,ringdoorbell
4,intel
...,...
99995,bash
99996,MrRobot
99997,Cloververse
99998,sweden


In [95]:
# === Encode the target using OrdinalEncoder === #
import category_encoders as ce

# Instantiate encoder instance
encoder = ce.OrdinalEncoder(cols=["subreddit"], mapping=[{"col": "subreddit", "mapping": mapper_tuples}])

# Fit the encoder to the data
encoder.fit_transform(y)

TypeError: 'list' object is not callable

In [83]:
# === Encode the target using sklearn's OrdinalEncoder === #
from sklearn.preprocessing import OrdinalEncoder

# Instantiate encoder instance
encoder = OrdinalEncoder(cols=["subreddit"], mapping=[{"col": "subreddit", "mapping": mapper["subreddit"]}])

# Fit the encoder to the data
encoder.fit_transform(y)

Unnamed: 0,subreddit
0,-1.0
1,-1.0
2,-1.0
3,-1.0
4,-1.0
...,...
99995,-1.0
99996,-1.0
99997,-1.0
99998,-1.0


In [72]:
encoder.category_mapping

[{'col': 'subreddit',
  'mapping': {0: '13ReasonsWhy',
   1: '3Dprinting',
   2: '3d6',
   3: '4Runner',
   4: '7daystodie',
   5: '90DayFiance',
   6: 'ABDL',
   7: 'ABraThatFits',
   8: 'ACL',
   9: 'ACT',
   10: 'ADHD',
   11: 'APStudents',
   12: 'ASUS',
   13: 'AcademicPsychology',
   14: 'Accounting',
   15: 'AdobeIllustrator',
   16: 'Adoption',
   17: 'AirBnB',
   18: 'Allergies',
   19: 'AlphaBayMarket',
   20: 'AmericanHorrorStory',
   21: 'Anarchism',
   22: 'AndroidAuto',
   23: 'Anki',
   24: 'ApocalypseRising',
   25: 'ArcherFX',
   26: 'Archery',
   27: 'AskAnthropology',
   28: 'AskEconomics',
   29: 'AskHR',
   30: 'AskLiteraryStudies',
   31: 'AskVet',
   32: 'AstralProjection',
   33: 'Audi',
   34: 'AutoDetailing',
   35: 'AutoHotkey',
   36: 'AutoModerator',
   37: 'AvPD',
   38: 'Ayahuasca',
   39: 'BATProject',
   40: 'BDSMcommunity',
   41: 'BMW',
   42: 'BackYardChickens',
   43: 'Bass',
   44: 'BeardedDragons',
   45: 'Beatmatch',
   46: 'BeautyBoxes',
   47: 

In [9]:
# === Vectorize! === #

# Extract features from the text data using the bag-of-words approach (single words + bigrams).
# Uses tfidf weighting (helps a little for Naive Bayes in general).
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.feature_selection import chi2, SelectKBest

# TODO: use spacy's stopwords

tfidf = TfidfVectorizer(
    max_features=100000,
    min_df=5,
    ngram_range=(1,2),
    stop_words=None,
    token_pattern='(?u)\\b\\w+\\b',
)

# Fit the vectorizer on the feature column to create vocab
# This process is split into component parts to make pickling the vocab simpler
vocab = tfidf.fit(X_train)

# Get sparse document-term matrix for training data
X_train_sparse = tfidf.transform(X_train)

# Get sparse document-term matrix for test data
X_test_sparse = tfidf.transform(X_test)

In [13]:
# === Naive Bayes model === #

# Instantiate and train the model
nb = MultinomialNB(alpha=0.1)
nb.fit(X_train_select, y_train)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [15]:
# === Create predictions on test feature === #
y_pred_proba = nb.predict_proba(X_test_select)

print(y_pred_proba.shape)
y_pred_proba[:10]

(202600, 1013)


array([[4.56545489e-04, 1.18151347e-04, 5.38448045e-04, ...,
        1.80345381e-03, 5.22980232e-04, 8.04658308e-04],
       [1.12450547e-04, 2.32793869e-04, 1.05751893e-03, ...,
        5.59297607e-03, 1.47642710e-02, 3.06511115e-04],
       [5.54748051e-08, 2.21058597e-08, 4.80524535e-08, ...,
        4.95766515e-08, 2.78150536e-08, 1.54200669e-07],
       ...,
       [3.85584688e-06, 4.60790915e-05, 2.15397656e-05, ...,
        3.67156913e-05, 6.14331168e-05, 5.16849255e-05],
       [2.09804425e-04, 6.02437836e-04, 1.87576608e-03, ...,
        3.34070611e-03, 9.57996639e-04, 8.29576071e-04],
       [2.65022901e-05, 6.97262528e-06, 2.47540456e-06, ...,
        1.61589692e-05, 2.66521737e-05, 1.58858054e-04]])

In [12]:
# === Split up dataset into train and test === #
from sklearn.model_selection import train_test_split

# 80% train, 20% test, stratified on the target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=df1["subreddit"])

train.shape, test.shape

((80000, 4), (20000, 4))

---
---

### Pipeline

In [2]:
# === sklearn imports === #
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# === Custom tokenizer function === #
import spacy

# Load the medium-sized english language model
nlp = spacy.load("en_core_web_md")

def tokenize(doc):
    """
    Extracts nouns and adjectives from a string of text.
    Returns a list of spacy token.lemma objects.
    """
    
    doc = nlp(doc)
    na_tokens = []
    
    for token in doc:
        if (
            ((token.is_stop == False) and (token.is_punct == False))
            and (token.pos_ == "NOUN")
            or (token.pos_ == "ADJ")
        ):
            na_tokens.append(token.lemma_.strip().lower())

    return na_tokens

In [8]:
# === Encode the target using LabelEncoder === #

# This process naively transforms each category of the target into a number
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() # Instantiate a new encoder instance
le.fit(y_train)  # Fit it on training label data

# Transform both using the train fit
y_train = le.transform(y_train)
y_test  = le.transform(y_test)

y_train[:8]

array([691, 289, 117, 952,  77, 532, 332, 995])

---

### Predict subreddit from new input

Now that our model is trained and we have our baseline performance metric that is not half-bad, we can use the trained model to predict what subreddit would belong to a new piece of data (a post).

In order to do this, the post will have to be vectorized.

In [18]:
# === Example post === #

# The example comes from 'r/learnprogramming'
post = """I am a new grad looking for a job and currently in the process with a company for a junior backend engineer role. I was under the impression that the position was Javascript but instead it is actually Java. My general programming and "leet code" skills are pretty good, but my understanding of Java is pretty shallow. How can I use the next three days to best improve my general Java knowledge? Most resources on the web seem to be targeting complete beginners. Maybe a book I can skim through in the next few days?

Edit:

A lot of people are saying "the company is a sinking ship don't even go to the interview". I just want to add that the position was always for a "junior backend engineer". This company uses multiple languages and the recruiter just told me the incorrect language for the specific team I'm interviewing for. I'm sure they're mainly interested in seeing my understanding of good backend principles and software design, it's not a senior lead Java position."""

---

### Picklization

In [61]:
# === Create pickle func to make pickling (a little) easier === #

def picklizer(to_pickle, filename, path):
    """
    Creates a pickle file.
    
    Parameters
    ----------
    to_pickle : Python object
        The trained / fitted instance of the 
        transformer or model to be pickled.
    filename : string
        The desired name of the output file,
        not including the '.pkl' extension.
    path : string or path-like object
        The path to the desired output directory.
    """
    import os
    import pickle

    # Create the path to save location
    picklepath = os.path.join(path, filename)

    # Use context manager to open file
    with open(picklepath, "wb") as p:
        pickle.dump(to_pickle, p)

In [62]:
# === Picklize! === #
filepath = "../../models"

# Export vectorizer as pickle
picklizer(vocab, "03_.pkl", filepath)

# Export chi2 selector as pickle
picklizer(chi2_selector, "03_.pkl", filepath)

# Export naive bayes model as pickle
picklizer(nb, "03_.pkl", filepath)