# Post Here: Subreddit Predictor

## Recommendation API - 1.0

> aka: the MVP models

---
---

## Intro - MVP

The first iteration of the model for recommending (predicting) appropriate subreddit(s) will be built using a somewhat naive approach to text.

The model will be trained using the [reddit self-post classification task dataset](https://www.kaggle.com/mswarbrickjones/reddit-selfposts), available on Kaggle thanks to [Evolution AI](https://evolution.ai//blog/page/5/an-imagenet-like-text-classification-task-based-on-reddit-posts/).

The dataset should include 1,013,000 rows (1000 records each from 1013 subreddits).

---

### Imports

In [1]:
# === General imports === #
import pandas as pd
import janitor
import os

In [2]:
# === sklearn imports === #
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier

---

### Load and preprocess the data

In [3]:
# === Load the dataset === #
data_filepath = "/Users/Tobias/workshop/buildbox/post_here_ds/dox/reddit-selfposts/rspct.tsv"
df1 = pd.read_csv(data_filepath, sep="\t")

In [4]:
# === First looks === #
print(df1.shape)
df1.head()

(1013000, 4)


Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,77sxto,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


The MVP model will be trained on the selftext column only. Therefore, the only columns that are needed are `selftext` (feature, X) and `subreddit` (target, y).

First, let's take a sample of the dataset to reduce the memory load and runtime of the training. The full dataset can be used after the MVP is complete.

In [5]:
# === Extract sample of 1/10th of total records === #
# Should be around 50k records
df1 = df1.sample(frac=0.1, replace=True, random_state=92)

In [6]:
# === Confirm it worked as expected === #
df1.shape

(101300, 4)

In [7]:
# === Split up dataset into train and test === #

# 80% train, 20% test, stratified on the target
train, test = train_test_split(df1, test_size=0.2, stratify=df1["subreddit"])

train.shape, test.shape

((81040, 4), (20260, 4))

In [8]:
# === Arrange data into feature and target === #

X_train = train["selftext"]
X_test = test["selftext"]

y_train = train["subreddit"]
y_test = test["subreddit"]

In [9]:
X_train.head()

929865    I cant figure out if i like this card or not. ...
9279      Kanyes apparently got 7 songs on this next alb...
896004    I have done 2 tri's to date, 1 sprint and 1 in...
952229    So, my parents know I'm genderqueer, though th...
741843    [Mine Oddity](https://www.youtube.com/watch?v=...
Name: selftext, dtype: object

In [10]:
y_train.head()

929865         FoWtcg
9279            Kanye
896004      triathlon
952229    genderqueer
741843    MemeEconomy
Name: subreddit, dtype: object

In [11]:
# === Look at number of classes === #
y_train.value_counts()

trumpet            106
weddingplanning    105
aznidentity        104
howyoudoin         103
Blink182           103
                  ... 
Volvo               59
Grimdawn            59
nihilism            59
bash                58
Lisk                54
Name: subreddit, Length: 1013, dtype: int64

---
---

### Vectorization

The vectorizer that will be used to convert the words into numbers will not analyze the text for meaning or anything like that - remember, this is the MVP model. We can get as crazy as we want after we have a working baseline.

TF-IDF vectorization finds the unique aspects of documents of text, based on a simple count of the words within each document. In this context, "document" refers to an individual reddit post.

The vectorizer can be instantiated then "trained" on the dataset. One way to think about the training step is that it builds a vectorized vocabulary of the words in the dataset.

The TF-IDF implementation that will be used in the MVP comes from scikit-learn:

> [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

A custom tokenizer function can be passed into the vectoirizer to increase the quality of the tokens (number representations of words). The tokenizer function will use the NLP library [spaCy](https://spacy.io/usage/).

In [None]:
# === Spacy === #
import spacy

# Load the medium-sized english language model
nlp = spacy.load("en_core_web_md")

In [12]:
# === Encode the target using LabelEncoder === #

# This process naively transforms each category of the target into a number
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test  = le.transform(y_test)

y_train[:8]

array([132, 192, 949, 617, 237, 327,  62, 105])

In [None]:
# === Custom tokenizer function === #

def tokenize(doc):
    """
    Extracts nouns and adjectives from a string of text.
    Returns a list of spacy token.lemma objects.
    """
    
    doc = nlp(doc)
    na_tokens = []
    
    for token in doc:
        if (
            ((token.is_stop == False) and (token.is_punct == False))
            and (token.pos_ == "NOUN")
            or (token.pos_ == "ADJ")
        ):
            na_tokens.append(token.lemma_.strip().lower())

    return na_tokens

In [15]:
# === Vectorize the data === #

# Instantiate the vectorizer object
# This extracts features from the text using a bag-of-words approach
# To start, the default tokenizer function will be used (instead of the above)
tfidf = TfidfVectorizer(
#     tokenizer=tokenize,
    stop_words="english",
    max_features=100000,
    min_df=0.025,
    max_df=0.98
)

# Fit the vectorizer on the corpus to create the vocabulary
dtm = tfidf.fit(X_train)

# Get sparse doc-term matrix / word counts
# This is done by using the trained vocabulary to transform the corpus into vectors
sparse = tfidf.transform(X_train)

# Get the feature names (words) to use as column names
# Also, convert the sparse matrix into dense form - fill in empty counts with 0
vdtm = pd.DataFrame(sparse.todense(), columns=tfidf.get_feature_names())

In [16]:
# === Preview the resulting dataframe of vectorized tokens === #
pd.options.display.max_rows = 200
vdtm.head()

Unnamed: 0,10,100,12,15,20,30,able,actually,add,advance,...,work,working,works,world,worth,wouldn,wrong,www,year,years
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.154543,0.295731
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.510678,0.191263,0.0


---
---

### Recommendations

The goal of the API is to be able to recommend subreddits with posts that are the most similar to the input text. With that in mind, we can convert the input document into a vector that matches the format of the dataset (transformed via the vocabulary that was generated in the vectorization step).

That input vector can be fed into a algorithm that finds similarities between the input and the dataset. In the case of this MVP, the algorithm is called nearest neighbors.

As the name suggests, the algorithm returns the so-called "nearest neighbors" to the input vector, based on a distance metric. Ideally, these will be the subreddits with content that is most similar to the input.

The nearest neighbor implementation that will be used in the MVP comes from scikit-learn: 

> [sklearn.neighbors.NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html)

Classification with KNearestNeighbor (Post-MVP)

> [sklearn.neighbors.KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict_proba)



In [17]:
# === Train the nearest neighbor model === #

# Instantiate the NN model
from sklearn.neighbors import NearestNeighbors

# Instantiate the model with initial parameters - 10 neighbors
nn = NearestNeighbors(n_neighbors=10, algorithm='kd_tree')

# Fit the NN model on TF-IDF Vectors
nn.fit(vdtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=1.0)

In [18]:
# === Find neighbors for example input post === #

# The example comes from 'r/learnprogramming'
post = """I am a new grad looking for a job and currently in the process with a company for a junior backend engineer role. I was under the impression that the position was Javascript but instead it is actually Java. My general programming and "leet code" skills are pretty good, but my understanding of Java is pretty shallow. How can I use the next three days to best improve my general Java knowledge? Most resources on the web seem to be targeting complete beginners. Maybe a book I can skim through in the next few days?

Edit:

A lot of people are saying "the company is a sinking ship don't even go to the interview". I just want to add that the position was always for a "junior backend engineer". This company uses multiple languages and the recruiter just told me the incorrect language for the specific team I'm interviewing for. I'm sure they're mainly interested in seeing my understanding of good backend principles and software design, it's not a senior lead Java position."""

In [19]:
# === Vectorize the example post using the trained vocab === #
post_sparse = tfidf.transform([post])

In [21]:
# Run the transformed (vectorized) input through the NN model
# See where the input fits into the corpus, and return the 10 nearest neighbors
# In order to use the vectorized input, the labels can be reverse_transformed back through the encoder
# Or looked up via the index
rec_array = nn.kneighbors(post_sparse.todense(), n_neighbors=10)
rec_array

(array([[1.        , 1.        , 1.        , 1.        , 1.        ,
         1.        , 1.        , 1.03440271, 1.04268939, 1.06792867]]),
 array([[ 9019,  5143,  2281, 71968, 72305, 34391, 70197, 65457, 19212,
         34797]]))

In [22]:
# Extract the second item in the outer array
# This is the list of the review indices that are 'closest' to input
rec_id_list = rec_array[1][0]
rec_id_list

array([ 9019,  5143,  2281, 71968, 72305, 34391, 70197, 65457, 19212,
       34797])

In [24]:
# Hydrate that list with the rest of the data from the (almost) original dataframe
recommendations = df1.iloc[rec_id_list]["subreddit"]

# The resulting dataframe should have 10 rows
assert recommendations.shape[0] == 10

recommendations

482379      Cloververse
819094    adventuretime
328843          TheWire
195943     bladeandsoul
571435             SCCM
106538    callofcthulhu
657019           Ripple
942412            Cisco
289069      woodworking
36363       freemasonry
Name: subreddit, dtype: object

### Model #1 Result

Out of the top ten nearest neighbors, only two of them have anything at all to do with programming.

That's not very good. Not good at all.

Will we be able to do better than that?!

> Find out next time, on Post Here at `02-post_here-mvp-model.ipynb`

In [25]:
# === But first, let's try with a different example post === #

# This one comes from r/suggestmeabook
post2 = """I've been dreaming about writing my own stort story for a while but I want to give it an unexpected ending. I've read lots of books, but none of them had the plot twist I want. I want to read books with the best plot twists, so that I can analyze what makes a good plot twist and write my own story based on that points. I don't like romance novels and I mostly enjoy sci-fi or historical books but anything beside romance novels would work for me, it doesn't have to be my type of novel. I'm open to experience after all. I need your help guys. Thanks in advance."""

In [26]:
# === Predict function === #
# To make the recommendation and integration with the Flask api easier,
# Here's a function to generate the recommendations

def recommend(req, n=10):
    """Function to recommend top n subreddits given a request."""
    # Create vector from request
    req_vec = tfidf.transform([req])

    # Get indexes for n nearest neighbors
    top_id = nn.kneighbors(req_vec.todense(), n_neighbors=n)[1][0]

    # Index-locate the neighbors in original dataframe
    top_array = df1.iloc[top_id]["subreddit"]

    return top_array

In [27]:
# === Run it again! === #
post2_recs = recommend(post2)
post2_recs

270901    netneutrality
195943     bladeandsoul
819094    adventuretime
571435             SCCM
657019           Ripple
328843          TheWire
482379      Cloververse
106538    callofcthulhu
391942      learnpython
860458             volt
Name: subreddit, dtype: object

### Model #1 Results, Part 2

Again, the recommendations do not seem to be very good. However, in the interest of getting the MVP up and working. Let's pickle this model anyways, to start integrating it with the Flask API.

## Pickling

In order to use the model in the Flask app, it can be pickled. 
The pickle module, and the pickle file format, allows Python objects
to be serialized and de-serialized. In this case, the trained vectorizer
and model can be made into pickle files, which are then loaded into the
Flask app for use in the recommendation API.

In [28]:
# === Create pickle func to make pickling (a little) easier === #

def picklizer(to_pickle, filename, path):
    """
    Creates a pickle file.
    
    Parameters
    ----------
    to_pickle : Python object
        The trained / fitted instance of the 
        transformer or model to be pickled.
    filename : string
        The desired name of the output file,
        not including the '.pkl' extension.
    path : string or path-like object
        The path to the desired output directory.
    """
    import os
    import pickle

    # Create the path to save location
    picklepath = os.path.join(path, filename)

    # Use context manager to open file
    with open(picklepath, "wb") as p:
        pickle.dump(to_pickle, p)

In [41]:
# === Picklize! === #
filepath = "../../models"

# Export vectorizer as pickle
picklizer(dtm, "vec_01.pkl", filepath)

# Export knn model as pickle
picklizer(nn, "nn_01.pkl", filepath)

### Run it again! (this time with pickles)

In [42]:
import pickle

# Load the vdtm pickle into new object for testing
vv_path = os.path.join(filepath, "vec_01.pkl")

# Use context manager to open and load pickle file
with open(vv_path, "rb") as p:
    vv = pickle.load(p)

In [43]:
# Load the knn pickle into new object for testing
knn_path = os.path.join(filepath, "nn_01.pkl")

# Use context manager to open and load pickle file
with open(knn_path, "rb") as p:
    nn2 = pickle.load(p)

In [44]:
# === Get (pick)les === #

# Another slightly modified version that uses pickle objects

def rec_pickle(req, n=10):
    """Function to recommend n subreddits given a request."""
    # Create vector from request
    req_vec = vv.transform([req])

    # Access the top n indexes
    rec_id = nn2.kneighbors(req_vec.todense(), n_neighbors=n)[1][0]

    # Index-locate the neighbors in original dataframe
    rec_array = df1.iloc[rec_id]["subreddit"]

    return rec_array

In [45]:
# === Test out the modified function / pickle objects === #

# This one comes from r/buildmeapc
post3 = """I posted my wants for my build about 2 months ago. Ordered them and when I went to build it I was soooooo lost. It took 3 days to put things together because I was afraid I would break something when I finally got the parts together it wouldn’t start, I was so defeated. With virtually replacing everything yesterday it finally booted and I couldn’t be more excited!"""

# This time I'll pass in 20, because why not?
post_3_recs = rec_pickle(post3, 20)
post_3_recs

106538      callofcthulhu
571435               SCCM
819094      adventuretime
328843            TheWire
482379        Cloververse
657019             Ripple
195943       bladeandsoul
509055          ethtrader
169129         instantpot
906320       bladeandsoul
805929            Hyundai
991606          fragrance
558214         Journalism
14804            brandnew
333284    DBZDokkanBattle
767414             kindle
104371         Machinists
790023          reloading
55396        digitalnomad
253547         oculusnsfw
Name: subreddit, dtype: object

# Pickled Results

Results of the pickled model are adequately terrible. But hey, it gives an output!