# Data Preparation

## Formatting
The Kaggle dataset is one TSV file that will need to be formatted before training. The data must also be split into a train and test set since a seperate validation file was not provided.

In [96]:
import pandas as pd
import numpy as np

raw_data = pd.read_csv('rspct.tsv', sep='\t')

# Taking a look at our original dataset
raw_data.head()

Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,77sxto,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


The dataset provides seperate fields for **title** and **self-text** but for the purpose of our project we want to look at an entire post (title + self-text) for the classification task

In [None]:
# Combining title and text of reddit post into one field
rspct = pd.DataFrame({
    'text':raw_data['title'] + " " + raw_data['selftext'],
    'subreddit':raw_data['subreddit']
}) 

After creating the dataframe, every unique subreddit needs to be mapped to a numerical index in order to use them with the SimpleTransformers library. An inverse mapping is also created to easily get the subreddit name from a numerical index.

In [97]:
labels = raw_data['subreddit'].unique();
label_dict = {}

# Create dictionary to map subreddits to numerical values
for i in range(len(labels)):
    label_dict[labels[i]] = i

inv_label_dict = {v: k for k, v in label_dict.items()}

In [None]:
# Changing subreddits to numerical values
for index, row in rspct.iterrows():
    row['subreddit'] = label_dict[row['subreddit']]

In [77]:
# Taking a look at our formatted dataset
rspct.head()

Unnamed: 0,text,subreddit
0,Remember your command line switches... Hi ther...,0
1,"So what was Matt ""addicted"" to? Did he ever sa...",1
2,No Club Colors Funny story. I went to college ...,2
3,"Not door bell, but floodlight mount height. I ...",3
4,Worried about my 8700k small fft/data stress r...,4


## Train and Test Set
Now that our data has been formatted into a simple format, we can create a training set and a test set using sklearn's helper method.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(rspct, test_size=0.2, shuffle=True, stratify=rspct['subreddit'])

We are splitting the dataset into 80% for training and 20% for testing. Stratify is being used to make sure that both sets have the same distribution of subreddits to avoid balancing issues. Note that the subreddit labels are not being seperated from the post text. This is because the SimpleTransformers library requires a single dataframe with both the text and label and handles the seperation within the framework.

# Training the Model
Using the SimpleTransformers library, we can easily instantiate one of many pre-trained NLP models such as BERT. The BERT model comes in different variants such as BERT Base and BERT Large with 110M parameters and 340M parameters, respectively. To conserve memory usage and reduce training time, we are using BERT Base that has been trained on lower cased text.

In [78]:
from simpletransformers.classification import ClassificationModel

# Create a ClassificationModel
model = ClassificationModel('bert', 
                            'bert-base-uncased', 
                            num_labels=len(label_dict), 
                            args={'reprocess_input_data': True, 'overwrite_output_dir': True, 'num_train_epochs': 1}, 
                            use_cuda=True) 

# Train the model
model.train_model(train_df)

# Evaluation
After training the model we can now evaluate its performance. By default the SimpleTransformers library uses MCC as a metric for classification tasks but can be also supplemented with extra metric functions. Here we are using the accuracy and classification report functions from sklearn.

In [81]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_df, 
                                                            acc=accuracy_score, 
                                                            rep=classification_report)

In [90]:
print("MCC: %s" %(result['mcc']))
print("Accuracy: %s" %(result['acc']))
print("Loss: %s" %(result['eval_loss']))

MCC: 0.9056156667582788
Accuracy: 0.905705824284304
Loss: 0.41924166499685794


## Precision-at-K Results
A metric used by the dataset curator on Kaggle was Precision-at-K Results. Precision at K measures how often the correct response was found in the first K positions of the most likely predictions. The method below has been adapted from the fast-ai library.

In [91]:
def top_k_accuracy(input, targs, k:int=5):
    "Computes the Top-k accuracy (target is in the top k predictions)."
    input = input.topk(k=k, dim=-1)[1]
    targs = targs.unsqueeze(dim=-1).expand_as(input)
    return (input == targs).max(dim=-1)[0].float().mean()

Since this method requires tensor data types, we will need to convert the model's evalutaion output into PyTorch tensors. We will also need a list of true labels from the test set. 

In [93]:
import torch 

# Creating a numpy array of labels in our test set
test_labels = test_df['subreddit'].to_numpy(dtype=np.int64)

input = torch.from_numpy(model_outputs)
targs = torch.from_numpy(test_labels)

In [94]:
k1 = top_k_accuracy(input,targs,1)
k3 = top_k_accuracy(input,targs,3)
k5 = top_k_accuracy(input,targs,5)

In [95]:
print('Precision-at-K=1 is %s' %(k1))
print('Precision-at-K=3 is %s' %(k3))
print('Precision-at-K=5 is %s' %(k5))

Precision-at-K=1 is tensor(0.9057)
Precision-at-K=3 is tensor(0.9587)
Precision-at-K=5 is tensor(0.9702)


# Prediction
The code below can be used to predict the parent subreddit for an arbitrary sequence of text

In [99]:
# The trained model can be loaded by running the line below
# model = ClassificationModel('bert', 
#                            'bert-weights/', 
#                             num_labels=len(label_dict), 
#                             args={'reprocess_input_data': True, 'overwrite_output_dir': True, 'num_train_epochs': 1}, 
#                             use_cuda=True) 

In [100]:
predictions, raw_outputs = model.predict(["Some arbitary sentence"])
print("Prediction: %s" %(inv_label_dict[predictions[0]]))