<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: NLP subreddit post classifier 
(wrong class evaluation)

## Problem Statement:
---

For this project, I wish to train a Natural Language Processing classifier to help differentiate reddit posts between [`Marvel`](https://www.reddit.com/r/Marvel/) and [`DCcomics`](https://www.reddit.com/r/DCcomics/) subreddits. This is a binary classification problem. 

In [`Project_3_data.ipynb`](./Project_3_data.ipynb), I have webscraped my desired data from 2 subreddits, [`Marvel`](https://www.reddit.com/r/Marvel/) and [`DCcomics`](https://www.reddit.com/r/DCcomics/) respectively. Please refer to [`Project_3_data.ipynb`](./Project_3_data.ipynb) for more details on how the data was extracted.

In [`Project_3_ML.ipynb`](./Project_3_ML.ipynb), I have transformed my text data with word vectorizors using `CountVectorizer` and `TfidfVectorizer`, and trained 2 classification models, **`Random Forest Classifier`** and **`Multinomial Naive Bayes`** classifier. Please refer to [`Project_3_ML.ipynb`](./Project_3_ML.ipynb) for more details on how the data was used.

Now, I wish to further investigate about the `post`s that the respective classifiers got wrong.

### Contents:

1. [Imported libraries](#Imports:)
1. [Text normalisation](#Tokenizing-and-Lemmatizing)
1. [Count Vectorization](#Count-Vectorization)
1. [Choose a row to study](#Choose-a-row-in-Z_misclass_cvec-to-find-out-word-features)

#### Imports:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
# import data
model_df = pd.read_csv('./data/train_cvec.csv')
misclass_df = pd.read_csv('./data/wrong_class.csv')
display(model_df.head(3))
display(model_df.shape)
display(misclass_df.head(3))

Unnamed: 0,ability,able,absolutely,act,action,action comic,actor,actual,actually,adam,...,www youtube,yeah,year,year ago,yellow,yes,young,young justice,youtube,youtube com
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0


(1614, 1000)

Unnamed: 0.1,Unnamed: 0,full_post,true_values,pred_probs,model
0,1572,"People who dislike/hate Damian Wayne, why? Jus...",0,0.514977,RF
1,1780,"Why did some people die in COIE, but others di...",0,0.50131,RF
2,1622,"if you were to raise a Kryptonian, What values...",0,0.50131,RF


In [3]:
# rename 'Unnamed:0' column
misclass_df.rename(columns={'Unnamed: 0':'index_origin'}, inplace=True)
display(misclass_df.head(3))

Unnamed: 0,index_origin,full_post,true_values,pred_probs,model
0,1572,"People who dislike/hate Damian Wayne, why? Jus...",0,0.514977,RF
1,1780,"Why did some people die in COIE, but others di...",0,0.50131,RF
2,1622,"if you were to raise a Kryptonian, What values...",0,0.50131,RF


### Tokenizing and Lemmatizing
---

Ensure consistent preprocessing steps done for `model_df` is done for the `misclass_df`

In [4]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import re

In [5]:
# function for tokenizing and lemming
def lemmatize_join(text):
    tokenizer = RegexpTokenizer('[a-z]+', gaps=False) # instantiate tokenizer
    lemmer = WordNetLemmatizer() # instantiate lemmatizer
    return ' '.join([lemmer.lemmatize(w) for w in tokenizer.tokenize(text.lower())]) 
    # lowercase, join back together with spaces so that word vectorizers can still operate 
    # on cell contents as strings

In [6]:
Z_misclass = misclass_df['full_post'].apply(lemmatize_join)
display(Z_misclass[:2])

0    people who dislike hate damian wayne why just ...
1    why did some people die in coie but others did...
Name: full_post, dtype: object

### Count Vectorization
---

In [7]:
# instantiate word vectorizer
cvec = CountVectorizer(lowercase=False, 
                       max_df=0.6, 
                       max_features=1000,
                       min_df=3,
                       ngram_range=(1,2),
                       stop_words='english',
                       strip_accents='unicode')

In [8]:
Z_train = pd.read_csv('./data/Z_train.csv') # just for fitting count vectorizer
Z_train.shape

(1614, 1)

In [9]:
cvec.fit(Z_train['full_post']) # unable to fit to dataframe, had to convert to pd.series

CountVectorizer(lowercase=False, max_df=0.6, max_features=1000, min_df=3,
                ngram_range=(1, 2), stop_words='english',
                strip_accents='unicode')

In [10]:
# no train-test-split since not planning for modelling
# Use fitted CountVectorizer on the whole lemmatized misclass_df
Z_misclass_cvec = pd.DataFrame(cvec.transform(Z_misclass).todense(),
                           columns=cvec.get_feature_names_out())
display(Z_misclass_cvec.shape) # 1000 word vocabulary from train data,
# most features will be zero since data size is small
display(Z_misclass_cvec.head(3))

(74, 1000)

Unnamed: 0,ability,able,absolutely,act,action,action comic,actor,actual,actually,adam,...,www youtube,yeah,year,year ago,yellow,yes,young,young justice,youtube,youtube com
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Choose a row in `Z_misclass_cvec` to find out word features
---

`model_df.columns` and `Z_misclass_cvec.columns` each have 1000 rows. So I decide not to print them out to save space

In [11]:
# find out if any of the Z_misclass_cvec rows have no 1's?
def row_all_zeros(data):
    row_list = [] 
    for row_num in range(len(data)):
        if data.iloc[row_num].all(0):
            row_list.append(row_num)
            
    if len(row_list)==0:
        print('No rows had all zeros.')
        
    return row_list        

In [12]:
row_all_zeros(data=Z_misclass_cvec)

No rows had all zeros.


[]

In [13]:
# pick a row to study
def features_in_row(row_num, data):
    words = []
    for index, i in data.iloc[row_num].eq(1).iteritems():
        if i == True:
            words.append(index)
    return words

In [14]:
features_in_row(row_num=0, data=Z_misclass_cvec) # features in row 0

['comic', 'curious', 'damian', 'just', 'movie', 'specific', 'spoiler', 'wayne']

From just the word features alone identified in a specific post, I am unable to identify why the classifier wrongly classified the post. I will have to use a different way, such as getting the individual weights assigned to each word feature by the classifier, that eventually when summed up resulted in a wrong classification. 

### Sum the weights of the features in a post
---

In [15]:
# import feature importances for both Random forest and Multinomial NB models
rf_feature_importances = pd.read_csv('./data/randomforest_feature_weights.csv')
nb_feature_probs = pd.read_csv('./data/multinomialNB_feature_probs.csv')

# check imports
display(rf_feature_importances.tail(3))
display(nb_feature_probs.head(3))

Unnamed: 0,feature,weight
997,young justice,0.001386
998,youtube,0.0
999,youtube com,0.0


Unnamed: 0,feature,predict_marvel,predict_Dcomics
0,ability,0.000593,0.000556
1,able,0.001717,0.000915
2,absolutely,0.000562,0.000163


In [16]:
# going back to analysing first row classified wrongly by Random Forest as Marvel
# find weights for each feature present in first row
def total_weights(features, reference, weight):
    index_list = []
    for feature in features:
        for index, value in reference['feature'].iteritems():
            if value==feature:
                index_list.append(index)
    
    weight_list = []
    for num in index_list:
        for index, value in reference[weight].iteritems():
            if index==num:
                weight_list.append(value)
    return sum(weight_list)

In [17]:
# analyse row 1 again, this is the total weight of the features
# it was classified as Marvel instead of DC comics
total_weights(features=features_in_row(row_num=0, data=Z_misclass_cvec),
              reference=rf_feature_importances,
              weight='weight'
             )

0.01609735099328855

### Compare with correctly classified data
---

In [18]:
# import reference of correctly classified data
rightclass_df = pd.read_csv('./data/right_class.csv')

display(rightclass_df.shape)
display(rightclass_df.head(3))

(734, 5)

Unnamed: 0.1,Unnamed: 0,full_post,true_values,pred_probs,model
0,1574,"People who dislike/hate Damian Wayne, why? I w...",0,0.48222,RF
1,1771,Is John Stewart really the “most well known Gr...,0,0.319268,RF
2,984,when do i read the spider-gwen annual #1? righ...,1,0.541272,RF


In [19]:
# rename 'Unnamed:0' column
rightclass_df.rename(columns={'Unnamed: 0':'index_origin'}, inplace=True)
display(rightclass_df.head(3))

Unnamed: 0,index_origin,full_post,true_values,pred_probs,model
0,1574,"People who dislike/hate Damian Wayne, why? I w...",0,0.48222,RF
1,1771,Is John Stewart really the “most well known Gr...,0,0.319268,RF
2,984,when do i read the spider-gwen annual #1? righ...,1,0.541272,RF


Notice that the post on `index_origin` **1572** in `misclass_df` and **1574** in `rightclass_df` are very similar, yet assigned different probability values. **1572** is assigned 0.514977, while **1574** is assigned 0.482220 by the same `Random Forest classifier`. These 2 posts more likely are re-posts, since the data was pulled based on epoch_time sequence. 

Let's see if post with index_origin **1574** has slightly different word features that allowed it to be classified correctly.

In [20]:
# lemmatize the correctly class
Z_rightclass = rightclass_df['full_post'].apply(lemmatize_join)
display(Z_rightclass[:2])

0    people who dislike hate damian wayne why i wan...
1    is john stewart really the most well known gre...
Name: full_post, dtype: object

In [21]:
# count vectorizer already fitted to original train data
# no train-test-split since not planning for modelling
# Use fitted CountVectorizer on the whole lemmatized misclass_df
Z_rightclass_cvec = pd.DataFrame(cvec.transform(Z_rightclass).todense(),
                           columns=cvec.get_feature_names_out())
display(Z_rightclass_cvec.shape) # 1000 word vocabulary from train data,
# most features will be zero since data size is small
display(Z_rightclass_cvec.head(3))

(734, 1000)

Unnamed: 0,ability,able,absolutely,act,action,action comic,actor,actual,actually,adam,...,www youtube,yeah,year,year ago,yellow,yes,young,young justice,youtube,youtube com
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# check if any rows are all zero
row_all_zeros(data=Z_rightclass_cvec)

No rows had all zeros.


[]

In [23]:
# get the features in post index_origin 1574
features_in_row(row_num=0, data=Z_rightclass_cvec)

['damian', 'hate', 'point', 'robin', 'understand', 'view', 'want', 'wayne']

In [24]:
# analyse row 1 again, this is the total weight of the features
# it was classified as Marvel instead of DC comics
total_weights(features=features_in_row(row_num=0, data=Z_rightclass_cvec),
              reference=rf_feature_importances,
              weight='weight'
             )

0.0202662044502592

0.0202662044502592 (assigned to index_origin **1574** by Random Forest - correct) compared with 0.01609735099328855 (assigned to index_origin **1572** by Random Forest - wrong), there is no significant difference. 

However there is a significant difference in the word features for both posts:
* Both posts have `damian`, `wayne` as common features
* Post **1572** has `comic`, `curious`, `just`, `movie`, `specific`, `spoiler` (wrongly classed)
* Post **1574** has `hate`, `point`, `robin`, `understand`, `view`, `want` (rightly classed)

From here, it is really challenging to identify the specific words that contributed to the wrong classification, and how to go about improving the model by tweaking the text inputs.

With more time and further study, I would like to try using [LIME (open source package)](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e) which is a black-box explainer.

### Study Multinomial Naive Bayes wrongly classified data
---

In [25]:
misclass_df.tail(3) 
# pick a datapoint to study, I pick index_origin 1750, index 73.
# classified as Marvel, but it is actually DC comics

Unnamed: 0,index_origin,full_post,true_values,pred_probs,model
71,131,Spiderman vs Batman Who got better Rogues Gall...,1,1.156167e-08,NB
72,799,How come marvel never had a crossover with Tat...,1,0.1221143,NB
73,1750,"Batman TAS was for kids? I mean, this is not t...",0,0.9247457,NB


In [26]:
# reference this
nb_feature_probs.head(3) 

Unnamed: 0,feature,predict_marvel,predict_Dcomics
0,ability,0.000593,0.000556
1,able,0.001717,0.000915
2,absolutely,0.000562,0.000163


In [27]:
print(features_in_row(row_num=73, data=Z_misclass_cvec)) # features in row 73
# classified as 'Marvel' wrongly

['batman', 'cartoon', 'child', 'deal', 'example', 'having', 'hell', 'kid', 'long', 'mean', 'meant', 'real', 'saying', 'shot', 'similar', 'thing', 'think', 'wa', 'war']


In [28]:
# function for filtering datapoint word feature probabilities
def point_probs(row_num, data):
    feature_list = features_in_row(row_num=row_num, data=data)
    new_df = nb_feature_probs[nb_feature_probs['feature'].isin(feature_list)]
    return new_df

In [29]:
point_probs(row_num=73, data=Z_misclass_cvec).sort_values('predict_marvel', ascending=False)
# sort based on predict_marvel column, highest contributor to lowest

Unnamed: 0,feature,predict_marvel,predict_Dcomics
933,wa,0.020508,0.015033
871,think,0.006992,0.006275
945,war,0.004932,0.001569
870,thing,0.003964,0.002778
583,mean,0.001967,0.001275
499,kid,0.001592,0.001111
544,long,0.001405,0.001144
783,similar,0.001217,0.000425
718,real,0.001155,0.001601
410,having,0.001124,0.001209


Based on the above table of probabilities for the features present in post with index_origin **1750**, the top 8 words contributed to the misclassification - they are `wa`, `think`, `war`, `thing`, `mean`, `kid`, `long`, `similar` . These words are assigned a higher probability value for predicting `marvel` classification than for `DC comics`. 

With this short analysis, I learnt that it is easier to explain Multinomial NB model results and reason for misclassifications compared with explaining the same about Random Forest classifier. 