GOAL:

1. TOPIC MODELLING

Process:
1. Reading in the data [Sampled data in this case]
2. Processing + Regex cleaning
3. Using TFIDF vector to vecotrize
5. selecting # of topics using NMF
6. Assigning docs to topics
7. Adding topic labels

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import numpy as np
import math
import io
from tqdm import tqdm
tqdm.pandas()

In [3]:
data_train = pd.read_pickle('/content/drive/My Drive/NLP Project/pickled_dataframes/data_train.pickle')
data_test = pd.read_pickle('/content/drive/My Drive/NLP Project/pickled_dataframes/data_test.pickle')

In [4]:
data_train.tail()

Unnamed: 0,id,comment_text,toxicity,toxicity_label,toxicity_bins,processed_comment,lemmatized_comment
18044,5319739,Where did you get the idea that anyone would b...,0.0,0,-1,Where did you get the idea that anyone would b...,idea force carry training s entirely voluntary
18045,852189,The MSM said that about Trump as well.,0.0,0,-1,The MSM said that about Trump as well,MSM say _PERSON_
18046,936562,"Hi, gtm. Wise words. Nope...no rust. Not si...",0.166667,0,-1,Hi gtm Wise words Nope no rust Not si...,hi gtm wise word Nope rust sit i...
18047,5239147,Perfect illustration of what's wrong with poli...,0.0,0,-1,Perfect illustration of what s wrong with poli...,perfect illustration s wrong politic politicia...
18048,262545,"You mean...no, it can't be...that the organic ...",0.0,0,-1,You mean no it can not be that the organi...,mean organic ice cream small batch s...


In [5]:
data_test.tail()

Unnamed: 0,id,comment_text,toxicity,toxicity_label,toxicity_bins,processed_comment,lemmatized_comment
1941,7048780,Government is clueless folks and downright dum...,0.5,1,1,Government is clueless folks and downright dum...,government clueless folk downright dumb Exam...
1942,7187867,Differentiating between Canada and the US on w...,0.0,0,-1,Differentiating between Canada and the US on w...,differentiate _GPE_ health care right citizen...
1943,7056283,"""Substance?"" Name a single instance. just one...",0.0,0,-1,Substance Name a single instance just one...,substance single instance wait
1944,7085860,Fire them.,0.166667,0,-1,Fire them,fire
1945,7010117,"Lynn, thanks for your devotion to hope and hel...",0.0,0,-1,Lynn thanks for your devotion to hope and hel...,_PERSON_ thank devotion hope help notice ...


In [6]:
# Combining train and test lemmatized_comments for toxic comments
combined_data= pd.concat([data_train[data_train.toxicity_label == 1],data_test[data_test.toxicity_label == 1]])
combined_data.reset_index(inplace= True, drop = True)


Regex cleaning

In [7]:
# Regex cleaning to remove generic terms which may introduce bias
import re
def clean(phrase):
    phrase = phrase.lower()
    phrase = re.sub(r"\burl\b", "_url_", phrase)
    phrase = re.sub(r"\bemail\b", "_email_", phrase)
    phrase = re.sub(r"\bnumber\b", "_number_", phrase)
    phrase = re.sub(r"\bpeople\b", "_people_", phrase)
    return phrase

In [8]:
combined_data.lemmatized_comment = combined_data.lemmatized_comment.apply(clean)  

TOPIC MODELING

In [9]:
# TFIDF Vecotrizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(2,4), min_df=2, stop_words="english", token_pattern = r'\b[A-Za-z]{5,}\b')
X_comments, tox_terms = vectorizer.fit_transform(combined_data.lemmatized_comment), vectorizer.get_feature_names_out()

comments_tf_idf = pd.DataFrame(X_comments.toarray(), columns=tox_terms)
print(f"Comments TF-IDF: {comments_tf_idf.shape}")

Comments TF-IDF: (2931, 1272)


In [10]:
# function  to get top tokens for each topic
from typing import List
import numpy as np

def get_top_tf_idf_tokens_for_topic(H: np.array, feature_names: List[str], num_top_tokens: int = 5):
  for topic, vector in enumerate(H):
    print(f"TOPIC {topic}\n")
    total = vector.sum()
    top_scores = vector.argsort()[::-1][:num_top_tokens]
    
    token_names = list(map(lambda idx: feature_names[idx], top_scores))
    strengths = list(map(lambda idx: vector[idx] / total, top_scores))
    
    for strength, token_name in zip(strengths, token_names):
      print(f"\b{token_name} ({round(strength * 100, 1)}%)\n")
    print(f"=" * 50)


In [11]:
# NMF model for 6 topics - we see unclear categorization and repeated topic
nmf = NMF(n_components=6)
W_comments = nmf.fit_transform(X_comments)
H_comments = nmf.components_
print(f"Original shape of X comments is {X_comments.shape}")
print(f"Decomposed W comments matrix is {W_comments.shape}")
print(f"Decomposed H comments matrix is {H_comments.shape}")

get_top_tf_idf_tokens_for_topic(H_comments, comments_tf_idf.columns.tolist(), 5)



Original shape of X comments is (2931, 1272)
Decomposed W comments matrix is (2931, 6)
Decomposed H comments matrix is (6, 1272)
TOPIC 0

white supremacist (48.2%)

racist white (3.4%)

claim thing (2.7%)

refuse black (2.5%)

supremacist group (2.4%)

TOPIC 1

drain swamp (76.5%)

right snowflake (6.6%)

double standard (5.3%)

private sector (5.0%)

professional wrestling (4.3%)

TOPIC 2

human right (43.9%)

person person (5.4%)

corruption greed (4.3%)

right freedom (3.6%)

teaching cultural (2.6%)

TOPIC 3

sexual assault (34.0%)

charge sexual (4.2%)

assault woman (3.6%)

position power (3.2%)

admit sexual (3.1%)

TOPIC 4

politically correct (39.4%)

civil right (8.8%)

government agency (4.6%)

little potato (4.3%)

school teacher (4.3%)

TOPIC 5

common sense (29.4%)

police officer (3.5%)

stupid stupid (3.3%)

illegal country (3.2%)

stupid think (2.4%)



In [12]:
# NMF model for 5 topics - we see repeated topic
nmf = NMF(n_components=5)
W_comments = nmf.fit_transform(X_comments)
H_comments = nmf.components_
print(f"Original shape of X comments is {X_comments.shape}")
print(f"Decomposed W comments matrix is {W_comments.shape}")
print(f"Decomposed H comments matrix is {H_comments.shape}")

get_top_tf_idf_tokens_for_topic(H_comments, comments_tf_idf.columns.tolist(), 5)



Original shape of X comments is (2931, 1272)
Decomposed W comments matrix is (2931, 5)
Decomposed H comments matrix is (5, 1272)
TOPIC 0

white supremacist (48.2%)

racist white (3.4%)

claim thing (2.7%)

refuse black (2.5%)

supremacist group (2.4%)

TOPIC 1

drain swamp (76.5%)

right snowflake (6.6%)

double standard (5.3%)

private sector (5.0%)

professional wrestling (4.3%)

TOPIC 2

human right (43.7%)

person person (5.3%)

corruption greed (4.3%)

right freedom (3.6%)

teaching cultural (2.6%)

TOPIC 3

sexual assault (32.4%)

charge sexual (4.1%)

assault woman (3.5%)

position power (3.1%)

admit sexual (2.9%)

TOPIC 4

politically correct (37.9%)

civil right (8.4%)

government agency (4.4%)

little potato (4.2%)

school teacher (4.1%)



In [13]:
# NMF model for 4 topics - SELECTED
nmf = NMF(n_components=4)
W_comments = nmf.fit_transform(X_comments)
H_comments = nmf.components_
print(f"Original shape of X comments is {X_comments.shape}")
print(f"Decomposed W comments matrix is {W_comments.shape}")
print(f"Decomposed H comments matrix is {H_comments.shape}")

get_top_tf_idf_tokens_for_topic(H_comments, comments_tf_idf.columns.tolist(), 5)



Original shape of X comments is (2931, 1272)
Decomposed W comments matrix is (2931, 4)
Decomposed H comments matrix is (4, 1272)
TOPIC 0

white supremacist (47.8%)

racist white (3.4%)

claim thing (2.7%)

refuse black (2.5%)

white supremacist group (2.4%)

TOPIC 1

drain swamp (74.7%)

right snowflake (6.5%)

double standard (5.2%)

private sector (5.0%)

professional wrestling (4.2%)

TOPIC 2

human right (43.7%)

person person (5.3%)

corruption greed (4.3%)

right freedom (3.6%)

teaching cultural (2.6%)

TOPIC 3

sexual assault (32.3%)

charge sexual (4.0%)

assault woman (3.5%)

position power (3.1%)

admit sexual (2.9%)



In [15]:
# function to get top comments in each topic
def get_top_documents_for_each_topic(W: np.array, documents: List[str], num_docs: int = 5):
    sorted_docs = W.argsort(axis=0)[::-1]
    top_docs = sorted_docs[:num_docs].T
    per_document_totals = W.sum(axis=1)
    for topic, top_documents_for_topic in enumerate(top_docs):
        print(f"Topic {topic}")
        for doc in top_documents_for_topic:
            score = W[doc][topic]
            percent_about_topic = round(score / per_document_totals[doc] * 100, 1)
            print(f"{percent_about_topic}%", documents[doc])
        print("=" * 50)

In [16]:
get_top_documents_for_each_topic(W_comments, combined_data.comment_text.tolist(), 6)

Topic 0
100.0% The fact is that the Trump regime is white supremacist Christian Republican extremists who hate everything our country stands for.
100.0% Wow these Nazi White Supremacist guys don't mess around... carrying guns and don't hesitate to use them. Scary dudes...
100.0% "a land of freedom for all" Dejain?

Even for white supremacists?
100.0% Yes, no more White supremacists should be allowed at UVa.  UVa should admit only BLM minority members.
100.0% Meh. Replace white supremacists with Muslims and you'd be all for shutting them down. The Greatest Generation sacrificed too much to let Nazis rear their ugly heads again. Color me unsympathetic to their "plight".
100.0% He did not give any kind of approval to white supremacists.   Now like I said, leftists give blatant approval to racists every day.
Topic 1
100.0% You'd have to drain the swamp to find the loathsome bottom-dwellers populating the Trump cabal.
100.0% Kill `em all, but save 6 for pallbearers! The Trump "administratio

In [17]:
# creating belongingness of each document in topic
combined_data['TOPIC1Belonging'] = np.nan
combined_data['TOPIC2Belonging'] = np.nan
combined_data['TOPIC3Belonging'] = np.nan
combined_data['TOPIC4Belonging'] = np.nan

combined_data['TOPIC1LABEL'] = 0
combined_data['TOPIC2LABEL'] = 0
combined_data['TOPIC3LABEL'] = 0
combined_data['TOPIC4LABEL'] = 0

t = 0.3


In [18]:
# adding label to each document above threshold
for i in range(len(combined_data)):
  combined_data['TOPIC1Belonging'].iloc[i] = W_comments[i][0]/(W_comments[i][0] + W_comments[i][1] + W_comments[i][2] + W_comments[i][3])
  combined_data['TOPIC2Belonging'].iloc[i] = W_comments[i][1]/(W_comments[i][0] + W_comments[i][1] + W_comments[i][2] + W_comments[i][3])
  combined_data['TOPIC3Belonging'].iloc[i] = W_comments[i][2]/(W_comments[i][0] + W_comments[i][1] + W_comments[i][2] + W_comments[i][3])
  combined_data['TOPIC4Belonging'].iloc[i] = W_comments[i][3]/(W_comments[i][0] + W_comments[i][1] + W_comments[i][2] + W_comments[i][3])

  
  if combined_data['TOPIC1Belonging'].iloc[i] >= t:
    combined_data['TOPIC1LABEL'].iloc[i] = 1
  if combined_data['TOPIC2Belonging'].iloc[i] >= t:
    combined_data['TOPIC2LABEL'].iloc[i] = 1
  if combined_data['TOPIC3Belonging'].iloc[i] >= t:
    combined_data['TOPIC3LABEL'].iloc[i] = 1
  if combined_data['TOPIC4Belonging'].iloc[i] >= t:
    combined_data['TOPIC4LABEL'].iloc[i] = 1



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
  combined_data['TOPIC1Belonging'].iloc[i] = W_comments[i][0]/(W_comments[i][0] + W_comments[i][1] + W_comments[i][2] + W_comments[i][3])
  combined_data['TOPIC2Belonging'].iloc[i] = W_comments[i][1]/(W_comments[i][0] + W_comments[i][1] + W_comments[i][2] + W_comments[i][3])
  combined_data['TOPIC3Belonging'].iloc[i] = W_comments[i][2]/(W_comments[i][0] + W_comments[i][1] + W_comments[i][2] + W_comments[i][3])
  combined_data['TOPIC4Belonging'].iloc[i] = W_comments[i][3]/(W_comments[i][0] + W_comments[i][1] + W_comments[i][2] + W_comments[i][3])


In [19]:
# random comment
combined_data.loc[461]

id                                                              5803028
comment_text          Here I'll fix this Leftist sophomore literary ...
toxicity                                                       0.723077
toxicity_label                                                        1
toxicity_bins                                                         1
processed_comment     Here I will fix this Leftist sophomore literar...
lemmatized_comment    fix leftist sophomore literary crap   _org_   ...
TOPIC1Belonging                                                  0.5748
TOPIC2Belonging                                                0.017416
TOPIC3Belonging                                                0.329323
TOPIC4Belonging                                                 0.07846
TOPIC1LABEL                                                           1
TOPIC2LABEL                                                           0
TOPIC3LABEL                                                     

In [20]:
combined_data.loc[461,'comment_text']

"Here I'll fix this Leftist sophomore literary crap once and for all\n#WHITELIVESMATTER\nNow STFU communist marxist evil\nIf you respond in any way that denies me my right to protect my whiteness\nyou are f'in racist"