## 1 Summary, background, value proposition

This data set will be used to train and/or evaluate the performance of a multi-head classifier model for screening sensitive sociopolitical topics, thereby helping curb manipulative online content seeking to influence the political process, such as an election, passage of legislation, and content for campaigning purposes. The data set contains excerpts from social media posts, news articles, and other online content related to various sociopolitical topics. The data set also includes entries labeled as non-sensitive to improve/evaluate the model's precision.

A key question about this data set is whether/by how much senstive and non-sensitive entries overlap, because such overlap may affect the quality of the model and/or the efficacy of the evaluation. Topic modeling techniques were used to answer this question. The results demonstrate that senstive and non-sensitive entries, at least in this data set, do not have significant overlap.

In [1]:
import pathlib
import pandas as pd

data_set = pathlib.Path('./Data_Set_Select_Sociopolitical_Topics.csv')
df = pd.read_csv(data_set)

display(df)


Unnamed: 0,Sr.No,RawQuery,Sensitive Topic
0,5,No not elected. Slipped in as a runner up to ...,1
1,28,"GREAT WOMAN, GREAT LEGISLATOR, AND A GREAT FRI...",0
2,31,I LOVE what they did with the residential scho...,0
3,37,Wonder what would have happened if everybody h...,0
4,38,And least but not last those undoubtedly 'trut...,1
...,...,...,...
2629,3934,"So basically, a whiny complaining column to su...",0
2630,3936,Oh come off it already. Wearing a beard is no...,0
2631,3941,"Please note that any criticism of Wavemaker, a...",0
2632,3944,Lukashenkos days are numbered. His troops have...,1


## 2 Data acquisition, preprocessing


The data set was scraped from various online socia media sources. I labeled the data points as sensitive or non-sensitive based on the sociopolitical topics discussed in the entries.

Since all the entries are natural language narratives, I performed pre-processing steps such as tokenization, stop word removal, lemmatization, etc.

## 4 Data analysis

I performed topic modeling to answer the question about the overlap between sensitive and non-sensitive entries; to do so, I separated the data set into sensitive and non-sensitive data sets. The results of the topic modeling are shown in the code below. The lists of top words indicate that sensitive and non-sensitive entries have different topics. The intertopic distance maps under the "Data visualization section" also show that sensitive and non-sensitive entries do not have significant overlap. Even for senstive and non-sensitive topics are appear in similar areas on the maps, the term frequency charts show dissimilarities in the terms making up the topics. 

In [2]:
import re
from gensim.utils import tokenize
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation
custom_stopwords = ['displaystyle']

def preprocess_tokenize_with_gensim(text):
  # convert to lowercase, remove extra whitespace
  text = text.lower().strip()
  # remove \n, strip symbols and punctuation 
  text = strip_punctuation(text)
  # remove stopwords
  text = remove_stopwords(text)
  # tokenize
  tokens = list(tokenize(text))
  # remove any additional stopwords if needed (this is a custom extra step if 
  # you see words in the topics that don't belong)
  # remove any words shorter than 2 characters
  tokens = [token for token in tokens if token not in custom_stopwords and len(token) > 2]
  return tokens

# create tokens column
df['paragraph_tokens'] = df['RawQuery'].apply(preprocess_tokenize_with_gensim)
df.head()

Unnamed: 0,Sr.No,RawQuery,Sensitive Topic,paragraph_tokens
0,5,No not elected. Slipped in as a runner up to ...,1,"[elected, slipped, runner, majority, voted, me..."
1,28,"GREAT WOMAN, GREAT LEGISLATOR, AND A GREAT FRI...",0,"[great, woman, great, legislator, great, frien..."
2,31,I LOVE what they did with the residential scho...,0,"[love, residential, schools]"
3,37,Wonder what would have happened if everybody h...,0,"[wonder, happened, everybody, personal, gun, c..."
4,38,And least but not last those undoubtedly 'trut...,1,"[undoubtedly, truthful, polls]"


In [3]:
# create a dataframe with data from RawQuery column that are sensitive
df_sensitive = df[df['Sensitive Topic'] == 1]

# create a dataframe with data from RawQuery column that are not sensitive
df_not_sensitive = df[df['Sensitive Topic'] == 0]

In [4]:
# Perform Topic Modeling Through Latent Semantic Analysis with term frequency–inverse document frequency (TFIDF)
import os.path
from gensim.corpora import Dictionary
from gensim.models import LsiModel, TfidfModel
import re
import warnings
warnings.filterwarnings('ignore')

In [5]:
# Generate dictionary and document-term-matrix needed for computing TFIDF
dictionary_sensitive = Dictionary(df_sensitive['paragraph_tokens'])
corpus_sensitive = [dictionary_sensitive.doc2bow(text) for text in df_sensitive['paragraph_tokens']]
tfidf_sensitive = TfidfModel(corpus_sensitive)

dictionary_not_sensitive = Dictionary(df_not_sensitive['paragraph_tokens'])
corpus_not_sensitive = [dictionary_not_sensitive.doc2bow(text) for text in df_not_sensitive['paragraph_tokens']]
tfidf_not_sensitive = TfidfModel(corpus_not_sensitive)

In [6]:
# Specific the number of topics
total_sensitive_topics = 10
total_not_sensitive_topics = 10

In [7]:
# Train LDA models using word counts

from gensim.models import LdaModel
lda_sensitive = LdaModel(corpus_sensitive, id2word=dictionary_sensitive, num_topics=total_sensitive_topics)

lda_not_sensitive = LdaModel(corpus_not_sensitive, id2word=dictionary_not_sensitive, num_topics=total_not_sensitive_topics)

In [8]:
# Compare the top 7 words for the top 7 topics from the LDA model
for topic in lda_sensitive.show_topics(num_topics=10, num_words=7, formatted=False):
    print(topic)

for topic in lda_not_sensitive.show_topics(num_topics=10, num_words=7, formatted=False):
    print(topic)
    

(0, [('war', 0.0118306), ('biden', 0.010292057), ('ukraine', 0.0100329025), ('trump', 0.009076829), ('people', 0.0077459067), ('money', 0.0041311006), ('russia', 0.003679826)])
(1, [('years', 0.0046200496), ('time', 0.0040901685), ('war', 0.0039145993), ('people', 0.003912225), ('like', 0.0037987418), ('democrats', 0.003711267), ('trump', 0.0036875475)])
(2, [('war', 0.0123109585), ('ukraine', 0.010948614), ('biden', 0.009136377), ('start', 0.004821789), ('russia', 0.0048157354), ('want', 0.004499425), ('low', 0.0042325016)])
(3, [('trump', 0.0072178487), ('know', 0.005289762), ('government', 0.00510377), ('war', 0.0047824676), ('need', 0.0044445167), ('obama', 0.0042887754), ('russia', 0.003941494)])
(4, [('ukraine', 0.009048695), ('russia', 0.0071237823), ('war', 0.007059266), ('trump', 0.0070298295), ('state', 0.0067854067), ('government', 0.0060946248), ('abortion', 0.005092202)])
(5, [('trump', 0.0128178345), ('people', 0.0073555573), ('right', 0.006372023), ('russia', 0.005609835

## 3 Data visualization

In [9]:
# Visualize the topics in the LDA model
#!pip install pyldavis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()


In [10]:
map_sensitive = pyLDAvis.gensim_models.prepare(lda_sensitive, corpus_sensitive, dictionary_sensitive)
map_sensitive

In [11]:
map_not_sensitive = pyLDAvis.gensim_models.prepare(lda_not_sensitive, corpus_not_sensitive, dictionary_not_sensitive)
map_not_sensitive