In [16]:
import warnings
import os
import re
import math
from statistics import mean
import pandas as pd
from gdtm.models import TND, NLDA, GTM
from nltk.corpus import stopwords

warnings.filterwarnings('ignore')

### Welcome to the Topic Noise Models Tutorial

#### In this tutorial you will...
1. Learn to use Topic Noise Discriminator (TND), Noiseless LDA (NLDA), and Guided Topic Noise Model (GTM).
2. Learn how to interpret the results of these models.

#### Steps to setting up your code:
1. Make sure that your python environment has `gdtm` installed.  You can install it using pip: `pip install gdtm`.  There is a `requirements.txt` file in the root directory that you can use to install everything you need in one go.
2. You will also need the source code of the topic models, which are built in Java. The models you will need for this tutorial are `mallet-gtm`, `mallet-lda`, and `mallet-tnd`. You can find them in this repository: https://github.com/GU-DataLab/topic-noise-models-source
4. Download the data set that we will be using from the Massive Data Institute (MDI) at Georgetown University here: https://portals.mdi.georgetown.edu/public/covid-19-misinformation-detection
5. By following the directions in the link above, you will be sent an email with the data set attached in a zipped folder.  Unzip the folder and put it inside the `data/` folder in the root directory.
6. Follow the directions throughout this notebook.

#### About this data set
This data set was created by MDI to facilitate testing of misinformation detection models.  This data set revolves around the Covid-19 pandemic and contains six different topics that were prone to misinformation.  The topics are: hydroxychloroquine, antibiotics, 5G internet, UV-light, mosquitoes, and disinfectants.  This data was collected from March 2020 to August 2020.

In [3]:
# Let's load our data. There are multiple files in the data set, so we need to iterate and concatenate.
data_path = 'data/covid-19-misinformation-detection/'
files = os.listdir(data_path)

datasets = []
for file_name in files:
    data = pd.read_csv('{}{}'.format(data_path, file_name))
    datasets.append(data)

raw_dataset = pd.concat(datasets, ignore_index=True)
raw_dataset = raw_dataset['text'].values
print('Number of tweets in the data set: {}'.format(len(raw_dataset)))

Number of tweets in the data set: 3761


In [4]:
# Let's take a look at our data.
print(raw_dataset[:10])

['YouTube to suppress content spreading #coronavirus 5G conspiracy theory - guardian <em>URL01 Removed</em> #ConspiracyTheory'
 '@USER01 5g is Chinese corona virus and spyware network which cause cancer. #5g #Covid_19 #coronavirus #HUAWEI #spyware #Cancer'
 'Health is wealth...keep social distance:victory_hand_medium-dark_skin_tone:#Weston #5G #20YearsHuaweiEurope #10over10 #coronaviruskenya #coronavirus'
 'If countries respond to the pandemic by upgrading their wireless networks so that more of us can work from home more easily, then coronavirus causes 5G \n#Showerthoughts #5G #coronavirus'
 '"No, #5G Does Not Cause #Coronavirus" <em>URL01 Removed</em>'
 'What is the deal with 5G? \n\nWhy are people against it? \n\n#Covid_19 #Day9ofLockdown <em>URL01 Removed</em>'
 '#M_D_P_News :thumbs_down:\nForeign Policy via @USER01 The Imagined Threats of 5G Conspiracy Theorists Are Causing Real-World Harm #5G #Disinformation #Coronavirus <em>URL01 Removed</em>'
 'Fake 5G coronavirus theories have

#### Yuck.
There are HTML tags, removed Users and URLs, and other things we want to remove from this data before we pass it into a topic model.  Let's do some preprocessing.

In [5]:
# Quick Preprocessing

stopwords_list = stopwords.words('english')

dataset = []
for tweet in raw_dataset: 
    tweet = tweet.lower()
    tweet = re.sub('<[^<]+?>', '', tweet)
    tweet = re.sub('url[0-9]* removed', '', tweet)
    tweet = re.sub('user[0-9]*', '', tweet)
    tweet = re.sub('[^\w\s]', '', tweet)
    tweet = tweet.replace('\n', ' ').replace('_', '').replace('ー', '')
    tweet = [x for x in tweet.split(' ') if len(x) > 1 and x != 'amp']
    tweet = [x for x in tweet if x not in stopwords_list]
    dataset.append(tweet)

_ = [print(dataset[i]) for i in range(0, 10)]

['youtube', 'suppress', 'content', 'spreading', 'coronavirus', '5g', 'conspiracy', 'theory', 'guardian', 'conspiracytheory']
['5g', 'chinese', 'corona', 'virus', 'spyware', 'network', 'cause', 'cancer', '5g', 'covid19', 'coronavirus', 'huawei', 'spyware', 'cancer']
['health', 'wealthkeep', 'social', 'distancevictoryhandmediumdarkskintoneweston', '5g', '20yearshuaweieurope', '10over10', 'coronaviruskenya', 'coronavirus']
['countries', 'respond', 'pandemic', 'upgrading', 'wireless', 'networks', 'us', 'work', 'home', 'easily', 'coronavirus', 'causes', '5g', 'showerthoughts', '5g', 'coronavirus']
['5g', 'cause', 'coronavirus']
['deal', '5g', 'people', 'covid19', 'day9oflockdown']
['mdpnews', 'thumbsdown', 'foreign', 'policy', 'via', 'imagined', 'threats', '5g', 'conspiracy', 'theorists', 'causing', 'realworld', 'harm', '5g', 'disinformation', 'coronavirus']
['fake', '5g', 'coronavirus', 'theories', 'realworld', 'consequences', 'via', '5g', 'fake5g', 'covid19', 'coronavirus', 'cnet']
['covi

#### That looks a bit better.  Preprocessing is never perfect, but we now have tokenized documents with stopwords, punctuation, and other obvious noise words removed.

#### Now, let's get on to some topic modeling

In [6]:
# We have to link our topic model source code to the Python wrapper. 
# If you have saved the source code to your root directory, the path should look like this:
tnd_path = 'mallet-tnd/bin/mallet'
lda_path = 'mallet-lda/bin/mallet'

# This line constructs the NLDA model using the TND and LDA models at the locations above.
model = NLDA(dataset=dataset, mallet_tnd_path=tnd_path, mallet_lda_path=lda_path, 
             tnd_k=10, lda_k=10, phi=10, top_words=20)

topics = model.get_topics()
noise = model.get_noise_distribution()

Mallet NFT: 10 topics, 4 topic bits, 1111 topic mask
Data loaded.
Max Noise Dist as of Initialization: 721
Max Topic Value as of Initialization: 46
skew: 25.0
max tokens: 26
total tokens: 31121
<10> LL/token: -7.33562
<20> LL/token: -7.04461
<30> LL/token: -6.90808
<40> LL/token: -6.822

0	5	spread covid spray study antibiotic dr prevent dangerous fake due claims fda resistance uk masts stayhome hcq give real risk 
1	5	mosquitoes coronavirusoutbreak uv light kill covid2019 coronaviruspandemic sarscov2 good carry body works disinfectant water day americans science medical free found 
2	5	5g virus conspiracy time 5gcoronavirus theory technology theories radiation hands state uproar deacon whats unleashes outbreak clean comfort source stupid 
3	5	hydroxychloroquine corona trump drug covid19 taking drink question chloroquine fauci usa information inject doesnt visit public clinical news vaccine stock 
4	5	effective dont treating mosquitos news preventing im make cases network coronavirus o

#### Once the cell above finishes running, we should be able to take a look at our topics and noise distributions

Let's do that now.

In [7]:
_= [print(x) for x in topics]

['mosquito', 'corona', 'virus', 'transmitted', 'bites', 'person', 'bite', 'fact', 'infected', 'due', 'blood', 'stayhome', 'day', 'information', 'indiafightscorona', 'stay', 'spreads', 'experts', 'coronavirusupdates', 'suggest']
['hydroxychloroquine', 'covid', 'patients', 'treatment', 'drug', 'doctors', 'doctor', 'fda', 'chloroquine', 'drugs', 'fauci', 'usa', 'hcq', 'azithromycin', 'deaths', 'qanon', 'clinical', 'days', 'found', 'antimalarial']
['mosquitoes', 'virus', 'transmit', 'mosquitos', 'study', 'pandemic', 'evidence', 'sarscov2', 'carry', 'question', 'disease', 'malaria', 'cases', 'humans', 'heres', 'summer', 'back', 'confirmed', 'diseases', 'human']
['5g', 'conspiracy', 'news', '5gcoronavirus', 'theory', 'technology', 'theories', 'uk', 'towers', 'masts', 'radiation', 'network', 'claims', 'dangerous', 'lockdown', 'linking', 'free', 'mobile', 'link', 'networks']
['people', 'drinking', 'stop', 'drink', 'doesnt', 'good', 'covidiots', 'fake', 'media', 'home', 'government', 'wuhancoro

### What do we see in these topics?

First of all, NLDA is a randomized algorithm.  That means that the results will not be exactly the same every time.  Often, we will run a topic model a number of times and take the result that maximizes some metric.  We don't cover evaluation metrics in this tutorial, but you can find a description of the ones we like best [here](https://www.churchill.io/papers/topic_noise_models.pdf).  The two metrics are diversity and coherence, and together they measure how unique individual topics are to each other, and how easy each topic is to interpret by a human.

What we see in our topic set are topics about mosquitoes, 5g, antibiotics, hydroxychloroquine, disinfectants (bleach, alcohol), and UV light.  We can very easily see that the six topics included in our data set are present.  In our topic set, the UV light and disinfectants topics are somewhat merged together, and there are words belonging to the disinfectant topic that are off in other topics  (clorox, lysol).

#### So why did we use ten topics instead of just 6?
We always ask for more topics than we expect to see because there are usually other topics that we didn't know existed that might take the place of one that we expect to see.  If we look at the four other topics in this set, we can see that one is about general remedies (hand sanitizer, water, washing hands, masks, and stayhome).  Another is a more general covid topic containing words like corona, flu, pandemic, china, wuhan, and healthcare).  Another topic seems to revolve around covid cases and deaths, while the last is a mashup, containing words like world, health, myth, lockdown, and facts.

By allowing for more topics than we expect to see, we will discover new topics and allow space for the most meaningful topics to grow without interference.

#### Now that we have taken a look at the topics, let's glance at noise.
Note that each word is accompanied by a frequency.  That is the frequency with which a word was selected as a noise word.

In [8]:
noise

[('5g', 947),
 ('disinfectant', 723),
 ('virus', 693),
 ('antibiotics', 603),
 ('hydroxychloroquine', 562),
 ('mosquitoes', 435),
 ('mosquito', 393),
 ('coronavirusoutbreak', 310),
 ('people', 300),
 ('spread', 293),
 ('corona', 284),
 ('trump', 259),
 ('covid', 244),
 ('bleach', 242),
 ('work', 196),
 ('patients', 192),
 ('cure', 192),
 ('dont', 187),
 ('bacteria', 182),
 ('viruses', 177),
 ('effective', 171),
 ('treatment', 168),
 ('conspiracy', 164),
 ('uv', 155),
 ('treat', 155),
 ('transmitted', 154),
 ('coronaoutbreak', 144),
 ('flu', 141),
 ('china', 134),
 ('drug', 133),
 ('light', 131),
 ('bites', 130),
 ('drinking', 126),
 ('person', 125),
 ('transmit', 125),
 ('world', 125),
 ('spray', 125),
 ('antibiotic', 123),
 ('infections', 121),
 ('mosquitos', 118),
 ('india', 116),
 ('health', 113),
 ('wuhan', 109),
 ('bacterial', 108),
 ('2019ncov', 104),
 ('bite', 102),
 ('study', 102),
 ('treating', 102),
 ('kill', 100),
 ('ad', 100),
 ('caused', 97),
 ('infection', 97),
 ('wipes',

### What is the first thing that we notice in the noise distribution?
What we notice is that the titles of the six topics are all at the top of the noise distribution.  Why?
This occurs because of the way that the data was collected.  Tweets were collected based on these keywords, and so every tweet must contain one of these.  Therefore, these words are incredibly common in the data set and are more likely to be chosen as noise.  However, you will also notice that in many cases, their probability of being in their respective topics was so strong that they were still chosen as topic words.  This is exactly how we expect the noise removal process to work.  Regardless of a word's overall frequency, it is the relation between its probability in a topic and its probability in noise that matters. 

### Now that we have seen how an unsupervised topic noise model works, let's take a look at a semi-supervised model: Guided Topic Noise Model (GTM)
GTM requires a set of seed topics to perform semi-supervised learning.  In the root directory, there is an empty file named `seeds.txt`.  Based on the topics that you saw above, put together a list of seed topics.  Each row should be a seed topic, with words separated by commas, like this: 

```word1,word2,word3,word4```

```word5,word6,word7,word8```

In [83]:
# Let's define our path to the GTM source code.  If you saved it to your root directory, your path should look like this:
gtm_path = 'mallet-gtm/bin/mallet'

# We also need to know where to find our seed topics
seeds_path = 'seeds.csv'

# Now, let's construct GTM using our data set and seed topics.  
# We want to set tnd_k and gtm_k to a value greater than the number of seed topics that we have.
model = GTM(dataset=dataset, mallet_tnd_path=tnd_path, mallet_gtm_path=gtm_path, 
             tnd_k=10, gtm_k=10, phi=10, top_words=20, seed_topics_file=seeds_path)

topics = model.get_topics()
noise = model.get_noise_distribution()

Mallet NFT: 10 topics, 4 topic bits, 1111 topic mask
Data loaded.
Max Noise Dist as of Initialization: 719
Max Topic Value as of Initialization: 35
skew: 25.0
max tokens: 26
total tokens: 31121
<10> LL/token: -7.30129
<20> LL/token: -7.03272
<30> LL/token: -6.91279
<40> LL/token: -6.8236

0	5	work bacteria treatment antibiotic viruses treating dont covid19 infection resistance preventing masts disinfectant carry drugs die hcq give fauci science 
1	5	virus transmitted world 2019ncov time coronaviruswuhan caused wuhanflu today fact live claims place join secret lord enter quarantine facebook class 
2	5	bleach cure effective infections bacterial im prevent health treat cases doesnt radiation dangerous day usa deaths surfaces secondary million body 
3	5	antibiotics spray wipes viruses dr stop treat viral sanitizer amazon prevention water hands government means free make pneumonia cleaning visit 
4	5	mosquito spread bites person bite evidence theories disease health good network heres myths

#### Once the cell above finishes, let's take a look at our topics

In [84]:
_= [print(x) for x in topics]

['5g', 'conspiracy', '5gcoronavirus', 'theory', 'technology', 'theories', 'uk', 'masts', 'towers', 'radiation', 'network', 'fake', 'claims', 'media', 'government', 'spreading', 'mobile', 'causing', 'video', 'link']
['mosquitoes', 'mosquito', 'spread', 'transmitted', 'bites', 'transmit', 'person', 'mosquitos', 'bite', 'evidence', 'carry', 'question', 'infected', 'malaria', 'blood', 'heres', 'summer', 'information', 'experts', 'spreads']
['hydroxychloroquine', 'patients', 'treatment', 'drug', 'india', 'taking', 'works', 'doctors', 'prevention', 'vaccine', 'fda', 'chloroquine', 'drugs', 'hcq', 'medicine', 'azithromycin', 'covid2019', 'clinical', 'antimalarial', 'trumps']
['coronavirusoutbreak', 'bleach', 'coronaoutbreak', 'flu', 'china', 'spray', 'virus', 'wuhan', '2019ncov', 'ad', 'wipes', 'corona', 'coronaviruswuhan', 'clorox', 'lysol', 'outbreak', 'coronavirususa', 'wuhanflu', 'chinaflu', 'healthcare']
['trump', 'uv', 'light', 'kill', 'dr', 'doesnt', 'make', 'good', 'medical', 'body', 

### What do we see?
Because of how clean of a data set this is, there is not a huge gap in topic quality between GTM and NLDA.  However, there are some noteable differences.  First, all of the seed topics are in the order that they appear in your `seeds.csv` file.  This makes them far easier to identify.  Second, we can see that the disinfectants and UV light topics are clearly their own topics this time around, and that the disinfectants topic is much stronger than before, with clorox, lysol, and wipes included.

These topics are easy for models to find, because they are built into the data set by design.  What about topics that we don't know are there for certain.  Try adding other seed topics (and increasing the number of topics in the model), and see if any pop out at you.
We tried adding a seed topic about masks (seed words: mask, masks).  What do you think about the mask topic?  Is there a strong mask topic?

Iterate through the process of curating seed topics and see how many solid topics you can get out of this data set on top of the original six.  What else is in there?  At what point do you start to lose quality?

### What's next?
Topic Noise Models are designed to perform better with more data.  The noise distribution in particular benefits from as many observations as it can get.  There is another fun, and much larger, data set that we can investigate using topic models.  It revolves around the 2020 United States presidential election.  You can find it on Kaggle, here: https://www.kaggle.com/datasets/manchunhui/us-election-2020-tweets

Download the data set at the link above, and load it into this notebook below. **MAKE SURE NOT TO USE THE WHOLE DATA SET FOR THIS PART!** Topic models are generally pretty slow, and performing topic modeling on such a large data set will take forever on your laptop.  There are optimizations built in for multiple processors and larger servers, which is why the source code of these models is Java and not Python (which does not currently support multiple processors or multithreading).

**TAKE A RANDOM SAMPLE OF 20,000-50,000 TWEETS FROM THE DATA SET!!**

Once you have your random sample, perform preprocessing (it might look a little bit different than above because the raw data is different), and then run NLDA and GTM on the data.  See what kind of topics are present in the election data set.  Create some seed topics and see if you can refine the topics into a strong, actionable topic set.

In [85]:
# Load data set


In [86]:
# Preprocess data set


In [87]:
# Run NLDA


In [88]:
# Run GTM


### Evaluation Metrics
Evaluating topic models requires both mathematical metrics and human intuition.  It's one thing to get a good score, but it's another thing altogether to get a topic set that is easily interpretable by humans.  In this section, we will look at the evaluation metrics called topic coherence and topic diversity.  Remember that while high coherence and diversity are positive indicators, a human still needs to be able to understand the topics.

Topic coherence is a measure of how well the words in a given topic fit together.  To calculate coherence, we use normalized pairwise mutual information (NPMI).  Topic coherence is the average NPMI score for each pair of words in a topic.  The scores are calculated as the probability of the two words appearing together divided by the probability of the words appearing at all.  So, if a pair of words only appear together, their score will be 1.  If two words never appear together, their score will be 0.

Let's start by getting word frequencies and cofrequencies (the frequency of a pair of words).

In [9]:
def word_frequency(frequency, docs):
    '''
    :param frequency: passed explicitly so that you can increment existing frequencies if using in online mode
    :param docs:
    :return: updated frequency

    '''
    for doc in docs:
        for word in doc:
            if word in frequency:
                frequency[word] += 1
            else:
                frequency[word] = 1
    return frequency


def word_co_frequency(frequency, docs):
    for doc in docs:
        for i in range(0, len(doc) - 1):
            w1 = doc[i]
            for j in range(i + 1, len(doc)):
                w2 = doc[j]
                word_list = sorted([w1, w2])
                word_tup = tuple(word_list)
                if not word_tup in frequency:
                    frequency[word_tup] = 0
                frequency[word_tup] += 1
    return frequency

Next, let's see how NPMI is calculated, and how we get topic coherence and diversity.

In [10]:
def npmi(topic, frequencies, cofrequencies):
    v = 0
    x = max(2, len(topic))
    for i in range(0, len(topic)):
        w_i = topic[i]
        p_i = 0
        if w_i in frequencies:
            p_i = frequencies[w_i]
        for j in range(i+1, len(topic)):
            w_j = topic[j]
            p_j = 0
            if w_j in frequencies:
                p_j = frequencies[w_j]
            word_tup = tuple(sorted([w_i, w_j]))
            p_ij = 0
            if word_tup in cofrequencies:
                p_ij = cofrequencies[word_tup]
            if p_ij < 2:
                v -= 1
            else:
                pmi = math.log(p_ij / (p_i * p_j), 2)
                denominator = -1 * math.log(p_ij, 2)
                v += (pmi / denominator)
    return (2*v) / (x*(x-1))


def topic_npmis(T, frequencies, cofrequencies, topn=20):
    npmis = []
    for topic in T:
        n = npmi(topic[:topn], frequencies, cofrequencies)
        npmis.append(n)
    return npmis


def topic_coherence(T, frequencies, cofrequencies, topn=20):
    '''
    Computes the coherence of a topic set (average NPMI of topics)
    :param T:
    :param frequencies:
    :param cofrequencies:
    :param k: top-k words per topic to consider
    :return:
    '''
    npmis = topic_npmis(T, frequencies, cofrequencies, topn)
    if len(npmis) > 0:
        return mean(npmis)
    return 0


def topic_diversity(T, topn=20):
    '''
    fraction of words in top-n words of each topic that are unique
    :param T:
    :param k: top k words per topic
    :return:
    '''
    top_words = []
    for topic in T:
        top_words.extend(topic[:topn])
    unique_words = set(top_words)
    if len(top_words) > 0:
        return len(unique_words)/len(top_words)
    return 0

In [22]:
# Topic diversity is straightforward.  Let's see it at work: 
topic_diversity(topics, topn=20)

0.98

In [23]:
# Topic Coherence is more involved to compute.  Let's compute frequencies and cofrequencies, and then use them to calculate coherence.
freqs = word_frequency({}, dataset)
cofreqs = word_co_frequency({}, dataset)

topic_coherence(topics, freqs, cofreqs, topn=20)

2.115925191447977

We want to maximize coherence and diversity, so the higher the better.  Coherence is u

### References:
1. Haber, J.; Kawintiranon, K.; Singh, L.; Chen, A.; Pizzo, A.; Pogrebivsky, A. and Yang, J. Identifying High-Quality Training Data for Misinformation Detection. In Proceedings of the 12th International Conference on Data Science, Technology and Applications (2022), ISBN 978-989-758-664-4, ISSN 2184-285X, pages 64-76. DOI: 10.5220/0012089000003541
2. Churchill, R., Singh, L.: Topic-noise models: Modeling topic and noise distributions in social media post collections. In: International Conference on Data Mining (ICDM) (2021)
3. Churchill, R., Singh, L., Ryan, R., & Davis-Kean, P. A Guided Topic-Noise Model for Short Texts. In Proceedings of the ACM Web Conference 2022 (pp. 2870-2878).