# Module 6 - Topic Modelling

The original data for the next steps can be downloaded from the [Kaggle News Category Dataset](https://www.kaggle.com/rmisra/news-category-dataset).

It is important to read about the dataset that you are using so that you understand what it contains and also what it doesn't contain.

### Subset exploration

Often we want to explore a subset of data, and we only need to analyse part of it.

In [1]:
import json

# load the complete dataset
with open('data/News_Category_Dataset_v2.json', 'r') as f:
    news_list = f.readlines()

# convert each line (string) to json (dict)
news_json = list(map(json.loads,news_list))

print("Number of stories: ",len(news_json))

# view the first 10 elements in the list
news_json[:20]

Number of stories:  200853


[{'category': 'CRIME',
  'headline': 'There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV',
  'authors': 'Melissa Jeltsen',
  'link': 'https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89',
  'short_description': 'She left her husband. He killed their children. Just another day in America.',
  'date': '2018-05-26'},
 {'category': 'ENTERTAINMENT',
  'headline': "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song",
  'authors': 'Andy McDonald',
  'link': 'https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201',
  'short_description': 'Of course it has a song.',
  'date': '2018-05-26'},
 {'category': 'ENTERTAINMENT',
  'headline': 'Hugh Grant Marries For The First Time At Age 57',
  'authors': 'Ron Dicker',
  'link': 'https://www.huffingtonpost.com/entry/hugh-grant-marries_us_5b09212ce4b0568a880b9a8c',
  'short_description

What categories are available in this dataset?

In [2]:
set([story['category'] for story in news_json])

{'ARTS',
 'ARTS & CULTURE',
 'BLACK VOICES',
 'BUSINESS',
 'COLLEGE',
 'COMEDY',
 'CRIME',
 'CULTURE & ARTS',
 'DIVORCE',
 'EDUCATION',
 'ENTERTAINMENT',
 'ENVIRONMENT',
 'FIFTY',
 'FOOD & DRINK',
 'GOOD NEWS',
 'GREEN',
 'HEALTHY LIVING',
 'HOME & LIVING',
 'IMPACT',
 'LATINO VOICES',
 'MEDIA',
 'MONEY',
 'PARENTING',
 'PARENTS',
 'POLITICS',
 'QUEER VOICES',
 'RELIGION',
 'SCIENCE',
 'SPORTS',
 'STYLE',
 'STYLE & BEAUTY',
 'TASTE',
 'TECH',
 'THE WORLDPOST',
 'TRAVEL',
 'WEDDINGS',
 'WEIRD NEWS',
 'WELLNESS',
 'WOMEN',
 'WORLD NEWS',
 'WORLDPOST'}

Extract just the science stories from the dataset...

In [3]:
# filter the list for stories that are in the category SCIENCE
science_json = [story for story in news_json if story['category']=='SCIENCE']

# for each, create the 'story' by adding together the headline and the short_description
science_stories = [story['headline']+' - '+story['short_description'] for story in science_json]

print("Number of science stories: ",len(science_stories))

# look at first 10
science_stories[:10]

Number of science stories:  2178


['Scientists Turn To DNA Technology To Search For Loch Ness Monster - The researchers plan to scour the Loch Ness next month for evidence of its supposed inhabitant.',
 'Unusual Asteroid Could Be An Interstellar Guest To Our Solar System - The supposed "interstellar immigrant" is located near Jupiter but has an atypical orbit.',
 "China Marks Another Milestone In Quest To Become Space Superpower - It's the first time a rocket designed by a Chinese private company has successfully entered orbit.",
 'Terrifying Clip Shows Why You Should Never Run Under A Tree During Thunderstorms - YIKES!',
 "U.S. Climate Scientists Flee For France To ‘Make Our Planet Great Again’ - Some of America's top researchers will move to France to continue their research.",
 'Stephen Hawking Finished Mind-Bending Parallel Universe Paper Days Before His Death - The new treatise on the existence of parallel universes was published on Friday.',
 "Mysterious Yellowstone Geyser Eruptions Stump Scientists - Steamboat b

In [4]:
#science_json

### Word frequency

How do we find anything meaningful in these science news stories?

We could start by just extracting words and looking at the frequencies...

In [5]:
import re

story1 = science_stories[0]

re.split('\W+',story1.lower())

['scientists',
 'turn',
 'to',
 'dna',
 'technology',
 'to',
 'search',
 'for',
 'loch',
 'ness',
 'monster',
 'the',
 'researchers',
 'plan',
 'to',
 'scour',
 'the',
 'loch',
 'ness',
 'next',
 'month',
 'for',
 'evidence',
 'of',
 'its',
 'supposed',
 'inhabitant',
 '']

In [6]:
story1

'Scientists Turn To DNA Technology To Search For Loch Ness Monster - The researchers plan to scour the Loch Ness next month for evidence of its supposed inhabitant.'

In [7]:
word_counts = {}

for story in science_stories:
    words = re.split('\W+',story.lower())
    for word in words:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1
        
# sort the word_counts by counts
sorted_counts = {k: v for k, v in sorted(word_counts.items(), key=lambda item: item[1],reverse=True)}

sorted_counts

{'the': 2754,
 'of': 1537,
 '': 1515,
 'to': 1334,
 'a': 1244,
 'in': 1034,
 'and': 860,
 's': 829,
 'is': 630,
 'that': 509,
 'for': 484,
 'on': 476,
 'it': 445,
 'this': 361,
 'be': 309,
 'new': 307,
 'from': 301,
 'are': 297,
 'you': 296,
 'space': 286,
 'with': 280,
 'have': 272,
 'at': 267,
 'may': 260,
 'as': 258,
 'scientists': 253,
 'we': 253,
 'by': 231,
 'science': 223,
 't': 215,
 'about': 214,
 'how': 211,
 'study': 207,
 'video': 206,
 'nasa': 203,
 'has': 197,
 'can': 195,
 'what': 188,
 'more': 187,
 'but': 185,
 'our': 174,
 'earth': 170,
 'an': 169,
 'not': 167,
 'one': 164,
 'was': 158,
 'will': 151,
 'your': 145,
 'could': 142,
 'i': 142,
 'than': 139,
 'all': 133,
 'like': 130,
 'life': 126,
 'years': 124,
 'time': 123,
 'their': 123,
 'world': 123,
 'mars': 122,
 'first': 121,
 'why': 119,
 'up': 117,
 'just': 117,
 'out': 117,
 'planet': 116,
 'they': 116,
 'say': 114,
 'into': 113,
 'shows': 107,
 'some': 107,
 'research': 106,
 'or': 106,
 'there': 104,
 'its': 

### Alternatives to finding information in text

This does give us some information, but there are some problems:
- small meaningless words are dominating the count
- words that are most significant are spread out amongst the list

The field of **Information Retrieval** has developed techniques to help with this issue. We're going to look at two...
1. TF/IDF as a better term frequency
2. LDA for topic modelling

First we need some additional packages not installed in our Jupyter environment...
- [gensim](https://radimrehurek.com/gensim/) for topic modelling
- [pyLDAvis](https://github.com/bmabey/pyLDAvis) for interactive visualisation of topic models

In [8]:
# Install a pip package in the current Jupyter kernel
import sys
#!{sys.executable} -m pip install gensim
#!{sys.executable} -m pip install pyLDAvis

Collecting pytest
  Downloading pytest-6.0.2-py3-none-any.whl (270 kB)
Collecting toml
  Downloading toml-0.10.1-py2.py3-none-any.whl (19 kB)
Collecting pluggy<1.0,>=0.12
  Downloading pluggy-0.13.1-py2.py3-none-any.whl (18 kB)
Collecting py>=1.8.2
  Downloading py-1.9.0-py2.py3-none-any.whl (99 kB)
Collecting iniconfig
  Downloading iniconfig-1.0.1-py3-none-any.whl (4.2 kB)
Collecting atomicwrites>=1.0; sys_platform == "win32"
  Downloading atomicwrites-1.4.0-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: toml, pluggy, py, iniconfig, atomicwrites, pytest
Successfully installed atomicwrites-1.4.0 iniconfig-1.0.1 pluggy-0.13.1 py-1.9.0 pytest-6.0.2 toml-0.10.1


In [9]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.utils import tokenize
from gensim.utils import simple_preprocess
from gensim.corpora.textcorpus import remove_stopwords
from gensim.summarization import keywords
from gensim.models.ldamodel import LdaModel
import pyLDAvis
import pyLDAvis.gensim
import pandas as pd

### Pre-processing with gensim

Let's bring our stories into a dataframe and use some of the gensim tools...

In [10]:
stories_df = pd.DataFrame(science_stories,columns=['story'])
stories_df

  and should_run_async(code)


Unnamed: 0,story
0,Scientists Turn To DNA Technology To Search Fo...
1,Unusual Asteroid Could Be An Interstellar Gues...
2,China Marks Another Milestone In Quest To Beco...
3,Terrifying Clip Shows Why You Should Never Run...
4,U.S. Climate Scientists Flee For France To ‘Ma...
...,...
2173,Treating a World Without Antibiotics? - Becaus...
2174,Russian Cargo Ship Docks At International Spac...
2175,"Robots Play Catch, Starring Agile Justin And R..."
2176,Thomas Edison Voted Most Iconic Inventor In U....


In [11]:
# get a list of tokens for first story
tokens = list(tokenize(stories_df['story'][0],lowercase=True))
tokens

  and should_run_async(code)


['scientists',
 'turn',
 'to',
 'dna',
 'technology',
 'to',
 'search',
 'for',
 'loch',
 'ness',
 'monster',
 'the',
 'researchers',
 'plan',
 'to',
 'scour',
 'the',
 'loch',
 'ness',
 'next',
 'month',
 'for',
 'evidence',
 'of',
 'its',
 'supposed',
 'inhabitant']

In [14]:
# get a list of tokens for first story using simple_preprocess
tokens = list(simple_preprocess(stories_df['story'][0],min_len=3))
tokens

  and should_run_async(code)


['scientists',
 'turn',
 'dna',
 'technology',
 'search',
 'for',
 'loch',
 'ness',
 'monster',
 'the',
 'researchers',
 'plan',
 'scour',
 'the',
 'loch',
 'ness',
 'next',
 'month',
 'for',
 'evidence',
 'its',
 'supposed',
 'inhabitant']

In [15]:
# remove the 'stopwords' from first story
remove_stopwords(tokens)

  and should_run_async(code)


['scientists',
 'turn',
 'dna',
 'technology',
 'search',
 'loch',
 'ness',
 'monster',
 'researchers',
 'plan',
 'scour',
 'loch',
 'ness',
 'month',
 'evidence',
 'supposed',
 'inhabitant']

In [16]:
# do this for whole dataframe
stories_df['terms'] = [remove_stopwords(simple_preprocess(story,min_len=3)) for story in stories_df['story']]
stories_df

  and should_run_async(code)


Unnamed: 0,story,terms
0,Scientists Turn To DNA Technology To Search Fo...,"[scientists, turn, dna, technology, search, lo..."
1,Unusual Asteroid Could Be An Interstellar Gues...,"[unusual, asteroid, interstellar, guest, solar..."
2,China Marks Another Milestone In Quest To Beco...,"[china, marks, milestone, quest, space, superp..."
3,Terrifying Clip Shows Why You Should Never Run...,"[terrifying, clip, shows, run, tree, thunderst..."
4,U.S. Climate Scientists Flee For France To ‘Ma...,"[climate, scientists, flee, france, planet, gr..."
...,...,...
2173,Treating a World Without Antibiotics? - Becaus...,"[treating, world, antibiotics, overuse, antibi..."
2174,Russian Cargo Ship Docks At International Spac...,"[russian, cargo, ship, docks, international, s..."
2175,"Robots Play Catch, Starring Agile Justin And R...","[robots, play, catch, starring, agile, justin,..."
2176,Thomas Edison Voted Most Iconic Inventor In U....,"[thomas, edison, voted, iconic, inventor, hist..."


In [17]:
vocab = Dictionary(stories_df['terms'])
print(vocab.token2id)



  and should_run_async(code)


### Term Frequency, Inverse Document Frequency (TF/IDF)

For TF/IDF we use Bag of Words (BoW). For more information on these terms, see:
- [A gentle introduction to the Bag-of-words model](https://machinelearningmastery.com/gentle-introduction-bag-words-model/)
- [tf-idf Wikipedia](https://en.wikipedia.org/wiki/Tf–idf)

In [18]:
# convert corpus to BoW format
corpus = [vocab.doc2bow(terms) for terms in stories_df['terms']]  

# fit a tf-idf model to the corpus
model = TfidfModel(corpus)

# apply model to the first corpus document
tfidf_doc = model[corpus[0]] 

  and should_run_async(code)


In [19]:
tfidf_doc

  and should_run_async(code)


[(0, 0.16403388746852715),
 (1, 0.1597481855313059),
 (2, 0.25445072197667334),
 (3, 0.5089014439533467),
 (4, 0.21447616377683318),
 (5, 0.1811356326658227),
 (6, 0.5089014439533467),
 (7, 0.20400844566836726),
 (8, 0.1188861098946524),
 (9, 0.07979906913162561),
 (10, 0.27967186013082634),
 (11, 0.1671999191881768),
 (12, 0.2396973019309862),
 (13, 0.1765813951116847),
 (14, 0.17066791455735678)]

In [20]:
[(vocab[w[0]],w[1]) for w in tfidf_doc]

  and should_run_async(code)


[('dna', 0.16403388746852715),
 ('evidence', 0.1597481855313059),
 ('inhabitant', 0.25445072197667334),
 ('loch', 0.5089014439533467),
 ('monster', 0.21447616377683318),
 ('month', 0.1811356326658227),
 ('ness', 0.5089014439533467),
 ('plan', 0.20400844566836726),
 ('researchers', 0.1188861098946524),
 ('scientists', 0.07979906913162561),
 ('scour', 0.27967186013082634),
 ('search', 0.1671999191881768),
 ('supposed', 0.2396973019309862),
 ('technology', 0.1765813951116847),
 ('turn', 0.17066791455735678)]

In [21]:
[(vocab[w[0]],w[1]) for w in tfidf_doc if w[1]>0.3]

  and should_run_async(code)


[('loch', 0.5089014439533467), ('ness', 0.5089014439533467)]

In [22]:
stories_df['terms'][0]

  and should_run_async(code)


['scientists',
 'turn',
 'dna',
 'technology',
 'search',
 'loch',
 'ness',
 'monster',
 'researchers',
 'plan',
 'scour',
 'loch',
 'ness',
 'month',
 'evidence',
 'supposed',
 'inhabitant']

In [23]:
# try the second story
terms = stories_df['terms'][1]
print("terms: ",terms)
tfidf_doc2 = model[corpus[1]]
tfidf2 = [(vocab[w[0]],w[1]) for w in tfidf_doc2 if w[1]>0.1]
print("tf/idf: ",tfidf2)

terms:  ['unusual', 'asteroid', 'interstellar', 'guest', 'solar', 'supposed', 'interstellar', 'immigrant', 'located', 'near', 'jupiter', 'atypical', 'orbit']
tf/idf:  [('supposed', 0.2973839839656211), ('asteroid', 0.19989766083509466), ('atypical', 0.315688031591355), ('guest', 0.315688031591355), ('immigrant', 0.3469790076849819), ('interstellar', 0.4860652701662278), ('jupiter', 0.20351105568474043), ('located', 0.24778896024626032), ('near', 0.2054323386599618), ('orbit', 0.20953910821319077), ('solar', 0.1644621856949979), ('unusual', 0.315688031591355)]


  and should_run_async(code)


### Most relevant terms

What is probably more interesting is the top n terms, which are expected to be the most relevant.

Let's create a function to take the top 5 terms based on tf/idf.

In [24]:
def get_tfidf(idx):
    term_values = [(vocab[el[0]],el[1]) for el in model[corpus[idx]] if el[1]>0]
    srt =  sorted(term_values, key=lambda x: x[1],reverse=True)
    return list(map(lambda x: x[0],srt[:5]))

  and should_run_async(code)


In [25]:
get_tfidf(1)

  and should_run_async(code)


['interstellar', 'immigrant', 'atypical', 'guest', 'unusual']

In [26]:
get_tfidf(0)

  and should_run_async(code)


['loch', 'ness', 'scour', 'inhabitant', 'supposed']

Now we apply this function to the whole dataframe

In [27]:
stories_df['tfidf'] = stories_df.index.map(get_tfidf)
stories_df

  and should_run_async(code)


Unnamed: 0,story,terms,tfidf
0,Scientists Turn To DNA Technology To Search Fo...,"[scientists, turn, dna, technology, search, lo...","[loch, ness, scour, inhabitant, supposed]"
1,Unusual Asteroid Could Be An Interstellar Gues...,"[unusual, asteroid, interstellar, guest, solar...","[interstellar, immigrant, atypical, guest, unu..."
2,China Marks Another Milestone In Quest To Beco...,"[china, marks, milestone, quest, space, superp...","[superpower, chinese, milestone, quest, entered]"
3,Terrifying Clip Shows Why You Should Never Run...,"[terrifying, clip, shows, run, tree, thunderst...","[thunderstorms, clip, terrifying, yikes, run]"
4,U.S. Climate Scientists Flee For France To ‘Ma...,"[climate, scientists, flee, france, planet, gr...","[france, flee, continue, america, great]"
...,...,...,...
2173,Treating a World Without Antibiotics? - Becaus...,"[treating, world, antibiotics, overuse, antibi...","[antibiotics, outpacing, overuse, resistance, ..."
2174,Russian Cargo Ship Docks At International Spac...,"[russian, cargo, ship, docks, international, s...","[russian, station, spaceships, vote, expedition]"
2175,"Robots Play Catch, Starring Agile Justin And R...","[robots, play, catch, starring, agile, justin,...","[justin, agile, dlr, hizook, ratios]"
2176,Thomas Edison Voted Most Iconic Inventor In U....,"[thomas, edison, voted, iconic, inventor, hist...","[drove, relentlessly, voted, wake, inventor]"


In [28]:
stories_df.iloc[2]

  and should_run_async(code)


story    China Marks Another Milestone In Quest To Beco...
terms    [china, marks, milestone, quest, space, superp...
tfidf     [superpower, chinese, milestone, quest, entered]
Name: 2, dtype: object

Although TF/IDF does a good job at distinguishing between documents - identifying what is unique about a document - it doesn't use human meaning-making.

Algorithmic 'semantics' is not the same as human semantics.

It is worth considering how this might be a problem in a world that increasingly uses computation to process language.


### Latent Dirichlet Allocation (LDA)

However, there are approaches that are closer to human meaning-making than TF/IDF. LDA is one. For more detail on LDA, see the [LDA Wikipedia page](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [29]:
# create an lda model from our corpus and vocab - we need to specify the number of topics
lda_model = LdaModel(corpus=corpus, id2word=vocab, num_topics=20)

  and should_run_async(code)


In [30]:
# view the topics in the model
for topic in lda_model.show_topics(num_topics=20,num_words=15):
    print("Topic "+str(topic[0])+"\n"+topic[1]+"\n")

Topic 0
0.022*"science" + 0.014*"borealis" + 0.013*"earth" + 0.011*"scientists" + 0.011*"world" + 0.010*"supermoon" + 0.010*"closely" + 0.009*"edt" + 0.008*"shuttle" + 0.008*"photos" + 0.008*"mars" + 0.007*"planet" + 0.007*"moon" + 0.007*"enterprise" + 0.007*"interior"

Topic 1
0.013*"edt" + 0.010*"psychosis" + 0.010*"april" + 0.009*"study" + 0.009*"shows" + 0.008*"earth" + 0.008*"water" + 0.007*"day" + 0.007*"change" + 0.007*"biology" + 0.007*"rule" + 0.007*"explanation" + 0.007*"force" + 0.007*"cognitive" + 0.007*"science"

Topic 2
0.016*"feathers" + 0.016*"maya" + 0.015*"suggests" + 0.014*"video" + 0.014*"study" + 0.011*"spitzer" + 0.011*"activism" + 0.011*"worked" + 0.011*"disorders" + 0.011*"add" + 0.011*"wild" + 0.010*"day" + 0.010*"new" + 0.008*"patterns" + 0.007*"human"

Topic 3
0.018*"april" + 0.015*"northern" + 0.014*"independents" + 0.013*"space" + 0.012*"lights" + 0.011*"miami" + 0.011*"sky" + 0.010*"black" + 0.010*"venus" + 0.009*"attracts" + 0.009*"aspects" + 0.009*"ameri

  and should_run_async(code)


For each document, we can get the probability that the document belongs to a particular topic

In [31]:
doc = stories_df['story'][1]
print("doc:\n",doc)
doc_topics = lda_model.get_document_topics(corpus[1],minimum_probability=0.3)
print("doc_topics:\n",doc_topics)
for topic in doc_topics:
    terms = [term for term, prob in lda_model.show_topic(topic[0])]
    print(terms)

doc:
 Unusual Asteroid Could Be An Interstellar Guest To Our Solar System - The supposed "interstellar immigrant" is located near Jupiter but has an atypical orbit.
doc_topics:
 [(19, 0.80538857)]
['video', 'higgs', 'solar', 'science', 'mass', 'boson', 'activity', 'new', 'livescience', 'count']


  and should_run_async(code)


We can create a function to get the top terms for the top topic for each document. This will enable us to assign the top topic words to the original dataframe.

In [32]:
def get_topic_terms(idx):
    doc_topics = lda_model.get_document_topics(corpus[idx], minimum_probability=0.3)
    top_topic = doc_topics[0]
    return [term for term, prob in lda_model.show_topic(top_topic[0])]

  and should_run_async(code)


In [33]:
# try out the function
get_topic_terms(1)

  and should_run_async(code)


['video',
 'higgs',
 'solar',
 'science',
 'mass',
 'boson',
 'activity',
 'new',
 'livescience',
 'count']

In [34]:
# add to our original dataframe
stories_df['lda'] = stories_df.index.map(get_topic_terms)
stories_df

  and should_run_async(code)


IndexError: list index out of range

To help us explore the model, we can visualise the topics using pyLDAvis. **NOTE:** This visualisation can take a while to produce (up to 5 minutes) so be patient!

In [33]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, vocab)
vis

  and should_run_async(code)
