<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2023_sem1)</div>

# IFN619 :: C1-UnstructuredAnalytics

For this session, the focus will be on analysis of unstructured text. However, the thinking required is similar to approaches to analysing images, video, sound and other unstructured data. Primarily, the analysis is based on the notion that there are useful patterns in the unstructured data which can be obtained mathematically. By converting the data to a mathematical structure, various algorithms can be applied to the structure with the aim of identifying patterns. 

In the case of the `topic modelling` approaches below, many of the techniques are *probabilistic* - that is they mathematically identify the *likelihood* that a feature might be important. Thus, they are never 100% accurate, and their use needs to be mediated by a more pragmatic *useful or not* approach, rather than *right or wrong*.

In [36]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json
import random

### Accessing the data via The Guardian API

See the `Accessing_the_Guardian_API.ipynb` notebook file for details on getting the data. **Note:** This approach may be used for additional data for Assignment 2.

### Read in pre-saved data

To save time, we're loading in pre-saved data that was fetched using the Guardian API.

In [2]:
# Load the data - articles from The Guardian about the war in Ukraine
file_path = "data/"
file_name = "ukraine_articles.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

print(f"Loaded {len(articles)} articles from {file_name}")

Loaded 202 articles from ukraine_articles.json


Each dictionary entry includes the *title [date]* as `key` and the *body text* from the article as `value`.

In [3]:
article1 = list(articles.items())[0]
print("Key:",article1[0])
print("Value:",article1[1][:300],"...") # Just show first 300 characters

Key: Russia-Ukraine war live: Ukraine ‘buying time’ in Bakhmut – as it happened [2023-03-12T18:26:12Z]
Value:   6.26pm GMT       Good evening, we are closing this blog now. You can see all our Ukraine coverage here       6.06pm GMT    A summary of today's developments     Ukraine and Russia both claimed that hundreds of enemy troops had been killed in the previous 24 hours in the fight for Bakhmut, with Kyi ...


So the values gives us a list of documents that we can analyse.

In [37]:
# Get a list of documents
documents = list(articles.values())

# View first 400 characters of the 1st document
documents[0][:400]

"  6.26pm GMT       Good evening, we are closing this blog now. You can see all our Ukraine coverage here       6.06pm GMT    A summary of today's developments     Ukraine and Russia both claimed that hundreds of enemy troops had been killed in the previous 24 hours in the fight for Bakhmut, with Kyiv fending off attacks and a small river that bisects the town now marking the new frontline. Serhiy "

### Term Count 

**Finding important terms by the frequency of their occurance**

Using `CountVectorizer` create a `vector` for each document where the dimensionality of the vector is the `vocabulary` (all terms in the collection), and the value of each component is the number of times that the `term` occurs in the document.

All of these analyses, approach the document as a [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model. In this approach, the order of the words don't matter. A popular approach that takes into account order is [Word embedding](https://en.wikipedia.org/wiki/Word_embedding). This session does not explore word embedding.

In [5]:
# Only count terms that in maximum of 75% of documents, and a minimum of 2 documents. 
# Count a maximum of 10000 terms, and remove common english stop words
count_vectorizer = CountVectorizer(max_df=0.75,min_df=2,max_features=10000,stop_words="english")
count_dt_matrix = count_vectorizer.fit_transform(articles.values())

In [6]:
# Take a look at the vector for the first document
doc001_vector = count_dt_matrix.toarray()[0]
doc001_vector

array([4, 0, 0, ..., 0, 0, 0])

In [7]:
# Get the 1000 terms identified during the vectorization process
feature_names = count_vectorizer.get_feature_names_out()
feature_names

array(['000', '000km', '00am', ..., 'подоляк', 'сhildren', 'та'],
      dtype=object)

In [8]:
# Look at how the counts match up to the terms (for the 1st doc)
doc001_term_counts = list(zip(feature_names,doc001_vector))
doc001_term_counts

[('000', 4),
 ('000km', 0),
 ('00am', 0),
 ('00pm', 1),
 ('01am', 0),
 ('01pm', 0),
 ('020', 0),
 ('02am', 0),
 ('02pm', 0),
 ('03am', 1),
 ('03pm', 1),
 ('04am', 1),
 ('04pm', 0),
 ('05am', 0),
 ('05pm', 1),
 ('06am', 0),
 ('06pm', 2),
 ('07am', 0),
 ('07pm', 0),
 ('08am', 0),
 ('08pm', 1),
 ('09am', 1),
 ('09pm', 0),
 ('10', 8),
 ('100', 1),
 ('1000', 0),
 ('100bn', 0),
 ('100m', 0),
 ('101', 0),
 ('106', 0),
 ('10am', 0),
 ('10bn', 0),
 ('10m', 0),
 ('10pm', 0),
 ('10th', 0),
 ('11', 11),
 ('110', 0),
 ('110m', 0),
 ('114', 0),
 ('115bn', 0),
 ('116', 0),
 ('11am', 1),
 ('11pm', 0),
 ('11th', 0),
 ('12', 6),
 ('120', 1),
 ('122', 0),
 ('123', 0),
 ('125mm', 0),
 ('129', 0),
 ('12am', 0),
 ('12pm', 0),
 ('13', 1),
 ('130', 0),
 ('131', 0),
 ('136', 0),
 ('13am', 1),
 ('13pm', 0),
 ('14', 0),
 ('140', 0),
 ('141', 0),
 ('147', 0),
 ('14am', 1),
 ('14bn', 0),
 ('14pm', 1),
 ('14th', 0),
 ('15', 3),
 ('150', 1),
 ('150bn', 0),
 ('150m', 2),
 ('152', 1),
 ('155', 0),
 ('155mm', 0),
 ('15

In [9]:
# Take a look at the vocabulary which shows the total counts for whole collection
count_vectorizer.vocabulary_

{'26pm': 191,
 'gmt': 4069,
 'good': 4083,
 'evening': 3369,
 'closing': 1943,
 'blog': 1381,
 'coverage': 2370,
 '06pm': 16,
 'summary': 8695,
 'today': 9079,
 'developments': 2757,
 'claimed': 1894,
 'hundreds': 4499,
 'enemy': 3240,
 'troops': 9229,
 'killed': 5062,
 'previous': 6847,
 '24': 176,
 'hours': 4472,
 'fight': 3677,
 'bakhmut': 1171,
 'kyiv': 5138,
 'attacks': 1067,
 'small': 8284,
 'river': 7727,
 'town': 9126,
 'marking': 5578,
 'frontline': 3920,
 'serhiy': 8049,
 'cherevatyi': 1833,
 'ukrainian': 9303,
 'military': 5764,
 'spokesperson': 8419,
 '221': 168,
 'pro': 6891,
 'moscow': 5884,
 '300': 211,
 'wounded': 9881,
 'defence': 2603,
 'ministry': 5787,
 '210': 162,
 'soldiers': 8312,
 'broader': 1528,
 'donetsk': 2958,
 'repelled': 7508,
 '92': 406,
 'assaults': 1020,
 'areas': 953,
 'past': 6453,
 'day': 2527,
 'general': 4007,
 'staff': 8460,
 'armed': 969,
 'forces': 3820,
 'institute': 4749,
 'study': 8624,
 'did': 2767,
 'make': 5509,
 'advances': 600,
 'saturd

#### Display matrix in dataframe

Take the term count matrix and display in a dataframe to make visible the structure


In [10]:
# Create a new dataframe with the matrix - use titles for the index and terms for the columns
count_df = pd.DataFrame(count_dt_matrix.toarray(), index=articles.keys(), columns=feature_names)
count_df

Unnamed: 0,000,000km,00am,00pm,01am,01pm,020,02am,02pm,03am,...,zoom,zoopark,zuma,çavuşoğlu,володимир,зеленський,михайло,подоляк,сhildren,та
Russia-Ukraine war live: Ukraine ‘buying time’ in Bakhmut – as it happened [2023-03-12T18:26:12Z],4,0,0,1,0,0,0,0,0,1,...,0,0,0,0,1,1,0,0,0,0
Russia-Ukraine war – as it happened: Ukraine to boost defences along border with Belarus [2023-04-08T17:16:35Z],3,0,0,0,0,0,2,0,1,1,...,0,0,0,0,0,0,1,1,4,0
Russia-Ukraine war: Zelenskiy and UK prime minister discuss accelerating military support for Ukraine – as it happened [2023-04-14T17:51:38Z],0,0,2,4,0,2,0,0,0,1,...,0,0,0,0,1,1,0,0,0,0
Russia-Ukraine war live: Putin visits Mariupol in first trip to occupied eastern Ukraine – as it happened [2023-03-19T19:12:28Z],3,0,0,0,0,2,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
"Russia-Ukraine war: Russia nearly shot down British spy plane near Ukraine, alleged leaked US document claims – as it happened [2023-04-10T18:24:19Z]",12,0,1,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
BoM shifts to El Niño watch after La Niña officially declared over – as it happened [2023-03-14T07:36:07Z],13,0,3,2,1,4,0,1,7,2,...,0,0,0,0,0,0,0,0,0,0
PM reportedly issued invitation to visit Beijing – as it happened [2023-04-04T08:40:05Z],2,0,1,0,2,0,0,4,0,3,...,0,0,0,0,0,0,0,0,0,0
Myer to close flagship Brisbane store – as it happened [2023-03-16T07:52:58Z],19,0,2,0,1,0,0,3,1,2,...,0,0,0,0,0,0,0,0,0,0
Voice referendum question and constitutional amendment could come tomorrow – as it happened [2023-03-22T07:38:24Z],6,0,2,1,4,1,0,2,0,4,...,0,0,0,0,0,0,0,0,0,0


By selecting a row from the dataframe and sorting the values (counts), we can identify the top 10 terms

In [65]:
# Sample 5 random articles
samples = random.sample(range(0,202),5)

for sample in samples:
    doc = count_df.iloc[sample]
    top_terms = dict(count_df.iloc[sample].sort_values(ascending=False).head(10))
    print(f"[{sample}] {doc.name}")
    print("\t- Top terms:",top_terms)

[105] Russian ambassador to the US says Moscow does not want confrontation after drone crash – as it happened [2023-03-15T00:38:07Z]
	- Top terms: {'gmt': 95, 'ukrainian': 71, 'forces': 55, 'drone': 52, 'military': 41, 'defence': 41, 'state': 36, 'black': 35, 'sea': 35, 'reports': 30}
[65] Russia-Ukraine war live: US sanctions over 120 people and entities supporting Russia’s invasion — as it happened [2023-04-12T17:59:24Z]
	- Top terms: {'bst': 86, 'ukrainian': 67, 'military': 47, 'sanctions': 41, 'forces': 34, 'video': 34, 'updated': 32, 'documents': 31, 'wednesday': 27, 'president': 27}
[151] Morning Mail: Dutton hangs on, Russian pro-war blogger killed in bomb blast, Sakamoto dies at 71 [2023-04-02T21:01:43Z]
	- Top terms: {'australia': 9, 'photograph': 6, 'says': 5, 'crisis': 5, 'melbourne': 4, 'images': 4, 'getty': 4, 'federal': 3, 'report': 3, 'day': 3}
[186] Government says opposition leader’s Aukus comments ‘irresponsible’ – as it happened [2023-03-01T08:06:45Z]
	- Top terms: {

#### Create a top10 terms dataframe

Using the index from the documents, create a dataframe that can hold the top10 terms for each document. We also include columns for our other analysis (tfidf, lda, nmf)

In [12]:
# Create a dataframe to hold top terms for each analysis type
terms_df = pd.DataFrame(index=count_df.index,columns=['count','tfidf','lda','nmf'])
terms_df

Unnamed: 0,count,tfidf,lda,nmf
Russia-Ukraine war live: Ukraine ‘buying time’ in Bakhmut – as it happened [2023-03-12T18:26:12Z],,,,
Russia-Ukraine war – as it happened: Ukraine to boost defences along border with Belarus [2023-04-08T17:16:35Z],,,,
Russia-Ukraine war: Zelenskiy and UK prime minister discuss accelerating military support for Ukraine – as it happened [2023-04-14T17:51:38Z],,,,
Russia-Ukraine war live: Putin visits Mariupol in first trip to occupied eastern Ukraine – as it happened [2023-03-19T19:12:28Z],,,,
"Russia-Ukraine war: Russia nearly shot down British spy plane near Ukraine, alleged leaked US document claims – as it happened [2023-04-10T18:24:19Z]",,,,
...,...,...,...,...
BoM shifts to El Niño watch after La Niña officially declared over – as it happened [2023-03-14T07:36:07Z],,,,
PM reportedly issued invitation to visit Beijing – as it happened [2023-04-04T08:40:05Z],,,,
Myer to close flagship Brisbane store – as it happened [2023-03-16T07:52:58Z],,,,
Voice referendum question and constitutional amendment could come tomorrow – as it happened [2023-03-22T07:38:24Z],,,,


Populate the count column with data created by the count vectorizer.

In [40]:
#For each doc, get the 10 columns with the largest counts
for idx in terms_df.index:
    counts = dict(count_df.loc[idx].sort_values(ascending=False).head(10))
    #print(counts)
    terms_df.at[idx,'count'] = list(counts.keys()) # Just the list of terms

terms_df

Unnamed: 0,count,tfidf,lda,nmf
Russia-Ukraine war live: Ukraine ‘buying time’ in Bakhmut – as it happened [2023-03-12T18:26:12Z],"[gmt, ukrainian, bakhmut, updated, forces, mos...","{'gmt': 0.4254, 'ukrainian': 0.1954, 'swiss': ...","[ukrainian, bst, military, forces, president, ...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
Russia-Ukraine war – as it happened: Ukraine to boost defences along border with Belarus [2023-04-08T17:16:35Z],"[ukrainian, bst, children, photograph, forces,...","{'bst': 0.3907, 'ukrainian': 0.2819, 'children...","[ukrainian, bst, military, forces, president, ...","[ukrainian, bakhmut, forces, defence, city, ky..."
Russia-Ukraine war: Zelenskiy and UK prime minister discuss accelerating military support for Ukraine – as it happened [2023-04-14T17:51:38Z],"[bst, ukrainian, china, friday, military, defe...","{'bst': 0.4633, 'friday': 0.1966, 'teixeira': ...","[ukrainian, bst, military, forces, president, ...","[bst, updated, april, 2023, 11, photograph, re..."
Russia-Ukraine war live: Putin visits Mariupol in first trip to occupied eastern Ukraine – as it happened [2023-03-19T19:12:28Z],"[putin, gmt, mariupol, city, ukrainian, march,...","{'gmt': 0.3642, 'mariupol': 0.3471, 'putin': 0...","[ukrainian, bst, military, forces, president, ...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
"Russia-Ukraine war: Russia nearly shot down British spy plane near Ukraine, alleged leaked US document claims – as it happened [2023-04-10T18:24:19Z]","[bst, ukrainian, documents, defence, air, city...","{'bst': 0.414, 'documents': 0.2066, 'ukrainian...","[ukrainian, bst, military, forces, president, ...","[documents, pentagon, leak, korea, classified,..."
...,...,...,...,...
BoM shifts to El Niño watch after La Niña officially declared over – as it happened [2023-03-14T07:36:07Z],"[gmt, australia, nuclear, aukus, updated, defe...","{'gmt': 0.458, 'aukus': 0.3142, 'australia': 0...","[gmt, australia, updated, bst, australian, say...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
PM reportedly issued invitation to visit Beijing – as it happened [2023-04-04T08:40:05Z],"[bst, updated, tiktok, australia, australian, ...","{'bst': 0.5509, 'tiktok': 0.2961, 'updated': 0...","[gmt, australia, updated, bst, australian, say...","[bst, updated, april, 2023, 11, photograph, re..."
Myer to close flagship Brisbane store – as it happened [2023-03-16T07:52:58Z],"[gmt, australia, updated, says, lehrmann, auku...","{'gmt': 0.4896, 'lehrmann': 0.2857, 'keating':...","[gmt, australia, updated, bst, australian, say...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
Voice referendum question and constitutional amendment could come tomorrow – as it happened [2023-03-22T07:38:24Z],"[gmt, updated, australia, prime, australian, s...","{'gmt': 0.5582, 'updated': 0.204, 'samoa': 0.1...","[gmt, australia, updated, bst, australian, say...","[gmt, updated, march, 12, 2023, today, 11, 10,..."


### Term Frequency / Inverse Document Frequency (TF/IDF)

**Finding terms that are very common in a document, but less common in the whole collection**

The [TF/IDF](https://en.wikipedia.org/wiki/Tf–idf) algorithm takes the term frequencies for a document and divides them by the frequencies of the terms in the whole collection.


In [14]:
# Only count terms that in maximum of 75% of documents, and a minimum of 2 documents. 
# Count a maximum of 10000 terms, and remove common english stop words
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.75, min_df=2, max_features=10000, stop_words="english"
)

In [15]:
# Get the document vectors
tfidf_dt_matrix = tfidf_vectorizer.fit_transform(articles.values())

# Display the vector for the first document
tfidf_dt_matrix.toarray()[0]

array([0.0210132, 0.       , 0.       , ..., 0.       , 0.       ,
       0.       ])

#### Display matrix in dataframe

In [16]:
tfidf_df = pd.DataFrame(tfidf_dt_matrix.toarray(), index=articles.keys(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,000,000km,00am,00pm,01am,01pm,020,02am,02pm,03am,...,zoom,zoopark,zuma,çavuşoğlu,володимир,зеленський,михайло,подоляк,сhildren,та
Russia-Ukraine war live: Ukraine ‘buying time’ in Bakhmut – as it happened [2023-03-12T18:26:12Z],0.021013,0.0,0.000000,0.007957,0.000000,0.000000,0.000000,0.000000,0.000000,0.008099,...,0.0,0.0,0.000000,0.0,0.012844,0.012844,0.000000,0.000000,0.000000,0.0
Russia-Ukraine war – as it happened: Ukraine to boost defences along border with Belarus [2023-04-08T17:16:35Z],0.019765,0.0,0.000000,0.000000,0.000000,0.000000,0.039577,0.000000,0.010642,0.010158,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.016531,0.016531,0.083775,0.0
Russia-Ukraine war: Zelenskiy and UK prime minister discuss accelerating military support for Ukraine – as it happened [2023-04-14T17:51:38Z],0.000000,0.0,0.008853,0.018321,0.000000,0.009496,0.000000,0.000000,0.000000,0.004662,...,0.0,0.0,0.000000,0.0,0.007393,0.007393,0.000000,0.000000,0.000000,0.0
Russia-Ukraine war live: Putin visits Mariupol in first trip to occupied eastern Ukraine – as it happened [2023-03-19T19:12:28Z],0.018333,0.0,0.000000,0.000000,0.000000,0.019190,0.000000,0.000000,0.009871,0.000000,...,0.0,0.0,0.015772,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
"Russia-Ukraine war: Russia nearly shot down British spy plane near Ukraine, alleged leaked US document claims – as it happened [2023-04-10T18:24:19Z]",0.047873,0.0,0.005840,0.000000,0.000000,0.006264,0.000000,0.000000,0.006444,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
BoM shifts to El Niño watch after La Niña officially declared over – as it happened [2023-03-14T07:36:07Z],0.015163,0.0,0.005122,0.003533,0.001815,0.007325,0.000000,0.001866,0.013188,0.003597,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
PM reportedly issued invitation to visit Beijing – as it happened [2023-04-04T08:40:05Z],0.004288,0.0,0.003138,0.000000,0.006670,0.000000,0.000000,0.013719,0.000000,0.009915,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
Myer to close flagship Brisbane store – as it happened [2023-03-16T07:52:58Z],0.037121,0.0,0.005720,0.000000,0.003039,0.000000,0.000000,0.009377,0.003156,0.006024,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
Voice referendum question and constitutional amendment could come tomorrow – as it happened [2023-03-22T07:38:24Z],0.011477,0.0,0.005600,0.002897,0.011903,0.003003,0.000000,0.006120,0.000000,0.011796,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0


#### Update the terms matrix

In [41]:
for idx in terms_df.index:
    tfidf = dict(tfidf_df.loc[idx].sort_values(ascending=False).head(10))
    #print(counts)
    terms_df.at[idx,'tfidf'] = list(tfidf.keys()) 

terms_df

Unnamed: 0,count,tfidf,lda,nmf
Russia-Ukraine war live: Ukraine ‘buying time’ in Bakhmut – as it happened [2023-03-12T18:26:12Z],"[gmt, ukrainian, bakhmut, updated, forces, mos...","[gmt, ukrainian, swiss, bakhmut, updated, satu...","[ukrainian, bst, military, forces, president, ...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
Russia-Ukraine war – as it happened: Ukraine to boost defences along border with Belarus [2023-04-08T17:16:35Z],"[ukrainian, bst, children, photograph, forces,...","[bst, ukrainian, children, reunited, photograp...","[ukrainian, bst, military, forces, president, ...","[ukrainian, bakhmut, forces, defence, city, ky..."
Russia-Ukraine war: Zelenskiy and UK prime minister discuss accelerating military support for Ukraine – as it happened [2023-04-14T17:51:38Z],"[bst, ukrainian, china, friday, military, defe...","[bst, friday, teixeira, ukrainian, china, upda...","[ukrainian, bst, military, forces, president, ...","[bst, updated, april, 2023, 11, photograph, re..."
Russia-Ukraine war live: Putin visits Mariupol in first trip to occupied eastern Ukraine – as it happened [2023-03-19T19:12:28Z],"[putin, gmt, mariupol, city, ukrainian, march,...","[gmt, mariupol, putin, ukrainian, city, warran...","[ukrainian, bst, military, forces, president, ...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
"Russia-Ukraine war: Russia nearly shot down British spy plane near Ukraine, alleged leaked US document claims – as it happened [2023-04-10T18:24:19Z]","[bst, ukrainian, documents, defence, air, city...","[bst, documents, ukrainian, easter, orthodox, ...","[ukrainian, bst, military, forces, president, ...","[documents, pentagon, leak, korea, classified,..."
...,...,...,...,...
BoM shifts to El Niño watch after La Niña officially declared over – as it happened [2023-03-14T07:36:07Z],"[gmt, australia, nuclear, aukus, updated, defe...","[gmt, aukus, australia, nuclear, submarine, su...","[gmt, australia, updated, bst, australian, say...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
PM reportedly issued invitation to visit Beijing – as it happened [2023-04-04T08:40:05Z],"[bst, updated, tiktok, australia, australian, ...","[bst, tiktok, updated, gambling, devices, aust...","[gmt, australia, updated, bst, australian, say...","[bst, updated, april, 2023, 11, photograph, re..."
Myer to close flagship Brisbane store – as it happened [2023-03-16T07:52:58Z],"[gmt, australia, updated, says, lehrmann, auku...","[gmt, lehrmann, keating, australia, updated, p...","[gmt, australia, updated, bst, australian, say...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
Voice referendum question and constitutional amendment could come tomorrow – as it happened [2023-03-22T07:38:24Z],"[gmt, updated, australia, prime, australian, s...","[gmt, updated, samoa, australia, australian, p...","[gmt, australia, updated, bst, australian, say...","[gmt, updated, march, 12, 2023, today, 11, 10,..."


#### Compare approaches

In [42]:
# Sample 5 random articles
samples = random.sample(range(0,202),5)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- Counts:\t",doc['count'])
    print("\t- TFIDF:\t",doc['tfidf'])
    print()

[82] Russia-Ukraine war: Zaporizhzhia nuclear plant reconnected to energy grid as UN warns ‘one day our luck will run out’ – as it happened [2023-03-09T18:50:51Z]
	- Counts:	 ['gmt', 'power', 'kyiv', 'strikes', 'ukrainian', 'plant', 'region', 'thursday', 'nuclear', 'updated']
	- TFIDF:	 ['gmt', 'kyiv', 'power', 'strikes', 'plant', 'thursday', 'lviv', 'ukrainian', 'updated', 'zaporizhzhia']

[101] Russia-Ukraine war: Xi to visit Russia as early as next week; Moscow says it could agree to shorter Black Sea grain deal – as it happened [2023-03-13T18:48:18Z]
	- Counts:	 ['gmt', 'ukrainian', 'forces', 'photograph', 'defence', 'military', 'updated', 'moscow', 'monday', 'president']
	- TFIDF:	 ['gmt', 'matsievskyi', 'ukrainian', 'forces', 'patriarch', 'updated', 'photograph', 'warrants', 'pope', 'monday']

[179] PM urged to ‘reset tone’ of voice debate – as it happened [2023-04-11T08:41:13Z]
	- Counts:	 ['bst', 'leeser', 'voice', 'updated', 'australia', 'julian', 'shadow', 'australian', 'dutt

### Topic modelling with Latent Dirichlet Allocation (LDA)

[LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is an algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It is a generative probabilistic approach to *decomposition* of the document-term matrix into 2 factor matrices: document-topic and topic-term.

![img](https://editor.analyticsvidhya.com/uploads/26864dtm.JPG)

*Source: [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/)*

The LDA model requires the number of topics to be set in advance. As it is a generative model, it also runs over a number of iterations. These values usually need to be experimented with to obtain quality topics.

In [44]:
# Set number of topics
num_topics = 20
# Set max number of iteractions
max_iterations = 100

# Create the model
lda_model = LatentDirichletAllocation(n_components=num_topics,max_iter=max_iterations,learning_method='online')

# Fit the model to the data, and use the model to transform the data (do the decomposition)
doc_topic_matrix = lda_model.fit_transform(count_dt_matrix)

# Obtain the topics
topic_term_matrix = lda_model.components_

#### View the topics

In [51]:
# Get the topics and their terms
lda_topic_dict = {}
for index, topic in enumerate(topic_term_matrix):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    lda_topic_dict[f"topic_{index}"] = top_terms_list

# Print the topics with their terms    
for k,v in lda_topic_dict.items():
    print(k)
    print(v)
    print()

{'bst': 0.05062464022508545, 'ukrainian': 0.05061109650082177, 'defence': 0.0504366215310309, 'military': 0.05042837788002845, 'gmt': 0.050419838073145165, 'updated': 0.05039230826154305, 'photograph': 0.05037395271982523, 'ministry': 0.05036467842970517, 'reuters': 0.050357087389385005, 'kyiv': 0.05035115512062176}
{'ukrainian': 0.051202509519921266, 'bst': 0.05116161690282603, 'gmt': 0.05072423087382788, 'president': 0.050671732857621235, 'updated': 0.05065403289872125, 'military': 0.05060117887101784, 'defence': 0.05059045647762943, 'forces': 0.05058565232149159, 'kyiv': 0.05049298563817203, 'putin': 0.05048584778797993}
{'ukrainian': 0.05137603896664686, 'bst': 0.050922551015423645, 'defence': 0.050879954925659554, 'bakhmut': 0.05083775879006237, 'forces': 0.05069537350784578, 'city': 0.05065282893988138, 'military': 0.05064124135067999, 'foreign': 0.05064004161059797, 'moscow': 0.050616215260691946, 'state': 0.05059373141348848}
{'ukrainian': 0.05125545174709058, 'president': 0.05

#### List of topics for each document

In [52]:
doc_topic_matrix[0]

array([1.74398326e-05, 1.74398326e-05, 1.74398326e-05, 1.74398326e-05,
       1.74398329e-05, 1.74398326e-05, 1.74398326e-05, 1.74398326e-05,
       9.99668643e-01, 1.74398326e-05, 1.74398326e-05, 1.74398344e-05,
       1.74398326e-05, 1.74398326e-05, 1.74398330e-05, 1.74398326e-05,
       1.74398326e-05, 1.74398326e-05, 1.74398326e-05, 1.74398326e-05])

In [53]:
doc_topic_matrix[200]

array([5.41653125e-06, 5.41653125e-06, 5.41653125e-06, 5.41653125e-06,
       5.41653135e-06, 5.41653125e-06, 5.41653125e-06, 5.41653125e-06,
       5.41653137e-06, 5.41653125e-06, 5.41653125e-06, 7.62981737e-01,
       5.41653126e-06, 5.41653126e-06, 2.36920765e-01, 5.41653125e-06,
       5.41653125e-06, 5.41653125e-06, 5.41653125e-06, 5.41653125e-06])

#### Update the terms matrix

In [58]:
for idx,topic in enumerate(doc_topic_matrix):
    topic_num = topic.argmax()
    top_topic = lda_topic_dict[f"topic_{topic_num}"]
    terms_df['lda'].iloc[idx] = list(top_topic.keys())

terms_df

Unnamed: 0,count,tfidf,lda,nmf
Russia-Ukraine war live: Ukraine ‘buying time’ in Bakhmut – as it happened [2023-03-12T18:26:12Z],"[gmt, ukrainian, bakhmut, updated, forces, mos...","[gmt, ukrainian, swiss, bakhmut, updated, satu...","[ukrainian, gmt, bst, forces, military, presid...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
Russia-Ukraine war – as it happened: Ukraine to boost defences along border with Belarus [2023-04-08T17:16:35Z],"[ukrainian, bst, children, photograph, forces,...","[bst, ukrainian, children, reunited, photograp...","[ukrainian, gmt, bst, forces, military, presid...","[ukrainian, bakhmut, forces, defence, city, ky..."
Russia-Ukraine war: Zelenskiy and UK prime minister discuss accelerating military support for Ukraine – as it happened [2023-04-14T17:51:38Z],"[bst, ukrainian, china, friday, military, defe...","[bst, friday, teixeira, ukrainian, china, upda...","[ukrainian, gmt, bst, forces, military, presid...","[bst, updated, april, 2023, 11, photograph, re..."
Russia-Ukraine war live: Putin visits Mariupol in first trip to occupied eastern Ukraine – as it happened [2023-03-19T19:12:28Z],"[putin, gmt, mariupol, city, ukrainian, march,...","[gmt, mariupol, putin, ukrainian, city, warran...","[ukrainian, gmt, bst, forces, military, presid...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
"Russia-Ukraine war: Russia nearly shot down British spy plane near Ukraine, alleged leaked US document claims – as it happened [2023-04-10T18:24:19Z]","[bst, ukrainian, documents, defence, air, city...","[bst, documents, ukrainian, easter, orthodox, ...","[ukrainian, gmt, bst, forces, military, presid...","[documents, pentagon, leak, korea, classified,..."
...,...,...,...,...
BoM shifts to El Niño watch after La Niña officially declared over – as it happened [2023-03-14T07:36:07Z],"[gmt, australia, nuclear, aukus, updated, defe...","[gmt, aukus, australia, nuclear, submarine, su...","[australia, gmt, aukus, nuclear, china, submar...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
PM reportedly issued invitation to visit Beijing – as it happened [2023-04-04T08:40:05Z],"[bst, updated, tiktok, australia, australian, ...","[bst, tiktok, updated, gambling, devices, aust...","[gmt, updated, australia, bst, australian, say...","[bst, updated, april, 2023, 11, photograph, re..."
Myer to close flagship Brisbane store – as it happened [2023-03-16T07:52:58Z],"[gmt, australia, updated, says, lehrmann, auku...","[gmt, lehrmann, keating, australia, updated, p...","[gmt, updated, australia, bst, australian, say...","[gmt, updated, march, 12, 2023, today, 11, 10,..."
Voice referendum question and constitutional amendment could come tomorrow – as it happened [2023-03-22T07:38:24Z],"[gmt, updated, australia, prime, australian, s...","[gmt, updated, samoa, australia, australian, p...","[gmt, updated, australia, bst, australian, say...","[gmt, updated, march, 12, 2023, today, 11, 10,..."


#### Compare approaches

In [57]:
# Sample 5 random articles
samples = random.sample(range(0,202),5)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- Counts:\t",doc['count'])
    print("\t- TFIDF:\t",doc['tfidf'])
    print("\t- LDA:\t\t",doc['lda'])
    print()

[153] Australia’s drought planning should begin now, not when the rain dries up | Gabrielle Chan [2023-03-06T14:00:28Z]
	- Counts:	 ['farmers', 'drought', 'years', 'record', 'australian', 'agricultural', 'scenario', 'australia', 'climate', 'dry']
	- TFIDF:	 ['drought', 'farmers', 'scenario', 'dry', 'floods', 'agricultural', 'record', 'drier', 'crops', 'climate']
	- LDA:		 ['ukrainian', 'bst', 'gmt', 'president', 'updated', 'military', 'defence', 'forces', 'kyiv', 'putin']

[58] Russia-Ukraine war at a glance: what we know on day 374 of the invasion [2023-03-04T01:30:24Z]
	- Counts:	 ['city', 'ukrainian', 'foreign', 'friday', 'western', 'bakhmut', 'president', 'eastern', 'biden', 'group']
	- TFIDF:	 ['city', 'ukrainian', 'garland', 'friday', 'kupiansk', 'serbia', 'evacuation', 'foreign', 'reznikov', 'biden']
	- LDA:		 ['nato', 'bst', 'president', 'finland', 'putin', 'military', 'ukrainian', 'children', 'state', 'security']

[57] ‘Algebra under air raids’: the children in a Ukraine war z

### Topic modelling with Non-negative Matrix Factorisation (NMF)


[NMF](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a different algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It also factorises the document-term matrix into 2 factor matrices: document-topic and topic-term.

In [59]:
# Set the number of topics
num_topics = 20

# Create the model
nmf_model = NMF(n_components=num_topics,init='random',beta_loss='frobenius')

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix)

topic_term_nmf = nmf_model.components_

In [61]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"topic_{index}"] = top_terms_list

# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()

topic_0
{'gmt': 4.1325, 'updated': 1.2509, 'march': 0.3253, '12': 0.2976, '2023': 0.277, 'today': 0.2655, '11': 0.2216, '10': 0.2041, 'says': 0.1871, 'twitter': 0.1797}

topic_1
{'bst': 2.011, 'updated': 0.5584, '2023': 0.1229, 'reuters': 0.1195, 'april': 0.1191, 'photograph': 0.1148, '11': 0.112, 'reports': 0.1056, '12': 0.1002, 'today': 0.0947}

topic_2
{'ukrainian': 1.2683, 'bakhmut': 0.9844, 'city': 0.7187, 'forces': 0.6798, 'defence': 0.3882, 'moscow': 0.3665, 'kyiv': 0.3664, 'military': 0.3482, 'ministry': 0.3014, 'eastern': 0.298}

topic_3
{'australia': 1.1637, 'photograph': 0.9061, 'crossword': 0.8935, 'notifications': 0.8529, 'day': 0.7912, 'guardian': 0.7466, 'morning': 0.7056, 'sydney': 0.6937, 'app': 0.6695, 'sign': 0.6634}

topic_4
{'imf': 4.113, 'growth': 2.1194, 'inflation': 1.4525, 'australia': 1.3082, 'economy': 1.1185, 'forecasts': 1.0599, 'global': 0.9709, 'economic': 0.8915, 'rates': 0.8137, 'chalmers': 0.7537}

topic_5
{'nato': 2.281, 'finland': 1.9902, 'military':

#### Update the terms matrix

In [62]:
for idx,topic in enumerate(doc_topic_nmf):
    topic_num = topic.argmax()
    top_topic = lda_topic_dict[f"topic_{topic_num}"]
    terms_df['nmf'].iloc[idx] = list(top_topic.keys())

terms_df

Unnamed: 0,count,tfidf,lda,nmf
Russia-Ukraine war live: Ukraine ‘buying time’ in Bakhmut – as it happened [2023-03-12T18:26:12Z],"[gmt, ukrainian, bakhmut, updated, forces, mos...","[gmt, ukrainian, swiss, bakhmut, updated, satu...","[ukrainian, gmt, bst, forces, military, presid...","[bst, ukrainian, defence, military, gmt, updat..."
Russia-Ukraine war – as it happened: Ukraine to boost defences along border with Belarus [2023-04-08T17:16:35Z],"[ukrainian, bst, children, photograph, forces,...","[bst, ukrainian, children, reunited, photograp...","[ukrainian, gmt, bst, forces, military, presid...","[ukrainian, bst, gmt, president, updated, mili..."
Russia-Ukraine war: Zelenskiy and UK prime minister discuss accelerating military support for Ukraine – as it happened [2023-04-14T17:51:38Z],"[bst, ukrainian, china, friday, military, defe...","[bst, friday, teixeira, ukrainian, china, upda...","[ukrainian, gmt, bst, forces, military, presid...","[ukrainian, bst, gmt, president, updated, mili..."
Russia-Ukraine war live: Putin visits Mariupol in first trip to occupied eastern Ukraine – as it happened [2023-03-19T19:12:28Z],"[putin, gmt, mariupol, city, ukrainian, march,...","[gmt, mariupol, putin, ukrainian, city, warran...","[ukrainian, gmt, bst, forces, military, presid...","[ukrainian, bst, updated, president, defence, ..."
"Russia-Ukraine war: Russia nearly shot down British spy plane near Ukraine, alleged leaked US document claims – as it happened [2023-04-10T18:24:19Z]","[bst, ukrainian, documents, defence, air, city...","[bst, documents, ukrainian, easter, orthodox, ...","[ukrainian, gmt, bst, forces, military, presid...","[ukrainian, bst, gmt, president, updated, mili..."
...,...,...,...,...
BoM shifts to El Niño watch after La Niña officially declared over – as it happened [2023-03-14T07:36:07Z],"[gmt, australia, nuclear, aukus, updated, defe...","[gmt, aukus, australia, nuclear, submarine, su...","[australia, gmt, aukus, nuclear, china, submar...","[qin, suppression, surely, modernisation, myth..."
PM reportedly issued invitation to visit Beijing – as it happened [2023-04-04T08:40:05Z],"[bst, updated, tiktok, australia, australian, ...","[bst, tiktok, updated, gambling, devices, aust...","[gmt, updated, australia, bst, australian, say...","[ukrainian, bst, gmt, president, updated, mili..."
Myer to close flagship Brisbane store – as it happened [2023-03-16T07:52:58Z],"[gmt, australia, updated, says, lehrmann, auku...","[gmt, lehrmann, keating, australia, updated, p...","[gmt, updated, australia, bst, australian, say...","[bst, ukrainian, defence, military, gmt, updat..."
Voice referendum question and constitutional amendment could come tomorrow – as it happened [2023-03-22T07:38:24Z],"[gmt, updated, australia, prime, australian, s...","[gmt, updated, samoa, australia, australian, p...","[gmt, updated, australia, bst, australian, say...","[bst, ukrainian, defence, military, gmt, updat..."


### Compare approaches

In [63]:
# Sample 5 random articles
samples = random.sample(range(0,202),5)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- Counts:\t",doc['count'])
    print("\t- TFIDF:\t",doc['tfidf'])
    print("\t- LDA:\t\t",doc['lda'])
    print("\t- NMF:\t\t",doc['nmf'])
    print()

[4] Russia-Ukraine war: Russia nearly shot down British spy plane near Ukraine, alleged leaked US document claims – as it happened [2023-04-10T18:24:19Z]
	- Counts:	 ['bst', 'ukrainian', 'documents', 'defence', 'air', 'city', 'bakhmut', 'president', 'updated', 'south']
	- TFIDF:	 ['bst', 'documents', 'ukrainian', 'easter', 'orthodox', 'palm', 'bakhmut', 'updated', 'air', 'defence']
	- LDA:		 ['ukrainian', 'gmt', 'bst', 'forces', 'military', 'president', 'defence', 'updated', 'bakhmut', 'city']
	- NMF:		 ['ukrainian', 'bst', 'gmt', 'president', 'updated', 'military', 'defence', 'forces', 'kyiv', 'putin']

[64] What happened in the Russia-Ukraine war this week? Catch up with the must-read news and analysis [2023-04-21T20:00:50Z]
	- Counts:	 ['ukrainian', 'reported', 'photograph', 'kyiv', 'military', 'city', 'state', 'lula', 'putin', 'grain']
	- TFIDF:	 ['ukrainian', 'lula', 'belgorod', 'brazil', 'accidentally', 'reported', 'kyiv', 'corruption', 'photograph', 'easter']
	- LDA:		 ['ukraini

In [64]:
doc

count    [says, portrait, finlay, archibald, blake, pho...
tfidf    [finlay, portrait, archibald, wiggins, blake, ...
lda      [australia, gmt, aukus, nuclear, china, submar...
nmf      [gmt, updated, australia, bst, australian, say...
Name: ‘We get to hear the stories’: unpacking the Archibald prize at the Art Gallery of NSW [2023-03-29T14:00:36Z], dtype: object