<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024_sem1)</div>

# IFN619 :: C1-UnstructuredAnalytics

For this tutorial, you will use the studio notebook as a guide, and:

1. Use the Guardian API to undertake your own search and obtain a json file of documents
2. Create a TF/IDF document-term matrix for your documents
3. Perform topic modelling of your documents using NMF

In [1]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json
import random

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### 1. Accessing the data via The Guardian API

Make a copy of the studio notebook file, and modify it to perform your own search of the Guardian API. **NOTE:** you will need to obtain a developer API key first.

A suggested search term is "ukraine", or come up with another that is of interest to you and will return a fair amount of data.

Save your search results in a json file, then read in that data below...

In [2]:
# Load the data - articles from The Guardian
file_path = "data/"
file_name = "ukraine_articles.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

print(f"Loaded {len(articles)} articles from {file_name}")

Loaded 99 articles from ukraine_articles.json


#### Create a top10 terms dataframe

Using the index from the documents, create a dataframe that can hold the top10 terms for each document.

In [3]:
# Create a dataframe to hold top terms for each analysis type
terms_df = pd.DataFrame(index=articles.keys(),columns=['tfidf','nmf'])
terms_df

Unnamed: 0,tfidf,nmf
Ukraine war briefing: Nato foreign ministers to discuss proposal for €100bn fund for Ukraine [2024-04-02T23:54:56Z],,
Ukraine war briefing: Biden scrapes together $300m more for Ukraine weapons [2024-03-13T00:33:45Z],,
"Ukraine war briefing: Russian forces occupying Ukraine use torture as ‘policy’, says UN expert [2024-03-09T02:58:43Z]",,
Ukraine war briefing: Russian fighter jet crashes off Crimea [2024-03-29T02:35:32Z],,
Ukraine war briefing: Pope urges Ukraine to have courage of ‘white flag’ and negotiate end to war [2024-03-10T01:50:45Z],,
...,...,...
Israel-Gaza war: US condemns ‘cynical’ Russia and China veto of ceasefire deal; Israel to go into Rafah ‘with or without US support’ – as it happened [2024-03-22T17:44:42Z],,
Collapse at regional mine – as it happened [2024-03-13T07:19:08Z],,
Nowland family reach settlement – as it happened [2024-03-07T07:25:37Z],,
King takes aim at Liberals over preselection of women – as it happened [2024-03-25T07:11:38Z],,


### Term Frequency / Inverse Document Frequency (TF/IDF)


In [4]:
# Set parameters appropriate to your data
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.75, min_df=2, max_features=10000, stop_words="english"
)

In [5]:
# Get the document vectors
tfidf_dt_matrix = tfidf_vectorizer.fit_transform(articles.values())

# Display the vector for the first document
tfidf_dt_matrix.toarray()[0]

array([0.08039513, 0.        , 0.        , ..., 0.        , 0.        ,
       0.        ])

#### Update the terms matrix

In [6]:
# list of feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# create a df to combine matrix with feature names
tfidf_df = pd.DataFrame(tfidf_dt_matrix.toarray(), index=articles.keys(), columns=feature_names)
tfidf_df

Unnamed: 0,000,000km,00am,00pm,01am,01pm,02am,02pm,03am,03pm,...,zdaniel,zealand,zelenskiy,zero,zoe,zomi,zone,zones,zoo,zuhri
Ukraine war briefing: Nato foreign ministers to discuss proposal for €100bn fund for Ukraine [2024-04-02T23:54:56Z],0.080395,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.030689,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
Ukraine war briefing: Biden scrapes together $300m more for Ukraine weapons [2024-03-13T00:33:45Z],0.000000,0.048352,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
"Ukraine war briefing: Russian forces occupying Ukraine use torture as ‘policy’, says UN expert [2024-03-09T02:58:43Z]",0.051112,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.087800,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
Ukraine war briefing: Russian fighter jet crashes off Crimea [2024-03-29T02:35:32Z],0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.130636,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
Ukraine war briefing: Pope urges Ukraine to have courage of ‘white flag’ and negotiate end to war [2024-03-10T01:50:45Z],0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Israel-Gaza war: US condemns ‘cynical’ Russia and China veto of ceasefire deal; Israel to go into Rafah ‘with or without US support’ – as it happened [2024-03-22T17:44:42Z],0.019069,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.004061,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.005059,0.000000,0.0
Collapse at regional mine – as it happened [2024-03-13T07:19:08Z],0.012499,0.000000,0.008862,0.000000,0.008862,0.000000,0.000000,0.005091,0.009084,0.000000,...,0.000000,0.000000,0.000000,0.009323,0.000000,0.0,0.000000,0.000000,0.000000,0.0
Nowland family reach settlement – as it happened [2024-03-07T07:25:37Z],0.042069,0.004821,0.006779,0.011683,0.010169,0.003390,0.007131,0.003894,0.000000,0.003665,...,0.000000,0.000000,0.000000,0.014263,0.000000,0.0,0.000000,0.000000,0.000000,0.0
King takes aim at Liberals over preselection of women – as it happened [2024-03-25T07:11:38Z],0.021930,0.000000,0.006479,0.003722,0.009718,0.006479,0.006815,0.003722,0.003320,0.000000,...,0.004922,0.000000,0.000000,0.003408,0.011551,0.0,0.006641,0.000000,0.000000,0.0


In [7]:
for idx in terms_df.index:
    tfidf = dict(tfidf_df.loc[idx].sort_values(ascending=False).head(5))
    #print(counts)
    terms_df.at[idx,'tfidf'] = list(tfidf.keys()) 

terms_df

Unnamed: 0,tfidf,nmf
Ukraine war briefing: Nato foreign ministers to discuss proposal for €100bn fund for Ukraine [2024-04-02T23:54:56Z],"[adm, nato, russian, tuesday, refinery]",
Ukraine war briefing: Biden scrapes together $300m more for Ukraine weapons [2024-03-13T00:33:45Z],"[johnson, russian, kursk, tuesday, nizhny]",
"Ukraine war briefing: Russian forces occupying Ukraine use torture as ‘policy’, says UN expert [2024-03-09T02:58:43Z]","[friday, torture, russian, serbia, ukrainian]",
Ukraine war briefing: Russian fighter jet crashes off Crimea [2024-03-29T02:35:32Z],"[polish, thursday, ukrainian, russian, poland]",
Ukraine war briefing: Pope urges Ukraine to have courage of ‘white flag’ and negotiate end to war [2024-03-10T01:50:45Z],"[saturday, boy, ria, moscow, afraid]",
...,...,...
Israel-Gaza war: US condemns ‘cynical’ Russia and China veto of ceasefire deal; Israel to go into Rafah ‘with or without US support’ – as it happened [2024-03-22T17:44:42Z],"[gaza, israel, gmt, blinken, resolution]",
Collapse at regional mine – as it happened [2024-03-13T07:19:08Z],"[gmt, updated, australia, funding, government]",
Nowland family reach settlement – as it happened [2024-03-07T07:25:37Z],"[gmt, updated, super, women, parental]",
King takes aim at Liberals over preselection of women – as it happened [2024-03-25T07:11:38Z],"[gmt, updated, labor, government, australia]",


### Topic modelling with Non-negative Matrix Factorisation (NMF)


[NMF](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a different algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It also factorises the document-term matrix into 2 factor matrices: document-topic and topic-term.

In [8]:
# Set the number of topics
num_topics = 20

# Create the model
nmf_model = NMF(n_components=num_topics, max_iter=200, init='random', beta_loss='frobenius')

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix)

topic_term_nmf = nmf_model.components_



In [9]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"topic_{index}"] = top_terms_list

# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()

topic_0
{'gmt': 3.3166, 'updated': 1.5042, 'government': 0.8724, 'labor': 0.5851, 'australia': 0.5505, 'minister': 0.4633, 'greens': 0.415, 'says': 0.4012, 'discrimination': 0.3888, 'religious': 0.3851}

topic_1
{'australia': 0.5136, 'crossword': 0.3282, 'notifications': 0.3282, 'sign': 0.2947, 'day': 0.2929, 'morning': 0.2768, 'world': 0.274, 'story': 0.2598, 'sydney': 0.2404, 'australian': 0.2172}

topic_2
{'russian': 1.3995, 'saturday': 1.3263, 'sunday': 1.01, 'german': 0.93, 'drone': 0.8888, 'odesa': 0.8645, 'recording': 0.8532, 'kara': 0.8329, 'murza': 0.8329, 'navalny': 0.8086}

topic_3
{'says': 1.3199, 'australia': 0.9891, 'jiang': 0.9395, 'visas': 0.8944, 'families': 0.7617, 'palestinians': 0.72, 'support': 0.7177, 'mh370': 0.7127, 'humanitarian': 0.6996, 'gaza': 0.6702}

topic_4
{'polish': 2.0955, 'russian': 1.6905, 'missile': 1.5784, 'poland': 1.5721, 'ukrainian': 1.4107, 'air': 1.3873, 'missiles': 1.3235, 'sunday': 1.3219, 'energy': 1.3138, 'lviv': 1.2572}

topic_5
{'austral

#### Update the terms matrix

In [10]:
for idx,topic in enumerate(doc_topic_nmf):
    topic_num = topic.argmax()
    top_topic = nmf_topic_dict[f"topic_{topic_num}"]
    terms_df['nmf'].iloc[idx] = list(top_topic.keys())

terms_df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  terms_df['nmf'].iloc[idx] = list(top_topic.keys())


Unnamed: 0,tfidf,nmf
Ukraine war briefing: Nato foreign ministers to discuss proposal for €100bn fund for Ukraine [2024-04-02T23:54:56Z],"[adm, nato, russian, tuesday, refinery]","[russian, defence, nato, military, sergei, ger..."
Ukraine war briefing: Biden scrapes together $300m more for Ukraine weapons [2024-03-13T00:33:45Z],"[johnson, russian, kursk, tuesday, nizhny]","[ukrainian, russian, refinery, region, oil, be..."
"Ukraine war briefing: Russian forces occupying Ukraine use torture as ‘policy’, says UN expert [2024-03-09T02:58:43Z]","[friday, torture, russian, serbia, ukrainian]","[russian, ukrainian, eu, zelenskiy, thursday, ..."
Ukraine war briefing: Russian fighter jet crashes off Crimea [2024-03-29T02:35:32Z],"[polish, thursday, ukrainian, russian, poland]","[russian, ukrainian, eu, zelenskiy, thursday, ..."
Ukraine war briefing: Pope urges Ukraine to have courage of ‘white flag’ and negotiate end to war [2024-03-10T01:50:45Z],"[saturday, boy, ria, moscow, afraid]","[russian, saturday, sunday, german, drone, ode..."
...,...,...
Israel-Gaza war: US condemns ‘cynical’ Russia and China veto of ceasefire deal; Israel to go into Rafah ‘with or without US support’ – as it happened [2024-03-22T17:44:42Z],"[gaza, israel, gmt, blinken, resolution]","[gaza, israel, hamas, israeli, aid, gmt, human..."
Collapse at regional mine – as it happened [2024-03-13T07:19:08Z],"[gmt, updated, australia, funding, government]","[gmt, updated, government, labor, australia, m..."
Nowland family reach settlement – as it happened [2024-03-07T07:25:37Z],"[gmt, updated, super, women, parental]","[gmt, updated, government, labor, australia, m..."
King takes aim at Liberals over preselection of women – as it happened [2024-03-25T07:11:38Z],"[gmt, updated, labor, government, australia]","[gmt, updated, government, labor, australia, m..."


### Check against articles

In [11]:
# Sample 5 random articles
samples = random.sample(range(0,len(terms_df)),5)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- TFIDF:\t",doc['tfidf'])
    print("\t- NMF:\t\t",doc['nmf'])
    print()

[66] Afternoon Update: Labor unveils fuel efficiency standard; Crown keeps Melbourne casino licence; and Diddy’s properties raided [2024-03-26T06:00:20Z]
	- TFIDF:	 ['festival', 'word', 'casino', 'crown', 'properties']
	- NMF:		 ['word', 'biden', 'starter', 'time', 'batty', 'festival', 'australia', 'transgender', 'post', 'casino']

[69] Afternoon Update: Taiwan rocked by 7.7 magnitude quake; Sam Mostyn to be next governor general; and what’s behind Sydney’s cocaine coast [2024-04-03T06:04:02Z]
	- TFIDF:	 ['word', 'batty', 'japan', 'cruise', 'starter']
	- NMF:		 ['word', 'biden', 'starter', 'time', 'batty', 'festival', 'australia', 'transgender', 'post', 'casino']

[28] Ukraine war briefing: Zelenskiy says Putin trying to falsely blame Kyiv for Moscow concert attack [2024-03-24T07:37:36Z]
	- TFIDF:	 ['lviv', 'air', 'saturday', 'attack', 'missiles']
	- NMF:		 ['polish', 'russian', 'missile', 'poland', 'ukrainian', 'air', 'missiles', 'sunday', 'energy', 'lviv']

[84] Middle East crisis: s

## Refine your analysis

Once you have worked through the process. Try tweaking the parameters in the TF/IDF vectorizer and also in the NMF topic modelling to try and obtain better results for your data.

#### Advanced

You may obtain better results by doing the following:

1. Creating smaller documents (e.g. article paragraphs)
2. Pre-processing the text by Stemming or Lemmatizing, and by removing additional stop words.