# DTM Analysis Examples

Hello! Here I outline a couple of examples analysing DTM-induced topics, fit on a sample corpus from the paper:
""

This is by no means an exhaustive list of ways to analyse the DTM information, please feel free to fork repository and extend the analysis.

In [1]:
from dtm_toolkit.dtm.analysis import DTMAnalysis

In [2]:
NDOCS = 2686 # number of lines in -mult.dat file.
NTOPICS = 30
analysis = DTMAnalysis(
    NDOCS, 
    NTOPICS,
    model_root="./example_dtm_model/",
    doc_year_map_file_name="model-year.dat",
    seq_dat_file_name="model-seq.dat",
    vocab_file_name="vocab.txt",
    model_out_dir="out",
    thesaurus=None, # defaults to thesaurus used in Scelsi et al. paper
)

Initialising EuroVoc...


## Quick look at the label thesaurus:

In [6]:
labels = analysis.auto_labelling.sorted_labels
print(f"NUM. LABELS: {len(labels)}")
print(f"labels: {labels}")

NUM. LABELS: 40
labels: ['0436 executive power and public service', '0821 defence', '1216 criminal law', '1611 economic conditions', '1616 regions and regional policy', '1631 economic analysis', '2016 business operations and trade', '2031 marketing', '2036 distributive trades', '2446 taxation', '2451 prices', '2816 demography and population', '3606 natural and applied sciences', '3611 humanities', '4026 accounting', '4811 organisation of transport', '4816 land transport', '5206 environmental policy', '5211 natural environment', '5216 deterioration of the environment', '5621 cultivation of agricultural land', '5631 agricultural activity', '6036 food technology', '6406 production', '6411 technology and technical regulations', '6416 research and intellectual property', '6611 coal and mining industries', '6616 oil and gas industry', '6621 electrical and nuclear industries', '6626 renewable energy', '6811 chemistry', '6816 iron, steel and other metal industries', '6821 mechanical engineerin

# DTM - Automatic Labeling
## Get automatic topic labels for each topic

In [7]:
# we can choose to use either match-based (tfidf) strategy, or embedding-based (embedding) strategy.
# NOTE: initialising the gloVe embeddings is only done once when first creating labels, after initialisation
# it is a fair amount faster than the first labelling run.

emb_topic_labels = analysis.get_topic_labels(_type="embedding")
TOPIC = 1 # let's see the best 4 labels from topic 1

print(f"embedding labels: {emb_topic_labels[TOPIC]}")
print(f"top-10 words: {[x for _,x in analysis.top_word_arr[TOPIC]]}")
print("==========")

tfidf_topic_labels = analysis.get_topic_labels(_type="tfidf")

print(f"tfidf labels: {tfidf_topic_labels[TOPIC]}")
print(f"top-10 words: {[x for _,x in analysis.top_word_arr[TOPIC]]}")

Initialising gloVe embeddings...
embedding labels: [('6616 oil and gas industry', 0.89), ('6611 coal and mining industries', 0.88), ('6406 production', 0.81), ('6626 renewable energy', 0.77)]
top-10 words: ['coal', 'ton', 'production', 'cost', 'percent', 'productivity', 'export', 'u.s', 'increase', 'region']
tfidf labels: [('6611 coal and mining industries', 1.88), ('6406 production', 1.13), ('1616 regions and regional policy', 0.32), ('4026 accounting', 0.15)]
top-10 words: ['coal', 'ton', 'production', 'cost', 'percent', 'productivity', 'export', 'u.s', 'increase', 'region']


# DTM best words for a topic
## Get the top-10 words for all topics

In [8]:
top_10 = analysis.get_top_words(n=10) # top-10
print(f"top-10: {top_10[TOPIC]}")
print("----------")
top_5 = analysis.get_top_words(n=5) # top-5
print(f"top-5: {top_5[TOPIC]}")
print("----------")
top_30 = analysis.get_top_words(n=30, with_prob=False) # top-30
print(f"top-30, with_prob=False: {top_30[TOPIC]}")

top-10: [(0.43320466929331153, 'coal'), (0.12312055209114217, 'ton'), (0.08423354362035818, 'production'), (0.06680367109857119, 'cost'), (0.05474056957099318, 'percent'), (0.048806098992659884, 'productivity'), (0.047816823540786764, 'export'), (0.047374993883538295, 'u.s'), (0.0471022988693782, 'increase'), (0.04679677903926059, 'region')]
----------
top-5: [(0.5684332249942692, 'coal'), (0.16155368916593393, 'ton'), (0.11052776723511393, 'production'), (0.08765701040564376, 'cost'), (0.07182830819903915, 'percent')]
----------
top-30, with_prob=False: ['coal', 'ton', 'production', 'cost', 'percent', 'productivity', 'export', 'u.s', 'increase', 'region', 'decline', 'price', 'project', 'low', 'mining', 'sulfur', 'year', 'demand', 'average', 'minemouth', 'high', 'transportation', 'use', 'labor', 'import', 'supply', 'mine', 'western', 'btu', 'market']


## Get the top-n words for a topic over time

In [10]:
"""Here we get the top-5 words for topic 1 over time and utilise the 'years' attribute to pair top words to their timesteps
"""
top_5_ot = analysis.get_top_words(n=5, over_time=True, with_prob=False)
print(f"top_5 in {analysis.years[0]}: {top_5_ot[TOPIC][0]}")
print("----------")
print(f"top_5 in {analysis.years[10]}: {top_5_ot[TOPIC][10]}")
print("----------")
print(f"top_5 in {analysis.years[20]}: {top_5_ot[TOPIC][20]}")

top_5 in 1997: ['coal', 'ton', 'percent', 'low', 'productivity']
----------
top_5 in 2008: ['coal', 'ton', 'production', 'cost', 'btu']
----------
top_5 in 2018: ['coal', 'production', 'region', 'ton', 'interior']


# More complex data aggregation

In [19]:
"""Here we create a pandas dataframe where each row is a topic-year combination and its top words at that time.
This is useful for visualisations.
"""
all_topics = analysis.get_top_words_ot()
print(all_topics)
print("==========")
print("==========")
# we can also specify a particular topic
topic_1 = analysis.get_top_words_ot(topic_idx=TOPIC)
print(topic_1)

   topic_idx  year                                          top_words
0          0  1997  [capacity, percent, coal, generation, electric...
1          0  1999  [capacity, percent, coal, generation, electric...
2          0  2000  [capacity, percent, generation, coal, electric...
3          0  2001  [project, capacity, percent, electricity, expe...
4          0  2002  [project, capacity, electricity, percent, gene...
..       ...   ...                                                ...
18        29  2016  [cost, time, price, cent, transmission, market...
19        29  2017  [cost, time, market, cent, price, transmission...
20        29  2018  [cost, time, market, cent, price, transmission...
21        29  2019  [cost, time, market, cent, price, transmission...
22        29  2020  [cost, time, market, cent, price, transmission...

[690 rows x 3 columns]
   topic_idx  year                                          top_words
0          1  1997  [coal, ton, percent, low, productivity, cost, 

In [17]:
"""Here we fetch topic representations at each timestep over the entire vocabulary.
23 timesteps, and vocab size of 1774. Essentially in each row of the matrix we have a 
probability distribution over the vocab. Useful to see which terms are prevalent at each
timestep, but only in custom cases, else create_top_words_df or get_top_words functions
can be used more easily.
"""
topic_word_dist_ot = analysis.get_topic_representation_ot(TOPIC)
print(f"shape: {topic_word_dist_ot.shape}")
print(topic_word_dist_ot)

shape: (23, 1774)
[[1.11228261e-05 1.11228261e-05 1.11228261e-05 ... 1.11228261e-05
  1.11228261e-05 1.11228261e-05]
 [1.13076434e-05 1.13076434e-05 1.13076434e-05 ... 1.13076434e-05
  1.13076434e-05 1.13076434e-05]
 [1.16562103e-05 1.16562103e-05 1.16562103e-05 ... 1.16562103e-05
  1.16562103e-05 1.16562103e-05]
 ...
 [7.94686081e-06 7.94686081e-06 7.94686081e-06 ... 7.94686081e-06
  7.94686081e-06 7.94686081e-06]
 [7.75087899e-06 7.75087899e-06 7.75087899e-06 ... 7.75087899e-06
  7.75087899e-06 7.75087899e-06]
 [7.63462480e-06 7.63462480e-06 7.63462480e-06 ... 7.63462480e-06
  7.63462480e-06 7.63462480e-06]]


## Raw output data formatted as a DataFrame

In [22]:
"""For those who want to analyse from scratch, a DataFrame with each documents topic mixtures is available,
sorted in the same way as input into the DTM in the first place. i.e. index 0 in the dataframe corresponds to
document 0 in /example_dtm_model/model-mult.dat
"""
doc_topic_mix = analysis.get_doc_topic_mixtures()
doc_topic_mix

Unnamed: 0,topic_dist,year
0,"[0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 3.9...",1997
1,"[0.01, 0.01, 0.01, 17.036593312282466, 0.01, 0...",1997
2,"[0.01, 0.01, 7.633104041666984, 0.01, 12.09599...",1997
3,"[0.01, 0.01, 17.700227347196016, 0.01, 0.01, 0...",1997
4,"[0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.0...",1997
...,...,...
2681,"[0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 93.415298...",2020
2682,"[43.23118368511311, 0.01, 0.01, 0.01, 0.01, 19...",2020
2683,"[0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.0...",2020
2684,"[22.039384264951234, 0.01, 0.01, 0.01, 0.01, 0...",2020
