# Topic Term Significance

DS 5001 Text as Data

**Purpose:** To explore term significance within topics as defined by various authors.

# Set Up

## Config

In [1]:
import configparser
config = configparser.ConfigParser()
config.read('../../../env.ini')
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']
local_lib = config['DEFAULT']['local_lib']

## Imports

In [2]:
import pandas as pd
import numpy as np
import plotly_express as px

In [20]:
pd.set_option('display.max_colwidth', None)

In [4]:
import sys; sys.path.append(local_lib)
from hac2 import HAC

In [5]:
data_prefix = 'austen-melville'
colors = "YlGnBu"
n_topics = 40
OHCO = ['book_id','chap_id']

# Get the Data

In [68]:
TOPICS = pd.read_csv(f"{output_dir}/{data_prefix}-LDA_TOPICS-{n_topics}.csv").set_index('topic_id')
PHI = pd.read_csv(f"{output_dir}/{data_prefix}-LDA_PHI-{n_topics}.csv").set_index('topic_id')
THETA = pd.read_csv(f"{output_dir}/{data_prefix}-LDA_THETA-{n_topics}.csv").set_index(OHCO)

# Define Label Function

In [7]:
LABELS = pd.DataFrame({'default':None}, index=PHI.index)
def extract_label(phi, label, n_terms=7):
    LABELS[label] = LABELS.apply(lambda x: ', '.join(phi.loc[x.name].sort_values(ascending=False).head(n_terms).index), axis=1)

# Default

In [8]:
extract_label(PHI, 'default')

# Relevance

Sievert, Carson, and Kenneth Shirley. 2014. “LDAvis: A Method for Visualizing and Interpreting Topics.” In _Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces_, 63–70. https://aclanthology.org/W14-3110.pdf

![image-3.png](images/topic-relevance.png)

In [37]:
λ = .5
PW = PHI.sum() / PHI.sum().sum()
PHI_REL = λ * np.log(PHI) + (1 - λ) * np.log(PHI/PW)

In [253]:
# PHI_REL

In [38]:
extract_label(PHI_REL, 'rel')

# Saliency

Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. 2012. “Termite: Visualization Techniques for Assessing Textual Topic Models.” In _Proceedings of the International Working Conference on Advanced Visual Interfaces_, 74–77. AVI ’12. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2254556.2254572.

<img src="images/topic-distinctiveness-formula.png" height=5 width=450>
<img src="images/topic-saliency-formula.png" height=5 width=450>

So, distinctiveness dampens local probability $P(T]w)$ with a global value $P(T)$.

In [247]:
PTw = (PHI / PHI.sum()).T
PT = (PHI.T.sum() / PHI.T.sum().sum())
PHI_DST = PTw * np.log(PTw/PT)
PHI_SAL = PHI_DST.T * PW

In [248]:
extract_label(PHI_SAL, 'sal')

# TFIDF

We treat topics as documents.

In [249]:
TF = (PHI.T / PHI.T.sum()).T
DF = TF[TF > 0.00025].fillna(0).astype(bool).sum()
PHI_TFIDF = TF * np.log2((len(TF) + 1)/(DF + 1) + 1)

In [250]:
extract_label(PHI_TFIDF, 'tfidf')

# Compare

In [251]:
LABELS.join(TOPICS.doc_weight_sum).sort_values('doc_weight_sum', ascending=False)

Unnamed: 0_level_0,default,rel,sal,tfidf,doc_weight_sum
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
T23,"time, letter, sister, feelings, heart, mind, family","letter, feelings, happiness, sister, affection, marriage, family","letter, feelings, sister, happiness, family, affection, marriage","sister, letter, feelings, happiness, affection, time, marriage",163.569493
T15,"time, room, thing, man, day, oh, friend","room, oh, evening, thing, dear, friend, pleasure","room, oh, evening, thing, friend, dear, pleasure","room, sister, oh, dear, thing, time, friend",140.340735
T32,"man, men, deck, ship, time, sailors, sea","deck, mate, men, ship, sailors, frigate, board","deck, men, ship, mate, sailors, man, board","deck, ship, man, men, mate, frigate, sailors",94.88859
T31,"sea, water, air, ship, wind, boat, day","sea, wind, water, breeze, waves, air, sky","sea, water, wind, air, breeze, land, waves","sea, wind, boat, water, ship, land, air",76.190798
T01,"lord, man, things, men, gods, soul, time","lord, gods, mortals, yoomy, minstrel, isle, things","lord, gods, things, mortals, isle, god, yoomy","lord, gods, man, things, isle, men, yoomy",72.28868
T05,"valley, natives, island, time, islanders, house, fruit","valley, natives, islanders, fruit, savages, tappa, island","valley, natives, islanders, fruit, island, trees, savages","natives, valley, islanders, island, savages, fruit, trees",52.151496
T18,"man, friend, sort, stranger, way, time, gentleman","man, cosmopolitan, stranger, charity, press, friend, gentleman","man, stranger, cosmopolitan, friend, gentleman, sort, charity","man, cosmopolitan, friend, stranger, confidence, sort, charity",45.478509
T36,"whale, whales, boat, ship, head, boats, sea","whale, whales, fish, boat, boats, harpooneer, whaling","whale, whales, boat, boats, fish, ship, head","whale, whales, boat, boats, ship, fish, harpooneer",45.423579
T00,"room, house, door, time, day, night, bed","room, house, door, bed, doors, windows, abbey","room, house, door, bed, night, hour, hours","room, house, door, bed, night, time, hours",40.27335
T02,"sir, thou, thee, boy, man, brother, way","sir, thou, thee, boy, barber, hast, brother","sir, thou, thee, boy, brother, barber, art","sir, thou, thee, boy, barber, brother, man",35.694933
