# Topic Matching
This notebook contains the code for labeling our unnamed topics.

## Imports
Necessary imports.

In [None]:
import pandas as pd # data manupulation
from sentence_transformers import SentenceTransformer, util # embeddings
from docx import Document # reading in Pew topics
import torch # similarity matching
import json # accessing IPTC topic schema

## Load Documents
Load documents containing topic information and Pew research topics.

In [None]:
with open("data/cptall-en-US.json") as f:
    d = json.load(f)["conceptSet"]

iptc_news_schema = pd.DataFrame(d)
iptc_news_schema["prefLabel"] = iptc_news_schema["prefLabel"].apply(lambda x: x["en-US"])
iptc_news_schema["definition"] = iptc_news_schema["definition"].apply(lambda x: x["en-US"] if "en-US" in x.keys() else None)

iptc_news_topics = []

for _, row in iptc_news_schema.iterrows():
    topic = f"{row["prefLabel"].title()}: {row["definition"]}"
    iptc_news_topics.append(topic)

In [None]:
topic_info = pd.read_csv("data/topic_info.csv")

## Similarity Sort
Here, we match found topics to their most similar Pew topic.

1. Embed pew topics and representative words from found topics using the same sentence transformer.
2. Calculate cosine similarity between representative words and each pew topic.
3. Match each found topic with its closest pew match.
4. Sort by highest to lowest cosine similarity so best likely matches are at the top.

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

# encoding pew topics individually
iptc_news_embeddings = model.encode(iptc_news_topics, convert_to_tensor=True)
# encoding all of a found topic's most representative words
topic_embeddings = model.encode(topic_info["Representation"], convert_to_tensor=True)

# calculating cosine similarities between representative topic words and pew topics
similarities = util.cos_sim(topic_embeddings, iptc_news_embeddings)  # shape: (n_topics, n_pew_topics)

# getting each topic's best pew match
best_scores, best_idxs = torch.max(similarities, dim=1)

# applying closest topic label as feature and its associated similarity score
topic_info["iptc_news_topic"] = [iptc_news_topics[idx].split(":")[0] for idx in best_idxs.tolist()]
topic_info["similarity"] = best_scores.tolist()

# sorting best matches at the top
topic_info = topic_info.sort_values("similarity", ascending=False)

In [None]:
topic_info.head()

## Exports
Export labeled and sorted topics for hand review.

In [None]:
topic_info.to_csv("data/iptc_news_topic_info.csv", index=False)