In [2]:
import warnings
import pandas as pd
warnings.filterwarnings("ignore", category=DeprecationWarning)
pd.options.mode.chained_assignment = None  # default='warn'

from tqdm import tqdm
from tqdm.notebook import tqdm_notebook
tqdm.pandas()
import numpy as np

from utils import clean_dataset_basedOn_media
from utils import cluster_by_month

We can use the coherence score in topic modeling to measure how interpretable the topics are to humans. In this case, topics are represented as the top N words with the highest probability of belonging to that particular topic. Briefly, the coherence score measures how similar these words are to each other.
#### CV Coherence Score

One of the most popular coherence metrics is called CV. It creates content vectors of words using their co-occurences and, after that, calculates the score using normalized pointwise mutual information (NPMI) and the cosine similarity. This metric is popular because it's the default metric in the Gensim topic coherence pipeline module, but it has some issues.

https://github.com/dice-group/Palmetto/issues/13#issuecomment-371553052

#### UMass Coherence Score

It calculates how often two words, $w_i$ and $w_j$ appear together in the corpus and it's defined as:

<img src="https://www.baeldung.com/wp-content/ql-cache/quicklatex.com-88c21c21c59dc5699d130bfeca00a5c7_l3.svg" />

where $D(w_i, w_j)$ indicates hwo many times words $w_i$ and $w_j$ appear together in documents, and $D(w_i)$ is how many time word $w_i$ appeared alone. The greater the number, the better is coherence score. Also, this measure isn't symmetric, which means that $C_UMass (w_i, w_j)$ in not equal to $C_UMass (w_j, w_i)$. We calculate the global coherence of the topic as the average pairwise coherence scores on the top $N$ words which describe the topic.

#### UCI Coherence Score
This coherence score is based on sliding windows and the pointwise mutual information of all word pair using top $N$ words by occurence. Instead of calculating how often two words appear in the document, we calculate the word co-occurence using a sliding window. It means that if our sliding window has a size of 10, for one particular word $w_i$, we observe only 10 words before and after the word $w_i$.

Therefore, if both words $w_{i}$ and $w_{j}$ appeared in the document but they’re not together in one sliding window, we don’t count as they appeared together. Similarly, as for the UMass score, we define the UCI coherence between words $w_{i}$ and $w_{j}$ as

<img src ="https://www.baeldung.com/wp-content/ql-cache/quicklatex.com-02e9dd099b3bc039c06661397ddc3d0d_l3.svg"/>

where $P(w)$ is probability of seeing word w in the sliding window and $P(w_{i}, w_{j})$ is probability of appearing words $w_{i}$ and $w_{j}$ together in the sliding window. In the original paper, those probabilities were estimated from the entire corpus of over two million English Wikipedia articles using a 10-words sliding window. We calculate the global coherence of the topic in the same way as for the UMass coherence.

#### Word2vec Coherence Score

This will introduce the semantic of the words in our score. Basically, we want to measure our coherence based on two criteria:
1. Intra-topic similarity - the similarity of words in the same topic.
2. Inter-topic similarity - the similarity of words across different topics.

The idea is pretty simple. We want to maximize intra-topic and minimize inter-topic similarity. Also, by similarity, we imply the cosine similarity between words represented by word2vec embedding.

Following that, we compute intra-topic similarity per topic as an average similarity between every possible pair of top $N$ words in that topic. Consequently, we compute the inter-topic similarity between two topics as an average similarity between top $N$ words from these topics. 

Finally, the word2vec coherence score between two topics, $t_i$ and $t_j$, is calculated as 

<img src="https://www.baeldung.com/wp-content/ql-cache/quicklatex.com-0912ea7a1eda042b202bff1cbbd68ee2_l3.svg" />

### Choosing the best coherence score

There is no one way to determine whether the coherence score is good or bad. The score and its value depends on the data that it's calculated from. For instance, in one case, the score of 0.5 might be good enough but in another case not acceptable. The only rule is that we want to **maximize** the score.

Usually, the coherence score will increase with the number of topics . This increase will become smaller as the number of topics get higher. The trade-off between the number of topics and coherence score can be achieved using the so-called elbow technique. The method implies plotting coherence score as a function of number of topics. We use the elbow of the curve to select the number of topics.


The idea behind this method is that we want to choose a point after which the diminishing increase of coherence score is no longer worth the additional increase of number of topics.

Also, the coherence score depends on the LDA hyperparameters, such as $\alpha , \beta$ and $K$. Because of that, we can use any machine learning hyperparameter tuning technique.

After all, it's important to manually validate results. 

In [3]:
df = pd.read_csv("../../data/loslagos-comunas.csv")
df = cluster_by_month(clean_dataset_basedOn_media(df))
df.isna().any()

date               False
media_outlet       False
url                False
title              False
text               False
content            False
comuna              True
date_clustering    False
dtype: bool

In [4]:
df.date_clustering.value_counts()

2022-03    4617
2021-12    4492
2021-10    4370
2021-11    4326
2022-01    4182
2022-02    3950
2022-04    3384
Name: date_clustering, dtype: int64

In [5]:
# Obtenemos las etiquetas del value_counts 
months = df.date_clustering.value_counts().index.tolist()

# se hará un análisis del primer mes
import datetime
dates = [datetime.datetime.strptime(ts, "%Y-%m") for ts in months]
dates.sort()
sorteddates = [datetime.datetime.strftime(ts, "%Y-%m") for ts in dates]

In [6]:
selected = df[df.date_clustering == sorteddates[0]]
docs = selected.content.tolist()

# [gensim coherence score](https://radimrehurek.com/gensim/models/coherencemodel.html)

The four stage pipeline is basically:

- Segmentation

- Probability Estimation

- Confirmation Measure

- Aggregation



In [7]:
from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

topic_model = BERTopic(verbose=True,
                       calculate_probabilities=True,
                       n_gram_range=(1, 3),
                       language="spanish")
topics, _ = topic_model.fit_transform(docs)

# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

Batches:   0%|          | 0/137 [00:02<?, ?it/s]

2022-09-29 18:20:29,253 - BERTopic - Transformed documents to Embeddings
2022-09-29 18:21:06,949 - BERTopic - Reduced dimensionality
2022-09-29 18:21:08,748 - BERTopic - Clustered reduced embeddings


In [8]:
# Evaluate
cv_coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = cv_coherence_model.get_coherence()
coherence

0.6306890201920521

In [12]:
# Evaluate
umass_coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='u_mass')
coherence = umass_coherence_model.get_coherence()
coherence

-0.8477866365429185

In [13]:
# Evaluate
c_uci_coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_uci')
coherence = c_uci_coherence_model.get_coherence()
coherence

-1.5842301436171595

In [14]:
# Evaluate
c_npmi_coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_npmi')
coherence = c_npmi_coherence_model.get_coherence()
coherence

0.07302562870265346

https://github.com/MaartenGr/BERTopic/issues/90

https://github.com/MIND-Lab/OCTIS