# Topic Modelling with BERT embeddings

This notebook is to explore how topic modelling can be done by using the combination
of BERT embeddings and different clustering methods. The work is inspired by 
https://github.com/MaartenGr/BERTopic.

In [1]:
# Imports of functions/packages used, please note that not all clustering
# methods are used in the notebook. The results included are for the one
# that provided the best results according to the evaluation metrics.

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import PCA
import umap
from sklearn.cluster import KMeans
from sklearn.cluster import Birch
from sklearn.cluster import SpectralClustering
from sklearn_extra.cluster import KMedoids

import hdbscan

from evaluation import eval_clustering

import numpy as np
import os
import settings

The Google News dataset has been used that is composed of 20 different topics.

In [2]:
data = fetch_20newsgroups(subset='all')
news_dataset = fetch_20newsgroups(subset='all')#,  remove=('headers', 'footers', 'quotes'))
data = news_dataset['data']

In [3]:
news_dataset.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

Different BERT embeddings has been calculated, check for Generate_Embeddings.ipynb

In [4]:
emb_dir = os.path.join(settings.PROJECT_ROOT, "Embeddings")

In [5]:
embeddings = os.listdir(emb_dir)
embeddings

['bert-base-nli-stsb-mean-tokens.emb',
 'bert-large-nli-stsb-mean-tokens.emb',
 'roberta-base-nli-stsb-mean-tokens.emb',
 'roberta-large-nli-stsb-mean-tokens.emb',
 'sbert.net_models_distilbert-base-nli-stsb-mean-tokens.emb',
 'stsb-roberta-large.emb',
 'xlm-r-bert-base-nli-stsb-mean-tokens.emb']

Clustering methods are not working well in high dimension hence 
after calculating embeddings dimensionality reduction techniques (Principal
Component Analysis and Uniform Manifold Approximation and Projection) has been
evaluated.

In [6]:
def dim_reducer(embedding, r_type, prm_list = []):
    
    if r_type == "PCA":
        reducer = PCA(n_components = prm_list[0])
            
    elif r_type == "UMAP":
        reducer = umap.UMAP(n_neighbors     = prm_list[0], 
                               n_components = prm_list[1], 
                               metric       = prm_list[2],
                           random_state = 0)
    
    if r_type != "":
        emb_reduced = reducer.fit_transform(embedding)
    else:
        emb_reduced = embedding                               
                                        
    return(emb_reduced)

The clusterer function contains the different clustering methods that have been tried out.

In [7]:
def clusterer(emb_reduced, c_type, prm_list = []):
    
    if c_type == "KMeans":
        cluster_app = KMeans(n_clusters   = prm_list[0],
                             random_state = prm_list[1])
    if c_type == "Birch":
        cluster_app = Birch(n_clusters = prm_list[0])

    if c_type == "SpectralClustering":
        cluster_app = SpectralClustering(n_clusters = prm_list[0])
        
    if c_type == "KMedoids":
        cluster_app = KMedoids(n_clusters = prm_list[0])#, method = 'pam')

    elif c_type == "HDBSCAN":
        cluster_app = hdbscan.HDBSCAN(min_cluster_size = prm_list[0],
                                                metric = 'euclidean',                      
                              cluster_selection_method = 'eom')
        
    cluster = cluster_app.fit(emb_reduced)
    
    return(cluster)

In [8]:
dim_reducers = [#["PCA", [150]],
                ["UMAP", [15, 4, 'cosine']],
                ["UMAP", [15, 5, 'cosine']],
                ["UMAP", [15, 6, 'cosine']],
                ["UMAP", [15, 7, 'cosine']],
                ["UMAP", [15, 8, 'cosine']],

                #["UMAP", [15, 5, 'euclidean']]
                ]

clusterers = [["KMeans", [20, 0]],
              #["KMeans", [30, 0]],
              #["Birch", [20]],
              #["KMedoids", [20]],
              #["HDBSCAN", [15]],
             ]

The following loop goes through all the dimensionality reducers and clusterers that are defined
above and evaluates the clustering (topics) according to silhouette score, average distance within cluster
and Rand index (see descriptions at https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

In [9]:
for emb_file in embeddings[0:1]:
    
    embedding = np.loadtxt(os.path.join(emb_dir, emb_file))
    
    for dim_reducer_ in dim_reducers:

        for clusterer_ in clusterers:

            emb_reduced = dim_reducer(embedding, dim_reducer_[0], dim_reducer_[1])
            cluster = clusterer(emb_reduced, clusterer_[0], clusterer_[1])

            labels = cluster.labels_
            
            eval_scores = eval_clustering(emb_reduced, labels, news_dataset['target'])
            
            print(emb_file[0:15], dim_reducer_, clusterer_, eval_scores)

bert-base-nli-s ['UMAP', [15, 4, 'cosine']] ['KMeans', [20, 0]] [0.3805159, 0.7157429995421716, 0.9275942328721714]
bert-base-nli-s ['UMAP', [15, 5, 'cosine']] ['KMeans', [20, 0]] [0.37844843, 0.8120980582803152, 0.9248328360685921]
bert-base-nli-s ['UMAP', [15, 6, 'cosine']] ['KMeans', [20, 0]] [0.3707836, 0.7807619385278589, 0.9295196797930987]
bert-base-nli-s ['UMAP', [15, 7, 'cosine']] ['KMeans', [20, 0]] [0.3788405, 0.8239822104147985, 0.9203452699115172]
bert-base-nli-s ['UMAP', [15, 8, 'cosine']] ['KMeans', [20, 0]] [0.3849625, 0.8247981480757426, 0.9250155517538123]


In this case we knew that 20 is the original number of topics. However for real world problems evaluation
of different topic modelling approaches shall be evaluated considering the context of the problem.