**Table of contents**<a id='toc0_'></a>    
- [BERT](#toc1_)    
- [BERTopic](#toc2_)    
  - [Create sentence embeddings](#toc2_1_)    
  - [Reduce the dimensionality of the dataset](#toc2_2_)    
- [Resources](#toc3_)    
- [References](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[BERT](#toc0_)

> Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. [$^{[1]}$](https://en.wikipedia.org/wiki/BERT_(language_model))

The transformer architecture (what's relevant to note here is that the input/output embeddings are simply fed to the model for training, being a separate step):  

![](https://upload.wikimedia.org/wikipedia/commons/8/8f/The-Transformer-model-architecture.png)    
(Source: [Transformer (machine learning model), Wikipedia](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)))  

> BERT is an "encoder-only" transformer architecture.

> On a high level, BERT consists of three modules:
> - embedding. This module converts an array of one-hot encoded tokens into an array of vectors representing the tokens.
> - a stack of encoders. These encoders are the Transformer encoders. They perform transformations over the array of representation vectors.
> - un-embedding. This module converts the final representation vectors into one-hot encoded tokens again. [$^{[1]}$](https://en.wikipedia.org/wiki/BERT_(language_model))

# <a id='toc2_'></a>[BERTopic](#toc0_)

BERTopic used to be a model that leveraged [BERT embeddings](https://maartengr.github.io/BERTopic/api/bertopic.html) and c-TF-IDF "to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions". Now, the only difference is that [BERTopic](https://maartengr.github.io/BERTopic/) uses Hugging Face transformers and c-TF-IDF for the same purpose, as the HF library gained in popularity in the last couple of years, being a platform where others can upload their pre-trained and fine-tuned models.

BERTopic is a modular library, meaning you can mix and match different steps in the NLP pipeline depending on your needs:  

![](https://maartengr.github.io/BERTopic/algorithm/modularity.svg)  
(Source: [BERTopic documentation](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview))

In [1]:
# You know the drill
# !pip install bertopic
# !pip install sentence_transformers
# !pip install umap
# !pip install hdbscan

In [1]:
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [4]:
print(len(docs))

18846


In [3]:
docs[0]

"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

## <a id='toc2_1_'></a>[Create sentence embeddings](#toc0_)

As opposed to the models we've seen so far in NLP, BERTopic uses sentence rather than word embeddings at the beginning of the clustering process.

In [5]:
from sentence_transformers import SentenceTransformer

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [9]:
embeddings = embedding_model.encode(docs)

## <a id='toc2_2_'></a>[Reduce the dimensionality of the dataset](#toc0_)

In [None]:
from umap import UMAP

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

# Clustering similar sentences

HDBSCAN is a density-clustering technique

> It can find clusters of different shapes and has the nice feature of identifying outliers where possible. As a result, we do not force documents into a cluster where they might not belong. This will improve the resulting topic representation as there is less noise to draw from.

In [None]:
from hdbscan import HDBSCAN

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

## Tokenize topics

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

## Topic representation

In [None]:
from bertopic.vectorizers import ClassTfidfTransformer

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

In [None]:

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired












# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()




## Building a pipeline

In [None]:
# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

In [None]:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

In [None]:
topic_model.get_topic_info()

In [None]:
topic_model.get_document_info(docs)

# <a id='toc3_'></a>[Resources](#toc0_)

- UMAP by StatQuest
    - [UMAP Main Ideas (19 min)](https://www.youtube.com/watch?v=eN0wFzBA4Sc)
    - [UMAP Mathematical Details (16 min)](https://www.youtube.com/watch?v=eN0wFzBA4Sc)

# <a id='toc4_'></a>[References](#toc0_)

[1] [BERT (language model), Wikipedia](https://en.wikipedia.org/wiki/BERT_(language_model))