# <center>Embedding, Flattening, and Clustering</center>

In [1]:
from sentence_transformers import SentenceTransformer
import umap
import hdbscan
import pandas as pd

Before we begin leveraging advanced transformer-based topic modeling libraries, like Top2Vec, we should have a good understanding about how they work. Top2Vec leverages three libraries: Sentence Transformers, UMAP, and HDBScan. Here, we will explore each of these so that the reader will have a basic understanding of the theory and methods behind Top2Vec

In [2]:
df = pd.read_csv("../data/trc.csv")
df.head(1)

Unnamed: 0,ObjectId,Last,First,Description,Place,Yr,Homeland,Province,Long,Lat,HRV,ORG
0,1,AARON,Thabo Simon,An ANCYL member who was shot and severely inju...,Bethulie,1991.0,,Orange Free State,25.97552,-30.50329,shoot|injure,ANC|ANCYL|Police|SAP


In [3]:
documents = df.Description.tolist()

## Embedding the Documents

The first step in using transformers in topic modeling is to convert the text into a vector. We met vectors when we explored LDA topic modeling in the previous chapter. Arrays for LDA topic modeling were rooted in a TF-IDF index. This index, while computationally light, did not retain semantic meaning or word order.

When we are working with transformers, we can create a vector for each document in our dataset. This vector is not an index of the words used, rather it is an embedding for the entire document that contains its semantic usages of words. It also preserves in this same vector space the word order to a degree. This document vector is similar to the word vector that we met in Part Three of this textbook. Instead of embedding a single word, however, the entire document receives an embedding. This allows us to mathematically compare documents across an entire corpus.

To convert our documents into vectors, we first need a transformer model. Fortunately, the Sentence Transformer library from HuggingFace allows us to easily load robust pre-trained language models. In our case, we will be using the `all-MiniLM-L6-v2` model. We can load this model by calling the Sentence Transformer class from the sentence_transformers library.

In [4]:
model = SentenceTransformer('all-MiniLM-L6-v2')

Once our model class is created, we can use the `.encode()` method. This method will encode all the documents that we pass to it. In our case, this is the approximately 22,000 descriptions from the TRC dataset. The `.encode()` method takes a single mandatory argument, a list of data to embed.

In [9]:
doc_embeddings = model.encode(documents)

Now that we have the vectors for each document, let's examine one.

In [22]:
doc_embeddings[0][:10]

array([-0.07123438,  0.00332592, -0.05571467,  0.08363085,  0.09066872,
        0.055036  ,  0.08029196,  0.0137071 ,  0.05912024,  0.06226279],
      dtype=float32)

As we can see, this looks remarkably similar to our word embeddings. While this is useful for examining mathematically comparing the similarity between documents, it can be difficult to parse this numerical data visually. For this reason, it is useful to flatten the data into 2 or 3 dimensions. This allows the data to be graphed. In the previous chapter, we learned how to flatten data with PCA. In this chapter, we will meet a new dimensionality reduction algorithm, UMAP.

## Flattening the Data

UMAP has gained popularity in recent years as a quick, effectively, and fairly accurate way to represent higher dimensional data in lower dimensions. In Python, we can access the UMAP algorithm through the UMAP library which can be installed with pip by typing the following command:

`pip install umap-learn`

Note the `-learn` after `umap`. This is very important as `umap` is an entirely different library.

Once you have installed UMAP correctly, you can access the `UMAP` class. This will take several parameters that can be adjusted to yield different results.

In [14]:
umap_proj = umap.UMAP(n_neighbors=10,
                          min_dist=0.01,
                          metric='correlation').fit_transform(doc_embeddings)

## Isolating Clusters with HDBScan

Once our data has been flattened, we can automatically identify the number of clusters within it and assign documents to each cluster with the HDBScan algorithm.

In [15]:
hdbscan_labels = hdbscan.HDBSCAN(min_samples=2, min_cluster_size=3).fit_predict(umap_proj)
print(len(set(hdbscan_labels)))

2355


In [16]:
df["x"] = umap_proj[:, 0]
df["y"] = umap_proj[:, 1]
df["topic"] = hdbscan_labels
df.head(1)

Unnamed: 0,ObjectId,Last,First,Description,Place,Yr,Homeland,Province,Long,Lat,HRV,ORG,x,y,topic
0,1,AARON,Thabo Simon,An ANCYL member who was shot and severely inju...,Bethulie,1991.0,,Orange Free State,25.97552,-30.50329,shoot|injure,ANC|ANCYL|Police|SAP,3.326068,6.641037,-1


## Analyzing the Labels

Now that we have the labels loaded into our DataFrame, we can use Pandas to interrogate that data. Let's grab a topic and examine it. Here, we will examine topic 100.

In [21]:
for d in df.loc[df.topic == 100].Description.tolist():
    print(d)
    print()

A Dikwankwetla National Party (DNP) supporter and member of the QwaQwa parliament, had his house burnt down in Botshabelo, near Bloemfontein, on 19 August 1987. His home was again petrol-bombed in June 1990.

A Dikwankwetla National Party (DNP) supporter who had her house burnt down by UDF supporters in Botshabelo, near Bloemfontein, on 16 July 1989, allegedly because of her support for the DNP.

A local councillor and member of the Dikwankwetla National Party (DNP) who had his house burnt down by unidentified persons in Winburg, OFS, on 30 March 1990, allegedly because of his support for the DNP.

A councilor and a member of the Dikwankwetla National Party (DNP) who was seriously injured when he was stabbed, beaten and burnt by ANC supporters in Botshabelo, near Bloemfontein, on 29 January 1990. DNP members were frequently attacked at this time, allegedly because of the party’s support for the QwaQwa homeland government.

A supporter of the Dikwankwetla National Party (DNP) who lost h

Notice that topic 100 has 6 different documents in it. They are all a bit different, but clearly match in a few ways. First, they all involve the Dikwankwetla National Party (DNP). Second, they all involve the concept of arson or burning in some way. In some cases, this is to physical property, but in others it is to persons.