# BERTopic Example

This notebook contains an applied example of using BERTopic for topic modelling. Code is taken from the [BERTopic site](https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html).

Please note, topic models for the full dataset aren't generated here, due to extended runtime due to running on CPU. Therefore, the topics generated aren't very good as they are based only on a small subset of the data. 

To run the same code in Google colabs on GPU (faster runtime), use [this](https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing#scrollTo=SNa-KtKDRnus).

---
**Load packages:**

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date. More information can be found here: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

**Fetch data:**

In [None]:
fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes')).keys()

In [None]:
fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['target_names']

In [None]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [None]:
docs[0]

In [None]:
docs[0:3]

In [None]:
len(docs)

------

### 1. Model training: Subset dataset across all categories

**Set up model:**

Note - this takes too long to run on AP due to no GPU being available. We'll just take a subsection of docs to get it to run faster to demo here.

With 100 docs takes about 25s to run.
With 500 docs takes about 2m30s to run.

There are 18846 docs in total in dataset. Approx 1h30 for whole thing.

In [None]:
model_docs = docs[0:500]

In [None]:
# Set random seed
from umap import UMAP
umap_model = UMAP(random_state=42)

In [None]:
topic_model = BERTopic(umap_model=umap_model, verbose = True)
topics, probs = topic_model.fit_transform(model_docs)

In [None]:
# Save model
# topic_model.save("models/my_model")

In [None]:
# Load model
#topic_model = BERTopic.load("models/my_model")

After generating topics, we can access the frequent topics that were generated:

In [None]:
topic_model.get_topic_info()

In [None]:
topic_model.get_topic(0)

In [None]:
topic_model.get_document_info(model_docs)

In [None]:
# Note: This doesn't work because we're using too few documents
#fig1 = topic_model.visualize_topics()
#fig1.write_html("./viz/topic_model.html")

------

### 2. Model training: Subset dataset by selecting a number of categories
To naturally reduce our number of documents and topics, we can reduce to X categories of documents (obviously can only do this for the example as it's labelled data).

In [None]:
import pandas as pd

In [None]:
newsgroups = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
categories = newsgroups.target_names
target_num = newsgroups.target
target = [categories[x] for x in target_num]

In [None]:
df = pd.DataFrame()
df['text'] = newsgroups.data
df['target'] = target
df.head()

Filter the dataset down into just 1 subcategory per category for our example:

In [None]:
df['category'] = df.target.str.split('.').str[0]

In [None]:
unq_targets = df.loc[:, ["category", "target"]].drop_duplicates()
chosen_targets = []
first_n_categories = 4 # Set the number of unique categories we want to keep

unq_targets = unq_targets[0:first_n_categories]

for i in unq_targets.category.unique():
    choice = unq_targets[unq_targets.category == i].iloc[[0]].target.iloc[0]
    chosen_targets.append(choice)

In [None]:
df = df.loc[df.target.isin(chosen_targets),].reset_index(drop = True)
df

In [None]:
docs_subset = df.text.values.tolist()

Set up model - takes ~15 mins to run on CPU

In [None]:
#topic_model_small = BERTopic(umap_model=umap_model, nr_topics = 20, verbose = True) # Set number of topics to 20
#topics, probs = topic_model_small.fit_transform(docs_subset)

In [None]:
# Save model
#topic_model_small.save("models/my_model_small")

In [None]:
# Load model
topic_model_small = BERTopic.load("models/my_model_small")

After generating topics, we can access the frequent topics that were generated:

In [None]:
topic_model_small.get_topic_info().head()

In [None]:
topic_model_small.get_topic(0)

In [None]:
fig2 = topic_model_small.visualize_topics()
fig2.write_html("./viz/topic_model_small.html")

In [None]:
fig2_hier = topic_model_small.visualize_hierarchy()
fig2_hier.write_html("./viz/topic_model_small_hier.html")

-----

**Test: Reduce the number of topics**

We can control the number of topics either through specifying the number of topics initially in BERTopic(), or afterwards using:`.reduce_topics()`

In [None]:
topic_model_small.reduce_topics(docs_subset, nr_topics=10)

In [None]:
topic_model_small.get_topic_info().head()

In [None]:
fig3 = topic_model_small.visualize_topics()
fig3.write_html("./viz/topic_model_small_reduced.html")

This reduction doesn't look like it works particualrly well. We'll abandon this approach.

--------

**Subsetted data by category: Compare actual topic to predicted topic**

In [None]:
# Load previous model
topic_model_small = BERTopic.load("models/my_model_small")

In [None]:
topic_predictions = topic_model_small.get_document_info(docs_subset)
topic_predictions

In [None]:
# Attach predicted topics onto original dataset
df_subset = df.loc[df.text.isin(docs_subset)]

df_pred_exp = df_subset.copy()
df_pred_exp["predicted_topic"] = topic_predictions["Topic"]
df_pred_exp["predicted_topic_prob"] = topic_predictions["Probability"]

df_pred_exp

**Compare actual categories to predicted topics:**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
conf_matrix = pd.crosstab(df_pred_exp['target'], df_pred_exp['predicted_topic'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)

--------

**Manually merge topics**

According to labelled data

In [None]:
topic_sets = [[1, 3], [0, 12], [2, 4, 5, 6, 8, 9, 13, 14, 15, 16, 17, 18]]
topic_model_small.merge_topics(docs = docs_subset, topics_to_merge = topic_sets)

In [None]:
topic_model_small.get_topic_info()

**Check what results look like**

In [None]:
topic_predictions = topic_model_small.get_document_info(docs_subset)
topic_predictions

In [None]:
# Attach predicted topics onto original dataset
df_subset = df.loc[df.text.isin(docs_subset)]

df_pred_exp = df_subset.copy()
df_pred_exp["predicted_topic"] = topic_predictions["Topic"]
df_pred_exp["predicted_topic_prob"] = topic_predictions["Probability"]

df_pred_exp

**Compare actual categories to predicted topics:**

In [None]:
conf_matrix = pd.crosstab(df_pred_exp['target'], df_pred_exp['predicted_topic'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)

Look at topics:

In [None]:
topic_model_small.get_topic(2)

In [None]:
topic_model_small.get_topic(0)

In [None]:
topic_model_small.get_topic(1)

--------

**Manually merge topics**

Based on topic hierarchy

In [None]:
# Load previous model
topic_model_small2 = BERTopic.load("models/my_model_small")

In [None]:
topic_sets = [[11,15,16],[3,4,5],[0,2,9,1,7,12]]
topic_model_small2.merge_topics(docs = docs_subset, topics_to_merge = topic_sets)

In [None]:
topic_model_small2.get_topic_info()

**Check what results look like**

In [None]:
topic_predictions = topic_model_small2.get_document_info(docs_subset)
topic_predictions

In [None]:
# Attach predicted topics onto original dataset
df_subset = df.loc[df.text.isin(docs_subset)]

df_pred_exp = df_subset.copy()
df_pred_exp["predicted_topic"] = topic_predictions["Topic"]
df_pred_exp["predicted_topic_prob"] = topic_predictions["Probability"]

df_pred_exp

**Compare actual categories to predicted topics:**

In [None]:
conf_matrix = pd.crosstab(df_pred_exp['target'], df_pred_exp['predicted_topic'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)

This doesn't give the results expected.