## Set up colab
Set up colab with packages required

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
#!virtualenv /content/drive/MyDrive/NLP_Examples/venv-nlp-code-examples #Only needed when creating venv for the first time

In [None]:
!source /content/drive/MyDrive/NLP_Examples/venv-nlp-code-examples/bin/activate

-----

# BERTopic Example

This notebook contains an applied example of using BERTopic for topic modelling. Code is taken from the [BERTopic site](https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html).

Please note, topic models for the full dataset aren't generated here, due to extended runtime due to running on CPU. Therefore, the topics generated aren't very good as they are based only on a small subset of the data.

To run the same code in Google colabs on GPU (faster runtime), use [this](https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing#scrollTo=SNa-KtKDRnus).

---
**Load packages:**

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date. More information can be found here: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

**Fetch data:**

In [None]:
fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes')).keys()

In [None]:
fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['target_names']

In [None]:
# Create a df of the input data with labelled category for use later

# Core data
newsgroups = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame()
df['text'] = newsgroups.data
df['target'] = newsgroups.target
df["ref_number"] = df.index

# Category names
target_names = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['target_names']
df_names = pd.DataFrame()
df_names["target_name"] = target_names
df_names["target_num"] = df_names.index

# Attach category name on
df_labelled = df.merge(df_names, left_on = "target", right_on = "target_num").sort_values("ref_number").reset_index(drop = True).drop(columns = "target_num")

In [None]:
df_labelled

In [None]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [None]:
docs[0]

In [None]:
docs[0:3]

In [None]:
len(docs)

------

### 1. Model training: Full dataset, 50 topics

**Set up model:**

Note - this takes too long to run on AP due to no GPU being available.

On AP:
With 100 docs takes about 25s to run.
With 500 docs takes about 2m30s to run.

On GPU (colab):
With 500 docs takes about 25s to run.
With 18k docs, approx 2 mins to run.

There are 18846 docs in total in dataset. Approx 1h30 for whole thing on CPU.

In [None]:
model_docs = docs

In [None]:
# Set random seed
from umap import UMAP
umap_model = UMAP(random_state=42)

In [None]:
topic_model = BERTopic(umap_model=umap_model, verbose = True, nr_topics = 50)
topics, probabilities = topic_model.fit_transform(docs)

In [None]:
# Save model
model_save_name = 'full_bertopic_model'
path = F"/content/drive/MyDrive/NLP_Examples/models/{model_save_name}"

In [None]:
#topic_model.save(path)

In [None]:
# Load model
#topic_model = BERTopic.load(path)

After generating topics, we can access the frequent topics that were generated:

In [None]:
topic_info = topic_model.get_topic_info()
topic_info.head()

In [None]:
topic_info.iloc[1].Representative_Docs[0]

In [None]:
topic_model.get_topic(0)

In [None]:
topic_model.get_document_info(model_docs)

In [None]:
fig1 = topic_model.visualize_topics()
fig1.write_html("/content/drive/MyDrive/NLP_Examples/topic_model_viz.html")
fig1

In [None]:
fig_hier = topic_model.visualize_hierarchy()
fig_hier

-----

**Test: Reduce the number of topics**

We can control the number of topics either through specifying the number of topics initially in BERTopic(), or afterwards using:`.reduce_topics()`

In [None]:
topic_model.reduce_topics(docs, nr_topics=20)

In [None]:
topic_model.get_topic_info().head()

In [None]:
topic_model.visualize_topics()

This reduction doesn't look like it works particualrly well. We'll abandon this approach.

--------

**Compare actual topic to predicted topic**

In [None]:
# Load previous (un-reduced) model
#topic_model = BERTopic.load(path)

In [None]:
topic_predictions = topic_model.get_document_info(docs)
topic_predictions

In [None]:
# Attach predicted topics onto original dataset
df_pred_exp = df_labelled.copy()
df_pred_exp["predicted_topic"] = topic_predictions["Topic"]
df_pred_exp["predicted_topic_prob"] = topic_predictions["Probability"]

df_pred_exp

**Compare actual categories to predicted topics:**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Drop uncategorized rows
df_pred_exp_clean = df_pred_exp.loc[df_pred_exp.predicted_topic != -1]

In [None]:
conf_matrix = pd.crosstab(df_pred_exp_clean['target'], df_pred_exp_clean['predicted_topic'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)

--------

**Manually merge topics**

According to labelled data

In [None]:
topic_sets = [[8, 11], [1,7], [1,12]]
topic_model_unmerged = topic_model
topic_model.merge_topics(docs = docs, topics_to_merge = topic_sets)

In [None]:
topic_model.get_topic_info()

**Check what results look like**

In [None]:
topic_predictions = topic_model.get_document_info(docs)
topic_predictions

In [None]:
# Attach predicted topics onto original dataset
df_pred_exp = df_labelled.copy()
df_pred_exp["predicted_topic"] = topic_predictions["Topic"]
df_pred_exp["predicted_topic_prob"] = topic_predictions["Probability"]

df_pred_exp

**Compare actual categories to predicted topics:**

In [None]:
# Drop uncategorized rows
df_pred_exp_clean = df_pred_exp.loc[df_pred_exp.predicted_topic != -1]

conf_matrix = pd.crosstab(df_pred_exp_clean['target'], df_pred_exp_clean['predicted_topic'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)

Look at topics:

In [None]:
topic_info = topic_model.get_topic_info()

In [None]:
chosen_topic = 7

In [None]:
rep_doc = topic_info.loc[topic_info.Topic == chosen_topic, "Representative_Docs"].reset_index(drop = True).iloc[0][0]
rep_doc

In [None]:
df_pred_exp_clean.loc[df_pred_exp_clean.predicted_topic == chosen_topic].target_name.value_counts()

--------

**Manually merge topics**

Based on topic hierarchy

In [None]:
# Copy previous model
topic_model_merged = topic_model

In [None]:
topic_sets = [[10, 3], [8, 15]]
topic_model_merged.merge_topics(docs = docs, topics_to_merge = topic_sets)

In [None]:
topic_model_merged.get_topic_info()

**Check what results look like**

In [None]:
topic_predictions = topic_model_small2.get_document_info(docs_subset)
topic_predictions

In [None]:
# Attach predicted topics onto original dataset
df_subset = df.loc[df.text.isin(docs_subset)]

df_pred_exp = df_subset.copy()
df_pred_exp["predicted_topic"] = topic_predictions["Topic"]
df_pred_exp["predicted_topic_prob"] = topic_predictions["Probability"]

df_pred_exp

**Compare actual categories to predicted topics:**

In [None]:
conf_matrix = pd.crosstab(df_pred_exp['target'], df_pred_exp['predicted_topic'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)