# BERTopic Find the Optimal Number of Topics and Visualise the Results
## BERTopic Find the Optimal Number

This documentation provides a comprehensive overview and code snippets for running BERTopic multiple times on a dataset and visualizing the variability in identified topics.

## Function: `run_bertopic_multiple_times`

This function performs topic modeling multiple times on a given dataset and records the number of unique topics identified in each run.

### Parameters

- `dataframe`: A `pandas.DataFrame` object containing the dataset.
- `num_runs`: An integer specifying the number of times to run BERTopic (default is 10).

### Returns

- An integer representing the most common number of topics identified across all runs.

### Process

1. Extracts the text column from the DataFrame.
2. Runs BERTopic the specified number of times, each time fitting the model on the documents and recording the number of unique topics (excluding outliers).
3. Saves the count of topics for each run to a CSV file.
4. Determines the most common number of topics identified across all runs.


In [1]:
import pandas as pd
from bertopic import BERTopic
import matplotlib.pyplot as plt
from collections import Counter
from tqdm import tqdm
import numpy as np

def run_bertopic_multiple_times(dataframe: pd.DataFrame, num_runs: int = 10) -> int:
    topic_counts = []

    # Extract the text column for BERTopic only once
    documents = dataframe['text'].tolist()

    # Use tqdm to add a progress bar
    for _ in tqdm(range(num_runs), desc="Running BERTopic"):
        # Apply BERTopic
        topic_model = BERTopic()
        topics, _ = topic_model.fit_transform(documents)

        # Get the number of unique topics (excluding outliers)
        num_topics = len(set(t for t in topics if t != -1))
        topic_counts.append(num_topics)

    # Save topic counts to a CSV file
    df_topic_counts = pd.DataFrame({'run_number': range(1, num_runs + 1), 'topic_count': topic_counts})
    df_topic_counts.to_csv('topic_counts.csv', index=False)
    
    # Find the most common number of topics
    most_common_num_topics = Counter(topic_counts).most_common(1)[0][0]
    
    return most_common_num_topics


In [2]:
dataframe = pd.read_csv('covid_df_20_01.csv')

In [3]:
run_bertopic_multiple_times(dataframe)

Running BERTopic: 100%|██████████| 10/10 [03:19<00:00, 19.97s/it]


113

## Visualization of Topic Variability

The part of the code also includes a segment for visualizing the distribution of topic counts across multiple runs of the BERTopic model.

### Steps

1. Load the `topic_counts.csv` file containing the results from multiple BERTopic runs.
2. Calculate the frequency of each unique topic count and identify the counts with the highest frequency.
3. Assign colors to bars in a bar plot based on their frequency:
   - **Green** for the most frequent counts.
   - **Orange** for the second most frequent counts.
   - **Red** for all other counts.
4. Generate and save a bar plot showing the variability of identified topics across runs.

### Key Features

- The visualization highlights the most common outcomes (in green) to easily identify the typical number of topics BERTopic finds in the dataset.
- This approach helps in understanding the consistency and variability of the BERTopic algorithm across multiple runs.


In [4]:
import pandas as pd
from bertopic import BERTopic
import matplotlib.pyplot as plt
# We will adjust the coloring logic to accommodate the new condition.
# If there are multiple topic counts with the same highest frequency, they will all be colored green.

df = pd.read_csv('topic_counts.csv')

# Count the frequency of each 'topic_count' value
topic_counts = df['topic_count'].value_counts().sort_index()

# Find the maximum frequency to determine the 'green' bars
max_freq = topic_counts.max()

# Get the frequencies in descending order
sorted_counts = topic_counts.sort_values(ascending=False)

# Get the top frequency value(s)
top_frequency = sorted_counts.iloc[0]

# Indices with the top frequency will be green
green_indices = sorted_counts[sorted_counts == top_frequency].index

# Indices with the second top frequency (and not green) will be orange
# We remove the green indices from consideration
remaining_counts = sorted_counts[~sorted_counts.index.isin(green_indices)]
if not remaining_counts.empty:
    second_top_frequency = remaining_counts.iloc[0]
    orange_indices = remaining_counts[remaining_counts == second_top_frequency].index
else:
    orange_indices = pd.Index([])

# All other indices will be red
red_indices = sorted_counts.index.difference(green_indices.union(orange_indices))

# Now we assign colors based on the indices
colors = ['green' if idx in green_indices else 'orange' if idx in orange_indices else 'red' for idx in topic_counts.index]

# Create the bar plot with the appropriate colors
plt.figure(figsize=(10, 6))  # Set the figure size for better readability
bars = plt.bar(topic_counts.index, topic_counts, color=colors, edgecolor='black')

# Set the labels and title
plt.xlabel('Number of Topics')
plt.ylabel('Frequency')
plt.title('Variability of Identified Topics in 100 BERTopic Model Runs')
plt.xticks(topic_counts.index)

  # Save the plot
plt.savefig('topic_distribution_histogram.png')
plt.close() 