# Automated Topic Modelling with BERTopic

## Overview

This Jupyter Notebook demonstrates the process of topic modeling using the BERTopic library. 

## Process

1. **Environment Setup**: 
    - Import necessary libraries: BERTopic, NLTK for stopwords, OS, and Pandas.
    - Set an environment variable to disable parallelism in tokenizers, ensuring thread safety.<br>

2. **Dataset Loading**:
    - Load Dataset.<br>

3. **BERTopic Modeling**:
    - Version 1: Basic BERTopic
        - Instances of BERTopic execution: 1
    - Version 2: Basic BERTopic with KeyBERT 
        - Instances of BERTopic execution: 1
        - Fit the model on the 'Text' data. This involves the transformation of text data into topics.
        - The representation layer is improved by using KeyBERT      
     - Version 3: Basic BERTopic with multiple iterations
        - Instances of BERTopic execution: n. 
        - The n+1 iteration happens on the outliers. 
        - Save the dataframe used for topic modeling to a CSV file, including the topic labels. After topic modeling, the dataframe used for topic modelling is updated with two new columns: 'Topic' and 'Topic Name'.
        - Save the 'Topic Name' column of each row to a .npy file (NumPy file format)
      - Version 4:Basic BERTopic with multiple iterations with KeyBert
        -  Instances of BERTopic execution: n. The n+1 iteration happens on the outliers. 
        - Fit the model on the 'Text' data. This involves the transformation of text data into topics.
        - Save the dataframe used for topic modeling to a CSV file, including the topic labels. After topic modeling, the dataframe used for topic modelling is updated with two new columns: 'Topic' and 'Topic Name'.
        - Save the 'Topic Name' column of each row to a .npy file (NumPy file format).
        - The representation layer is improved by using KeyBERT    
    - Version 5: Basic BERTopic with multiple iterations fixed labelling
    - Version 6: Basic BERTopic with multiple iterations with KeyBert fixed labelling
    


        

In [3]:
import pandas as pd
import re
import os
from bertopic import BERTopic
from nltk.corpus import stopwords

2. **Dataset Loading**


In [4]:
# dataframe = pd.read_csv('biden_df_12_01.csv')
dataframe = pd.read_csv('covid_df_20_01.csv')

3. **BERTopic Modeling**

    - Version 1: Basic BERTopic



In [3]:
# Set the environment variable to disable parallelism before importing 'tokenizers'
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Load English stopwords from NLTK
english_stopwords = stopwords.words('english')

# # Assuming you have a DataFrame named 'dataframe' with a 'Summary' column
# # Apply stopword removal to the 'Summary' column
# dataframe['cleaned_summary'] = dataframe['Summary'].astype(str).apply(
#     lambda x: ' '.join([word for word in x.split() if word.lower() not in english_stopwords])
# )

# Create BERTopic model with English language setting
topic_model_english = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Fit the model to your cleaned data
topics, probs = topic_model_english.fit_transform(dataframe['text'])

# Get the topics and their names
topic_details = topic_model_english.get_topic_info()
topic_names = {row['Topic']: row['Name'] for index, row in topic_details.iterrows()}

# Assign topics to each row in the dataset
dataframe['Topic'] = topics
dataframe['Topic Name'] = dataframe['Topic'].apply(lambda topic_num: topic_names.get(topic_num, 'Unknown'))
dataframe['Topic Representation'] = dataframe['Topic'].apply(lambda topic_num: ', '.join(term for term, _ in topic_model_english.get_topic(topic_num)))

2024-02-06 20:47:25,259 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/50 [00:00<?, ?it/s]

2024-02-06 20:48:13,640 - BERTopic - Embedding - Completed ✓
2024-02-06 20:48:13,641 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-06 20:48:24,176 - BERTopic - Dimensionality - Completed ✓
2024-02-06 20:48:24,177 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-06 20:48:24,331 - BERTopic - Cluster - Completed ✓
2024-02-06 20:48:24,335 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-06 20:48:24,403 - BERTopic - Representation - Completed ✓


In [4]:
topic_model_english.get_topic_info().to_csv('TOPICS-Biden-context.csv')

    - Version 2: Basic BERTopic with KeyBERT 


In [5]:
import os
import pandas as pd
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired

# Set the environment variable to disable parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Load English stopwords from NLTK
english_stopwords = stopwords.words('english')

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with a `bertopic.representation` model
representation_model = KeyBERTInspired()

# Create BERTopic model with custom settings
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic representations
)

# Fit the model to your cleaned data
topics, probs = topic_model.fit_transform(dataframe['text'])

# Get the topics and their names
topic_details = topic_model.get_topic_info()
topic_names = {row['Topic']: row['Name'] for index, row in topic_details.iterrows()}

# Assign topics to each row in the dataset
dataframe['Topic'] = topics
dataframe['Topic Name'] = dataframe['Topic'].apply(lambda topic_num: topic_names.get(topic_num, 'Unknown'))
dataframe['Topic Representation'] = dataframe['Topic'].apply(lambda topic_num: ', '.join(term for term, _ in topic_model.get_topic(topic_num)))


In [6]:
topic_model.get_topic_info().to_csv('TOPICS-Biden-context-keybert.csv')

In [8]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,538,-1_biden_trump_donald_barack,"[biden, trump, donald, barack, presidential, p...",[An image shared on Facebook purportedly shows...
1,0,125,0_biden_image_photo_posing,"[biden, image, photo, posing, photograph, pict...",[An image shared on Facebook purportedly shows...
2,1,103,1_biden_votes_electors_electoral,"[biden, votes, electors, electoral, voters, ba...","[Some 2,600 uncounted votes, a majority of whi..."
3,2,97,2_biden_video_joe_chanting,"[biden, video, joe, chanting, presidential, ch...",[A YouTube video shared on Facebook purportedl...
4,3,91,3_biden_oil_fuels_petroleum,"[biden, oil, fuels, petroleum, gas, gasoline, ...","[Low gas prices in Russia, Kuwait and Saudi Ar..."
5,4,74,4_covid_vaccine_vaccines_vaccinations,"[covid, vaccine, vaccines, vaccinations, vacci...","[A viral Facebook post shared over 18,000 time..."
6,5,63,5_pelosi_cruz_biden_ted,"[pelosi, cruz, biden, ted, nancy, republican, ...",[A video shared on Facebook claims Republican ...
7,6,56,6_biden_funding_obama_obamas,"[biden, funding, obama, obamas, trillion, mill...",[President Joe Biden claimed in a forum Oct. 2...
8,7,54,7_biden_pence_harris_pelosi,"[biden, pence, harris, pelosi, presidential, c...",[A video shows U.S. President Joe Biden accide...
9,8,54,8_immigration_deportations_immigrants_migrants,"[immigration, deportations, immigrants, migran...",[Under U.S. President Joe Biden's administrati...


     - Version 3: Basic BERTopic with multiple iterations

In [9]:
import os
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
from bertopic import BERTopic

# Set the environment variable to disable parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def process_dataset(dataframe: pd.DataFrame, dataset_name: str, num_runs: int) -> None:
    # Load English stopwords from NLTK
    english_stopwords = stopwords.words('english')

    # # Initial cleaning of the 'text' column by removing stopwords
    # dataframe['text'] = dataframe['text'].astype(str).apply(
    #     lambda x: ' '.join([word for word in x.split() if word.lower() not in english_stopwords])
    # )

    for run_number in range(1, num_runs + 1):
        run_folder = f"BERTopic_run_{run_number}"
        os.makedirs(run_folder, exist_ok=True)

        if run_number > 1:
            # Load the outliers from the previous run's folder
            previous_run_folder = f"BERTopic_run_{run_number - 1}"
            outliers_filename = os.path.join(previous_run_folder, f"BERTopic_run_{run_number - 1}_Outliers.csv")
            try:
                dataframe = pd.read_csv(outliers_filename)
            except FileNotFoundError:
                print(f"File {outliers_filename} not found. Ending the process.")
                break

        # Create BERTopic model with default settings
        topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

        topics, probabilities = topic_model.fit_transform(dataframe['text'])

        # Add topic labels to the dataframe
        topic_names = topic_model.get_topic_info()['Name'].to_dict()
        dataframe['Topic'] = topics
        dataframe['Topic Name'] = dataframe['Topic'].apply(lambda t: topic_names.get(t, 'Unknown'))

        # Save the dataframe with topic labels
        labeled_data_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_LabeledData.csv")
        dataframe.to_csv(labeled_data_filename, index=False)

        # Save 'Topic Name' column to a npy file
        npy_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicNames.npy")
        np.save(npy_filename, dataframe['Topic Name'].values)

        # Output Generated Topics and Automatic Labelling
        topic_info = topic_model.get_topic_info()

        # Save BERTopic results for this run
        bertopic_results_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Results.csv")
        topic_info.to_csv(bertopic_results_filename, index=False)

        # Save each topic's documents to separate CSV files named after the topic's generated name
        for topic in set(topics):
            if topic != -1:  # Exclude outlier topic
                topic_indices = [i for i, t in enumerate(topics) if t == topic]
                topic_documents = dataframe.iloc[topic_indices]
                topic_name = topic_info[topic_info['Topic'] == topic]['Name'].values[0]
                topic_name = topic_name.replace(" ", "_").replace("/", "_")
                topic_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_{topic_name}.csv")
                topic_documents.to_csv(topic_filename, index=False)

        # Identify Outliers
        outlier_indices = [i for i, topic in enumerate(topics) if topic == -1]

        # Save the outliers if any
        if len(outlier_indices) > 0:
            outliers_dataframe = dataframe.iloc[outlier_indices]
            outliers_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Outliers.csv")
            outliers_dataframe.to_csv(outliers_filename, index=False)
        else:
            # No outliers, end the process
            break


In [10]:
# process_dataset(dataframe, dataframe_topic_summary_give a name, num_runs)
process_dataset(dataframe, "BERT_Topics",2)

2024-02-03 13:45:37,495 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/50 [00:00<?, ?it/s]

2024-02-03 13:45:38,110 - BERTopic - Embedding - Completed ✓
2024-02-03 13:45:38,111 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-03 13:45:43,738 - BERTopic - Dimensionality - Completed ✓
2024-02-03 13:45:43,740 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-03 13:45:43,896 - BERTopic - Cluster - Completed ✓
2024-02-03 13:45:43,900 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-03 13:45:43,973 - BERTopic - Representation - Completed ✓
2024-02-03 13:45:44,253 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

2024-02-03 13:45:44,554 - BERTopic - Embedding - Completed ✓
2024-02-03 13:45:44,554 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-03 13:45:46,903 - BERTopic - Dimensionality - Completed ✓
2024-02-03 13:45:46,904 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-03 13:45:46,926 - BERTopic - Cluster - Completed ✓
2024-02-03 13:45:46,929 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-03 13:45:46,953 - BERTopic - Representation - Completed ✓


      - Version 4: Basic BERTopic with multiple iterations with KeyBert

In [6]:
import os
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired  # Import the custom representation model

# Set the environment variable to disable parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def process_dataset(dataframe: pd.DataFrame, dataset_name: str, num_runs: int) -> None:
    # Load English stopwords from NLTK
    english_stopwords = stopwords.words('english')
    # Initialize the representation model
    representation_model = KeyBERTInspired()

    for run_number in range(1, num_runs + 1):
        run_folder = f"BERTopic_run_{run_number}"
        os.makedirs(run_folder, exist_ok=True)

        if run_number > 1:
            # Load the outliers from the previous run's folder
            previous_run_folder = f"BERTopic_run_{run_number - 1}"
            outliers_filename = os.path.join(previous_run_folder, f"BERTopic_run_{run_number - 1}_Outliers.csv")
            try:
                dataframe = pd.read_csv(outliers_filename)
            except FileNotFoundError:
                print(f"File {outliers_filename} not found. Ending the process.")
                break

        # Create BERTopic model with custom representation model
        topic_model = BERTopic(
            language="english",
            calculate_probabilities=True,
            verbose=True,
            representation_model=representation_model  # Add the custom representation model
        )

        topics, probabilities = topic_model.fit_transform(dataframe['text'])

        # Add topic labels to the dataframe
        topic_names = topic_model.get_topic_info()['Name'].to_dict()
        dataframe['Topic'] = topics
        dataframe['Topic Name'] = dataframe['Topic'].apply(lambda t: topic_names.get(t, 'Unknown'))

        # Save the dataframe with topic labels
        labeled_data_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_LabeledData.csv")
        dataframe.to_csv(labeled_data_filename, index=False)

        # Save 'Topic Name' column to a npy file
        npy_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicNames.npy")
        np.save(npy_filename, dataframe['Topic Name'].values)

        # Output Generated Topics and Automatic Labelling
        topic_info = topic_model.get_topic_info()

        # Save BERTopic results for this run
        bertopic_results_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Results.csv")
        topic_info.to_csv(bertopic_results_filename, index=False)

        # Save each topic's documents to separate CSV files named after the topic's generated name
        for topic in set(topics):
            if topic != -1:  # Exclude outlier topic
                topic_indices = [i for i, t in enumerate(topics) if t == topic]
                topic_documents = dataframe.iloc[topic_indices]
                topic_name = topic_info[topic_info['Topic'] == topic]['Name'].values[0]
                topic_name = topic_name.replace(" ", "_").replace("/", "_")
                topic_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_{topic_name}.csv")
                topic_documents.to_csv(topic_filename, index=False)

        # Identify Outliers
        outlier_indices = [i for i, topic in enumerate(topics) if topic == -1]

        # Save the outliers if any
        if len(outlier_indices) > 0:
            outliers_dataframe = dataframe.iloc[outlier_indices]
            outliers_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Outliers.csv")
            outliers_dataframe.to_csv(outliers_filename, index=False)
        else:
            # No outliers, end the process
            break


In [7]:
# process_dataset(dataframe, dataframe_topic_summary_give a name, num_runs)
process_dataset(dataframe, "BERT_Topics",2)

2024-02-05 16:53:40,496 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/50 [00:00<?, ?it/s]

2024-02-05 16:53:41,032 - BERTopic - Embedding - Completed ✓
2024-02-05 16:53:41,033 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-05 16:53:46,055 - BERTopic - Dimensionality - Completed ✓
2024-02-05 16:53:46,058 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-05 16:53:46,197 - BERTopic - Cluster - Completed ✓
2024-02-05 16:53:46,201 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-05 16:53:46,721 - BERTopic - Representation - Completed ✓
2024-02-05 16:53:46,998 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

2024-02-05 16:53:47,327 - BERTopic - Embedding - Completed ✓
2024-02-05 16:53:47,328 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-05 16:53:49,637 - BERTopic - Dimensionality - Completed ✓
2024-02-05 16:53:49,637 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-05 16:53:49,660 - BERTopic - Cluster - Completed ✓
2024-02-05 16:53:49,663 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-05 16:53:49,798 - BERTopic - Representation - Completed ✓


      - Version 5: Basic BERTopic with multiple iterations. Fixing the labelling. 

In [13]:
import os
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
from bertopic import BERTopic

# Set the environment variable to disable parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def process_dataset(dataframe: pd.DataFrame, dataset_name: str, num_runs: int) -> None:
    # Load English stopwords from NLTK (This part of the code was commented out in your original script)
    #english_stopwords = stopwords.words('english')

    for run_number in range(1, num_runs + 1):
        run_folder = f"BERTopic_run_{run_number}"
        os.makedirs(run_folder, exist_ok=True)

        if run_number > 1:
            # Correctly load outliers from the previous run
            outliers_filename = os.path.join(f"BERTopic_run_{run_number - 1}", f"BERTopic_run_{run_number - 1}_Outliers.csv")
            try:
                dataframe = pd.read_csv(outliers_filename)
            except FileNotFoundError:
                print(f"File {outliers_filename} not found. Ending the process.")
                break

        # Create BERTopic model with default settings
        topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
        topics, probabilities = topic_model.fit_transform(dataframe['text'])

        # Add topic labels to the dataframe
        topic_info = topic_model.get_topic_info()  # Get topic information
        topic_names = {row['Topic']: row['Name'] for index, row in topic_info.iterrows()}
        dataframe['Topic'] = topics
        dataframe['Topic Name'] = dataframe['Topic'].apply(lambda t: topic_names.get(t, 'Unknown'))

        # Save the dataframe with topic labels
        labeled_data_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicLabels.csv")
        dataframe.to_csv(labeled_data_filename, index=False)

        # Save 'Topic Name' column to a npy file
        npy_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicNames.npy")
        np.save(npy_filename, dataframe['Topic Name'].values)

        # Save BERTopic results with consistent naming
        bertopic_results_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Topics_Results.csv")
        topic_info.to_csv(bertopic_results_filename, index=False)

        # Save documents for each topic with the simplified naming convention
        for topic, name in topic_names.items():
            if topic != -1:  # Excluding outlier topic
                topic_indices = dataframe[dataframe['Topic'] == topic].index
                topic_dataframe = dataframe.loc[topic_indices]
                clean_name = name.replace(" ", "_").replace("/", "_")
                topic_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Topic_{topic}_{clean_name}.csv")
                topic_dataframe.to_csv(topic_filename, index=False)

        # Handle outliers, saving them with consistent naming
        outlier_indices = dataframe[dataframe['Topic'] == -1].index
        if outlier_indices.any():
            outliers_dataframe = dataframe.loc[outlier_indices]
            outliers_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Outliers.csv")
            outliers_dataframe.to_csv(outliers_filename, index=False)
        else:
            print("No outliers found in this run. Ending the process.")
            break


In [14]:
# process_dataset(dataframe, dataframe_topic_summary_give a name, num_runs)
process_dataset(dataframe, "BERT_Topics",2)

2024-02-24 11:53:41,984 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/50 [00:00<?, ?it/s]

2024-02-24 11:53:42,524 - BERTopic - Embedding - Completed ✓
2024-02-24 11:53:42,524 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-24 11:53:48,819 - BERTopic - Dimensionality - Completed ✓
2024-02-24 11:53:48,822 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-24 11:53:48,973 - BERTopic - Cluster - Completed ✓
2024-02-24 11:53:48,980 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-24 11:53:49,051 - BERTopic - Representation - Completed ✓
2024-02-24 11:53:49,368 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/15 [00:00<?, ?it/s]

2024-02-24 11:53:49,665 - BERTopic - Embedding - Completed ✓
2024-02-24 11:53:49,665 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-24 11:53:51,893 - BERTopic - Dimensionality - Completed ✓
2024-02-24 11:53:51,893 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-24 11:53:51,916 - BERTopic - Cluster - Completed ✓
2024-02-24 11:53:51,918 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-24 11:53:51,942 - BERTopic - Representation - Completed ✓


      - Version 6: Basic BERTopic with multiple iterations with KeyBert. Fixing the labelling. 

In [15]:
import os
from nltk.corpus import stopwords
import numpy as np
from bertopic import BERTopic

# Set the environment variable to disable parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def process_dataset(dataframe: pd.DataFrame, dataset_name: str, num_runs: int) -> None:
    # Load English stopwords from NLTK (Commented as it's not used in the provided code snippet)
    # english_stopwords = stopwords.words('english')
    
    for run_number in range(1, num_runs + 1):
        run_folder = f"BERTopic_run_{run_number}"
        os.makedirs(run_folder, exist_ok=True)

        if run_number > 1:
            # Correctly load the outliers from the previous run
            outliers_filename = os.path.join(f"BERTopic_run_{run_number - 1}", f"BERTopic_run_{run_number - 1}_Outliers.csv")
            try:
                dataframe = pd.read_csv(outliers_filename)
            except FileNotFoundError:
                print(f"File {outliers_filename} not found. Ending the process.")
                break

        # Create BERTopic model with default settings
        topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
        topics, probabilities = topic_model.fit_transform(dataframe['text'])

        # Generate topic names and update the dataframe with these names
        topic_info = topic_model.get_topic_info()  # Get topic information
        topic_names = {row['Topic']: row['Name'] for index, row in topic_info.iterrows()}
        dataframe['Topic'] = topics
        dataframe['Topic Name'] = dataframe['Topic'].apply(lambda t: topic_names.get(t, 'Unknown'))

        # Save the dataframe with topic labels using the updated naming convention
        labeled_data_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicLabels.csv")
        dataframe.to_csv(labeled_data_filename, index=False)

        # Save 'Topic Name' column to a npy file using the updated naming convention
        npy_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicNames.npy")
        np.save(npy_filename, dataframe['Topic Name'].values)

        # Save BERTopic results with the updated naming convention
        bertopic_results_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Topics_Results.csv")
        topic_info.to_csv(bertopic_results_filename, index=False)

        # Save documents for each topic using the simplified naming convention
        for topic, name in topic_names.items():
            if topic != -1:  # Excluding outlier topic
                topic_indices = dataframe[dataframe['Topic'] == topic].index
                topic_dataframe = dataframe.loc[topic_indices]
                clean_name = name.replace(" ", "_").replace("/", "_")
                topic_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Topic_{topic}_{clean_name}.csv")
                topic_dataframe.to_csv(topic_filename, index=False)

        # Handle outliers, saving them with consistent naming
        outlier_indices = dataframe[dataframe['Topic'] == -1].index
        if len(outlier_indices) > 0:
            outliers_dataframe = dataframe.loc[outlier_indices]
            outliers_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Outliers.csv")
            outliers_dataframe.to_csv(outliers_filename, index=False)
        else:
            print("No outliers found in this run. Ending the process.")
            break


In [16]:
# process_dataset(dataframe, dataframe_topic_summary_give a name, num_runs)
process_dataset(dataframe, "BERT_Topics",2)

2024-02-24 12:00:01,239 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/50 [00:00<?, ?it/s]

2024-02-24 12:00:02,396 - BERTopic - Embedding - Completed ✓
2024-02-24 12:00:02,396 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-24 12:00:07,886 - BERTopic - Dimensionality - Completed ✓
2024-02-24 12:00:07,888 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-24 12:00:08,035 - BERTopic - Cluster - Completed ✓
2024-02-24 12:00:08,038 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-24 12:00:08,105 - BERTopic - Representation - Completed ✓
2024-02-24 12:00:10,849 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

2024-02-24 12:00:11,483 - BERTopic - Embedding - Completed ✓
2024-02-24 12:00:11,483 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-24 12:00:13,767 - BERTopic - Dimensionality - Completed ✓
2024-02-24 12:00:13,768 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-24 12:00:13,792 - BERTopic - Cluster - Completed ✓
2024-02-24 12:00:13,796 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-24 12:00:13,817 - BERTopic - Representation - Completed ✓


No outliers found in this run. Ending the process.


In [14]:
import os
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired  # Import the custom representation model

# Set the environment variable to disable parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def process_dataset(dataframe: pd.DataFrame, dataset_name: str, num_runs: int) -> None:
    # Initialize the representation model
    representation_model = KeyBERTInspired()

    for run_number in range(1, num_runs + 1):
        run_folder = f"BERTopic_run_{run_number}"
        os.makedirs(run_folder, exist_ok=True)

        if run_number > 1:
            # Correctly load the outliers from the previous run
            outliers_filename = os.path.join(f"BERTopic_run_{run_number - 1}", f"BERTopic_run_{run_number - 1}_Outliers.csv")
            try:
                dataframe = pd.read_csv(outliers_filename)
            except FileNotFoundError:
                print(f"File {outliers_filename} not found. Ending the process.")
                break

        # Create BERTopic model with the custom representation model
        topic_model = BERTopic(
            language="english",
            calculate_probabilities=True,
            verbose=True,
            representation_model=representation_model  # Add the custom representation model
        )

        topics, probabilities = topic_model.fit_transform(dataframe['text'])

        # Generate topic names and update the dataframe with these names
        topic_info = topic_model.get_topic_info()  # Get topic information
        topic_names = {row['Topic']: row['Name'] for index, row in topic_info.iterrows()}
        dataframe['Topic'] = topics
        dataframe['Topic Name'] = dataframe['Topic'].apply(lambda t: topic_names.get(t, 'Unknown'))

        # Save the dataframe with topic labels using the updated naming convention
        labeled_data_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicLabels.csv")
        dataframe.to_csv(labeled_data_filename, index=False)

        # Save 'Topic Name' column to a npy file using the updated naming convention
        npy_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicNames.npy")
        np.save(npy_filename, dataframe['Topic Name'].values)

        # Save BERTopic results with the updated naming convention
        bertopic_results_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Topics_Results.csv")
        topic_info.to_csv(bertopic_results_filename, index=False)

        # Save documents for each topic using the simplified naming convention
        for topic, name in topic_names.items():
            if topic != -1:  # Excluding outlier topic
                topic_indices = dataframe[dataframe['Topic'] == topic].index
                topic_dataframe = dataframe.loc[topic_indices]
                clean_name = name.replace(" ", "_").replace("/", "_")
                topic_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Topic_{topic}_{clean_name}.csv")
                topic_dataframe.to_csv(topic_filename, index=False)

        # Handle outliers, saving them with consistent naming
        outlier_indices = dataframe[dataframe['Topic'] == -1].index
        if len(outlier_indices) > 0:
            outliers_dataframe = dataframe.loc[outlier_indices]
            outliers_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Outliers.csv")
            outliers_dataframe.to_csv(outliers_filename, index=False)
        else:
            print("No outliers found in this run. Ending the process.")
            break


In [15]:
# process_dataset(dataframe, dataframe_topic_summary_give a name, num_runs)
process_dataset(dataframe, "BERT_Topics",2)

2024-02-24 16:40:27,538 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/50 [00:00<?, ?it/s]

2024-02-24 16:40:28,079 - BERTopic - Embedding - Completed ✓
2024-02-24 16:40:28,080 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-24 16:40:33,175 - BERTopic - Dimensionality - Completed ✓
2024-02-24 16:40:33,176 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-24 16:40:33,318 - BERTopic - Cluster - Completed ✓
2024-02-24 16:40:33,321 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-24 16:40:33,832 - BERTopic - Representation - Completed ✓
2024-02-24 16:40:34,151 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

2024-02-24 16:40:34,419 - BERTopic - Embedding - Completed ✓
2024-02-24 16:40:34,420 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-24 16:40:37,206 - BERTopic - Dimensionality - Completed ✓
2024-02-24 16:40:37,208 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-24 16:40:37,228 - BERTopic - Cluster - Completed ✓
2024-02-24 16:40:37,232 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-24 16:40:37,343 - BERTopic - Representation - Completed ✓


In [5]:
import os
import pandas as pd
from bertopic import BERTopic
import numpy as np
from bertopic.representation import KeyBERTInspired  # Import the custom representation model

# Set the environment variable to disable parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def process_dataset(dataframe: pd.DataFrame, dataset_name: str, num_runs: int) -> None:
    # Initialize the representation model
    representation_model = KeyBERTInspired()

    # Specify the embedding model to be saved with the BERTopic model
    embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

    for run_number in range(1, num_runs + 1):
        run_folder = f"BERTopic_run_{run_number}"
        os.makedirs(run_folder, exist_ok=True)

        if run_number > 1:
            # Correctly load the outliers from the previous run
            outliers_filename = os.path.join(f"BERTopic_run_{run_number - 1}", f"BERTopic_run_{run_number - 1}_Outliers.csv")
            try:
                dataframe = pd.read_csv(outliers_filename)
            except FileNotFoundError:
                print(f"File {outliers_filename} not found. Ending the process.")
                break

        # Create BERTopic model with the custom representation model
        topic_model = BERTopic(
            language="english",
            calculate_probabilities=True,
            verbose=True,
            representation_model=representation_model  # Add the custom representation model
        )

        topics, probabilities = topic_model.fit_transform(dataframe['text'])

        # Generate topic names and update the dataframe with these names
        topic_info = topic_model.get_topic_info()  # Get topic information
        topic_names = {row['Topic']: row['Name'] for index, row in topic_info.iterrows()}
        dataframe['Topic'] = topics
        dataframe['Topic Name'] = dataframe['Topic'].apply(lambda t: topic_names.get(t, 'Unknown'))

        # Save the dataframe with topic labels using the updated naming convention
        labeled_data_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicLabels.csv")
        dataframe.to_csv(labeled_data_filename, index=False)

        # Save 'Topic Name' column to a npy file using the updated naming convention
        npy_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_TopicNames.npy")
        np.save(npy_filename, dataframe['Topic Name'].values)

        # Save BERTopic results with the updated naming convention
        bertopic_results_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Topics_Results.csv")
        topic_info.to_csv(bertopic_results_filename, index=False)

        # Save documents for each topic using the simplified naming convention
        for topic, name in topic_names.items():
            if topic != -1:  # Excluding outlier topic
                topic_indices = dataframe[dataframe['Topic'] == topic].index
                topic_dataframe = dataframe.loc[topic_indices]
                clean_name = name.replace(" ", "_").replace("/", "_")
                topic_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Topic_{topic}_{clean_name}.csv")
                topic_dataframe.to_csv(topic_filename, index=False)

        # Handle outliers, saving them with consistent naming
        outlier_indices = dataframe[dataframe['Topic'] == -1].index
        if len(outlier_indices) > 0:
            outliers_dataframe = dataframe.loc[outlier_indices]
            outliers_filename = os.path.join(run_folder, f"BERTopic_run_{run_number}_Outliers.csv")
            outliers_dataframe.to_csv(outliers_filename, index=False)
        else:
            print("No outliers found in this run. Ending the process.")
            break

        # Save the BERTopic model using safetensors serialization
        model_dir = os.path.join(run_folder, "model_dir")
        topic_model.save(model_dir, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

        print(f"Run {run_number}: Model and data saved successfully in {model_dir} using safetensors serialization.")


In [6]:
# process_dataset(dataframe, dataframe_topic_summary_give a name, num_runs)
process_dataset(dataframe, "BERT_Topics",1)

2024-02-25 21:55:06,301 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/165 [00:00<?, ?it/s]

2024-02-25 21:55:10,349 - BERTopic - Embedding - Completed ✓
2024-02-25 21:55:10,350 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-25 21:55:38,024 - BERTopic - Dimensionality - Completed ✓
2024-02-25 21:55:38,028 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-25 21:55:41,371 - BERTopic - Cluster - Completed ✓
2024-02-25 21:55:41,378 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-25 21:55:43,287 - BERTopic - Representation - Completed ✓


Run 1: Model and data saved successfully in BERTopic_run_1/model_dir using safetensors serialization.
