### Overview
This file focuses on organizing and generating higher-level topic labels from a set of subtopics (topic clsusters) using hierarchical clustering and natural language processing (NLP). The main steps involve embedding subtopic labels, clustering these embeddings, and generating human-readable topic labels for higher topic interpretation.

### Main Components
1. **Data Loading and Preparation**:
   - The `load_and_merge_datasets` function loads multiple datasets and merges them into a single DataFrame, filtering out any records without a valid topic.

2. **Embeddings**:
   - The notebook utilizes the `SentenceTransformer` model to encode subtopics into vector representations (embeddings). This is handled by the `compute_embeddings` function.

3. **Clustering**:
   - **Hierarchical Clustering**:
     - The `perform_clustering` function performs hierarchical clustering on the topic embeddings using the Ward method. This results in a dendrogram that visualizes the relationships between subtopics.
   - **Cluster Assignment**:
     - The `assign_cluster_labels` function assigns each subtopic to a cluster based on a specified cut height in the dendrogram.

4. **Higher-Level Topic Generation**:
   - **Grouping Subtopics**:
     - `group_subtopics_by_cluster` aggregates subtopics within each cluster, preparing the data for generating higher-level topic labels.
   - **Label Generation with Together AI**:
     - The `generate_topic_label_together` function leverages a Together AI LLM to create concise and human-readable labels that represent each cluster.
   - **Finding Representative Documents**:
     - `find_representative_document` identifies a representative document for each cluster by finding the embedding that is most similar to the cluster centroid.

5. **Final Processing**:
   - The `process_clusters` function processes each cluster, generating a higher-level topic label and selecting a representative document for it.
   - The final results are saved to a CSV file using the `save_results` function.

6. **Main Workflow**:
   - The `main` function orchestrates the entire workflow, starting from initializing the API client and loading data to clustering, topic label generation, and saving results.

### Additional Details
- The notebook uses the `SentenceTransformer` model (`all-MiniLM-L6-v2`) for generating embeddings.
- It performs hierarchical clustering using the Ward method and determines cluster labels based on a specified cut height in the dendrogram. (baed on heuristics, 1.2 is chosen)
- The Together AI LLM is used to generate human-readable, higher-level topic labels.

This process aids in transforming subtopics into broader, more interpretable categories, making it easier to understand the content and structure of large datasets.


1. **Data Loading and Preparation**:


**First Step - Second Level Topic Knowledge**

In [11]:
import pandas as pd

# Load the two datasets
first_step_df = pd.read_csv('First_Step_Topic_Clusters_with_human_readable_label.csv')
higher_level_topics_df = pd.read_csv('Hierchical_Topics_Second_Level-Topics (1).csv')

higher_level_topics_df.rename(columns={'Higher_Topic_Label': 'Second_Level_Topic_Label'}, inplace=True)
higher_level_topics_df

# Function to clean and split the topics
def clean_and_split_topics(topic_str):
    # Remove quotes and split by commas, then strip any extra spaces
    topics = [topic.strip().replace('"', '') for topic in topic_str.split(',')]
    return topics


# Apply the cleaning function to the 'Human_Readable_Topic' column
first_step_df['Human_Readable_Topic'] = first_step_df['Human_Readable_Topic'].str.replace('"', '', regex=False)
higher_level_topics_df['Human_Readable_Topic'] = higher_level_topics_df['Human_Readable_Topic'].apply(clean_and_split_topics)

# Explode the 'Human_Readable_Topic' to create a row for each topic
higher_level_topics_df = higher_level_topics_df.explode('Human_Readable_Topic')

first_step_df
higher_level_topics_df
# Merge the two datasets on the cleaned 'Human_Readable_Topic' column
merged_df = pd.merge(first_step_df, higher_level_topics_df[['Human_Readable_Topic', 'Second_Level_Topic_Label']], 
                     on='Human_Readable_Topic', how='left')
merged_df
# # Save the merged dataframe to a new CSV file
# merged_df.to_csv('BERTopic_First_Step_Second_Level_Topic_Knowledge.csv', index=False)

# print("Merging completed and the result is saved as 'BERTopic_First_Step_Second_Level_Topic_Knowledge.csv'")


Unnamed: 0.1,Unnamed: 0,Topic,Count,Name,Representation,Aspect1,Aspect2,Representative_Docs,Human_Readable_Topic,Second_Level_Topic_Label
0,0,-1,20415,-1_learning_knowledge_trained_tasks,"['learning', 'knowledge', 'trained', 'tasks', ...","['data', 'models', 'learning', 'model', 'langu...","['knowledge', 'trained', 'tasks', 'ai', 'model...","["" In meta reinforcement learning (meta RL), ...",Artificial Intelligence and Machine Learning M...,
1,1,0,1000,0_molecular_molecule_molecules_ligands,"['molecular', 'molecule', 'molecules', 'ligand...","['molecular', 'protein', 'drug', 'molecules', ...","['molecular', 'ligands', 'modeling', 'discover...",[' Generating molecules that bind to specific...,Molecular Generation and Modeling for Drug Dis...,
2,2,1,622,1_recommender_recommenders_personalized_recomm...,"['recommender', 'recommenders', 'personalized'...","['recommendation', 'recommender', 'item', 'use...","['recommenders', 'personalized', 'factorizatio...",[' Contemporary recommender systems predomina...,Personalized Recommendation Systems,
3,3,2,602,2_nlp_text_annotated_clinical,"['nlp', 'text', 'annotated', 'clinical', 'medi...","['medical', 'clinical', 'biomedical', 'patient...","['nlp', 'text', 'annotated', 'hospital', 'retr...",[' In studies that rely on data from electron...,Natural Language Processing in Clinical Text A...,
4,4,3,583,3_retrieval_search_relevance_recall,"['retrieval', 'search', 'relevance', 'recall',...","['retrieval', 'documents', 'query', 'document'...","['retrieval', 'recall', 'semantic', 'retriever...",[' Large Language Models (LLMs) excel in vari...,Improving Document Retrieval for Large Languag...,
...,...,...,...,...,...,...,...,...,...,...
518,517,516,10,516_nationalities_cultural_cultures_language,"['nationalities', 'cultural', 'cultures', 'lan...","['cultural', 'debiasing', 'nationality', 'coun...","['nationalities', 'culturally', 'discourses', ...","["" Large Language Models (LLMs) attempt to im...",Cultural Sensitivity in Large Language Models,
519,518,517,10,517_scheduling_prediction_predictions_queueing,"['scheduling', 'prediction', 'predictions', 'q...","['predictions', 'jobs', 'skip', 'queues', 'lis...","['scheduling', 'predictions', 'queueing', 'alg...","["" Online decision-makers often obtain predic...",Scheduling and Queueing with Predictions,
520,519,518,10,518_programming_solvers_solver_optimization,"['programming', 'solvers', 'solver', 'optimiza...","['problems', 'programming', 'program', 'mathem...","['programming', 'solvers', 'optimization', 'op...",[' Optimization problems are pervasive in sec...,Optimization Problem Solving with Large Langua...,Optimization and Efficiency of Large Language ...
521,520,519,10,519_modeling_industrial_deep_flow,"['modeling', 'industrial', 'deep', 'flow', 'st...","['soft', 'wells', 'sensor', 'sensing', 'sensor...","['modeling', 'industrial', 'flow', 'stochastic...",[' The modeling of multistage manufacturing s...,Industrial Process Modeling and Sensing with D...,


<!-- This is the code, if the regular String based matching doesn't work!!!! -->

In [15]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the two datasets
first_step_df = pd.read_csv('First_Step_Topic_Clusters_with_human_readable_label.csv')
higher_level_topics_df = pd.read_csv('Hierchical_Topics_Second_Level-Topics (1).csv')

# Rename column for consistency
higher_level_topics_df.rename(columns={'Higher_Topic_Label': 'Second_Level_Topic_Label'}, inplace=True)

# Function to clean and split the topics
def clean_and_split_topics(topic_str):
    # Remove quotes and split by commas, then strip any extra spaces
    topics = [topic.strip().replace('"', '') for topic in topic_str.split(',')]
    return topics

# Apply the cleaning function to the 'Human_Readable_Topic' column
first_step_df['Human_Readable_Topic'] = first_step_df['Human_Readable_Topic'].str.replace('"', '', regex=False)
higher_level_topics_df['Human_Readable_Topic'] = higher_level_topics_df['Human_Readable_Topic'].apply(clean_and_split_topics)

# Explode the 'Human_Readable_Topic' to create a row for each topic
higher_level_topics_df = higher_level_topics_df.explode('Human_Readable_Topic')

# Combine all human-readable topics for vectorization
all_topics = pd.concat([first_step_df['Human_Readable_Topic'], higher_level_topics_df['Human_Readable_Topic']])

# Vectorize the topics using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(all_topics)

# Split the TF-IDF matrix back into the two parts
first_step_tfidf = tfidf_matrix[:len(first_step_df)]
higher_level_tfidf = tfidf_matrix[len(first_step_df):]

# Calculate cosine similarity between each topic in first_step_df and all topics in higher_level_topics_df
similarity_matrix = cosine_similarity(first_step_tfidf, higher_level_tfidf)

# Find the index of the highest similarity for each topic
best_match_indices = similarity_matrix.argmax(axis=1)

# Get the best matching topics and their corresponding labels
first_step_df['Matched_Topic'] = higher_level_topics_df.iloc[best_match_indices]['Human_Readable_Topic'].values
first_step_df['Second_Level_Topic_Label'] = higher_level_topics_df.iloc[best_match_indices]['Second_Level_Topic_Label'].values

# Save the merged dataframe to a new CSV file (if needed)
first_step_df.to_csv('BERTopic_First_Step_Second_Level_Topic_Knowledge.csv', index=False)
print("Merging completed and the result is saved as 'BERTopic_First_Step_Second_Level_Topic_Knowledge.csv'")

Merging completed and the result is saved as 'BERTopic_First_Step_Second_Level_Topic_Knowledge.csv'


First Step - Third Level Topic Knowledge

In [16]:
import pandas as pd

# Load the datasets
first_step_df = pd.read_csv('BERTopic_First_Step_Second_Level_Topic_Knowledge.csv')
highest_level_topics_df = pd.read_csv('Hierchical_Topics_Third_Level-Topics.csv')

# Function to clean and split the 'Higher_Topic_Label' column
def clean_and_split_labels(label_str):
    # Split by commas and strip any extra spaces
    labels = [label.strip() for label in label_str.split(';')]
    return labels

# Apply the cleaning and splitting function to the 'Higher_Topic_Label' column
highest_level_topics_df['Second_Level_Topic_Label'] = highest_level_topics_df['Second_Level_Topic_Label'].apply(clean_and_split_labels)

# Explode the 'Higher_Topic_Label' to create a row for each label
highest_level_topics_df = highest_level_topics_df.explode('Second_Level_Topic_Label')

# Merge the first step dataset with the highest level topics based on 'Higher_Topic_Label'
final_merged_df = pd.merge(first_step_df, highest_level_topics_df[['Second_Level_Topic_Label', 'Highest_Topic_Label']], 
                           on='Second_Level_Topic_Label', how='left')

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('BERTopic_First_Step_Third_Level_Topic_Knowledge.csv', index=False)

print("Merging completed and the result is saved as BERTopic_First_Step_Second_Level_Topic_Knowledge.csv")


Merging completed and the result is saved as BERTopic_First_Step_Second_Level_Topic_Knowledge.csv


First Step - Final Topic Knowledge

In [None]:
import pandas as pd

# Load the datasets
first_step_final_df = pd.read_csv('First_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv')
final_highest_level_topics_df = pd.read_csv('final_highest_level_topic_labels_with_representatives.csv')

# Function to split the 'Highest_Topic_Label' based on semicolons or commas, handle NaN values
def split_labels(label_str):
    if isinstance(label_str, str):  # Check if the label_str is a string
        # Split by semicolon or comma and strip any extra spaces
        labels = [label.strip() for label in label_str.split(',')]
    else:
        labels = [None]  # Return a list with None if label_str is not a string
    return labels

# Apply the splitting function to the 'Highest_Topic_Label' in the first dataset
first_step_final_df['Highest_Topic_Label_Split'] = first_step_final_df['Highest_Topic_Label'].apply(split_labels)

# Explode the first dataframe to have one row per label
first_step_final_df = first_step_final_df.explode('Highest_Topic_Label_Split')

# Explode the final_highest_level_topics_df based on 'Highest_Topic_Label' for proper matching
final_highest_level_topics_df['Highest_Topic_Label_Split'] = final_highest_level_topics_df['Highest_Topic_Label'].apply(split_labels)
final_highest_level_topics_df = final_highest_level_topics_df.explode('Highest_Topic_Label_Split')

# Merge the exploded first dataset with the final highest level topics on 'Highest_Topic_Label_Split'
final_merged_df = pd.merge(first_step_final_df, final_highest_level_topics_df[['Highest_Topic_Label_Split', 'Final_Label']], 
                           left_on='Highest_Topic_Label_Split', right_on='Highest_Topic_Label_Split', how='left')

# Drop the auxiliary column used for merging
final_merged_df.drop(columns=['Highest_Topic_Label_Split'], inplace=True)

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv', index=False)

print("Merging completed and the result is saved as 'First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'")


**Second Step - Topic Knowledge**

Second Step Hierchical Topic Knowledge Assignment

In [13]:
import pandas as pd

# Load the two datasets
first_step_df = pd.read_csv('Second_Step_Topic_Clusters_with_human_readable_label.csv')
higher_level_topics_df = pd.read_csv('Hierchical_Topics_Second_Level-Topics.csv')

# Function to clean and split the topics
def clean_and_split_topics(topic_str):
    # Remove quotes and split by commas, then strip any extra spaces
    topics = [topic.strip().replace('"', '') for topic in topic_str.split(',')]
    return topics

# Apply the cleaning function to the 'Human_Readable_Topic' column
first_step_df['Human_Readable_Topic'] = first_step_df['Human_Readable_Topic'].str.replace('"', '', regex=False)
higher_level_topics_df['Human_Readable_Topic'] = higher_level_topics_df['Human_Readable_Topic'].apply(clean_and_split_topics)

# Explode the 'Human_Readable_Topic' to create a row for each topic
higher_level_topics_df = higher_level_topics_df.explode('Human_Readable_Topic')

# Merge the two datasets on the cleaned 'Human_Readable_Topic' column
merged_df = pd.merge(first_step_df, higher_level_topics_df[['Human_Readable_Topic', 'Second_Level_Topic_Label']], 
                     on='Human_Readable_Topic', how='left')

# Save the merged dataframe to a new CSV file
merged_df.to_csv('BERTopic_Second_Step_Second_Level_Topic_Knowledge.csv', index=False)

print("Merging completed and the result is saved as 'BERTopic_Second_Step_Second_Level_Topic_Knowledge.csv'")

Merging completed and the result is saved as 'BERTopic_Second_Step_Second_Level_Topic_Knowledge.csv'


In [14]:
import pandas as pd

# Load the datasets
first_step_df = pd.read_csv('BERTopic_Second_Step_Second_Level_Topic_Knowledge.csv')
highest_level_topics_df = pd.read_csv('Hierchical_Topics_Third_Level-Topics.csv')

# Function to clean and split the 'Higher_Topic_Label' column
def clean_and_split_labels(label_str):
    # Split by commas and strip any extra spaces
    labels = [label.strip() for label in label_str.split(';')]
    return labels

# Apply the cleaning and splitting function to the 'Higher_Topic_Label' column
highest_level_topics_df['Second_Level_Topic_Label'] = highest_level_topics_df['Second_Level_Topic_Label'].apply(clean_and_split_labels)

# Explode the 'Higher_Topic_Label' to create a row for each label
highest_level_topics_df = highest_level_topics_df.explode('Second_Level_Topic_Label')

# Merge the first step dataset with the highest level topics based on 'Higher_Topic_Label'
final_merged_df = pd.merge(first_step_df, highest_level_topics_df[['Second_Level_Topic_Label', 'Highest_Topic_Label']], 
                           on='Second_Level_Topic_Label', how='left')

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('BERTopic_Second_Step_Third_Level_Topic_Knowledge.csv', index=False)

print("Merging completed and the result is saved as BERTopic_Second_Step_Third_Level_Topic_Knowledge.csv")


ValueError: You are trying to merge on float64 and object columns for key 'Second_Level_Topic_Label'. If you wish to proceed you should use pd.concat

In [None]:
import pandas as pd

# Load the datasets
first_step_final_df = pd.read_csv('BERTopic_Second_Step_Third_Level_Topic_Knowledge.csv')
final_highest_level_topics_df = pd.read_csv('final_highest_level_topic_labels_with_representatives.csv')

# Function to split the 'Highest_Topic_Label' based on semicolons or commas, handle NaN values
def split_labels(label_str):
    if isinstance(label_str, str):  # Check if the label_str is a string
        # Split by semicolon or comma and strip any extra spaces
        labels = [label.strip() for label in label_str.split(',')]
    else:
        labels = [None]  # Return a list with None if label_str is not a string
    return labels

# Apply the splitting function to the 'Highest_Topic_Label' in the first dataset
first_step_final_df['Highest_Topic_Label_Split'] = first_step_final_df['Highest_Topic_Label'].apply(split_labels)

# Explode the first dataframe to have one row per label
first_step_final_df = first_step_final_df.explode('Highest_Topic_Label_Split')

# Explode the final_highest_level_topics_df based on 'Highest_Topic_Label' for proper matching
final_highest_level_topics_df['Highest_Topic_Label_Split'] = final_highest_level_topics_df['Highest_Topic_Label'].apply(split_labels)
final_highest_level_topics_df = final_highest_level_topics_df.explode('Highest_Topic_Label_Split')

# Merge the exploded first dataset with the final highest level topics on 'Highest_Topic_Label_Split'
final_merged_df = pd.merge(first_step_final_df, final_highest_level_topics_df[['Highest_Topic_Label_Split', 'Final_Label']], 
                           left_on='Highest_Topic_Label_Split', right_on='Highest_Topic_Label_Split', how='left')

# Drop the auxiliary column used for merging
final_merged_df.drop(columns=['Highest_Topic_Label_Split'], inplace=True)

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('Second_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv', index=False)

print("Merging completed and the result is saved as 'Second_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'")


In [2]:
import pandas as pd

# Load the two datasets
first_step_df = pd.read_csv('Second_step_clustering_results_labelled.csv')
higher_level_topics_df = pd.read_csv('higherr_level_topic_labels_with_representatives.csv')

# Function to clean and split the topics
def clean_and_split_topics(topic_str):
    # Remove quotes and split by commas, then strip any extra spaces
    topics = [topic.strip().replace('"', '') for topic in topic_str.split(',')]
    return topics

# Apply the cleaning function to the 'Human_Readable_Topic' column
first_step_df['Human_Readable_Topic'] = first_step_df['Human_Readable_Topic'].str.replace('"', '', regex=False)
higher_level_topics_df['Human_Readable_Topic'] = higher_level_topics_df['Human_Readable_Topic'].apply(clean_and_split_topics)

# Explode the 'Human_Readable_Topic' to create a row for each topic
higher_level_topics_df = higher_level_topics_df.explode('Human_Readable_Topic')

# Merge the two datasets on the cleaned 'Human_Readable_Topic' column
merged_df = pd.merge(first_step_df, higher_level_topics_df[['Human_Readable_Topic', 'Higher_Topic_Label']], 
                     on='Human_Readable_Topic', how='left')

# Save the merged dataframe to a new CSV file
merged_df.to_csv('Second_Step_BERTopic_topic_info_labelled_with_Higher_Topic_Label_cleaned.csv', index=False)

print("Merging completed and the result is saved as 'Second_Step_BERTopic_topic_info_labelled_with_Higher_Topic_Label_cleaned.csv'")


Merging completed and the result is saved as 'Second_Step_BERTopic_topic_info_labelled_with_Higher_Topic_Label_cleaned.csv'


In [None]:
Assign the Highest Topic Knowledge

In [4]:
import pandas as pd

# Load the datasets
first_step_df = pd.read_csv('Second_Step_BERTopic_topic_info_labelled_with_Higher_Topic_Label_cleaned.csv')
highest_level_topics_df = pd.read_csv('highest_level_topic_labels_with_representatives.csv')

# Function to clean and split the 'Higher_Topic_Label' column
def clean_and_split_labels(label_str):
    # Split by commas and strip any extra spaces
    labels = [label.strip() for label in label_str.split(';')]
    return labels

# Apply the cleaning and splitting function to the 'Higher_Topic_Label' column
highest_level_topics_df['Higher_Topic_Label'] = highest_level_topics_df['Higher_Topic_Label'].apply(clean_and_split_labels)

# Explode the 'Higher_Topic_Label' to create a row for each label
highest_level_topics_df = highest_level_topics_df.explode('Higher_Topic_Label')

# Merge the first step dataset with the highest level topics based on 'Higher_Topic_Label'
final_merged_df = pd.merge(first_step_df, highest_level_topics_df[['Higher_Topic_Label', 'Highest_Topic_Label']], 
                           on='Higher_Topic_Label', how='left')

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('Second_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv', index=False)

print("Merging completed and the result is saved as 'Second_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv'")


Merging completed and the result is saved as 'Second_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv'


Assign the Final Labels 

In [1]:
import pandas as pd

# Load the datasets
first_step_final_df = pd.read_csv('First_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv')
final_highest_level_topics_df = pd.read_csv('final_highest_level_topic_labels_with_representatives.csv')

# Merge the first step dataset with the final highest level topics based on 'Highest_Topic_Label'
final_merged_df = pd.merge(first_step_final_df, final_highest_level_topics_df[['Highest_Topic_Label', 'Final_Label']], 
                           on='Highest_Topic_Label', how='left')

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv', index=False)

print("Merging completed and the result is saved as 'First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'")


Merging completed and the result is saved as 'First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'


In [19]:
import pandas as pd

# Load the datasets
first_step_final_df = pd.read_csv('First_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv')
final_highest_level_topics_df = pd.read_csv('final_highest_level_topic_labels_with_representatives.csv')

# Function to split the 'Highest_Topic_Label' based on semicolons or commas, handle NaN values
def split_labels(label_str):
    if isinstance(label_str, str):  # Check if the label_str is a string
        # Split by semicolon and strip any extra spaces
        labels = [label.strip() for label in label_str.split(';')]
    else:
        labels = [None]  # Return a list with None if label_str is not a string
    return labels

# Apply the splitting function to the 'Highest_Topic_Label' in the first dataset
first_step_final_df['Highest_Topic_Label_Split'] = first_step_final_df['Highest_Topic_Label'].apply(split_labels)

# Explode the first dataframe to have one row per label
first_step_final_df = first_step_final_df.explode('Highest_Topic_Label_Split')

# Merge the exploded first dataset with the final highest level topics on 'Highest_Topic_Label_Split'
final_merged_df = pd.merge(first_step_final_df, final_highest_level_topics_df[['Highest_Topic_Label', 'Final_Label']], 
                           left_on='Highest_Topic_Label_Split', right_on='Highest_Topic_Label', how='left')

# Drop the auxiliary column used for merging
final_merged_df.drop(columns=['Highest_Topic_Label_Split'], inplace=True)

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv', index=False)

print("Merging completed and the result is saved as 'First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'")


Merging completed and the result is saved as 'First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'


In [21]:
import pandas as pd

# Load the datasets
first_step_final_df = pd.read_csv('First_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv')
final_highest_level_topics_df = pd.read_csv('final_highest_level_topic_labels_with_representatives.csv')

# Function to split the 'Highest_Topic_Label' based on semicolons or commas, handle NaN values
def split_labels(label_str):
    if isinstance(label_str, str):  # Check if the label_str is a string
        # Split by semicolon or comma and strip any extra spaces
        labels = [label.strip() for label in label_str.split(',')]
    else:
        labels = [None]  # Return a list with None if label_str is not a string
    return labels

# Apply the splitting function to the 'Highest_Topic_Label' in the first dataset
first_step_final_df['Highest_Topic_Label_Split'] = first_step_final_df['Highest_Topic_Label'].apply(split_labels)

# Explode the first dataframe to have one row per label
first_step_final_df = first_step_final_df.explode('Highest_Topic_Label_Split')

# Explode the final_highest_level_topics_df based on 'Highest_Topic_Label' for proper matching
final_highest_level_topics_df['Highest_Topic_Label_Split'] = final_highest_level_topics_df['Highest_Topic_Label'].apply(split_labels)
final_highest_level_topics_df = final_highest_level_topics_df.explode('Highest_Topic_Label_Split')

# Merge the exploded first dataset with the final highest level topics on 'Highest_Topic_Label_Split'
final_merged_df = pd.merge(first_step_final_df, final_highest_level_topics_df[['Highest_Topic_Label_Split', 'Final_Label']], 
                           left_on='Highest_Topic_Label_Split', right_on='Highest_Topic_Label_Split', how='left')

# Drop the auxiliary column used for merging
final_merged_df.drop(columns=['Highest_Topic_Label_Split'], inplace=True)

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv', index=False)

print("Merging completed and the result is saved as 'First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'")


Merging completed and the result is saved as 'First_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'


In [2]:
import pandas as pd

# Load the datasets
first_step_final_df = pd.read_csv('Second_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv')
final_highest_level_topics_df = pd.read_csv('final_highest_level_topic_labels_with_representatives.csv')

# Function to split the 'Highest_Topic_Label' based on semicolons or commas, handle NaN values
def split_labels(label_str):
    if isinstance(label_str, str):  # Check if the label_str is a string
        # Split by semicolon or comma and strip any extra spaces
        labels = [label.strip() for label in label_str.split(',')]
    else:
        labels = [None]  # Return a list with None if label_str is not a string
    return labels

# Apply the splitting function to the 'Highest_Topic_Label' in the first dataset
first_step_final_df['Highest_Topic_Label_Split'] = first_step_final_df['Highest_Topic_Label'].apply(split_labels)

# Explode the first dataframe to have one row per label
first_step_final_df = first_step_final_df.explode('Highest_Topic_Label_Split')

# Explode the final_highest_level_topics_df based on 'Highest_Topic_Label' for proper matching
final_highest_level_topics_df['Highest_Topic_Label_Split'] = final_highest_level_topics_df['Highest_Topic_Label'].apply(split_labels)
final_highest_level_topics_df = final_highest_level_topics_df.explode('Highest_Topic_Label_Split')

# Merge the exploded first dataset with the final highest level topics on 'Highest_Topic_Label_Split'
final_merged_df = pd.merge(first_step_final_df, final_highest_level_topics_df[['Highest_Topic_Label_Split', 'Final_Label']], 
                           left_on='Highest_Topic_Label_Split', right_on='Highest_Topic_Label_Split', how='left')

# Drop the auxiliary column used for merging
final_merged_df.drop(columns=['Highest_Topic_Label_Split'], inplace=True)

# Save the final merged dataframe to a new CSV file
final_merged_df.to_csv('Second_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv', index=False)

print("Merging completed and the result is saved as 'Second_Step_BERTopic_topic_info_labelled_with_Final_Label_Merged.csv'")


FileNotFoundError: [Errno 2] No such file or directory: 'Second_Step_BERTopic_topic_info_labelled_with_Highest_Topic_Label_final.csv'