# This notebook performs keyword extraction and topic coherence evaluation for the third hierarchical level of topic clustering.

## Steps:
1. **Preprocessing the text data**:
   - Loads the dataset `Concatenated_First_Second_Step_BERTopic_Result.csv`.
   - Preprocesses the text by removing stopwords and non-alphabetic tokens using NLTK, preparing the documents for keyword extraction.

2. **Keyword extraction with KeyBERT**:
   - Groups the documents by 'Highest_Topic_Label' (third-level topic label).
   - Uses the KeyBERT model to extract the top 10 keywords for each topic at the third level.
   - The extracted keywords are stored in a dictionary and merged back into the original dataset.

3. **Coherence score calculation**:
   - Creates a Gensim dictionary from the processed documents.
   - For each third-level topic, the c_v coherence score is calculated using Gensim's `CoherenceModel`.
   - The topic coherence scores, keywords, and topic labels are stored in a DataFrame and saved as `Third_Level_Topic_coherence.csv`.

4. **Results output**:
   - Displays the results, including third-level topic labels, keywords, and their coherence scores.
   - The average coherence score across all topics is calculated and displayed.


In [22]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')

# Load your dataset
file_path = "Concatenated_First_Second_Step_BERTopic_Result.csv"  # Replace with your file path
data = pd.read_csv(file_path)
data
merged_second_step_df = data.drop_duplicates()
merged_second_step_df

[nltk_data] Downloading package punkt to /home/yc656703/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/yc656703/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  data = pd.read_csv(file_path)


Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,...,Human_Readable_Topic,Higher_Topic_Label,Highest_Topic_Label,Unnamed: 0_x,Second_Step_Topic_Name,Second_Step_Topic_Keywords,Second_Step_Topic_Representation,Second_Step_Representative_Docs,Unnamed: 0_y,Final_Label
0,806.1636,Ian Pratt-Hartmann,Ian Pratt-Hartmann,Data-Complexity of the Two-Variable Fragment w...,,"Information and Computation, 207(8), 2009, pp....",10.1016/j.ic.2009.02.004,,cs.LO cs.AI cs.CC,http://arxiv.org/licenses/nonexclusive-distrib...,...,Formal Reasoning and Satisfiability in Logic,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,,,,,,,
1,808.0521,Ian Pratt-Hartmann,Ian Pratt-Hartmann and Lawrence S. Moss,Logics for the Relational Syllogistic,,"Review of Symbolic Logic, 2(4), 2009, pp. 647-...",10.1017/S1755020309990086,,cs.LO cs.CC cs.CL,http://arxiv.org/licenses/nonexclusive-distrib...,...,Formal Reasoning and Satisfiability in Logic,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,,,,,,,
2,905.3108,Ian Pratt-Hartmann,Yevgeny Kazakov and Ian Pratt-Hartmann,A Note on the Complexity of the Satisfiability...,Full proofs for paper presented at the IEEE Co...,"Proceedings, 24th Annual IEEE Symposium on Log...",10.1109/LICS.2009.17,,cs.LO cs.AI cs.CC,http://arxiv.org/licenses/nonexclusive-distrib...,...,Formal Reasoning and Satisfiability in Logic,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,,,,,,,
3,1104.2444,Claus-Peter Wirth,Claus-Peter Wirth,A Simplified and Improved Free-Variable Framew...,ii + 114 pages,IfCoLog Journal of Logics and their Applicatio...,,SEKI Report SR-2011-01,cs.AI math.LO,http://arxiv.org/licenses/nonexclusive-distrib...,...,Formal Reasoning and Satisfiability in Logic,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,,,,,,,
4,1301.387,Pierfrancesco La Mura,Pierfrancesco La Mura,Game Networks,Appears in Proceedings of the Sixteenth Confer...,,,UAI-P-2000-PG-335-342,cs.GT cs.AI,http://arxiv.org/licenses/nonexclusive-distrib...,...,Game Theory and Strategic Decision Making,Game Theory and Strategic Decision Making,Decision Making and Optimization under Uncerta...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60323,2408.08302,Bin Hu,"Usman Syed, Ethan Light, Xingang Guo, Huan Zha...",Benchmarking the Capabilities of Large Languag...,,,,,cs.AI cs.CL cs.LG,http://arxiv.org/licenses/nonexclusive-distrib...,...,Transportation Systems and Mobility Analysis,Transportation and Mobility Analysis,Transportation Systems and Environmental Analy...,20548.0,36_transportation_traffic_routes_ridesharing,"transportation, traffic, routes, ridesharing, ...","transportation (0.48), traffic (0.47), routes ...",Mobility service route design requires deman...,36.0,Transportation Systems and Environmental Analy...
60324,2408.08307,Ahmed Imtiaz Humayun,"Ahmed Imtiaz Humayun, Ibtihel Amara, Candice S...",Understanding the Local Geometry of Generative...,"Pre-print. 11 pages main, 8 pages app., 28 fig...",,,,cs.LG cs.CV,http://creativecommons.org/licenses/by-nc-sa/4.0/,...,Manifold Learning with Autoencoders,Geometric Deep Learning on Manifolds,Geometric and Equivariant Deep Learning,20549.0,291_autoencoders_autoencoder_embeddings_encoder,"autoencoders, autoencoder, embeddings, encoder...","autoencoders (0.58), autoencoder (0.52), embed...",Representing a manifold of very high-dimensi...,291.0,Geometric and Equivariant Deep Learning
60325,2408.08310,Ruihang Li,"Ruihang Li, Yixuan Wei, Miaosen Zhang, Nenghai...",ScalingFilter: Assessing Data Quality through ...,,,,,cs.CL,http://arxiv.org/licenses/nonexclusive-distrib...,...,Large Language Models and Dataset Generation,Advances in Large Language Models,Large Language Models,20550.0,146_tokenizers_language_datasets_nlp,"tokenizers, language, datasets, nlp, models, f...","tokenizers (0.42), language (0.35), datasets (...",The rapid advancement of large language mode...,146.0,Large Language Models
60326,cs/0701194,Andrij Rovenchak,Solomija Buk and Andrij Rovenchak,Menzerath-Altmann Law for Syntactic Structures...,8 pages; submitted to the Proceedings of the I...,"Glottotheory. Vol. 1, No. 1, pp 10-17 (2008)",10.1515/glot-2008-0002,,cs.CL,,...,Multilingual Language Processing and Linguistics,Multilingual Natural Language Processing,Natural Language Processing,20551.0,170_lingual_multilingual_corpus_translations,"lingual, multilingual, corpus, translations, l...","lingual (0.52), multilingual (0.52), corpus (0...","As Uzbek language is agglutinative, has many...",170.0,Natural Language Processing


In [23]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from keybert import KeyBERT
import nltk

# Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')

# file_path = 'Concatenated_First_Second_Step_BERTopic_Result.csv'  # Replace with your file path
data = data_cleaned.copy()

# Example: Assuming you have a separate list of documents
documents = data['text']

# Preprocess documents
stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    return ' '.join(tokens)  # Join tokens back to a single string for KeyBERT

processed_docs = [preprocess(doc) for doc in documents]

# Initialize KeyBERT model
kw_model = KeyBERT()

# Group documents by 'Second_Level_Topic_Label'
grouped_docs = data.groupby('Highest_Topic_Label')['text'].apply(lambda texts: ' '.join(texts))

# Extract keywords per topic
def extract_keywords(text, model, num_keywords=10):
    keywords = model.extract_keywords(text, stop_words='english', top_n=num_keywords)
    return [keyword[0] for keyword in keywords]

topic_keywords = {}
for topic_label, docs in grouped_docs.items():
    processed_text = preprocess(docs)
    keywords = extract_keywords(processed_text, kw_model)
    topic_keywords[topic_label] = keywords

# Convert topic_keywords dictionary to DataFrame
keywords_df = pd.DataFrame(list(topic_keywords.items()), columns=['Highest_Topic_Label', 'KeyBERT_Keywords'])

# Merge the KeyBERT keywords back into the original dataset
data_with_keywords = pd.merge(data, keywords_df, on='Highest_Topic_Label', how='left')

# Display the resulting dataset with the KeyBERT keywords added
data_with_keywords

[nltk_data] Downloading package punkt to /home/yc656703/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/yc656703/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,...,Aspect1,Count,Higher_Topic_Label,Highest_Topic_Label,Human_Readable_Topic,Name,Representation,Representative_Docs,KeyBERT_Keywords_x,KeyBERT_Keywords_y
0,806.1636,Ian Pratt-Hartmann,Ian Pratt-Hartmann,Data-Complexity of the Two-Variable Fragment w...,,"Information and Computation, 207(8), 2009, pp....",10.1016/j.ic.2009.02.004,,cs.LO cs.AI cs.CC,http://arxiv.org/licenses/nonexclusive-distrib...,...,"['logic', 'satisfiability', 'logics', 'clause'...",138,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,Formal Reasoning and Satisfiability in Logic,57_satisfiability_semantics_unsatisfiable_logics,"['satisfiability', 'semantics', 'unsatisfiable...",[' We formulate discussion graph semantics of...,,"[satisfiability, formalizations, logics, forma..."
1,808.0521,Ian Pratt-Hartmann,Ian Pratt-Hartmann and Lawrence S. Moss,Logics for the Relational Syllogistic,,"Review of Symbolic Logic, 2(4), 2009, pp. 647-...",10.1017/S1755020309990086,,cs.LO cs.CC cs.CL,http://arxiv.org/licenses/nonexclusive-distrib...,...,"['logic', 'satisfiability', 'logics', 'clause'...",138,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,Formal Reasoning and Satisfiability in Logic,57_satisfiability_semantics_unsatisfiable_logics,"['satisfiability', 'semantics', 'unsatisfiable...",[' We formulate discussion graph semantics of...,,"[satisfiability, formalizations, logics, forma..."
2,905.3108,Ian Pratt-Hartmann,Yevgeny Kazakov and Ian Pratt-Hartmann,A Note on the Complexity of the Satisfiability...,Full proofs for paper presented at the IEEE Co...,"Proceedings, 24th Annual IEEE Symposium on Log...",10.1109/LICS.2009.17,,cs.LO cs.AI cs.CC,http://arxiv.org/licenses/nonexclusive-distrib...,...,"['logic', 'satisfiability', 'logics', 'clause'...",138,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,Formal Reasoning and Satisfiability in Logic,57_satisfiability_semantics_unsatisfiable_logics,"['satisfiability', 'semantics', 'unsatisfiable...",[' We formulate discussion graph semantics of...,,"[satisfiability, formalizations, logics, forma..."
3,1104.2444,Claus-Peter Wirth,Claus-Peter Wirth,A Simplified and Improved Free-Variable Framew...,ii + 114 pages,IfCoLog Journal of Logics and their Applicatio...,,SEKI Report SR-2011-01,cs.AI math.LO,http://arxiv.org/licenses/nonexclusive-distrib...,...,"['logic', 'satisfiability', 'logics', 'clause'...",138,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,Formal Reasoning and Satisfiability in Logic,57_satisfiability_semantics_unsatisfiable_logics,"['satisfiability', 'semantics', 'unsatisfiable...",[' We formulate discussion graph semantics of...,,"[satisfiability, formalizations, logics, forma..."
4,1203.055,Afshin Rostamizadeh,"Corinna Cortes, Mehryar Mohri, Afshin Rostamiz...",Algorithms for Learning Kernels Based on Cente...,,Journal of Machine Learning Research 13 (2012)...,,,cs.LG cs.AI,http://arxiv.org/licenses/nonexclusive-distrib...,...,"['data', 'learning', 'models', 'model', 'train...",20553,,,Multimodal Models for Image-Text Analysis,-1_features_classification_learning_datasets,"['features', 'classification', 'learning', 'da...",[' With the significant advancements of Large...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54942,2408.08313,Weiyang Liu,"Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, ...",Can Large Language Models Understand Symbolic ...,"Technical Report v1 (44 pages, 23 figures, pro...",,,,cs.LG cs.AI cs.CL cs.CV,http://arxiv.org/licenses/nonexclusive-distrib...,...,"['visual', 'multimodal', 'reasoning', 'vision'...",376,Multimodal Large Language Models (MLLMs),Multimodal Learning and Vision-Language Models,Multimodal Large Language Models (MLLMs),13_multimodal_visual_captioning_mllm,"['multimodal', 'visual', 'captioning', 'mllm',...","["" In production, multi-modal large language ...",,"[utterances, captioners, captioner, captioning..."
54943,cs/0512102,Andrij Rovenchak,Solomija Buk and Andrij Rovenchak,"Statistical Parameters of the Novel ""Perekhres...",11 pages,Quantitative Linguistics 62: Exact methods in ...,10.1515/9783110894219.39,,cs.CL,,...,"['dependency', 'syntactic', 'parsing', 'morpho...",67,Natural Language Processing and Linguistics,Natural Language Processing,Syntactic Parsing and Language Change Analysis,131_parsers_treebanks_treebank_parsing,"['parsers', 'treebanks', 'treebank', 'parsing'...","["" Many studies have shown that human languag...",,"[corpus, lingual, multilinguals, linguistics, ..."
54944,cs/0701039,Ian Pratt-Hartmann,Ian Pratt-Hartmann,On the Complexity of the Numerically Definite ...,24 pages 1 figure,"Bulletin of Symbolic Logic, 14(1), 2008, pp. 1...",10.2178/bsl/1208358842,,cs.LO cs.AI cs.CC,,...,"['logic', 'satisfiability', 'logics', 'clause'...",138,"Reasoning and Problem-Solving with Logic, Lang...",Artificial Intelligence and Reasoning Systems,Formal Reasoning and Satisfiability in Logic,57_satisfiability_semantics_unsatisfiable_logics,"['satisfiability', 'semantics', 'unsatisfiable...",[' We formulate discussion graph semantics of...,,"[satisfiability, formalizations, logics, forma..."
54945,cs/0701194,Andrij Rovenchak,Solomija Buk and Andrij Rovenchak,Menzerath-Altmann Law for Syntactic Structures...,8 pages; submitted to the Proceedings of the I...,"Glottotheory. Vol. 1, No. 1, pp 10-17 (2008)",10.1515/glot-2008-0002,,cs.CL,,...,"['data', 'learning', 'models', 'model', 'train...",20553,,,Multimodal Models for Image-Text Analysis,-1_features_classification_learning_datasets,"['features', 'classification', 'learning', 'da...",[' With the significant advancements of Large...,,


In [24]:
# Example: Assuming you have a separate list of documents
data = data_with_keywords.copy()
documents = data['text']

# Preprocess documents
stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    return tokens

processed_docs = [preprocess(doc) for doc in documents]

# Extract topics after filtering
# No need to split the 'KeyBERT_Keywords' column as it's already a list
topic_representations = data.groupby('Highest_Topic_Label')['KeyBERT_Keywords_y'].apply(lambda x: x.iloc[0])
topics = topic_representations.tolist()

# Calculate Topic Diversity
def calculate_topic_diversity(topics):
    unique_words = set()
    total_words = 0

    for topic in topics:
        unique_words.update(topic)  # Add words to the unique set
        total_words += len(topic)   # Count total words in all topics

    # Topic diversity is the proportion of unique words to total words
    topic_diversity = len(unique_words) / total_words if total_words > 0 else 0
    return topic_diversity

# Calculate topic diversity
topic_diversity = calculate_topic_diversity(topics)

# Display the topic diversity score
print(f"\nTopic Diversity: {topic_diversity}")



Topic Diversity: 0.69


Topic Coherence

In [25]:
import os
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary
import pandas as pd

# Disable Hugging Face tokenizers parallelism to avoid the warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Assuming 'processed_docs' is a list of tokenized documents from the previous code
# Create a Gensim dictionary from the processed documents
dictionary = Dictionary(processed_docs)

# Initialize a list to store the per-topic coherence c_v scores
per_topic_coherence_cv = []

# Calculate coherence for each topic in 'topics'
for topic in topics:
    # Create a list containing just the current topic
    current_topic = [topic]
    
    # Initialize the CoherenceModel for the current topic using 'c_v'
    coherence_model_cv = CoherenceModel(topics=current_topic, texts=processed_docs, dictionary=dictionary, coherence='c_v')
    
    # Compute the c_v coherence score
    coherence_cv = coherence_model_cv.get_coherence()
    print(f"Current topic: {current_topic}")
    print(f"Coherence c_v: {coherence_cv}")
    
    # Append the c_v score to the list
    per_topic_coherence_cv.append(coherence_cv)
    
# Create a DataFrame to display the results
results = pd.DataFrame({
    "Third_Level_Topic_Label": topic_representations.index,  # Use topic labels from topic_representations
    "Keywords": topic_representations.values,  # Keywords from earlier processing
    "Coherence c_v": per_topic_coherence_cv,  # Coherence scores
})

# Display the results
print(results)

# Save the results to a CSV file
results.to_csv('Third_Level_Topic_coherence.csv', index=False)


Current topic: [['crisistransformers', 'webnews', 'crises', 'disasters', 'tweets', 'twitter', 'crisisfacts', 'microblogs', 'semantic', 'thematic']]
Coherence c_v: 0.29306551466275194
Current topic: [['lasso', 'lassoglm', 'optimizations', 'optimizer', 'optimizers', 'penalized', 'penalizes', 'optimizes', 'regularization', 'penalize']]
Coherence c_v: 0.1462274503128631
Current topic: [['imaging', 'phases', 'microscopy', 'optics', 'photonics', 'nanobeamnn', 'optimizers', 'phase', 'microscope', 'reflectance']]
Coherence c_v: 0.22956213085342
Current topic: [['learns', 'memorization', 'attentional', 'classifiers', 'perceptrons', 'attention', 'attentions', 'recognition', 'softmax', 'neural']]
Coherence c_v: 0.28350375596069355
Current topic: [['adversarial', 'adversarially', 'dversarial', 'attackgnn', 'deepnet', 'attacks', 'imagenet', 'classifier', 'classifiers', 'badnet']]
Coherence c_v: 0.33027303611365977
Current topic: [['fisheries', 'fishing', 'sportfishing', 'fish', 'overfishing', 'lure

In [26]:
file_path = 'Third_Level_Topic_coherence.csv'  # Replace with your file path
data = pd.read_csv(file_path)
mean_score = data['Coherence c_v'].mean()
data['Coherence c_v'].mean()
mean_score

0.35840321747998716