# import libraries

In [24]:
%matplotlib inline

%load_ext autoreload
%autoreload 2

from pandas_profiling import profile_report

import os
import xml.etree.ElementTree as ET
import pandas as pd
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
import openai
from dotenv import load_dotenv
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

load_dotenv()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



[1mnumba.generated_jit is deprecated. Please see the documentation at: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-generated-jit for more information and advice on a suitable replacement.[0m


[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m


`import pandas_profiling` is going to be deprecated by April 1st. Please use `import ydata_profiling` instead.



True

# import data

In [43]:
# Initialize an empty list to store the data
data = []

# Specify the folder path
folder_path = '2020'

# Iterate over all files in the folder
for file_name in os.listdir(folder_path):
    if file_name.endswith('.xml'):
        file_path = os.path.join(folder_path, file_name)

        # Parse the XML file
        tree = ET.parse(file_path)
        root = tree.getroot()

        # Initialize an empty dictionary to store the data for this file
        file_data = {}

        # Iterate over all elements in the XML file
        for elem in root.iter():
            # Use the tag name as the column name and the text as the value
            file_data[elem.tag] = elem.text

        file_data['file_name']=file_name
        # Add the data for this file to the list
        data.append(file_data)

# Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(data)

df['AbstractNarration']=df['AbstractNarration'].fillna('No abstract')

# Now 'df' is a DataFrame where each row represents one XML file and each column represents one XML tag
abstracts=df['AbstractNarration']

titles=df['AwardTitle']

# EDA

In [25]:
profile=df.profile_report(
        title='Pandas profiling Report',
        correlations={'spearman':{'calculate':True},
        'pearson':{'calculate':True}}
        #,interactions={'targets':['taken'],'continuous':True}
        #,minimal=True
        #,explorative=True
        )

In [26]:
profile.to_file('abstract_profiling_.html')

Summarize dataset: 100%|██████████| 209/209 [15:02<00:00,  4.32s/it, Completed]                                        
Generate report structure: 100%|██████████| 1/1 [00:05<00:00,  5.77s/it]
Render HTML: 100%|██████████| 1/1 [00:41<00:00, 41.02s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00,  4.20it/s]


Here we have get an exploratory analysis of all the documents, but to take advantage of the abstracts we are just to work with the narration 

# Topic modeling

## Pre-calculate Embeddings

BERTopic works by converting documents into numerical values, called embeddings. This process can be very costly, especially if we want to iterate over parameters. Instead, we can calculate those embeddings once and feed them to BERTopic to skip calculating embeddings each time.

In [4]:
# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

Batches: 100%|██████████| 416/416 [05:39<00:00,  1.23it/s]


## Preventing Stochastic Behavior

In BERTopic, we generally use a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the  curse of dimensionality to a certain degree.

As a default, this is done with **UMAP** which is an incredible algorithm for reducing dimensional space. However, by default, it shows stochastic behavior which creates different results each time you run it. To prevent that, we will need to set a `random_state` of the model before passing it to BERTopic.

As a result, we can now fully reproduce the results each time we run the model.

In [5]:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

## Controlling Number of Topics

There is a parameter to control the number of topics, namely `nr_topics`. This parameter, however, merges topics **after** they have been created. It is a parameter that supports creating a fixed number of topics.

However, it is advised to control the number of topics through the cluster model which is by default HDBSCAN. HDBSCAN has a parameter, namely `min_topic_size` that indirectly controls the number of topics that will be created.

A higher `min_topic_size` will generate fewer topics and a lower `min_topic_size` will generate more topics.

Here, we will go with `min_topic_size=40` to get around XXX topics.

In [6]:
hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

## Improving Default Representation

The default representation of topics is calculated through ****c-TF-IDF**. However, c-TF-IDF is powered by the **CountVectorizer** which converts text into tokens. Using the CountVectorizer, we can do a number of things:

* Remove stopwords
* Ignore infrequent words
* Increase

In other words, we can preprocess the topic representations **after** documents are assigned to topics. This will not influence the clustering process in any way.

Here, we will ignore English stopwords and infrequent words. Moreover, by increasing the n-gram range we will consider topic representations that are made up of one or two words.

In [7]:
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

## Additional Representations

Previously, we have tuned the default representation but there are quite a number of **other topic representations** in BERTopic that we can choose from. From **KeyBERTInspired** and **PartOfSpeech**, to **OpenAI's ChatGPT** and **open-source** alternatives, many representations are possible.

In BERTopic, you can model many different topic representations simultanously to test them out and get different perspectives of topic descriptions. This is called **multi-aspect** topic modeling.

Here, we will demonstrate a number of interesting and useful representations in BERTopic:

* KeyBERTInspired
  * A method that derives inspiration from how KeyBERT works
* PartOfSpeech
  * Using SpaCy's POS tagging to extract words
* MaximalMarginalRelevance
  * Diversify the topic words
* OpenAI
  * Use ChatGPT to label our topics

In [8]:
# KeyBERT
keybert_model = KeyBERTInspired()
# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# GPT-3.5
openai.api_key = os.getenv('OPENAI_API_KEY')
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
openai_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

## Training

Now that we have a set of best practices, we can use them in our training loop. Here, several different representations, keywords and labels for our topics will be created. If you want to iterate over the topic model it is advised to use the pre-calculated embeddings as that significantly speeds up training.

In [9]:
# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    "OpenAI": openai_model,  # Uncomment if you will use OpenAI
    "MMR": mmr_model,
    "POS": pos_model
}

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True
)

topics, probs = topic_model.fit_transform(abstracts, embeddings)

2023-09-07 12:08:09,844 - BERTopic - Reduced dimensionality
2023-09-07 12:08:10,431 - BERTopic - Clustered reduced embeddings


In [44]:
df['Topic']=topics

In [10]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,OpenAI,MMR,POS,Representative_Docs
0,-1,46,-1_graduate_education_stem_nsf graduate,"[graduate, education, stem, nsf graduate, fell...","[nsf graduate, foundation nsf, science foundat...",[NSF GRFP STEM Fellowship Program],"[graduate, education, stem, nsf graduate, fell...","[graduate, education, graduate education, fell...",[The National Science Foundation (NSF) Graduat...
1,0,12492,0_br_gt_lt_lt br,"[br, gt, lt, lt br, br gt, project, research, ...","[researchers, scientific, research, science, p...",[Fish symbiotic microorganisms],"[br, gt, lt, lt br, br gt, project, research, ...","[project, research, students, data, support, b...",[A request is made to fund additional and back...
2,1,591,1_gt_lt_br_br gt,"[gt, lt, br, br gt, lt br, physics, gt lt, res...","[dark matter, cosmic, universe, neutrino, matt...",[Physics research using award],"[gt, lt, br, br gt, lt br, physics, gt lt, res...","[physics, research, award, stars, project, new...",[This award funds the research activities of P...
3,2,171,2_frontera_meeting_pi meeting_abstract,"[frontera, meeting, pi meeting, abstract, comp...","[leadership computing, computing center, compu...",[Advanced Computing Leadership Coordination],"[frontera, meeting, pi meeting, abstract, comp...","[meeting, abstract, computing, leadership, coo...","[For nearly four decades, the National Science..."


In [11]:
topic_model.get_topic(1, full=True)

{'Main': [('gt', 0.036716578334750165),
  ('lt', 0.0366536884363135),
  ('br', 0.036592803833351194),
  ('br gt', 0.0364874459045093),
  ('lt br', 0.0364874459045093),
  ('physics', 0.023310421592333333),
  ('gt lt', 0.022003266728724224),
  ('research', 0.01898919194094649),
  ('using', 0.017865086264252143),
  ('award', 0.01745246353245312)],
 'KeyBERT': [('dark matter', 0.6664233),
  ('cosmic', 0.45344865),
  ('universe', 0.4513024),
  ('neutrino', 0.44085652),
  ('matter', 0.4303727),
  ('astronomy', 0.42845023),
  ('particles', 0.40968922),
  ('astrophysics', 0.40772104),
  ('physics', 0.3836239),
  ('galaxies', 0.38101494)],
 'OpenAI': [('Physics research using award', 1)],
 'MMR': [('gt', 0.036716578334750165),
  ('lt', 0.0366536884363135),
  ('br', 0.036592803833351194),
  ('br gt', 0.0364874459045093),
  ('lt br', 0.0364874459045093),
  ('physics', 0.023310421592333333),
  ('gt lt', 0.022003266728724224),
  ('research', 0.01898919194094649),
  ('using', 0.017865086264252143),


## (Custom) Labels
The default label of each topic are the top 3 words in each topic combined with an underscore between them.

This, of course, might not be the best label that you can think of for a certain topic. Instead, we can use `.set_topic_labels` to manually label all or certain topics.

We can also use `.set_topic_labels` to use one of the other topic representations that we had before, like `KeyBERTInspired` or even `OpenAI`.

In [12]:
chatgpt_topic_labels = {topic: " | ".join(list(zip(*values))[0]) for topic, values in topic_model.topic_aspects_["OpenAI"].items()}

topic_model.set_topic_labels(chatgpt_topic_labels)

In [34]:
topic_summary=topic_model.get_topic_info()
topic_summary

Unnamed: 0,Topic,Count,Name,CustomName,Representation,KeyBERT,OpenAI,MMR,POS,Representative_Docs
0,-1,46,-1_graduate_education_stem_nsf graduate,NSF GRFP STEM Fellowship Program,"[graduate, education, stem, nsf graduate, fell...","[nsf graduate, foundation nsf, science foundat...",[NSF GRFP STEM Fellowship Program],"[graduate, education, stem, nsf graduate, fell...","[graduate, education, graduate education, fell...",[The National Science Foundation (NSF) Graduat...
1,0,12492,0_br_gt_lt_lt br,Fish symbiotic microorganisms,"[br, gt, lt, lt br, br gt, project, research, ...","[researchers, scientific, research, science, p...",[Fish symbiotic microorganisms],"[br, gt, lt, lt br, br gt, project, research, ...","[project, research, students, data, support, b...",[A request is made to fund additional and back...
2,1,591,1_gt_lt_br_br gt,Physics research using award,"[gt, lt, br, br gt, lt br, physics, gt lt, res...","[dark matter, cosmic, universe, neutrino, matt...",[Physics research using award],"[gt, lt, br, br gt, lt br, physics, gt lt, res...","[physics, research, award, stars, project, new...",[This award funds the research activities of P...
3,2,171,2_frontera_meeting_pi meeting_abstract,Advanced Computing Leadership Coordination,"[frontera, meeting, pi meeting, abstract, comp...","[leadership computing, computing center, compu...",[Advanced Computing Leadership Coordination],"[frontera, meeting, pi meeting, abstract, comp...","[meeting, abstract, computing, leadership, coo...","[For nearly four decades, the National Science..."


In [33]:
df.topics.value_counts()

topics
 0    12492
 1      591
 2      171
-1       46
Name: count, dtype: int64

In [46]:
df=df.merge(topic_summary[['Topic','CustomName']],how='left',on='Topic')

In [48]:
df.CustomName.value_counts(normalize=True)

CustomName
Fish symbiotic microorganisms                 0.939248
Physics research using award                  0.044436
Advanced Computing Leadership Coordination    0.012857
NSF GRFP STEM Fellowship Program              0.003459
Name: proportion, dtype: float64

Almost all the abstracts belong to **Fish symbiotic microorganisms**  

## Topic-Document Distribution

If using `calculate_probabilities=True` is not possible, than you can **approximate the topic-document distributions** using `.approximate_distribution`. It is a fast and flexible method for creating different topic-document distributions.

In [14]:
# `topic_distr` contains the distribution of topics in each document
topic_distr, _ = topic_model.approximate_distribution(abstracts, window=8, stride=4)

100%|██████████| 14/14 [00:10<00:00,  1.35it/s]


In [15]:
abstract_id = 10
print(abstracts[abstract_id])

Viral contamination of drinking water can cause disease. Although such waterborne diseases pose significant public health threats, current filtration technologies suitable for virus removal have high cost and energy requirements.  This prevents their widespread use and has led to the need for less expensive and sustainable alternatives for disinfecting drinking water. The goal of this project is to use computational biology tools to discover plant-based  peptides that can trap viruses to create low-cost and energy-efficient drinking water filters. The potential for scale up will be assessed to understand the impacts of common water constituents on virus-protein interactions to improve sustainable and effective filter operation. Creation of a large database of plant peptides will be broadly informative to other scientific disciplines and easily accessible via internet resource to be developed as part of this project. Successful development of plant-based water biofilters will have a ran

## Visualize Topics

With visualizations, we are closing into the realm of subjective "best practices". These are things that I generally do because I like the representations but your experience might differ.

Having said that, there are two visualizations that are my go-to when visualizing the topics themselves:

* `topic_model.visualize_topics()`
* `topic_model.visualize_hierarchy()`

In [16]:
# Visualize the topic-document distribution for a single document
topic_model.visualize_distribution(topic_distr[abstract_id], custom_labels=True)

In [51]:
topic_model.visualize_barchart(custom_labels=True)

In [52]:
topic_model.visualize_heatmap()

In [18]:
topic_model.visualize_hierarchy(custom_labels=True)

## Visualize Documents

When visualizing documents, it helps to have embedded the documents beforehand to speed up computation. Fortunately, we have already done that as a "best practice".

Visualizing documents in 2-dimensional space helps in understanding the underlying structure of the documents and topics.

In [19]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

In [20]:
# We can also hide the annotation to have a more clear overview of the topics
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True)

## Serialization

In [21]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("bertopic_model", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

In [22]:
# to load
# Define embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Load model and add embedding model
loaded_model = BERTopic.load("bertopic_model", embedding_model=embedding_model)