BERTOPIC Model

Dataset https://www.kaggle.com/datasets/Cornell-University/arxiv

In [1]:
import pandas as pd
import numpy as np

In [2]:
import json

In [3]:
data_file = r'arxiv-metadata-oai-snapshot.json'

""" Using `yield` to load the JSON file in a loop to prevent Python memory issues if JSON is loaded directly"""

def get_metadata():
    with open(data_file, 'r') as f:
        for line in f:
            yield line

In [4]:
metadata = get_metadata()
ids = []
titles = []
abstracts = []
categories = []

for paper in metadata:
    metaDict = json.loads(paper)
    try:
        try:
            year = int(metaDict['journal-ref'][-4:])    ### Example Format: "Phys.Rev.D76:013009,2007"
        except:
            year = int(metaDict['journal-ref'][-5:-1])    ### Example Format: "Phys.Rev.D76:013009,(2007)"
        if(year == 2020 or year == 2019 or year == 2018):
            ids.append(metaDict['id'])
            titles.append(metaDict['title'])
            abstracts.append(metaDict['abstract'])
            categories.append(metaDict['categories'])
    except:
        pass

In [5]:
df = pd.DataFrame({'id' : ids,'title' : titles,'abstract' : abstracts, 'categories' : categories})


print(len(df))

90271


In [6]:
df.head()

Unnamed: 0,id,title,abstract,categories
0,708.007,Bohmian Mechanics at Space-Time Singularities....,We develop an extension of Bohmian mechanics...,quant-ph
1,709.1457,What happens to geometric phase when spin-orbi...,Spin-orbit interaction lifts accidental band...,cond-mat.other
2,710.1849,Regularity of solutions of the isoperimetric p...,In this work we consider a question in the c...,math.DG math.AP math.MG
3,712.1975,Reentrant spin glass transition in LuFe2O4,We have carried out a comprehensive investig...,cond-mat.str-el cond-mat.mtrl-sci
4,804.3104,"Teichm\""uller Structures and Dual Geometric Gi...",The Gibbs measure theory for smooth potentia...,math.DS math.CV


In [7]:
cat_list= df['categories'].unique()
print(*cat_list, sep = "\n")

quant-ph
cond-mat.other
math.DG math.AP math.MG
cond-mat.str-el cond-mat.mtrl-sci
math.DS math.CV
physics.gen-ph
math.NT
nucl-th
physics.atom-ph
cond-mat.stat-mech
gr-qc
cs.MA cs.AI q-bio.NC
math-ph math.MP nlin.SI quant-ph
physics.flu-dyn math.NA physics.comp-ph
physics.data-an physics.hist-ph physics.pop-ph
astro-ph.CO astro-ph.HE
q-bio.NC q-bio.QM
math-ph math.MP
hep-th astro-ph.CO gr-qc
physics.comp-ph
math.DG
quant-ph hep-th math-ph math.MP
astro-ph.IM astro-ph.EP
cond-mat.str-el cond-mat.mes-hall
cond-mat.stat-mech physics.atom-ph quant-ph
math.NT math.AG
math.AG math.KT
cond-mat.dis-nn cs.DM math.CO
math.DS
astro-ph.IM astro-ph.CO
astro-ph.IM astro-ph.CO cs.IT math.IT
cond-mat.supr-con
math.NT math.GM
cond-mat.stat-mech cond-mat.dis-nn quant-ph
hep-ph hep-lat hep-th
physics.gen-ph gr-qc
cond-mat.mes-hall
math.PR q-bio.QM stat.AP stat.ML
astro-ph.HE astro-ph.SR
stat.AP math.ST stat.TH
math.MG
physics.plasm-ph
math.PR
eess.SY cs.SY
physics.soc-ph cond-mat.dis-nn cs.SI
cond-mat.sof

In [8]:
ml_df = df[df['categories'].str.contains("cs.")]

sentencesList= ml_df['abstract'].tolist()

In [9]:
print(len(ml_df))

27271


In [10]:
print(sentencesList[0])

  In this paper we leave the neighborhood of the singularity at the origin and
turn to the singularity at the horizon. Using nonlinear superdistributional
geometry and supergeneralized functions it seems possible to show that the
horizon singularity is not only a coordinate singularity without leaving
Schwarzschild coordinates. However the Tolman formula for the total energy $E$
of a static and asymptotically flat spacetime,gives $E=mc^2$, as it should be.
New class Colombeau solutions to Einstein field equations is obtained.New class
Colombeau solutions to Einstein field equations is obtained. The vacuum energy
density of free scalar quantum field ${\Phi}$ with a distributional background
spacetime also is considered.It has been widely believed that, except in very
extreme situations, the influence of acceleration on quantum fields should
amount to just small, sub-dominant contributions. Here we argue that this
belief is wrong by showing that in a Rindler distributional background
spa

In [11]:
sampleSentencesList = sentencesList[1:2500]

In [12]:
print(len(sampleSentencesList))

2499


In [13]:
%pip install bertopic -q
from bertopic import BERTopic

Note: you may need to restart the kernel to use updated packages.


In [14]:
topic_model = BERTopic(calculate_probabilities= True)
topics, probabilities = topic_model.fit_transform(sampleSentencesList)

In [15]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,708,-1_the_of_and_to,"[the, of, and, to, in, is, we, for, that, with]","[ In empirical studies of random walks, conti..."
1,0,180,0_quantum_channel_the_of,"[quantum, channel, the, of, codes, we, for, is...",[ Recall the classical hypothesis testing set...
2,1,139,1_social_of_and_to,"[social, of, and, to, the, in, that, we, netwo...",[ Evolution and propagation of the world's la...
3,2,112,2_networks_network_nodes_of,"[networks, network, nodes, of, the, to, we, an...","[ In physics, biology and engineering, networ..."
4,3,84,3_cell_of_the_cells,"[cell, of, the, cells, and, in, we, model, tha...",[ Bacterial colonies are abundant on living a...
5,4,70,4_laser_electron_beam_the,"[laser, electron, beam, the, pulse, of, in, wi...",[ Laser wakefield accelerators (LWFA) hold gr...
6,5,66,5_quantum_photon_photons_optical,"[quantum, photon, photons, optical, cavity, th...",[ Controlling and swapping quantum informatio...
7,6,62,6_images_deep_image_features,"[images, deep, image, features, convolutional,...","[ During the recent years, correlation filter..."
8,7,61,7_graph_graphs_vertices_problem,"[graph, graphs, vertices, problem, we, of, num...",[ We consider the problem of finding a 1-plan...
9,8,45,8_turbulence_turbulent_the_flows,"[turbulence, turbulent, the, flows, of, in, ve...",[ The clustering of small heavy inertial part...


In [16]:
topic_model.get_topic(0)

[('quantum', 0.03400472451141179),
 ('channel', 0.01656799993348172),
 ('the', 0.015832606015046772),
 ('of', 0.015272213814901096),
 ('codes', 0.015112485631827962),
 ('we', 0.014717490027879776),
 ('for', 0.013948528954726432),
 ('is', 0.013801849439146483),
 ('that', 0.013493277863104655),
 ('in', 0.013296081160536918)]

In [17]:
topic_model.get_topic(38)

[('contact', 0.0519243171239238),
 ('drop', 0.03987483067226221),
 ('liquid', 0.034088206139550443),
 ('slip', 0.03389360607142288),
 ('wetting', 0.023659076313110483),
 ('line', 0.02241701954968684),
 ('impact', 0.020969609513079836),
 ('the', 0.020947155561709233),
 ('pressure', 0.01974334943430179),
 ('droplet', 0.019562006551077704)]

In [18]:
topic_model.visualize_barchart(top_n_topics=10)

In [19]:
topic_model.visualize_topics()

In [20]:
topic_model.visualize_heatmap()

In [21]:
topic_model.visualize_hierarchy()

In [22]:
topic_model.visualize_distribution(probabilities[0])

In [23]:
topic_model.visualize_distribution(probabilities[1])

In [24]:
topic_model.visualize_distribution(probabilities[10])

In [25]:
topic_model.visualize_distribution(probabilities[1001])

In [26]:
%pip install threadpoolctl==3.1.0

Note: you may need to restart the kernel to use updated packages.


In [27]:

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans



dim_model = PCA(n_components=5)
cluster_model = KMeans(n_clusters=50)

topic_model = BERTopic(umap_model=dim_model, embedding_model="allenai-specter", 
                       hdbscan_model=cluster_model, calculate_probabilities= True)
topics, probabilities = topic_model.fit_transform(sampleSentencesList)

In [28]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,94,0_networks_network_of_the,"[networks, network, of, the, to, we, model, in...","[ In the social, behavioral, and economic sci..."
1,1,89,1_of_data_we_to,"[of, data, we, to, and, in, the, on, that, is]",[ Recent advances in data collection have fac...
2,2,80,2_we_of_graph_agents,"[we, of, graph, agents, to, social, the, graph...",[ Graphs form a natural model for relationshi...
3,3,79,3_algorithm_we_problem_of,"[algorithm, we, problem, of, the, to, in, for,...",[ Covering problems are classical computation...
4,4,72,4_and_the_of_in,"[and, the, of, in, data, for, is, to, this, on]",[ Monte Carlo methods are essential tools for...
5,5,71,5_data_to_learning_and,"[data, to, learning, and, in, is, the, we, on,...",[ Building classification models is an intrin...
6,6,70,6_optical_photonic_quantum_of,"[optical, photonic, quantum, of, and, graphene...",[ Quantum light sources are characterized by ...
7,7,69,7_of_the_we_to,"[of, the, we, to, in, and, that, is, an, for]",[ The problem of learning structural equation...
8,8,69,8_image_deep_convolutional_images,"[image, deep, convolutional, images, learning,...",[ The science of solving clinical problems by...
9,9,67,9_networks_network_of_to,"[networks, network, of, to, and, the, that, in...",[ Multilayer networks allow one to represent ...


In [29]:
topic_model.visualize_barchart(top_n_topics=10)

In [30]:
topic_model.visualize_topics()

In [31]:
topic_model.visualize_heatmap()

In [32]:
topic_model.visualize_hierarchy()