To cluster the documents and annotate the clusters according to the main topic, we can use various natural language 
processing and machine learning techniques. Here is a general framework that can be used to accomplish this task:

1.Load the dataset into a pandas dataframe.
2.Clean the data by removing any null values or duplicates.
3.Preprocess the textual data (i.e., the abstracts) by removing stopwords, punctuation, and other noise. This can be done using various libraries like NLTK or spaCy.
4.Convert the preprocessed abstracts into a numerical representation using a technique like TF-IDF or Doc2Vec.
5. Find the relevant document 
6.To find the main topic of each document, we will use a topic modeling algorithm such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF)
6.For clustering and visualization, we will use a dimensionality reduction technique such as Principal Component Analysis (PCA) or t-SNE, and a visualization library such as bokeh.''' 

In [7]:
'''Packages for preprocessing'''
import nltk 
from nltk.tokenize import word_tokenize,sent_tokenize

'''Pakages to load dataset'''
import pandas as pd

import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.decomposition import NMF
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

from bokeh.plotting import figure, show, output_file, ColumnDataSource
from bokeh.models import HoverTool


import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 


nltk.download('punkt')
nltk.download('stopwords')

from sklearn.metrics.pairwise import cosine_similarity



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sitas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sitas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
''' Loading the dataset and subsetting it'''
data = pd.read_csv("clean_dataset1.csv",encoding='latin-1')
# # Prompt the user to enter one or more search queries separated by commas
# query_str = input('Enter one or more search queries separated by commas: ')
# queries = [q.strip() for q in query_str.split(',')]

In [9]:
"""  In this block the tokens is vectorized then perform topic modeling"""

vectorizer = TfidfVectorizer(stop_words='english')
word_vector = vectorizer.fit_transform(data['Document Title'] + ' '+ data['Abstract'])

# Normalize the feature matrix
word_vector = normalize(word_vector)


# Fit NMF model to the data
nmf_model = NMF(n_components=5, init='nndsvd')
nmf_model.fit(word_vector)

# Add topic column to dataframe
data['topic'] = nmf_model.transform(word_vector).argmax(axis=1)

# get the top word for each topic
topic_words = []
n_words = 5
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(nmf_model.components_):
    top_words_idx = topic.argsort()[:-n_words - 1:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    topic_words.append(', '.join(top_words))

data['topic_words'] = [topic_words[i] for i in data['topic']]
data

Unnamed: 0,Document Title,Year,Abstract,PDF Link,label,topic,topic_words
0,diagnostic maintenance technique using computer,1963,possible technique attending software need adv...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,3,"test, testing, case, coverage, suite"
1,high speed electro optical mechanical phototyp...,1968,adanced performance automated phototypesetting...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,1,"network, algorithm, performance, power, data"
2,recognition handprinted numeral two stage feat...,1970,optical character recognition system handprint...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,1,"network, algorithm, performance, power, data"
3,computer diagnosis using blocking gate approach,1971,previous paper author considered application g...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,2,"fault, reliability, failure, tolerant, detection"
4,simulation modeling air quality control,1971,simulation modeling major role air quality pro...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,0,"software, metric, model, quality, defect"
...,...,...,...,...,...,...,...
5995,research relationship curvature radius deflect...,2011,china current specification design asphalt pav...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,1,"network, algorithm, performance, power, data"
5996,sampling dmr practical low overhead permanent ...,2011,technology scaling manufacture time field perm...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,2,"fault, reliability, failure, tolerant, detection"
5997,qos aware multipath routing protocol delay sen...,2011,paper proposes qos multipath routing protocol ...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,1,"network, algorithm, performance, power, data"
5998,monitoring high performance data stream vertic...,2011,last several year monitoring high performance ...,http://ieeexplore.ieee.org/stamp/stamp.jsp?arn...,0,1,"network, algorithm, performance, power, data"


In [18]:
data.topic_words.unique()

array(['test, testing, case, coverage, suite',
       'network, algorithm, performance, power, data',
       'fault, reliability, failure, tolerant, detection',
       'software, metric, model, quality, defect',
       'service, web, qos, application, resource'], dtype=object)

In [19]:
# Group the dataframe by the 'topic' column and aggregate the 'topic_word' column
grouped = data.groupby('topic')['topic_words'].unique()

# Print the unique values in the 'topic' column and their corresponding values in the 'topic_word' column
for topic, words in grouped.items():
    print(f"Topic {topic}: {', '.join(words)}")

Topic 0: software, metric, model, quality, defect
Topic 1: network, algorithm, performance, power, data
Topic 2: fault, reliability, failure, tolerant, detection
Topic 3: test, testing, case, coverage, suite
Topic 4: service, web, qos, application, resource


In [20]:
# Perform dimensionality reduction using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(word_vector.toarray())

# Perform dimensionality reduction using t-SNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42)
X_tsne = tsne.fit_transform(word_vector.toarray())


In [21]:
# Create Bokeh plot
source = ColumnDataSource(data=dict(
    x=X_tsne[:,0],
    y=X_tsne[:,1],
    color=data['topic'].map({0:'red', 1:'blue', 2:'green', 3:'purple', 4:'orange'}),
    topic=data['topic'],
    topics=data['topic_words'],
    title=data['Document Title'],
    url=data['PDF Link'],
    published=data['Year']
))
p = figure(title='Topic Clustering of Documents', plot_width=800, plot_height=800, tools='hover,box_zoom,reset')
p.scatter(x='x', y='y', color='color', source=source, size=10, legend_group='topics')
p.legend.title = 'Topic'
p.hover.tooltips = [
    ('Title', '@title'),
    ('URL', '@url')
]
show(p)

In [7]:

'''-----------------------------------------------------------------------------------------------------------------'''

'-----------------------------------------------------------------------------------------------------------------'

FileNotFoundError: [Errno 2] No such file or directory: 'dataset.csv'