<a href="https://colab.research.google.com/github/redzuanabdullah/Test/blob/main/BERTopic_modelling_and_other_analysis_on_collective_emotions_in_business_and_accounting_fields.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from wordcloud import WordCloud
import matplotlib.pyplot as plt


In [None]:
from google.colab import files

# Upload the file
uploaded = files.upload()

# After uploading, read the file
import pandas as pd

df = pd.read_csv('scopus_cleaned_no_link.csv')

# Display the first few rows of the dataframe
df.head()


Saving scopus_cleaned_no_link.csv to scopus_cleaned_no_link (3).csv


Unnamed: 0,authors,author_full_names,author(s)_id,title,year,source_title,cited_by,abstract,author_keywords
0,Smatov N.; Kalashnikov R.; Kartbayev A.,"Smatov, Nurmaganbet (59196078000); Kalashnikov...",59196078000; 58861145800; 56875410200,Development of Context-Based Sentiment Classif...,2024,Big Data and Cognitive Computing,0,This paper presents a novel approach to sentim...,deep learning; neural networks; sentiment anal...
1,Farhoudinia B.; Ozturkcan S.; Kasap N.,"Farhoudinia, Bahareh (58698192500); Ozturkcan,...",58698192500; 57193953273; 9436751700,Emotions unveiled: detecting COVID-19 fake new...,2024,Humanities and Social Sciences Communications,0,The COVID-19 pandemic has highlighted the pern...,No Keywords
2,Filippou I.; Taylor M.P.; Wang Z.,"Filippou, Ilias (57204036755); Taylor, Mark P....",57204036755; 7406239196; 58851492100,Media Sentiment and Currency Reversals,2024,Journal of Financial and Quantitative Analysis,0,Analyzing 48 foreign exchange (FX) rates and 1...,No Keywords
3,Supino E.; Tenucci A.; Di Nanna G.,"Supino, Enrico (56904312100); Tenucci, Andrea ...",56904312100; 55388280400; 59001274900,Sports failures and stock returns between rati...,2024,Research in International Business and Finance,0,The paper explores the relationships between s...,Abnormal return; Event Study Analysis; Financi...
4,Hou J.; Liu W.; Cao Y.; Wang S.; Tang O.,"Hou, Jiahe (57221687678); Liu, Weihua (5645676...",57221687678; 56456762900; 58249029800; 5764723...,Evaluating service quality of express logistic...,2024,Journal of Management Science and Engineering,0,The quality of logistics services affects the ...,Excellent rating; LDA model and service elemen...


In [None]:
# Inspect the column names
df.columns


Index(['authors', 'author_full_names', 'author(s)_id', 'title', 'year',
       'source_title', 'cited_by', 'abstract', 'author_keywords'],
      dtype='object')

In [None]:
# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Define stopwords
stopwords = list(STOP_WORDS)

# Function for text preprocessing and lemmatization
def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if token.is_alpha and token.lemma_ not in stopwords]
    return ' '.join(tokens)

# Apply preprocessing to the 'Abstract' column
df['Processed_abstract'] = df['abstract'].apply(preprocess_text)

# Display the first few rows to see the new column
df[['abstract', 'Processed_abstract']].head()


Unnamed: 0,abstract,Processed_abstract
0,This paper presents a novel approach to sentim...,paper present novel approach sentiment analysi...
1,The COVID-19 pandemic has highlighted the pern...,pandemic highlight pernicious effect fake news...
2,Analyzing 48 foreign exchange (FX) rates and 1...,analyze foreign exchange FX rate million FX re...
3,The paper explores the relationships between s...,paper explore relationship sport performance f...
4,The quality of logistics services affects the ...,quality logistic service affect performance co...


After confirming the column names, this is the text preprocessing and lemmatization stage.

In [None]:
# Inspect the column names
df.columns


Index(['authors', 'author_full_names', 'author(s)_id', 'title', 'year',
       'source_title', 'cited_by', 'abstract', 'author_keywords',
       'Processed_abstract'],
      dtype='object')

In [None]:
import re

# List of common publisher names to omit
publishers = ["China Science Publishing & Media Ltd", "Springer", "Elsevier", "Wiley", "Taylor & Francis"]

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Define stopwords
stopwords = list(STOP_WORDS)

# Function for text preprocessing, lemmatization, omitting publisher names, removing year at the end, and additional cleaning
def preprocess_text(text):
    # Remove publisher names
    for publisher in publishers:
        text = text.replace(publisher, "")

    # Remove special characters and extra spaces
    text = re.sub(r'[&@]', '', text)
    text = re.sub(r'\s+', ' ', text)

    # Remove year at the end of the abstract
    text = re.sub(r'\b(?:19|20)\d{2}\b$', '', text)

    # Remove specific unwanted words
    text = re.sub(r'\bltd\b|\bLTD\b', '', text, flags=re.IGNORECASE)

    # Lemmatize and remove stopwords
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if token.is_alpha and token.lemma_ not in stopwords]
    return ' '.join(tokens)

# Use the correct column name instead of 'Abstract'
correct_column_name = 'abstract'  # Replace this with the actual column name

# Apply preprocessing to the 'Abstract' column
df['Processed_abstract'] = df[correct_column_name].apply(preprocess_text)

# Display the first few rows to see the new column
df[[correct_column_name, 'Processed_abstract']].head()


Unnamed: 0,abstract,Processed_abstract
0,This paper presents a novel approach to sentim...,paper present novel approach sentiment analysi...
1,The COVID-19 pandemic has highlighted the pern...,pandemic highlight pernicious effect fake news...
2,Analyzing 48 foreign exchange (FX) rates and 1...,analyze foreign exchange FX rate million FX re...
3,The paper explores the relationships between s...,paper explore relationship sport performance f...
4,The quality of logistics services affects the ...,quality logistic service affect performance co...


This step converts the processed abstracts into a document-term matrix, which will be used for topic modeling.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with English stopwords
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the 'Processed_Abstract' column
X = vectorizer.fit_transform(df['Processed_abstract'])

# Display the shape of the resulting document-term matrix
print(X.shape)


(3022, 13758)


This code initializes the BERTopic model, fits it to the processed abstracts, and adds the resulting topics to the dataframe.

In [None]:
# Initialize and fit BERTopic
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(df['Processed_abstract'])

# Add topics to the dataframe
df['Topic'] = topics

# Display the first few rows to see the new 'Topic' column
df[['Processed_abstract', 'Topic']].head()


Unnamed: 0,Processed_abstract,Topic
0,paper present novel approach sentiment analysi...,13
1,pandemic highlight pernicious effect fake news...,10
2,analyze foreign exchange FX rate million FX re...,-1
3,paper explore relationship sport performance f...,28
4,quality logistic service affect performance co...,32


This code initializes the BERTopic model, fits it to the processed abstracts, and adds the resulting topics (above) to the dataframe.

In [None]:
# Inspect the column names
df.columns


Index(['authors', 'author_full_names', 'author(s)_id', 'title', 'year',
       'source_title', 'cited_by', 'abstract', 'author_keywords',
       'Processed_abstract', 'Topic'],
      dtype='object')

In [None]:
# Replace with the correct column names
year_column = 'year'  # Replace with actual year column name
citation_column = 'cited_by'  # Replace with actual citation column name

# Identify and visualize citation distributions by year
citation_by_year = df.groupby(year_column)[citation_column].sum().reset_index()

# Plot the citation distribution by year using Plotly
fig = px.bar(citation_by_year, x=year_column, y=citation_column, title='Citation Distribution by Year')
fig.show()


Next we, identify and visualize the top 50 articles based on citation counts.

In [None]:
# Identify the top 50 articles based on citations
top_50_articles = df.nlargest(50, 'cited_by')  # Replace 'cited_by' if it's different

# Plot the top 50 articles by citation count
fig = px.bar(top_50_articles, x='title', y='cited_by', title='Top 50 Articles by Citation', orientation='v')
fig.update_layout(xaxis={'categoryorder':'total descending'}, xaxis_title='Article Title', yaxis_title='Citation Count')
fig.show()


This is to dentify and visualize the top 30 frequent topics from the BERTopic model. This code calculates the frequency of each topic and then visualizes the top 30 topics in a bar chart.

In [None]:
# Identify the top 30 frequent topics
top_30_topics = df['Topic'].value_counts().head(30).reset_index()
top_30_topics.columns = ['Topic', 'Frequency']

# Get topic labels from BERTopic
topic_labels = {topic: topic_model.get_topic(topic)[0][0] for topic in top_30_topics['Topic']}

# Replace topic numbers with labels
top_30_topics['Topic_Label'] = top_30_topics['Topic'].map(topic_labels)

# Plot the top 30 frequent topics with labels
fig = px.bar(top_30_topics, x='Topic_Label', y='Frequency', title='Top 30 Frequent Topics')
fig.update_layout(xaxis_title='Topic', yaxis_title='Frequency')
fig.show()


Next is to have some insights -  visualizations to understand the attributes and description of the data.

In [None]:
# Plot the distribution of articles by year
fig = px.histogram(df, x='year', title='Distribution of Articles by Year')
fig.update_layout(xaxis_title='Year', yaxis_title='Number of Articles')
fig.show()


In [None]:
# Plot the distribution of citations
fig = px.histogram(df, x='cited_by', title='Distribution of Citations')
fig.update_layout(xaxis_title='Number of Citations', yaxis_title='Number of Articles')
fig.show()


In [None]:
# Visualize the created topics
fig = topic_model.visualize_topics()
fig.show()


In [None]:
# Visualize the created topics with labels and a legend
fig = topic_model.visualize_topics()

# Add labels and legend to the plot
fig.update_layout(
    title='Intertopic Distance Map',
    xaxis_title='D1 (Principal Component 1)',
    yaxis_title='D2 (Principal Component 2)',
    legend_title='Topic',
    legend=dict(
        title='Topic',
        title_font_size=16,
        title_font_family='Arial'
    )
)

fig.show()


In [None]:
# Get the topic labels and sizes
topics = topic_model.get_topics()
topic_labels = {topic: " | ".join([word for word, _ in topic_model.get_topic(topic)]) for topic in topics.keys()}
topic_sizes = topic_model.get_topic_info()

# Create a DataFrame for topic labels and sizes
topics_df = pd.DataFrame(list(topic_labels.items()), columns=['Topic Number', 'Topic Words'])
topics_df['Size'] = topics_df['Topic Number'].map(topic_sizes.set_index('Topic')['Count'])

# Generate custom labels
topics_df['Custom Label'] = topics_df.apply(
    lambda row: f"Topic {row['Topic Number']} - {row['Topic Words']} | Size: {row['Size']}", axis=1
)

# Display the DataFrame with custom labels
topics_df[['Topic Number', 'Custom Label']].head(30)


Unnamed: 0,Topic Number,Custom Label
0,-1,Topic -1 - sentiment | analysis | use | study ...
1,0,Topic 0 - investor | stock | market | return |...
2,1,Topic 1 - tourism | tourist | destination | tr...
3,2,Topic 2 - domain | sentiment | classification ...
4,3,Topic 3 - hotel | review | customer | guest | ...
5,4,Topic 4 - aspect | sentence | model | word | a...
6,5,Topic 5 - public | government | policy | tax |...
7,6,Topic 6 - economic | indicator | confidence | ...
8,7,Topic 7 - disaster | vaccine | pandemic | twee...
9,8,Topic 8 - product | consumer | customer | onli...


The above is to show the topics and their attributes as a table or data-frame display. Useful code to remember.

To identify and visualize emerging topics using similarity scores between clusters and BERTopic’s hierarchical clustering integration. This demonstrates the hierarchical nature of topics to represent their similarities. This code will generate a hierarchical clustering plot showing the relationships and similarities between topics.



In [None]:
# Visualize the similarity between clusters and hierarchical clustering
fig = topic_model.visualize_hierarchy()
fig.show()


Hierarchical Clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. It is often represented using a tree-like diagram called a dendrogram.

Key Concepts:
Agglomerative vs. Divisive:

Agglomerative: Starts with each observation as its own cluster and iteratively merges the closest pairs of clusters. This is the most common approach.
Divisive: Starts with all observations in one cluster and iteratively splits the clusters.
Distance Metrics:

The choice of distance metric (e.g., Euclidean, Manhattan) affects how clusters are formed. The distance metric measures the dissimilarity between data points.
Linkage Criteria:

Single Linkage: Distance between the closest points of the clusters.
Complete Linkage: Distance between the farthest points of the clusters.
Average Linkage: Average distance between all points of the clusters.
Ward’s Method: Minimizes the variance within each cluster.
Dendrogram:

A tree-like diagram that records the sequences of merges or splits. The vertical axis represents the distance or dissimilarity between clusters.
Interpretation of Hierarchical Clustering:
Dendrogram Analysis:

Height of Branches: The vertical height at which two clusters merge indicates the distance or dissimilarity between them. Taller branches imply greater dissimilarity.
Cluster Formation: By cutting the dendrogram at a certain height, you can determine the number of clusters. The horizontal line at the cut point intersects the dendrogram, defining the clusters.
Cluster Similarity:

Clusters that merge at lower heights are more similar to each other compared to those merging at higher heights.
Sub-clusters:

Within each large cluster, you can observe sub-clusters at different levels of the dendrogram, indicating finer-grained groupings.
Hierarchical Clustering in Topic Modeling:
In the context of topic modeling using BERTopic, hierarchical clustering helps visualize the relationships and similarities between topics. Topics that are closely related (in terms of their word distributions) will be merged first, while more distinct topics merge later.

Next, is a visualization that will show each document (article title) as a dot, with similar documents positioned close to each other based on their topic embeddings. This code will create an interactive scatter plot where each dot represents a document. Documents that are close together are similar in meaning and topic.



In [None]:
# Visualize Titles (article titles) and Topics
fig = topic_model.visualize_documents(df['Processed_abstract'])
fig.show()


Explanation of Document-Topic Visualization
The document-topic visualization, created using topic_model.visualize_documents(), offers an interactive scatter plot that illustrates the relationships between documents based on their topic embeddings. Here’s a brief explanation:

Documents as Dots:

Each dot on the scatter plot represents a document (in this case, an article title).
The position of each dot is determined by the document's embedding in a two-dimensional space.
Similarity in Meaning:

Documents that are positioned close to each other in this plot are similar in meaning and topic.
This proximity is calculated based on the embeddings generated by BERTopic, which captures the semantic content of the documents.
Interactive Features:

Hovering over a dot reveals the document's title and other metadata.
The plot may allow zooming and panning to explore different regions of the document space.
Applications:

This visualization helps in identifying clusters of similar documents.
It can be used to explore the distribution of topics across documents and to understand how documents relate to each other within the dataset.
Overall, this visualization is a powerful tool for exploring the structure of your document collection and gaining insights into the distribution and relationships of topics within the data.

In [None]:
# Inspect the column names
df.columns


Index(['authors', 'author_full_names', 'author(s)_id', 'title', 'year',
       'source_title', 'cited_by', 'abstract', 'author_keywords',
       'Processed_abstract', 'Topic'],
      dtype='object')

In [None]:
# Inspect the column names
df.columns


Index(['authors', 'author_full_names', 'author(s)_id', 'title', 'year',
       'source_title', 'cited_by', 'abstract', 'author_keywords',
       'Processed_abstract', 'Topic'],
      dtype='object')

In [None]:
# Inspect the column names
df.columns


Index(['authors', 'author_full_names', 'author(s)_id', 'title', 'year',
       'source_title', 'cited_by', 'abstract', 'author_keywords',
       'Processed_abstract', 'Topic'],
      dtype='object')

In [None]:
# Verify the column names
df.columns

# Assign the correct column names
abstract_column = 'Processed_abstract'  # Replace with actual column name if different
year_column = 'year'  # Replace with actual column name if different
topic_column = 'Topic'  # Ensure 'Topic' is the column with topic numbers

# Prepare the 'topics_over_time' data
topics_over_time = topic_model.topics_over_time(docs=df[abstract_column].tolist(), topics=df[topic_column].tolist(), timestamps=df[year_column].tolist())

# Visualize the topics over time
fig = topic_model.visualize_topics_over_time(topics_over_time)
fig.show()


The above "Topics Over Time" is an analytical approach used in topic modeling to examine how the prevalence of specific topics evolves over a period. This visualization helps in understanding the temporal dynamics of themes within a dataset. Here's a detailed explanation of its purpose:



Next, let's visualize the trending or popular topics and those that are losing relevance over time.



In [None]:
# Define a function to calculate the trend of each topic
def calculate_trends(topics_over_time):
    trends = {}
    for topic in topics_over_time['Topic'].unique():
        topic_data = topics_over_time[topics_over_time['Topic'] == topic]
        trend = topic_data['Frequency'].iloc[-1] - topic_data['Frequency'].iloc[0]
        trends[topic] = trend
    return trends

# Calculate the trends
trends = calculate_trends(topics_over_time)

# Sort topics by their trends
trending_topics = sorted(trends.items(), key=lambda x: x[1], reverse=True)[:10]
losing_topics = sorted(trends.items(), key=lambda x: x[1])[:10]

# Prepare data for visualization
trending_df = pd.DataFrame(trending_topics, columns=['Topic', 'Trend'])
losing_df = pd.DataFrame(losing_topics, columns=['Topic', 'Trend'])

# Map topic numbers to labels
topic_labels = {topic: topic_model.get_topic(topic)[0][0] for topic in topic_model.get_topics().keys()}
trending_df['Topic_Label'] = trending_df['Topic'].map(topic_labels)
losing_df['Topic_Label'] = losing_df['Topic'].map(topic_labels)


In [None]:
# Visualize trending topics with labels
fig = px.bar(trending_df, x='Topic_Label', y='Trend', title='Top 10 Trending Topics')
fig.update_layout(
    xaxis_title='Topic',
    yaxis_title='Trend (Change in Frequency)',
    legend_title='Trending Topics'
)
fig.show()


In [None]:
# Visualize losing topics with labels
fig = px.bar(losing_df, x='Topic_Label', y='Trend', title='Top 10 Losing Topics')
fig.update_layout(
    xaxis_title='Topic',
    yaxis_title='Trend (Change in Frequency)',
    legend_title='Losing Topics'
)
fig.show()
