<a href="https://colab.research.google.com/github/sherryyuon/ReCoDE_Analysing-Lit-using-BERT-RoBERTa/blob/main/ReCoDE_Analysis_of_environmental_literature_with_BERTopic_and_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Explosive litearture in Environmental and Sustainability Studies**

The field of **environmental and sustainability** studies has witnessed an explosive growth in literature over the past few decades, driven by the increasing global awareness and urgency surrounding environmental issues, climate change, and the need for sustainable practices.

This rapidly expanding body of literature is characterized by its **interdisciplinary nature**, encompassing a wide range of disciplines such as ecology, climate science, energy, economics, policy, sociology, and more. With a global focus and contributions from countries around the world, the literature base reflects **diverse cultural, socio-economic, and geographical contexts**, often in multiple languages. **Novel research areas and emerging topics**, such as circular economy, sustainable urban planning, environmental justice, biodiversity conservation, renewable energy technologies, and ecosystem services, continue to arise as environmental challenges evolve and our understanding deepens. The **development of environmental policies**, regulations, and international agreements, as well as increased public interest and awareness, have further fueled research and the demand for literature aimed at informing and engaging various stakeholders. **Technological advancements** in areas like remote sensing, environmental monitoring, and computational modeling have enabled new avenues of research and data-driven studies, contributing to the proliferation of literature. **The rise of open access publishing and digital platforms** has facilitated the dissemination and accessibility of this constantly evolving and interdisciplinary body of knowledge.

So, in summary, the explosive growth of the literature across multiple disciplines, geographic regions, languages, and emerging topics poses significant challenges in terms of effectively organizing, synthesizing, and extracting insights from this vast and rapidly expanding body of knowledge. This is where **Natural Language Processing (NLP)** techniques like **topic modeling** with BERTopic and advanced language models like RoBERTa can play a crucial role. Their ability to process large volumes of text data, identify semantic topics and patterns, cluster related documents, and handle multiple languages can help researchers, policymakers, and stakeholders navigate this extensive literature more effectively.


Also, as a STEMM PhD student at Imperial, who is going to step into a new field like Sustainability, it is helpful to learn how to take advantage of the NLP tools to accelerate your literature exploration and review process, and achieve a more smooth interdisciplinary research.

**Furthermore, as a STEMM PhD student at Imperial stepping into a new field such as Sustainability, taking advantage of the NLP tools can significantly enhance the efficiency of literature exploration and review. This skill facilitates a seamless transition into interdisciplinary research, empowering you to navigate diverse datasets and extract valuable insights with greater ease and precision.**

# **The Potential of Topic Modeling**

Topic modeling is a technique in NLP and machine learning used to discover abstract "topics" that occur in a collection of documents. The key idea is that documents are made up of mixtures of topics, and that each topic is a probability distribution over words.

More specifically, topic modeling algorithms like Latent Dirichlet Allocation (LDA) work by:

1. Taking a set of text documents as input.
2. Learning the topics contained in those documents in an unsupervised way. Each topic is represented as a distribution over the words that describe that topic.
3. Assigning each document a mixture of topics with different weights/proportions.

For example, if you ran topic modeling on a set of news articles, it may discover topics like "politics", "sports", "technology", etc. The "politics" topic would be made up of words like "government", "election", "policy" with high probabilities. Each document would then be characterized as a mixture of different proportions of these topics.

The key benefits of topic modeling include:

1. Automatically discovering topics without need for labeled data
2. Understanding the themes/concepts contained in large document collections
3. Organizing, searching, and navigating over a document corpus by topics
4. Providing low-dimensional representations of documents based on their topics

Topic modeling has found applications in areas like **information retrieval, exploratory data analysis, document clustering and classification, recommendation systems**, and more. Popular implementations include Latent Dirichlet Allocation (LDA), Biterm Topic Model (BTM), and techniques leveraging neural embeddings like BERTopic.

# **BERTopic and RoBERTa**

**BERTopic** and **RoBERTa**, leveraging transformer-based NLP, offer potent solutions to analyze the explosively growing, multilingual environmental and sustainability literature by enabling unsupervised topic modeling, semantic understanding, and scalable analysis of massive text corpora, empowering researchers to navigate this vast interdisciplinary knowledge domain effectively. More specifically:

* Cross-lingual Topic Modeling: BERTopic can leverage multilingual language models like RoBERTa to perform topic modeling on text data from different languages simultaneously. This can reveal common themes, research areas, and concepts across diverse linguistic and cultural backgrounds in the environmental domain.
* Identifying Language-specific Nuances: By analyzing literature in multiple languages separately or comparatively, BERTopic and RoBERTa can help uncover language-specific nuances, terminologies, and perspectives on environmental and sustainability issues.
* Facilitating Knowledge Transfer: Effective topic modeling and clustering of multilingual literature can bridge language barriers and facilitate knowledge transfer across different regions, enabling researchers and policymakers to access and understand diverse perspectives on environmental challenges.
* Monitoring Global Trends: With the increasing globalization of environmental research and policymaking, analyzing literature in multiple languages can provide a more comprehensive understanding of global trends, priorities, and emerging areas of interest in sustainability studies.

**BERTopic:**

BERTopic is a topic modeling technique that leverages transformer language models to perform unsupervised topic extraction and document clustering. Its key features include:

*   Unsupervised Learning: BERTopic does not require labeled training data, making it useful for exploring and understanding large unlabeled text corpora.
Semantic Topic Modeling: It uses contextualized word embeddings from BERT-like models to capture the semantic meaning of words, resulting in more coherent and meaningful topics.
*   Document Clustering: In addition to topic extraction, BERTopic can also cluster documents based on their topic distributions, enabling exploration of document collections.
*   Dynamic Topic Modeling: It allows for dynamic updates of the topic model as new data becomes available, making it suitable for streaming or evolving text sources.

BERTopic has applications in various domains, such as text summarization, information retrieval, and content organization.

**RoBERTa:**

RoBERTa (Robustly Optimized BERT Pretraining Approach) is a variant of the BERT language model, introduced by researchers at Facebook AI Research. It is a pre-trained model on a large corpus of text data using the same pretraining objectives as BERT (Masked Language Modeling and Next Sentence Prediction). However, RoBERTa introduces several improvements over the original BERT:


*   Larger Training Data: RoBERTa was trained on a much larger dataset compared to BERT, leading to better performance on various NLP tasks.
*   Longer Training: RoBERTa was trained for a longer period with more iterations, allowing it to capture more contextual information.
*   Dynamic Masking: Instead of static masking used in BERT, RoBERTa employs dynamic masking, which randomly masks different tokens during each training iteration.
*   Removal of Next Sentence Prediction: RoBERTa dropped the Next Sentence Prediction objective, focusing solely on the Masked Language Modeling task.

RoBERTa has achieved state-of-the-art performance on various NLP benchmarks and has been widely adopted as a base model for fine-tuning on specific tasks, such as text classification, question answering, and named entity recognition.

By synergistically leveraging BERTopic's unsupervised topic extraction capabilities and RoBERTa's robust language understanding, researchers can effectively navigate and uncover salient themes, trends, and interdisciplinary connections within the rapidly growing and multilingual corpus of environmental and sustainability literature. This powerful combination of NLP techniques holds immense potential to facilitate knowledge transfer across domains, identify critical research gaps, and ultimately support data-driven decision-making processes aimed at tackling pressing global environmental challenges at scale.

# **A Step-by-Step Case Study using BERTopic to Analyze XXX Dataset**

As an example: XXXXXXXXXXXXXXXXX

In [None]:
pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m112.6/154.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9

In [None]:
pip install -U "tensorflow-text==2.15.*"

In [None]:
pip install "tf-models-official==2.15.*"

In [None]:
!pip install umap-learn hdbscan

In [None]:
import os
import shutil

import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("california_housing_test.csv", encoding='utf-8')

# Preview the data
print(df.head())

# New section

# New section

# New section

# New section

# **Frequently Asked Questions**

**1. What is**


**2. When and why do we go to some specific BERT models?**


3.

# **Suggested Readings**

1. Natural Language Processing: A Textbook with Python Implementation (by Raymond S. T. Lee): https://www.amazon.co.uk/Natural-Language-Processing-Textbook-Implementation-ebook/dp/B0CBR29GV2

2. Speech and Language Processing (3rd ed. draft) (by Dan Jurafsky and James H. Martin): https://web.stanford.edu/~jurafsky/slp3/

