<a href="https://www.kaggle.com/code/sblaizer/in-depth-analysis-of-kaggle-and-arxiv-datasets?scriptVersionId=159089025" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Objective
This notebook is a follow-up to the [EDA Kaggle and Arxiv datasets](https://www.kaggle.com/code/sophieb/eda-kaggle-and-arxiv-datasets). The aim is to dive deeper into the following tasks:

* Integrate all text data-related competitions (9) from the past two years into the metadata analysis of the Kaggle write-ups. In the first EDA, we only referred to five competitions.
* Analyze the Arxiv dataset in greater detail to compare the insights gained in academia with those learned from text data write-ups reported in [A Journey Through Text Data Competitions](https://www.kaggle.com/code/sophieb/a-journey-through-text-data-competitions). 
* Take a step further by using the PKE model to extract keywords from both Kaggle write-ups and Arxiv datasets, considering n-gram candidates, stopwords, and integrating a function to compute idf weights.
* Present results using resources other than horizontal bar plots, such as stylecloud and n-gram plots.

In [None]:
# Installing Modules
!pip install git+https://github.com/boudinfl/pke.git
!pip install stylecloud wordcloud

In [None]:
# Library Definition
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

import re

import string
from string import punctuation
import pke
from pke import compute_document_frequency

import stylecloud
from PIL import Image
from IPython.display import Image

import gc

---

# 1. Analyzing Kaggle Writeups Dataset

In [None]:
# Reading the Kaggle writeups dataset
writeup_df = pd.read_csv("/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv", parse_dates=[0,3]) # Consider the first four columns as date-format ones
writeup_df.head(3)

In [None]:
writeup_df.info(memory_usage='deep')

In [None]:
print("Number of writeups in the dataset: " + str(len(writeup_df)))

**The Kaggle writeups dataset contains a total of 3,127 writeups and uses 22 MB of memory**

## How many unique competitions has the Kaggle writeups dataset?

In [None]:
num_competitions = writeup_df["Title of Competition"].nunique()
print(f"The dataset has {num_competitions} competitions.")

**Answer: The Kaggle writeups dataset includes 310 competitions with a total of 3,127 writeups**

## What are the earliest and latest competitions?

In [None]:
early_comp = writeup_df["Competition Launch Date"].min().strftime('%Y-%m-%d')
late_comp = writeup_df["Competition Launch Date"].max().strftime('%Y-%m-%d')

print(f"The earliest competition is {early_comp} \nThe latest competition is {late_comp}")

**Answer: The dates of the Kaggle competitions range from <ins>2010-08-03</ins> to <ins>2023-02-23</ins>**

## How many competitions are from the past two years?
Let's consider competitions from January 2021 onwards.

In [None]:
writeup_past2years_df = writeup_df[(writeup_df["Competition Launch Date"].dt.year >= 2021) & (writeup_df["Competition Launch Date"].dt.month >= 1)]
numcomp_past2years = writeup_past2years_df['Title of Competition'].nunique()

print(f"Number of competitions from the past two years is {numcomp_past2years}")

In [None]:
len(writeup_past2years_df)

In [None]:
writeup_past2years_df["Title of Competition"].unique()

**Answer: There are 71 competitions held within the past two years (from January 2021 to February 2023) having 1,073 writeups.**

## What are the competitions from the past two years with most writeups?

In [None]:
writeup_past2years_df["Title of Competition"].value_counts().head()

## What are the competitions from the past two years with less writeups?


In [None]:
writeup_past2years_df["Title of Competition"].value_counts().tail()

**Answers: The Feedback Prize - English Language Learning and Jigsaw Rate Severity of Toxic Comments competitions take the lead (both are text-related competitions) whereas the Herbarium 2021 and Herbarium 2022 competitions have only one writeup.**

## What is the number of writeups corresponding to text data competitions held in the past two years? 
For this analysis, we're going to use the following external dataset [Top 3 Kaggle Text Data Competitions (2021-2023)](https://www.kaggle.com/datasets/sblaizer/top-3-kaggle-text-data-competitions-2021-2023) that has identified nine competitions related to text data. 

In [None]:
textdata_df = pd.read_csv("/kaggle/input/top-3-kaggle-text-data-competitions-2021-2023/Summary_27write-ups_AIreport - Text Data Write-ups 27.csv")
textdata_df.head(3)

In [None]:
textdata_df["Competition"].unique()

In [None]:
# Turning unique competitions into a list 
list_textdata_comp = list(textdata_df["Competition"].unique())

In [None]:
# Correcting middle dash typos of the list
list_textdata_comp[0] = 'Feedback Prize - Predicting Effective Arguments'
list_textdata_comp[3] = 'Feedback Prize - Evaluating Student Writing'
list_textdata_comp[5] = 'chaii - Hindi and Tamil Question Answering'
list_textdata_comp[7] = 'Coleridge Initiative - Show US the Data'
list_textdata_comp[8] = 'NBME - Score Clinical Patient Notes'

list_textdata_comp

In [None]:
# Filtering out text data related competitions from the writeups of the past two years
text_past2years_df = writeup_past2years_df[writeup_past2years_df["Title of Competition"].isin(list_textdata_comp)].copy()
text_past2years_df["Title of Competition"].unique()

In [None]:
print(f"Total writeups from the past two years: {len(writeup_past2years_df)}")
print(f"Total writeups related to text data competitions from the past two years: {len(text_past2years_df)}")

In [None]:
text_past2years_df = text_past2years_df.reset_index(drop=True)
text_past2years_df.sort_values(by='Competition Launch Date', ascending=True)

**Response: There are 9 competitions related to text data spanning from March 2021 to May 2022 having a total of 208 writeups.**

## What is the number of writeups per text data competition?

In [None]:
text_past2years_df["Title of Competition"].value_counts()

**Response: Jigsaw Rate Severity of Toxic Comments takes the lead with 33 writeups**

# 2. Extracting keywords from text data writeups
Now that we have identified 9 competitions related to text data and their 208 writeups, let's analyze the content of the writeups using the [PKE](https://boudinfl.github.io/pke/build/html/index.html) (Python Keyword Extraction) module. Before stepping into this task, it's paramount to implement a cleaning text data stage. 

## Cleaning text data

Let's have a look at the format of a single writeup:

In [None]:
text_past2years_df["Writeup"][0]

In [None]:
# Creating a function that performs several text data cleaning steps 
def clean_text(df, col_to_clean):

    # Remove HTML tags
    df['cleaned_text'] = df[col_to_clean].apply(lambda x: re.sub('<[^<]+?>', ' ', x))
 
    # Remove brackets and apostrophes from Python lists
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r"[\[\]'\"]"," ", x))
    
    # Remove change of line characters 
    df['cleaned_text'] = df['cleaned_text'].str.replace("\n", " ", regex=True)
   
    # Remove special characters
    df['cleaned_text'] = df['cleaned_text'].str.replace("-", "", regex=False)
    df['cleaned_text'] = df['cleaned_text'].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
     
    # Lowercase text
    df['cleaned_text'] = df['cleaned_text'].str.lower()
    
    return df

After applying the cleaning function, this is the outcome we obtained:

In [None]:
# Applying `clean_text()` function on writeups
text_past2years_df = clean_text(text_past2years_df, 'Writeup')
text_past2years_df['cleaned_text'][0]

## Computing the frequency of keywords in writeups

In [None]:
# Creating a list containing all writeups
lst_writeups = text_past2years_df['cleaned_text'].to_list()

This function calculates the frequency of keywords in the collection of writeups. If using a CPU setting, this task will take around 7 min to complete. 


In [None]:
#Reference1: https://github.com/boudinfl/pke/blob/master/examples/compute-df-counts.ipynb
#Reference2: https://boudinfl.github.io/pke/build/html/unsupervised.html
compute_document_frequency(
    documents=lst_writeups,     # List of writeups
    output_file='inspec.df.gz',
    language='en',              # language of the input files
    normalization='stemming',   # use porter stemmer
    stoplist=list(punctuation), # stoplist (punctuation marks)
    n=3                         # compute n-grams up to 3-grams
)

Let's have a look at the frequency of 20 keywords from the writeups collection 

In [None]:
from pke import load_document_frequency_file
dict_freq = load_document_frequency_file(input_file='inspec.df.gz')

count = 0  # Initialize a counter
for key, value in dict_freq.items():
    if count < 20:  # Limit to the first 5 key-value pairs
        print(f'{key}: {value}')
        count += 1
    else:
        break

In [None]:
# Erasing non-utilized variables to freeing up memory
import gc

del writeup_df
del writeup_past2years_df 

# Freeing up memory 
gc.collect()

## Extracting keywords from writeups

The keyword extraction stage is based on the [TfIdf](https://boudinfl.github.io/pke/build/html/unsupervised.html) (Term Frequency-Inverse Document Frequency) method from PKE. *TfIdf is a popular and effective technique for identifying keyphrases in a collection of text documents.* We have created the `extract_keywords()` function to extract the top 5 keywords from each writeup. This function will process the 208 writeups and render a total of 1,040 keywords (208*5). 



In [None]:
def extract_keywords(text):
    stoplist = list(string.punctuation)
    stoplist += pke.lang.stopwords.get('en')
    extractor = pke.unsupervised.TfIdf()
    extractor.load_document(input=text,
                           language='en',
                           stoplist=stoplist,
                           normalization=None)
 
    extractor.candidate_selection(n=3) #Select 1 to 3 grams
    df = load_document_frequency_file(input_file='inspec.df.gz')
    extractor.candidate_weighting(df=df) #Candidate weighting using document frequencies
    keyphrases = extractor.get_n_best(n=10)
    
    # Extract top 5 keywords
    keywords = [keyword[0] for keyword in keyphrases[:5]]
    
    return keywords

In [None]:
# Calculating a memory usage estimation of the collection of writeups to be processed.
text_past2years_df['cleaned_text'].info(memory_usage='deep')

In [None]:
# Creating a bar to track the keyword extraction progress
from tqdm import tqdm
with tqdm(total=len(text_past2years_df), desc="Processing") as pbar:
    def apply_with_progress(text):
        result = extract_keywords(text)
        pbar.update(1)  # Update the progress bar
        return result

    # Apply the function to the Series with progress tracking
    abstract_keywords = text_past2years_df['cleaned_text'].apply(apply_with_progress)

In [None]:
# Displaying the top 5 keywords of the first 10 writeups
abstract_keywords[:10]

The result is a list of lists that contains the top 5 keywords of each writeup. Let's store these results in a new `keywords_lst` column. 

In [None]:
text_past2years_df['keywords_lst'] = abstract_keywords
text_past2years_df.head(1)

In [None]:
# Freeing up memory 
gc.collect()

## Plotting extracted keywords from writeups
Let's count the most mentioned keywords in the collection of writeups

In [None]:
text_past2years_df_exploded = text_past2years_df.explode('keywords_lst')
text_past2years_df_exploded = text_past2years_df_exploded.reset_index(drop=True)
text_past2years_df_exploded['keywords_lst'].value_counts().head(30)

In [None]:
# Converting the previous keyword list into a dataframe
keywords_count_serie = text_past2years_df_exploded['keywords_lst'].value_counts()
keywords_count_df = pd.DataFrame({'Keywords': keywords_count_serie.index,'Count': keywords_count_serie.values})
keywords_count_df.head(5)

In [None]:
# Selecting the top 50 words
keywords_count_50_df = keywords_count_df[:50]

In [None]:
import plotly.express as px

fig = px.bar(keywords_count_50_df, x='Count', y='Keywords', title='Top 50 keywords found in all Writeups by the TfIdf method', orientation='h', width=750, height=900, color='Keywords')
fig.show()

In [None]:
# Filtering out duplicated keywords
keywords_tree = text_past2years_df_exploded['keywords_lst'].to_list()
set_keywords_tree = set(keywords_tree)
lst_keywords_tree = list(set_keywords_tree)
print(f"Total keywords: {len(keywords_tree)} \nUnique keywords: {len(lst_keywords_tree)}")

In [None]:
# Creating a word cloud image using stylecloud
stylecloud.gen_stylecloud(
    text=' '.join(lst_keywords_tree), 
    icon_name='fas fa-tree',                     # 'fas fa-cloud'; 'fas fa-eye'; ''
    palette='cmocean.sequential.Matter_10',
    background_color='black',
    gradient='horizontal',
    size=1024
)
Image(filename="./stylecloud.png", width=1024, height=768)

---

# 3. Examining popular architectures, domains, and techniques used in Kaggle writeups based on word occurrences

In the previous section, we extracted the top 5 keywords of every writeup and computed an empirical analysis of their occurrences in all writeups. We identified common words, including 'models,' 'competition,' 'training,' 'ensemble,' and 'different.' However, these words do not appear to offer valuable insights about the write-ups. In this section, we will formulate specific questions and provide keywords that are more likely to yield better results in understanding the techniques, text data domains, and architectures used in Kaggle's text data competitions.

## What are the main architectures used in the solutions of text data competitions?
We considered the following 16 architectures as keywords for this question. 

In [None]:
text_architectures_keywords = [
    "fasttext", "roberta", "bert", "gpt", "rnn", "cnn", "gru", "t5", "electra", "xlnet",
    "encoder", "decoder", "lstm", "transformer", "deberta", "codebert"
]

In [None]:
# Function that matchs a list of specific words with a given column of a dataframe
def count_ocurrences_in_dataframe(df, column_name, strings_list):
    # Convert the string_list input to a set for faster membership checking
    strings_set = set(strings_list)
    
    # Filter out the dataframe to only include rows where 'column_name' contains any of the strings in 'strings_list' 
    # This is used to create a regular expression pattern where the '|' pipe acts as an "OR" operator.
    filtered_df = df[df[column_name].str.contains('|'.join(strings_set))]
    
    # Create a dictionary to store the counting results
    results_dict = {'String': [], 'Occurrences':[]}
    
    # Iterate over the strings list
    for string in strings_list:
        # Add the string and its corresponding count to the dictionary
        results_dict['String'].append(string)
      
        # Count the actual ocurrences in the filtered dataframe
        actual_occurrences = filtered_df[column_name].str.count(string).sum()
        results_dict['Occurrences'].append(actual_occurrences)
    
    # Convert the dictionary to a dataframe
    counts_df = pd.DataFrame(results_dict)
    
    return counts_df
    

In [None]:
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', text_architectures_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [None]:
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Architectures')
ax.set_xlabel('Occurrences')
ax.set_title('Architectures used in Kaggle text data competitions', fontsize=12)
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

**Answer: The top 3 architectures used in Kaggle text data competitions are BERT, DEBERTA, and ROBERTA.**

## Which of the following techniques is mostly used in the solutions of text data competitions?

In [None]:
techniques_keywords = [
    "pseudo labeling",
    "masked language modeling",
    "adversarial weight perturbation",
    "model ensembling",
    "model efficiency",
    "data augmentation"
]

In [None]:
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', techniques_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [None]:
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Techniques')
ax.set_xlabel('Occurrences')
ax.set_title('Trending NLP techniques found in Kaggle writeups')
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

**Answer: Pseudo labeling is the most referred technique along with data augmentation. Interestingly, it seems that kagglers didn't worry at all about optimizing their model's efficiency.**

## Which of the following domains is mostly referred in the solutions of text data competitions?

In [None]:
domain_keywords = [
     "text mining",
     "text analytics",
     "text preprocessing",
     "text classification",
     "text clustering",
     "named entity recognition",
     "topic modeling",
     "information retrieval",
     "text summarization",
     "text generation",
     "text similarity",
     "word embeddings",
     "document classification",
     "text feature extraction",
     "text segmentation",
     "text normalization",
     "text corpora",
     "textual data analysis",
     "question answering",
     "sentiment analysis",
     "language modeling"    
]

In [None]:
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', domain_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [None]:
fig, ax= plt.subplots(1, 1, figsize=(8,6))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Domains')
ax.set_xlabel('Occurrences')
ax.set_title('NLP domains found in Kaggle writeups')
ax.set_xlim([0, 12])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

**Answer: Question and answering and text classification domains are the most referred in the collection of writeups.**

---

In [None]:
# Freeing up memory 
gc.collect()

---

# 4. Analyzing the Arxiv Dataset 

Likewise the Kaggle writeup dataset, we're going to analyze the Arxiv dataset to gain more insights about the strategies followed in Academia related to text data.

In [None]:
# Loading the Arxiv Dataset
df_arxiv = pd.read_json(
    '/kaggle/input/2023-kaggle-ai-report/arxiv_metadata_20230510.json',
    lines = True, 
    convert_dates = True, 
    chunksize = 100000
)

In [None]:
# Reading a single chunk from the Arxiv dataset 
for chunk in df_arxiv:
    break
len(chunk)

In [None]:
chunk.head(3)

In [None]:
chunk.info(memory_usage='deep')

In [None]:
# Reading all chunks and concatenating them into a single dataframe
arxiv_df = pd.DataFrame()
for chunk in df_arxiv:
    arxiv_df = pd.concat([arxiv_df, chunk], ignore_index=True)
arxiv_df.head(5) 

In [None]:
arxiv_df.info(memory_usage='deep')

**We found around 2.15 M papers in the Arxiv dataset. This dataframe has a memory usage of 4.0 GB.**

## How many distinct categories of papers are present in the ArXiv Dataset? 

In [None]:
print(f"The Arxiv Dataset has {arxiv_df['categories'].nunique()} unique categories")

**Answer: The Arxiv dataset has around 76 k categories**

## What are the earliest and latest papers in the Arxiv Dataset?

In [None]:
# Turning the 'update_date' column into a datetime format column 
arxiv_df['update_date'] = pd.to_datetime(arxiv_df['update_date'])
arxiv_df.info()

In [None]:
arxiv_date_min = arxiv_df['update_date'].min().strftime('%Y-%m-%d')
arxiv_date_max = arxiv_df['update_date'].max().strftime('%Y-%m-%d')
print(f"The Arxiv Dataset includes papers from {arxiv_date_min} to {arxiv_date_max}")

**Answer:  The Arxiv Dataset contains papers from <u>2007-05-23</u> to <u>2023-05-05</u>**

## How many papers has the Arxiv Dataset from the past two years? 

In [None]:
arxiv_2years_df = arxiv_df[(arxiv_df['update_date'].dt.year >= 2021) & (arxiv_df['update_date'].dt.month >= 1)].copy()
arxiv_2years_df = arxiv_2years_df.reset_index(drop=True)
arxiv_2years_df.head(3)

In [None]:
print(f"The Arxiv Dataset contains {len(arxiv_2years_df)} papers from January 2021 to 2023")

**Answer: The Arxiv Dataset contains around 527 k papers from January 2021 to 2023**

## What are the papers with most categories from the past two years?

In [None]:
arxiv_2years_df['categories'].value_counts().head(20)

**Answer: Computer Vision takes the lead on number of papers followed by quantum physics.**

## What are the papers with less categories from the past two years?

In [None]:
arxiv_2years_df['categories'].value_counts().tail(20)

**Answer: We noticed that categories that are concatenated in a single row are hard to identify as unique categories**

## What is the number of NLP-related papers from the past two years?

To identify the most popular categories in the NLP domain in the Arxiv Dataset, we searched the term **Natural Language Processing** in the [arxiv](https://arxiv.org/) dataset from 2021-01 to 2023. Then we ordered the results by **Annnoucement date (oldest first)** and then by **Annnoucement date (newest first)** and identified 19 categories that were mostly referred by researchers.

In [None]:
# Filtering out only text data related papers from the Arxiv dataset
nlp_categories_arxiv = [
    "cs.SE",      # Software Engineering
    "cs.CY",      # Computers and Society
    "cs.IR",      # Information Retrieval
    "cs.CL",      # Computation and Language
    "cs.LG",      # Machine Learning
    "cs.NE",      # Neural and Evolutionary Computing
    "cs.AI",      # Artificial Intelligence
    "cs.DL",      # Digital Libraries
    "cs.HC",      # Human Computer Interaction
    "cs.SI",      # Social and Information Networks
    "stat.ML",    # Machine Learning
    "cs.SD",      # Sound
    "cs.CR",      # Cryptography and Security
    "q-fin.ST",   # Statistical Finance
    "quant-ph",   # Quantum Physics
    "q-bio.OT",   # Other Quantitative Biology
    "physics.comp-ph", # Computational Physics
    "physics.data-an", # Data Analysis, Statistics, and Probability
    "cs.AR"            # Hardware Architecture
]

In [None]:
nlp_arxiv_2years_df = arxiv_2years_df[arxiv_2years_df.categories.isin(nlp_categories_arxiv)]
nlp_arxiv_2years_df = nlp_arxiv_2years_df.reset_index(drop=True)
print(f"Number of papers from the past two years: {len(arxiv_2years_df)} \nNumber of NLP-related papers from the past two years: {len(nlp_arxiv_2years_df)}")

In [None]:
nlp_arxiv_2years_df.sort_values('update_date', ascending=True)

**Answer: There is a total of 527 k papers from the past two years (from January 2021 to May 2023) and around 46 k of them corresponding to the NLP domain.**

## What is the count of NLP-related papers per category from the past two years?

In [None]:
nlp_keywords_serie = nlp_arxiv_2years_df['categories'].value_counts()
nlp_keywords_serie

In [None]:
import plotly.express as px

keywords_count_df = pd.DataFrame({'Keywords': nlp_keywords_serie.index,'Count': nlp_keywords_serie.values})
fig = px.bar(keywords_count_df, x='Count', y='Keywords', title='NLP-related papers per category found in the Arxiv dataset', orientation='h', width=750, height=900, color='Keywords')
fig.show()

**Answer: Interestingly, Quantum Physics takes the lead in NLP-related papers followed by Computation and Language, and Machine Learning papers.**

---

# 5. Extracting keywords from Arxiv papers
In this section, we analyze the content of abstracts of papers using the PKE (Python Keyword Extraction) module to identify main keywords.  

## Cleaning the Arxiv Dataset 
Before stepping into this task, it's paramount to implement a cleaning text data stage. 

In [None]:
# Eliminating duplicates 
nlp_arxiv_2years_cleaned_df = nlp_arxiv_2years_df.drop_duplicates(subset=['title'])
len(nlp_arxiv_2years_df), len(nlp_arxiv_2years_cleaned_df)

In [None]:
# Printing a sample abstract 
nlp_arxiv_2years_cleaned_df['abstract'][0]

In [None]:
# Creating a function that performs several text data cleaning steps 
def clean_text(df, col_to_clean):

    # Remove HTML tags
    df['cleaned_text'] = df[col_to_clean].apply(lambda x: re.sub('<[^<]+?>', ' ', x))
 
    # Remove brackets and apostrophes from Python lists
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r"[\[\]'\"]"," ", x))
    
    # Remove change of line characters 
    df['cleaned_text'] = df['cleaned_text'].str.replace("\n", " ", regex=True)
   
    # Remove special characters
    df['cleaned_text'] = df['cleaned_text'].str.replace("-", "", regex=False)
    df['cleaned_text'] = df['cleaned_text'].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
     
    # Lowercase text
    df['cleaned_text'] = df['cleaned_text'].str.lower()
    
    return df

In [None]:
# Applying the `clean_text()` function on abstracts
nlp_arxiv_2years_copy_df = nlp_arxiv_2years_cleaned_df.copy()
nlp_arxiv_2years_copy_df = clean_text(nlp_arxiv_2years_copy_df, 'abstract')

# Printing a cleaned abstract
nlp_arxiv_2years_copy_df['cleaned_text'][0]

In [None]:
nlp_arxiv_2years_copy_df.head(1)

## Computing keyword extraction on the Arxiv Dataset
The keyword extraction stage is based on the [TfIdf](https://boudinfl.github.io/pke/build/html/unsupervised.html) (Term Frequency-Inverse Document Frequency) method from PKE. *TfIdf is a popular and effective technique for identifying keyphrases in a collection of text documents.* We have created the `extract_keywords()` function to extract the top 5 keywords from each abstract. 

In [None]:
# Cleaning up memory 
import gc

del df_arxiv
del chunk
del arxiv_df
del arxiv_2years_df
del nlp_arxiv_2years_df
del nlp_arxiv_2years_cleaned_df

gc.collect() 

In [None]:
!pip install git+https://github.com/boudinfl/pke.git

In [None]:
def extract_keywords(text):
    stoplist = list(string.punctuation)
    stoplist += pke.lang.stopwords.get('en')
    extractor = pke.unsupervised.TfIdf()
    extractor.load_document(input=text,
                           language='en',
                           stoplist=stoplist,
                           normalization=None)
 
    extractor.candidate_selection() #Select 1 to 3 grams
    extractor.candidate_weighting() #Candidate weighting using document frequencies
    keyphrases = extractor.get_n_best(n=10)
    
    # Extract top 5 keywords
    keywords = [keyword[0] for keyword in keyphrases[:5]]
   
    return keywords

In [None]:
nlp_arxiv_2years_copy_df['cleaned_text'].info(memory_usage='deep')

**Important: The size of text-related papers (52.5 MB) is 250 times larger than that of the processed Kaggle writeups collection (208 KB). This implies that the keyword extraction process for the ArXiv dataset will experience a significant RAM memory overload if using the standard CPU settings. For this reason, we have selected only 5,000 abstracts to prove that the keyword extraction process works by identifying 25,000 keywords (5*5,000).**

In [None]:
# Selecting only 5,000 abstracts to be processed
clean_abstract = nlp_arxiv_2years_copy_df['cleaned_text'][:5000]
clean_abstract.info(memory_usage='deep')

## Implementing a progress bar to track the keyword extraction process

In [None]:
!pip install tqdm

In [None]:
from tqdm import tqdm

In [None]:
# Creating a tqdm progress bar to track the keyword extraction process. It takes about 6 h
cleanup_interval = 20

with tqdm(total=len(clean_abstract), desc="Processing") as pbar:
    def apply_with_progress(text):
        result = extract_keywords(text)
        pbar.update(1)  # Update the progress bar
        # Check if it's time to clean up memory
        if pbar.n % cleanup_interval == 0:
            gc.collect()
        return result

    # Apply the function to the Series with progress tracking
    abstract_keywords = clean_abstract.apply(apply_with_progress)

In [None]:
abstract_keywords[100:]

## Plotting identified keywords from 5,000 papers

In [None]:
# Finding the most popular keywords from 5,000 papers
nlp_keywords_serie = abstract_keywords.explode()
nlp_keywords_count = nlp_keywords_serie.value_counts()
nlp_keywords_count_df = pd.DataFrame({'Keywords': nlp_keywords_count.index,'Count': nlp_keywords_count.values})
nlp_keywords_count_df.head(5)

In [None]:
# Selecting the top 50 words
nlp_keywords_count50_df = nlp_keywords_count_df[:50]

In [None]:
import plotly.express as px

fig = px.bar(nlp_keywords_count50_df, x='Count', y='Keywords', title='Top 50 keywords found in 5,000 abstracts by the TfIdf method', orientation='h', width=750, height=900, color='Keywords')
fig.show()

In [None]:
# Filtering out duplicated keywords
keywords_tree = nlp_keywords_serie.to_list()
set_keywords_tree = set(keywords_tree)
lst_keywords_tree = list(set_keywords_tree)
print(f"Total keywords: {len(keywords_tree)} \nUnique keywords: {len(lst_keywords_tree)}")

In [None]:
# Creating a word cloud image using stylecloud
stylecloud.gen_stylecloud(
    text=' '.join(lst_keywords_tree), 
    icon_name='fas fa-tree',                     # 'fas fa-cloud'; 'fas fa-eye'; ''
    palette='cmocean.sequential.Matter_10',
    background_color='black',
    gradient='horizontal',
    size=1024
)
Image(filename="./stylecloud.png", width=1024, height=768)

---

# 6. Examining popular architectures, domains, and techniques in the Arxiv dataset based on word occurrences

Likewise the analysis on the Kaggle writeup dataset, in this section we make specific questions and provide keywords to narrow down our analysis. We will focus specifically on the techniques, text data domains, and architectures mostly employed in the papers of the Arxiv dataset. 

## What are the main architectures used in Academia?
We considered the following 16 architectures as keywords for this question. 

In [None]:
text_architectures_keywords = [
    "fasttext", "roberta", "bert", "gpt", "rnn", "cnn", "gru", "t5", "electra", "xlnet",
    "encoder", "decoder", "lstm", "transformer", "deberta", "codebert"
]

In [None]:
# Function that matchs a list of specific words with a column of a dataframe
def count_ocurrences_in_dataframe(df, column_name, strings_list):
    # Convert the string_list input to a set format for faster membership checking
    strings_set = set(strings_list)
    
    # Filter out the dataframe to only include rows where 'column_name' contains any of the strings in 'strings_list' 
    # This is used to create a regular expression pattern where the '|' pipe acts as an "OR" operator.
    filtered_df = df[df[column_name].str.contains('|'.join(strings_set))]
    
    # Create a dictionary to store the counting results
    results_dict = {'String': [], 'Occurrences':[]}
    
    # Iterate over the strings list
    for string in strings_list:
        # Add the string and its corresponding count to the dictionary
        results_dict['String'].append(string)
      
        # Count the actual ocurrences in the filtered dataframe
        actual_occurrences = filtered_df[column_name].str.count(string).sum()
        results_dict['Occurrences'].append(actual_occurrences)
    
    # Convert the dictionary to a dataframe
    counts_df = pd.DataFrame(results_dict)
    
    return counts_df
    

In [None]:
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', text_architectures_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [None]:
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Architectures')
ax.set_xlabel('Occurrences')
ax.set_title('Architectures used in the Arxiv Dataset', fontsize=12)
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

**Answer: It seems that BERT and encoder-based architectures take the lead in Academia along with trasformers.**

## Which of the following techniques is mostly used in Academia?

In [None]:
techniques_keywords = [
    "pseudo labeling",
    "masked language modeling",
    "adversarial weight perturbation",
    "model ensembling",
    "model efficiency",
    "data augmentation"
]

In [None]:
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', techniques_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [None]:
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Techniques')
ax.set_xlabel('Occurrences')
ax.set_title('Trending NLP techniques found in Academia')
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

**Answer: Unlike the text-related Kaggle writeups, researchers are more interested in data augmentation techniques rather than pseudo labeling.**

## Which of the following domains is mostly referred in Academia?

In [None]:
text_data_keywords = [
     "text mining",
     "text analytics",
     "text preprocessing",
     "text classification",
     "text clustering",
     "named entity recognition",
     "topic modeling",
     "information retrieval",
     "text summarization",
     "text generation",
     "text similarity",
     "word embeddings",
     "document classification",
     "text feature extraction",
     "text segmentation",
     "text normalization",
     "text corpora",
     "textual data analysis",
     "question answering",
     "sentiment analysis",
     "language modeling"    
]

In [None]:
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', text_data_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [None]:
fig, ax= plt.subplots(1, 1, figsize=(8,6))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Domains')
ax.set_xlabel('Occurrences')
ax.set_title('NLP domains found in Academia')
ax.set_xlim([0, 800])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

**Answer: Regarding the NLP domains, we found agreement of interest in both the Kaggle community and Academia focusing their efforts on question and answering and text classification fields.**

---

# Conclusion

By considering 9 text-data-related competitions instead of 5, we identified 208 writeups (70 times more data to be analyzed than in our previous EDA). This helped us gain a better understanding of Kaggle text data competitions. We've also expanded our consideration to 19 categories that could contain NLP papers, as opposed to the previous 12, for the ArXiv dataset. Here's a general summary of our findings over the last two years:

* BERT and encoder-based architectures are the most popular in both the Kaggle community and academic contexts.
* Pseudo labeling was the most frequently referenced technique in text data writeups, while data augmentation was more prevalent in text-related papers. It appears that researchers prioritize model efficiency, whereas Kagglers might overlook it in their solutions.
* Both the Kaggle community and academia are increasingly focusing their efforts on Question and Answer (Q&A) and text classification domains.
---

# Appendix
Finally, here are some useful tips for processing large datasets:

1. Keep an eye on the RAM memory usage at every stage of your dataset analysis. You can:

    * Assess the memory size of dataframes using the `df.info(memory_usage='deep')` command
    * Consider removing dataframes that you no longer need with `del df`
    * Free up memory whenever possible using the `gc.collect()` command.
    * Use the following commands to assess the memory usage of your variables:
    ```
        from __future__ import print_function  # for Python2
        import sys

        local_vars = list(locals().items())
        for var, obj in local_vars:
        print(var, sys.getsizeof(obj))    
    ```
2. Implement a progress bar when executing large processes