# Paper Topic Recognition System

## Introduction

This Jupyter Notebook is dedicated to the development of the Paper Topic Recognition System, an automated tool for classifying academic papers into predefined categories based on their textual content. Using a comprehensive dataset from arXiv, which includes over 2 million articles, this project harnesses the power of Natural Language Processing (NLP) and machine learning techniques to efficiently categorize academic papers, enhancing the management and retrieval of scholarly articles.

## Dataset

[arXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv/data?select=arxiv-metadata-oai-snapshot.json) Version 177.

## Objectives

- **Data Preparation**: Implement preprocessing techniques to clean the dataset, removing noise and standardizing text format for further analysis.
- **Feature Extraction**: Use Term Frequency-Inverse Document Frequency (TF-IDF) for converting textual data into a structured numerical format that facilitates effective machine learning model training.
- **Model Selection and Training**: Explore and evaluate various machine learning algorithms, including Naïve Bayes, K-Nearest Neighbors, Support Vector Machines, and more advanced neural network architectures, to determine the optimal model.
- **Performance Evaluation**: Assess the models using accuracy metrics and cross-validation techniques to ensure reliability and effectiveness in paper categorization.

## Step 1: Load and explore the Data

In [1]:
%%time

import pandas as pd
from tqdm import tqdm
tqdm.pandas()

# Load JSON data into DataFrame
data = pd.read_json('arxiv-metadata-oai-snapshot.json', lines=True)

Wall time: 1min 34s


In [2]:
# Visualize data

data

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,0704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,0704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,0704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,0704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,0704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2468398,supr-con/9608008,Ruslan Prozorov,"R. Prozorov, M. Konczykowski, B. Schmidt, Y. Y...",On the origin of the irreversibility line in t...,"19 pages, LaTex, 6 PostScript figures; Author'...",,10.1103/PhysRevB.54.15530,,supr-con cond-mat.supr-con,,We report on measurements of the angular dep...,"[{'version': 'v1', 'created': 'Mon, 26 Aug 199...",2009-10-30,"[[Prozorov, R., ], [Konczykowski, M., ], [Schm..."
2468399,supr-con/9609001,Durga P. Choudhury,"Durga P. Choudhury, Balam A. Willemsen, John S...",Nonlinear Response of HTSC Thin Film Microwave...,"4 pages, LaTeX type, Uses IEEE style files, 60...",,10.1109/77.620744,,supr-con cond-mat.supr-con,,The non-linear microwave surface impedance o...,"[{'version': 'v1', 'created': 'Sat, 31 Aug 199...",2016-11-18,"[[Choudhury, Durga P., , Physics Department, N..."
2468400,supr-con/9609002,Durga P. Choudhury,"Balam A. Willemsen, J. S. Derov and S.Sridhar ...",Critical State Flux Penetration and Linear Mic...,"20 pages, LaTeX type, Uses REVTeX style files,...",,10.1103/PhysRevB.56.11989,,supr-con cond-mat.supr-con,,The vortex contribution to the dc field (H) ...,"[{'version': 'v1', 'created': 'Tue, 3 Sep 1996...",2009-10-30,"[[Willemsen, Balam A., , Physics Department,\n..."
2468401,supr-con/9609003,Hasegawa Yasumasa,Yasumasa Hasegawa (Himeji Institute of Technol...,Density of States and NMR Relaxation Rate in A...,"7 pages, 4 PostScript Figures, LaTeX, to appea...",,10.1143/JPSJ.65.3131,,supr-con cond-mat.supr-con,,We show that the density of states in an ani...,"[{'version': 'v1', 'created': 'Wed, 18 Sep 199...",2009-10-30,"[[Hasegawa, Yasumasa, , Himeji Institute of Te..."


In [3]:
# Print dimensions of DataFrame

print(data.shape)

(2468403, 14)


In [4]:
# Print all column names

print(data.columns)           

In [5]:
# Print first row

print(data.iloc[0])                

In [6]:
print(data.loc[0, 'title'])         # Print the value of column 'title' from first row
print()
print(data.loc[0, 'abstract'])      # Print the value of column 'abstract' from first row
print()
print(data.loc[0, 'categories'])    # Print the value of column 'categories' from first row

## Step 2: Data Cleaning and Preprocessing

In [7]:
filtered_data = data[['title', 'abstract', 'categories']]

In [2]:
# Download text data for stopwords

import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

print(len(stop_words))
print(stop_words)

179
{'under', "don't", 've', 'for', 'y', 'while', 'yourself', 'below', 'should', 'herself', "haven't", 'after', 'we', 'from', 'was', 'have', 'or', 'a', 'before', 'mightn', 'on', 'against', 'she', 'nor', 'where', 'hadn', 'more', 'these', 'by', 'if', 'you', 'until', 'so', 'theirs', "doesn't", 'it', 'there', 'its', "wouldn't", 'them', 't', 'doesn', 'only', "shan't", "hasn't", 'once', 'did', 'isn', 'didn', 'ma', "it's", 'been', 'hasn', 'which', 'but', "hadn't", 'myself', 'he', 'of', 'no', 'all', "shouldn't", 'that', 'what', 'how', 'wouldn', 'aren', "should've", "she's", 'in', 'can', 'same', 'down', 's', 'me', "needn't", 'over', 'don', 'off', 'the', 'into', 'any', 'd', 're', 'not', 'very', 'is', 'had', 'why', 'who', 'i', 'each', 'themselves', 'at', 'doing', 'other', 'needn', 'just', 'has', "you'll", 'his', 'through', 'as', 'shouldn', 'out', "weren't", 'between', 'our', 'does', 'shan', "wasn't", 'am', 'weren', 'm', 'do', 'ours', "you've", 'are', 'such', 'won', "you're", 'further', 'wasn', 'w

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\TeoDea\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
def clean_text(text):
    text = text.lower()                                                        # Convert to lowercase
    text = re.sub(r'\[.*?\]', '', text)                                        # Remove text inside square brackets
    text = re.sub(r'[^a-z0-9\s]', '', text)                                    # Remove non-alphanumeric characters
    text = ' '.join([word for word in text.split() if word not in stop_words]) # Remove stopwords
    return text

In [10]:
tqdm.pandas()

# Create a copy of the filtered data to avoid SettingWithCopyWarning when modifying
filtered_data = data[['title', 'abstract', 'categories']].copy()

# Apply the clean_text function to create 'clean_abstract'
filtered_data['clean_abstract'] = filtered_data['abstract'].progress_apply(clean_text)

filtered_data

100%|██████████| 2468403/2468403 [01:48<00:00, 22847.88it/s]

Wall time: 1min 48s





In [12]:
"""
PROCESS FOR STEMMING >1HR
"""

tqdm.pandas()

from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_text(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

# Apply the stemming function to the cleaned text
filtered_data['stemmed_clean_abstract'] = filtered_data['clean_abstract'].progress_apply(stem_text)

filtered_data

100%|██████████| 2468403/2468403 [1:12:21<00:00, 568.60it/s] 


Unnamed: 0,title,abstract,categories,clean_abstract,stemmed_clean_abstract
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...,fulli differenti calcul perturb quantum chromo...
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...,describ new algorithm kellpebbl game color use...
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...,evolut earthmoon system describ dark matter fi...
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...,show determin stirl cycl number count unlabel ...
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...,paper show comput lambdaalpha norm alphag 0 us...
...,...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...,report measur angular depend irrevers temperat...
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...,nonlinear microwav surfac imped pattern ybco t...
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...,vortex contribut dc field h depend microwav su...
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...,show densiti state anisotrop superconductor in...


In [15]:
%%time

# Saving data

# Save DataFrame to CSV
filtered_data.to_csv('processed_texts_stemming.csv', index=False)

Wall time: 1min 47s


In [16]:
"""
PROCESS FOR LEMMATIZATION >8HR (>20HR)
"""

tqdm.pandas()

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

def lemmatize_text(text):
    doc = nlp(text)
    lemmatized_text = ' '.join([token.lemma_ for token in doc])
    return lemmatized_text
    
import time

# Batch processing
start_time = time.time()
docs = list(nlp.pipe(texts, batch_size=100))
results_batch = [' '.join([token.lemma_ for token in doc]) for doc in docs]
print("Batch processing time:", time.time() - start_time, "seconds")

# Apply the lemmatization function to the cleaned text
filtered_data['lemmatized_clean_abstract'] = filtered_data['clean_abstract'].progress_apply(lemmatize_text)

filtered_data

Batch processing time: 5.802279233932495 seconds


100%|██████████| 2468403/2468403 [20:12:45<00:00, 33.92it/s]        


Unnamed: 0,title,abstract,categories,clean_abstract,stemmed_clean_abstract,lemmatized_clean_abstract
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...,fulli differenti calcul perturb quantum chromo...,fully differential calculation perturbative qu...
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...,describ new algorithm kellpebbl game color use...,describe new algorithm kellpebble game color u...
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...,evolut earthmoon system describ dark matter fi...,evolution earthmoon system describe dark matte...
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...,show determin stirl cycl number count unlabel ...,show determinant stirling cycle number count u...
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...,paper show comput lambdaalpha norm alphag 0 us...,paper show compute lambdaalpha norm alphage 0 ...
...,...,...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...,report measur angular depend irrevers temperat...,report measurement angular dependence irrevers...
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...,nonlinear microwav surfac imped pattern ybco t...,nonlinear microwave surface impedance pattern ...
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...,vortex contribut dc field h depend microwav su...,vortex contribution dc field h dependent micro...
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...,show densiti state anisotrop superconductor in...,show density state anisotropic superconductor ...


In [17]:
%%time

# Saving data

# Save DataFrame to CSV
filtered_data.to_csv('processed_texts_stemming_and_lemmatization.csv', index=False)

Wall time: 2min 34s


In [18]:
# Assume the first category is the primary category in multiple category papers

filtered_data['primary_category'] = filtered_data['categories'].str.split().str[0]

filtered_data

Unnamed: 0,title,abstract,categories,clean_abstract,stemmed_clean_abstract,lemmatized_clean_abstract,primary_category
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...,fulli differenti calcul perturb quantum chromo...,fully differential calculation perturbative qu...,hep-ph
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...,describ new algorithm kellpebbl game color use...,describe new algorithm kellpebble game color u...,math.CO
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...,evolut earthmoon system describ dark matter fi...,evolution earthmoon system describe dark matte...,physics.gen-ph
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...,show determin stirl cycl number count unlabel ...,show determinant stirling cycle number count u...,math.CO
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...,paper show comput lambdaalpha norm alphag 0 us...,paper show compute lambdaalpha norm alphage 0 ...,math.CA
...,...,...,...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...,report measur angular depend irrevers temperat...,report measurement angular dependence irrevers...,supr-con
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...,nonlinear microwav surfac imped pattern ybco t...,nonlinear microwave surface impedance pattern ...,supr-con
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...,vortex contribut dc field h depend microwav su...,vortex contribution dc field h dependent micro...,supr-con
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...,show densiti state anisotrop superconductor in...,show density state anisotropic superconductor ...,supr-con


**Note:** The 'primary_category' represents the main category assigned to each paper, and not all categories from the 'categories' list are used as primary. This affects how the model is trained and may impact the representation of less frequent categories.

In [19]:
unique_primary_categories = filtered_data['primary_category'].unique()
num_unique_primary_categories = len(unique_primary_categories)
print("Number of unique primary categories:", num_unique_primary_categories)

all_categories = filtered_data['categories'].str.split().explode()
unique_categories = all_categories.unique()
num_unique_categories = len(unique_categories)
print("Number of unique categories from all entries:", num_unique_categories)

Number of unique primary categories: 172
Number of unique categories from all entries: 176


**Note:** Some categories in this dataset are not included in the current [arXiv category taxonomy](https://arxiv.org/category_taxonomy) because they are based on [older classification policies](https://arxiv.org/archive/list).

In [20]:
# Define the mapping of old categories to new categories
old_to_new_categories = {
    'acc-phys': 'physics.acc-ph',
    'adap-org': 'nlin.AO',
    'alg-geom': 'math.AG',
    'ao-sci': 'physics.ao-ph',
    'atom-ph': 'physics.atom-ph',
    'bayes-an': 'physics.data-an',
    'chao-dyn': 'nlin.CD',
    'chem-ph': 'physics.chem-ph',
    'cmp-lg': 'cs.CL',
    'comp-gas': 'nlin.CG',
    'dg-ga': 'math.DG',
    'funct-an': 'math.FA',
    'mtrl-th': 'cond-mat.mtrl-sci',
    'patt-sol': 'nlin.PS',
    'plasm-ph': 'physics.plasm-ph',
    'q-alg': 'math.QA',
    'solv-int': 'nlin.SI',
    'supr-con': 'cond-mat.supr-con'
}

# Map the old categories to the new categories
filtered_data['primary_category'] = filtered_data['primary_category'].replace(old_to_new_categories)

In [21]:
category_to_main_group = {
    'cs': 'Computer Science',
    'cs.AI': 'Computer Science', 'cs.AR': 'Computer Science', 'cs.CC': 'Computer Science',
    'cs.CE': 'Computer Science', 'cs.CG': 'Computer Science', 'cs.CL': 'Computer Science',
    'cs.CR': 'Computer Science', 'cs.CV': 'Computer Science', 'cs.CY': 'Computer Science',
    'cs.DB': 'Computer Science', 'cs.DC': 'Computer Science', 'cs.DL': 'Computer Science',
    'cs.DM': 'Computer Science', 'cs.DS': 'Computer Science', 'cs.ET': 'Computer Science',
    'cs.FL': 'Computer Science', 'cs.GL': 'Computer Science', 'cs.GR': 'Computer Science',
    'cs.GT': 'Computer Science', 'cs.HC': 'Computer Science', 'cs.IR': 'Computer Science',
    'cs.IT': 'Computer Science', 'cs.LG': 'Computer Science', 'cs.LO': 'Computer Science',
    'cs.MA': 'Computer Science', 'cs.MM': 'Computer Science', 'cs.MS': 'Computer Science',
    'cs.NA': 'Computer Science', 'cs.NE': 'Computer Science', 'cs.NI': 'Computer Science',
    'cs.OH': 'Computer Science', 'cs.OS': 'Computer Science', 'cs.PF': 'Computer Science',
    'cs.PL': 'Computer Science', 'cs.RO': 'Computer Science', 'cs.SC': 'Computer Science',
    'cs.SD': 'Computer Science', 'cs.SE': 'Computer Science', 'cs.SI': 'Computer Science',
    'cs.SY': 'Computer Science',
    
    'econ': 'Economics',
    'econ.EM': 'Economics', 'econ.GN': 'Economics', 'econ.TH': 'Economics',
    
    'eess': 'Electrical Engineering and Systems Science',
    'eess.AS': 'Electrical Engineering and Systems Science', 'eess.IV': 'Electrical Engineering and Systems Science',
    'eess.SP': 'Electrical Engineering and Systems Science', 'eess.SY': 'Electrical Engineering and Systems Science',
    
    'math': 'Mathematics',
    'math.AC': 'Mathematics', 'math.AG': 'Mathematics', 'math.AP': 'Mathematics', 'math.AT': 'Mathematics',
    'math.CA': 'Mathematics', 'math.CO': 'Mathematics', 'math.CT': 'Mathematics', 'math.CV': 'Mathematics',
    'math.DG': 'Mathematics', 'math.DS': 'Mathematics', 'math.FA': 'Mathematics', 'math.GM': 'Mathematics',
    'math.GN': 'Mathematics', 'math.GR': 'Mathematics', 'math.GT': 'Mathematics', 'math.HO': 'Mathematics',
    'math.IT': 'Mathematics', 'math.KT': 'Mathematics', 'math.LO': 'Mathematics', 'math.MG': 'Mathematics',
    'math.MP': 'Mathematics', 'math.NA': 'Mathematics', 'math.NT': 'Mathematics', 'math.OA': 'Mathematics',
    'math.OC': 'Mathematics', 'math.PR': 'Mathematics', 'math.QA': 'Mathematics', 'math.RA': 'Mathematics',
    'math.RT': 'Mathematics', 'math.SG': 'Mathematics', 'math.SP': 'Mathematics', 'math.ST': 'Mathematics',
    
    'astro-ph': 'Physics',
    'astro-ph.CO': 'Physics', 'astro-ph.EP': 'Physics', 'astro-ph.GA': 'Physics', 'astro-ph.HE': 'Physics', 
    'astro-ph.IM': 'Physics', 'astro-ph.SR': 'Physics',
    'cond-mat': 'Physics',
    'cond-mat.dis-nn': 'Physics', 'cond-mat.mes-hall': 'Physics', 'cond-mat.mtrl-sci': 'Physics', 
    'cond-mat.other': 'Physics', 'cond-mat.quant-gas': 'Physics', 'cond-mat.soft': 'Physics', 
    'cond-mat.stat-mech': 'Physics', 'cond-mat.str-el': 'Physics', 'cond-mat.supr-con': 'Physics', 
    'gr-qc': 'Physics',
    'hep-ex': 'Physics', 
    'hep-lat': 'Physics',
    'hep-ph': 'Physics',
    'hep-th': 'Physics',
    'math-ph': 'Physics', 
    'nlin': 'Physics',
    'nlin.AO': 'Physics', 'nlin.CD': 'Physics', 'nlin.CG': 'Physics', 'nlin.PS': 'Physics', 'nlin.SI': 'Physics',
    'nucl-ex': 'Physics',
    'nucl-th': 'Physics',
    'physics': 'Physics',
    'physics.acc-ph': 'Physics', 'physics.ao-ph': 'Physics', 'physics.app-ph': 'Physics', 
    'physics.atm-clus': 'Physics', 'physics.atom-ph': 'Physics', 'physics.bio-ph': 'Physics', 
    'physics.chem-ph': 'Physics', 'physics.class-ph': 'Physics', 'physics.comp-ph': 'Physics', 
    'physics.data-an': 'Physics', 'physics.ed-ph': 'Physics', 'physics.flu-dyn': 'Physics', 
    'physics.gen-ph': 'Physics', 'physics.geo-ph': 'Physics', 'physics.hist-ph': 'Physics', 
    'physics.ins-det': 'Physics', 'physics.med-ph': 'Physics', 'physics.optics': 'Physics', 
    'physics.plasm-ph': 'Physics', 'physics.pop-ph': 'Physics', 'physics.soc-ph': 'Physics', 
    'physics.space-ph': 'Physics',
    'quant-ph': 'Physics',
    
    'q-bio': 'Quantitative Biology',
    'q-bio.BM': 'Quantitative Biology', 'q-bio.CB': 'Quantitative Biology', 'q-bio.GN': 'Quantitative Biology', 
    'q-bio.MN': 'Quantitative Biology', 'q-bio.NC': 'Quantitative Biology', 'q-bio.OT': 'Quantitative Biology', 
    'q-bio.PE': 'Quantitative Biology', 'q-bio.QM': 'Quantitative Biology', 'q-bio.SC': 'Quantitative Biology', 
    'q-bio.TO': 'Quantitative Biology', 
    
    'q-fin': 'Quantitative Finance',
    'q-fin.CP': 'Quantitative Finance', 'q-fin.EC': 'Quantitative Finance', 'q-fin.GN': 'Quantitative Finance', 
    'q-fin.MF': 'Quantitative Finance', 'q-fin.PM': 'Quantitative Finance', 'q-fin.PR': 'Quantitative Finance', 
    'q-fin.RM': 'Quantitative Finance', 'q-fin.ST': 'Quantitative Finance', 'q-fin.TR': 'Quantitative Finance', 
    
    'stat': 'Statistics',
    'stat.AP': 'Statistics', 'stat.CO': 'Statistics', 'stat.ME': 'Statistics', 'stat.ML': 'Statistics', 
    'stat.OT': 'Statistics', 'stat.TH': 'Statistics'
}

filtered_data['main_primary_category'] = filtered_data['primary_category'].map(category_to_main_group)

filtered_data

Unnamed: 0,title,abstract,categories,clean_abstract,stemmed_clean_abstract,lemmatized_clean_abstract,primary_category,main_primary_category
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...,fulli differenti calcul perturb quantum chromo...,fully differential calculation perturbative qu...,hep-ph,Physics
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...,describ new algorithm kellpebbl game color use...,describe new algorithm kellpebble game color u...,math.CO,Mathematics
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...,evolut earthmoon system describ dark matter fi...,evolution earthmoon system describe dark matte...,physics.gen-ph,Physics
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...,show determin stirl cycl number count unlabel ...,show determinant stirling cycle number count u...,math.CO,Mathematics
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...,paper show comput lambdaalpha norm alphag 0 us...,paper show compute lambdaalpha norm alphage 0 ...,math.CA,Mathematics
...,...,...,...,...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...,report measur angular depend irrevers temperat...,report measurement angular dependence irrevers...,cond-mat.supr-con,Physics
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...,nonlinear microwav surfac imped pattern ybco t...,nonlinear microwave surface impedance pattern ...,cond-mat.supr-con,Physics
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...,vortex contribut dc field h depend microwav su...,vortex contribution dc field h dependent micro...,cond-mat.supr-con,Physics
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...,show densiti state anisotrop superconductor in...,show density state anisotropic superconductor ...,cond-mat.supr-con,Physics


In [22]:
# Calculate the percentage of papers per category
category_counts = filtered_data['main_primary_category'].value_counts(normalize=True) * 100

print(category_counts)

Physics                                       53.800777
Computer Science                              20.423812
Mathematics                                   20.052560
Statistics                                     1.943200
Electrical Engineering and Systems Science     1.921364
Quantitative Biology                           1.130245
Quantitative Finance                           0.445268
Economics                                      0.282774
Name: main_primary_category, dtype: float64


In [23]:
physics_expanded = {
    'astro-ph': 'Astrophysics',
    'astro-ph.CO': 'Astrophysics', 'astro-ph.EP': 'Astrophysics', 'astro-ph.GA': 'Astrophysics', 'astro-ph.HE': 'Astrophysics', 
    'astro-ph.IM': 'Astrophysics', 'astro-ph.SR': 'Astrophysics',
    'cond-mat': 'Condensed Matter',
    'cond-mat.dis-nn': 'Condensed Matter', 'cond-mat.mes-hall': 'Condensed Matter', 'cond-mat.mtrl-sci': 'Condensed Matter', 
    'cond-mat.other': 'Condensed Matter', 'cond-mat.quant-gas': 'Condensed Matter', 'cond-mat.soft': 'Condensed Matter', 
    'cond-mat.stat-mech': 'Condensed Matter', 'cond-mat.str-el': 'Condensed Matter', 'cond-mat.supr-con': 'Condensed Matter', 
    'gr-qc': 'General Relativity and Quantum Cosmology',
    'hep-ex': 'High Energy Physics - Experiment', 
    'hep-lat': 'High Energy Physics - Lattice',
    'hep-ph': 'High Energy Physics - Phenomenology',
    'hep-th': 'High Energy Physics - Theory',
    'math-ph': 'Mathematical Physics', 
    'nlin': 'Nonlinear Sciences',
    'nlin.AO': 'Nonlinear Sciences', 'nlin.CD': 'Nonlinear Sciences', 'nlin.CG': 'Nonlinear Sciences',
    'nlin.PS': 'Nonlinear Sciences', 'nlin.SI': 'Nonlinear Sciences',
    'nucl-ex': 'Nuclear Experiment',
    'nucl-th': 'Nuclear Theory',
    'physics': 'Physics',
    'physics.acc-ph': 'Physics', 'physics.ao-ph': 'Physics', 'physics.app-ph': 'Physics', 
    'physics.atm-clus': 'Physics', 'physics.atom-ph': 'Physics', 'physics.bio-ph': 'Physics', 
    'physics.chem-ph': 'Physics', 'physics.class-ph': 'Physics', 'physics.comp-ph': 'Physics', 
    'physics.data-an': 'Physics', 'physics.ed-ph': 'Physics', 'physics.flu-dyn': 'Physics', 
    'physics.gen-ph': 'Physics', 'physics.geo-ph': 'Physics', 'physics.hist-ph': 'Physics', 
    'physics.ins-det': 'Physics', 'physics.med-ph': 'Physics', 'physics.optics': 'Physics', 
    'physics.plasm-ph': 'Physics', 'physics.pop-ph': 'Physics', 'physics.soc-ph': 'Physics', 
    'physics.space-ph': 'Physics',
    'quant-ph': 'Quantum Physics',
}

filtered_data['main_primary_category_physics_expanded'] = filtered_data['main_primary_category']
filtered_data['main_primary_category_physics_expanded'].update(filtered_data['primary_category'].map(physics_expanded))

filtered_data

Unnamed: 0,title,abstract,categories,clean_abstract,stemmed_clean_abstract,lemmatized_clean_abstract,primary_category,main_primary_category,main_primary_category_physics_expanded
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...,fulli differenti calcul perturb quantum chromo...,fully differential calculation perturbative qu...,hep-ph,Physics,High Energy Physics - Phenomenology
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...,describ new algorithm kellpebbl game color use...,describe new algorithm kellpebble game color u...,math.CO,Mathematics,Mathematics
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...,evolut earthmoon system describ dark matter fi...,evolution earthmoon system describe dark matte...,physics.gen-ph,Physics,Physics
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...,show determin stirl cycl number count unlabel ...,show determinant stirling cycle number count u...,math.CO,Mathematics,Mathematics
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...,paper show comput lambdaalpha norm alphag 0 us...,paper show compute lambdaalpha norm alphage 0 ...,math.CA,Mathematics,Mathematics
...,...,...,...,...,...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...,report measur angular depend irrevers temperat...,report measurement angular dependence irrevers...,cond-mat.supr-con,Physics,Condensed Matter
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...,nonlinear microwav surfac imped pattern ybco t...,nonlinear microwave surface impedance pattern ...,cond-mat.supr-con,Physics,Condensed Matter
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...,vortex contribut dc field h depend microwav su...,vortex contribution dc field h dependent micro...,cond-mat.supr-con,Physics,Condensed Matter
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...,show densiti state anisotrop superconductor in...,show density state anisotropic superconductor ...,cond-mat.supr-con,Physics,Condensed Matter


In [24]:
# Calculate the percentage of papers per category
category_counts = filtered_data['main_primary_category_physics_expanded'].value_counts(normalize=True) * 100

print(category_counts)

Computer Science                              20.423812
Mathematics                                   20.052560
Condensed Matter                              12.573555
Astrophysics                                  12.077647
Physics                                        7.128374
High Energy Physics - Phenomenology            5.309506
Quantum Physics                                4.327940
High Energy Physics - Theory                   4.228199
General Relativity and Quantum Cosmology       2.532690
Statistics                                     1.943200
Electrical Engineering and Systems Science     1.921364
Nuclear Theory                                 1.336856
Mathematical Physics                           1.269282
Quantitative Biology                           1.130245
High Energy Physics - Experiment               0.923877
Nonlinear Sciences                             0.906051
High Energy Physics - Lattice                  0.722127
Nuclear Experiment                             0

In [None]:
%%time

# Saving data

# Save DataFrame to CSV
filtered_data.to_csv('processed_texts_complete.csv', index=False)