# Paper Topic Recognition System

## Introduction

This Jupyter Notebook is dedicated to the development of the Paper Topic Recognition System, an automated tool for classifying academic papers into predefined categories based on their textual content. Using a comprehensive dataset from arXiv, which includes over 2 million articles, this project harnesses the power of Natural Language Processing (NLP) and machine learning techniques to efficiently categorize academic papers, enhancing the management and retrieval of scholarly articles.

## Dataset

[arXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv/data?select=arxiv-metadata-oai-snapshot.json) Version 177.

## Objectives

- **Data Preparation**: Implement preprocessing techniques to clean the dataset, removing noise and standardizing text format for further analysis.
- **Feature Extraction**: Use Term Frequency-Inverse Document Frequency (TF-IDF) for converting textual data into a structured numerical format that facilitates effective machine learning model training.
- **Model Selection and Training**: Explore and evaluate various machine learning algorithms, including Naïve Bayes, K-Nearest Neighbors, Support Vector Machines, and more advanced neural network architectures, to determine the optimal model.
- **Performance Evaluation**: Assess the models using accuracy metrics and cross-validation techniques to ensure reliability and effectiveness in paper categorization.

## Step 1: Load and explore the Data

In [1]:
%%time

import pandas as pd

# Load JSON data into DataFrame
data = pd.read_json('arxiv-metadata-oai-snapshot.json', lines=True)

Wall time: 1min 34s


In [2]:
# Print dimensions of DataFrame

print(data.shape)

(2468403, 14)


In [3]:
# Print all column names

print(data.columns)           

Index(['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'],
      dtype='object')


In [4]:
# Print first row

print(data.iloc[0])                

id                                                        0704.0001
submitter                                            Pavel Nadolsky
authors           C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...
title             Calculation of prompt diphoton production cros...
comments                    37 pages, 15 figures; published version
journal-ref                                Phys.Rev.D76:013009,2007
doi                                      10.1103/PhysRevD.76.013009
report-no                                          ANL-HEP-PR-07-12
categories                                                   hep-ph
license                                                        None
abstract            A fully differential calculation in perturba...
versions          [{'version': 'v1', 'created': 'Mon, 2 Apr 2007...
update_date                                              2008-11-26
authors_parsed    [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
Name: 0, dtype: object


In [5]:
print(data.loc[0, 'title'])         # Print the value of column 'title' from first row
print()
print(data.loc[0, 'abstract'])      # Print the value of column 'abstract' from first row
print()
print(data.loc[0, 'categories'])    # Print the value of column 'categories' from first row

Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies

  A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests with CDF and DO data. Predictions are shown for
distributions of diphoton pairs produced at the energy of the Large Hadron
Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs
boson are contrasted with those produced from QCD processes at the LHC, showing
that enhanced sensit

## Step 2: Data Cleaning and Preprocessing

In [6]:
filtered_data = data[['title', 'abstract', 'categories']]

In [7]:
# Download text data for stopwords

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\TeoDea\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
import re
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

print(len(stop_words))
print(stop_words)

179
{'under', 'ain', 'it', 'some', 'between', "she's", "wouldn't", 'from', 'before', 'more', 'wouldn', 've', 'just', 'the', 'their', 'with', 'hers', 'doesn', 'him', 'he', "you're", 'above', 'again', 'yourselves', "won't", 'after', 'does', 'why', 'ourselves', 'here', 'at', 'd', "doesn't", "shan't", 'have', 'has', 'hasn', 'below', 'who', 'but', 'needn', 'can', 'out', 'them', 'itself', 'being', 'its', "you'll", 'weren', 'own', "didn't", 'up', 'because', "don't", 'each', 'off', "weren't", 'yourself', 'won', 'her', 'during', "you've", 'aren', 'doing', 'then', "mustn't", 'theirs', 'nor', 'both', 'what', 'that', 'where', 'shouldn', 'this', "hasn't", "that'll", 'how', 're', 'while', 'those', 't', 'in', "hadn't", 'once', 'on', 'no', 'will', 'as', 'into', 'few', 'couldn', 'mustn', 'was', 'are', 'too', 'through', 'be', 'mightn', 'when', 'other', 'were', 'himself', 'i', 'my', 'which', 'such', 'only', 'about', 'm', 'am', 'shan', "mightn't", 'or', 'of', 'haven', 'by', 'had', 'you', "needn't", 'me', 

In [9]:
def clean_text(text):
    text = text.lower()                                                        # Convert to lowercase
    text = re.sub(r'\[.*?\]', '', text)                                        # Remove text inside square brackets
    text = re.sub(r'[^a-z0-9\s]', '', text)                                    # Remove non-alphanumeric characters
    text = ' '.join([word for word in text.split() if word not in stop_words]) # Remove stopwords
    return text

In [10]:
%%time

# Create a copy of the filtered data to avoid SettingWithCopyWarning when modifying
filtered_data = data[['title', 'abstract', 'categories']].copy()

# Apply the clean_text function to create 'clean_abstract'
filtered_data['clean_abstract'] = filtered_data['abstract'].apply(clean_text)

filtered_data

Wall time: 1min 38s


Unnamed: 0,title,abstract,categories,clean_abstract
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...
...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...


In [11]:
# Assume the first category is the primary category in multiple category papers

filtered_data['primary_category'] = filtered_data['categories'].str.split().str[0]

filtered_data

Unnamed: 0,title,abstract,categories,clean_abstract,primary_category
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...,hep-ph
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...,math.CO
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...,physics.gen-ph
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...,math.CO
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...,math.CA
...,...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...,supr-con
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...,supr-con
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...,supr-con
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...,supr-con


**Note:** The 'primary_category' represents the main category assigned to each paper, and not all categories from the 'categories' list are used as primary. This affects how the model is trained and may impact the representation of less frequent categories.

In [12]:
unique_primary_categories = filtered_data['primary_category'].unique()
num_unique_primary_categories = len(unique_primary_categories)
print("Number of unique primary categories:", num_unique_primary_categories)

all_categories = filtered_data['categories'].str.split().explode()
unique_categories = all_categories.unique()
num_unique_categories = len(unique_categories)
print("Number of unique categories from all entries:", num_unique_categories)

Number of unique primary categories: 172
Number of unique categories from all entries: 176


**Note:** Some categories in this dataset are not included in the current [arXiv category taxonomy](https://arxiv.org/category_taxonomy) because they are based on [older classification policies](https://arxiv.org/archive/list).

In [13]:
# Define the mapping of old categories to new categories
old_to_new_categories = {
    'acc-phys': 'physics.acc-ph',
    'adap-org': 'nlin.AO',
    'alg-geom': 'math.AG',
    'ao-sci': 'physics.ao-ph',
    'atom-ph': 'physics.atom-ph',
    'bayes-an': 'physics.data-an',
    'chao-dyn': 'nlin.CD',
    'chem-ph': 'physics.chem-ph',
    'cmp-lg': 'cs.CL',
    'comp-gas': 'nlin.CG',
    'dg-ga': 'math.DG',
    'funct-an': 'math.FA',
    'mtrl-th': 'cond-mat.mtrl-sci',
    'patt-sol': 'nlin.PS',
    'plasm-ph': 'physics.plasm-ph',
    'q-alg': 'math.QA',
    'solv-int': 'nlin.SI',
    'supr-con': 'cond-mat.supr-con'
}

# Map the old categories to the new categories
filtered_data['primary_category'] = filtered_data['primary_category'].replace(old_to_new_categories)

In [14]:
category_to_main_group = {
    'cs': 'Computer Science',
    'cs.AI': 'Computer Science', 'cs.AR': 'Computer Science', 'cs.CC': 'Computer Science',
    'cs.CE': 'Computer Science', 'cs.CG': 'Computer Science', 'cs.CL': 'Computer Science',
    'cs.CR': 'Computer Science', 'cs.CV': 'Computer Science', 'cs.CY': 'Computer Science',
    'cs.DB': 'Computer Science', 'cs.DC': 'Computer Science', 'cs.DL': 'Computer Science',
    'cs.DM': 'Computer Science', 'cs.DS': 'Computer Science', 'cs.ET': 'Computer Science',
    'cs.FL': 'Computer Science', 'cs.GL': 'Computer Science', 'cs.GR': 'Computer Science',
    'cs.GT': 'Computer Science', 'cs.HC': 'Computer Science', 'cs.IR': 'Computer Science',
    'cs.IT': 'Computer Science', 'cs.LG': 'Computer Science', 'cs.LO': 'Computer Science',
    'cs.MA': 'Computer Science', 'cs.MM': 'Computer Science', 'cs.MS': 'Computer Science',
    'cs.NA': 'Computer Science', 'cs.NE': 'Computer Science', 'cs.NI': 'Computer Science',
    'cs.OH': 'Computer Science', 'cs.OS': 'Computer Science', 'cs.PF': 'Computer Science',
    'cs.PL': 'Computer Science', 'cs.RO': 'Computer Science', 'cs.SC': 'Computer Science',
    'cs.SD': 'Computer Science', 'cs.SE': 'Computer Science', 'cs.SI': 'Computer Science',
    'cs.SY': 'Computer Science',
    
    'econ': 'Economics',
    'econ.EM': 'Economics', 'econ.GN': 'Economics', 'econ.TH': 'Economics',
    
    'eess': 'Electrical Engineering and Systems Science',
    'eess.AS': 'Electrical Engineering and Systems Science', 'eess.IV': 'Electrical Engineering and Systems Science',
    'eess.SP': 'Electrical Engineering and Systems Science', 'eess.SY': 'Electrical Engineering and Systems Science',
    
    'math': 'Mathematics',
    'math.AC': 'Mathematics', 'math.AG': 'Mathematics', 'math.AP': 'Mathematics', 'math.AT': 'Mathematics',
    'math.CA': 'Mathematics', 'math.CO': 'Mathematics', 'math.CT': 'Mathematics', 'math.CV': 'Mathematics',
    'math.DG': 'Mathematics', 'math.DS': 'Mathematics', 'math.FA': 'Mathematics', 'math.GM': 'Mathematics',
    'math.GN': 'Mathematics', 'math.GR': 'Mathematics', 'math.GT': 'Mathematics', 'math.HO': 'Mathematics',
    'math.IT': 'Mathematics', 'math.KT': 'Mathematics', 'math.LO': 'Mathematics', 'math.MG': 'Mathematics',
    'math.MP': 'Mathematics', 'math.NA': 'Mathematics', 'math.NT': 'Mathematics', 'math.OA': 'Mathematics',
    'math.OC': 'Mathematics', 'math.PR': 'Mathematics', 'math.QA': 'Mathematics', 'math.RA': 'Mathematics',
    'math.RT': 'Mathematics', 'math.SG': 'Mathematics', 'math.SP': 'Mathematics', 'math.ST': 'Mathematics',
    
    'astro-ph': 'Physics',
    'astro-ph.CO': 'Physics', 'astro-ph.EP': 'Physics', 'astro-ph.GA': 'Physics', 'astro-ph.HE': 'Physics', 
    'astro-ph.IM': 'Physics', 'astro-ph.SR': 'Physics',
    'cond-mat': 'Physics',
    'cond-mat.dis-nn': 'Physics', 'cond-mat.mes-hall': 'Physics', 'cond-mat.mtrl-sci': 'Physics', 
    'cond-mat.other': 'Physics', 'cond-mat.quant-gas': 'Physics', 'cond-mat.soft': 'Physics', 
    'cond-mat.stat-mech': 'Physics', 'cond-mat.str-el': 'Physics', 'cond-mat.supr-con': 'Physics', 
    'gr-qc': 'Physics',
    'hep-ex': 'Physics', 
    'hep-lat': 'Physics',
    'hep-ph': 'Physics',
    'hep-th': 'Physics',
    'math-ph': 'Physics', 
    'nlin': 'Physics',
    'nlin.AO': 'Physics', 'nlin.CD': 'Physics', 'nlin.CG': 'Physics', 'nlin.PS': 'Physics', 'nlin.SI': 'Physics',
    'nucl-ex': 'Physics',
    'nucl-th': 'Physics',
    'physics': 'Physics',
    'physics.acc-ph': 'Physics', 'physics.ao-ph': 'Physics', 'physics.app-ph': 'Physics', 
    'physics.atm-clus': 'Physics', 'physics.atom-ph': 'Physics', 'physics.bio-ph': 'Physics', 
    'physics.chem-ph': 'Physics', 'physics.class-ph': 'Physics', 'physics.comp-ph': 'Physics', 
    'physics.data-an': 'Physics', 'physics.ed-ph': 'Physics', 'physics.flu-dyn': 'Physics', 
    'physics.gen-ph': 'Physics', 'physics.geo-ph': 'Physics', 'physics.hist-ph': 'Physics', 
    'physics.ins-det': 'Physics', 'physics.med-ph': 'Physics', 'physics.optics': 'Physics', 
    'physics.plasm-ph': 'Physics', 'physics.pop-ph': 'Physics', 'physics.soc-ph': 'Physics', 
    'physics.space-ph': 'Physics',
    'quant-ph': 'Physics',
    
    'q-bio': 'Quantitative Biology',
    'q-bio.BM': 'Quantitative Biology', 'q-bio.CB': 'Quantitative Biology', 'q-bio.GN': 'Quantitative Biology', 
    'q-bio.MN': 'Quantitative Biology', 'q-bio.NC': 'Quantitative Biology', 'q-bio.OT': 'Quantitative Biology', 
    'q-bio.PE': 'Quantitative Biology', 'q-bio.QM': 'Quantitative Biology', 'q-bio.SC': 'Quantitative Biology', 
    'q-bio.TO': 'Quantitative Biology', 
    
    'q-fin': 'Quantitative Finance',
    'q-fin.CP': 'Quantitative Finance', 'q-fin.EC': 'Quantitative Finance', 'q-fin.GN': 'Quantitative Finance', 
    'q-fin.MF': 'Quantitative Finance', 'q-fin.PM': 'Quantitative Finance', 'q-fin.PR': 'Quantitative Finance', 
    'q-fin.RM': 'Quantitative Finance', 'q-fin.ST': 'Quantitative Finance', 'q-fin.TR': 'Quantitative Finance', 
    
    'stat': 'Statistics',
    'stat.AP': 'Statistics', 'stat.CO': 'Statistics', 'stat.ME': 'Statistics', 'stat.ML': 'Statistics', 
    'stat.OT': 'Statistics', 'stat.TH': 'Statistics'
}

filtered_data['main_primary_category'] = filtered_data['primary_category'].map(category_to_main_group)

filtered_data

Unnamed: 0,title,abstract,categories,clean_abstract,primary_category,main_primary_category
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...,hep-ph,Physics
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...,math.CO,Mathematics
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...,physics.gen-ph,Physics
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...,math.CO,Mathematics
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...,math.CA,Mathematics
...,...,...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...,cond-mat.supr-con,Physics
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...,cond-mat.supr-con,Physics
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...,cond-mat.supr-con,Physics
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...,cond-mat.supr-con,Physics


In [15]:
# Calculate the percentage of papers per category
category_counts = filtered_data['main_primary_category'].value_counts(normalize=True) * 100

print(category_counts)

Physics                                       53.800777
Computer Science                              20.423812
Mathematics                                   20.052560
Statistics                                     1.943200
Electrical Engineering and Systems Science     1.921364
Quantitative Biology                           1.130245
Quantitative Finance                           0.445268
Economics                                      0.282774
Name: main_primary_category, dtype: float64


In [16]:
physics_expanded = {
    'astro-ph': 'Astrophysics',
    'astro-ph.CO': 'Astrophysics', 'astro-ph.EP': 'Astrophysics', 'astro-ph.GA': 'Astrophysics', 'astro-ph.HE': 'Astrophysics', 
    'astro-ph.IM': 'Astrophysics', 'astro-ph.SR': 'Astrophysics',
    'cond-mat': 'Condensed Matter',
    'cond-mat.dis-nn': 'Condensed Matter', 'cond-mat.mes-hall': 'Condensed Matter', 'cond-mat.mtrl-sci': 'Condensed Matter', 
    'cond-mat.other': 'Condensed Matter', 'cond-mat.quant-gas': 'Condensed Matter', 'cond-mat.soft': 'Condensed Matter', 
    'cond-mat.stat-mech': 'Condensed Matter', 'cond-mat.str-el': 'Condensed Matter', 'cond-mat.supr-con': 'Condensed Matter', 
    'gr-qc': 'General Relativity and Quantum Cosmology',
    'hep-ex': 'High Energy Physics - Experiment', 
    'hep-lat': 'High Energy Physics - Lattice',
    'hep-ph': 'High Energy Physics - Phenomenology',
    'hep-th': 'High Energy Physics - Theory',
    'math-ph': 'Mathematical Physics', 
    'nlin': 'Nonlinear Sciences',
    'nlin.AO': 'Nonlinear Sciences', 'nlin.CD': 'Nonlinear Sciences', 'nlin.CG': 'Nonlinear Sciences',
    'nlin.PS': 'Nonlinear Sciences', 'nlin.SI': 'Nonlinear Sciences',
    'nucl-ex': 'Nuclear Experiment',
    'nucl-th': 'Nuclear Theory',
    'physics': 'Physics',
    'physics.acc-ph': 'Physics', 'physics.ao-ph': 'Physics', 'physics.app-ph': 'Physics', 
    'physics.atm-clus': 'Physics', 'physics.atom-ph': 'Physics', 'physics.bio-ph': 'Physics', 
    'physics.chem-ph': 'Physics', 'physics.class-ph': 'Physics', 'physics.comp-ph': 'Physics', 
    'physics.data-an': 'Physics', 'physics.ed-ph': 'Physics', 'physics.flu-dyn': 'Physics', 
    'physics.gen-ph': 'Physics', 'physics.geo-ph': 'Physics', 'physics.hist-ph': 'Physics', 
    'physics.ins-det': 'Physics', 'physics.med-ph': 'Physics', 'physics.optics': 'Physics', 
    'physics.plasm-ph': 'Physics', 'physics.pop-ph': 'Physics', 'physics.soc-ph': 'Physics', 
    'physics.space-ph': 'Physics',
    'quant-ph': 'Quantum Physics',
}

filtered_data['main_primary_category_physics_expanded'] = filtered_data['main_primary_category']
filtered_data['main_primary_category_physics_expanded'].update(filtered_data['primary_category'].map(physics_expanded))

filtered_data

Unnamed: 0,title,abstract,categories,clean_abstract,primary_category,main_primary_category,main_primary_category_physics_expanded
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...,hep-ph,Physics,High Energy Physics - Phenomenology
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm kellpebble game colors ...,math.CO,Mathematics,Mathematics
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution earthmoon system described dark matt...,physics.gen-ph,Physics,Physics
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle numbers counts...,math.CO,Mathematics,Mathematics
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute lambdaalpha norm alphage 0 ...,math.CA,Mathematics,Mathematics
...,...,...,...,...,...,...,...
2468398,On the origin of the irreversibility line in t...,We report on measurements of the angular dep...,supr-con cond-mat.supr-con,report measurements angular dependence irrever...,cond-mat.supr-con,Physics,Condensed Matter
2468399,Nonlinear Response of HTSC Thin Film Microwave...,The non-linear microwave surface impedance o...,supr-con cond-mat.supr-con,nonlinear microwave surface impedance patterne...,cond-mat.supr-con,Physics,Condensed Matter
2468400,Critical State Flux Penetration and Linear Mic...,The vortex contribution to the dc field (H) ...,supr-con cond-mat.supr-con,vortex contribution dc field h dependent micro...,cond-mat.supr-con,Physics,Condensed Matter
2468401,Density of States and NMR Relaxation Rate in A...,We show that the density of states in an ani...,supr-con cond-mat.supr-con,show density states anisotropic superconductor...,cond-mat.supr-con,Physics,Condensed Matter


In [17]:
# Calculate the percentage of papers per category
category_counts = filtered_data['main_primary_category_physics_expanded'].value_counts(normalize=True) * 100

print(category_counts)

Computer Science                              20.423812
Mathematics                                   20.052560
Condensed Matter                              12.573555
Astrophysics                                  12.077647
Physics                                        7.128374
High Energy Physics - Phenomenology            5.309506
Quantum Physics                                4.327940
High Energy Physics - Theory                   4.228199
General Relativity and Quantum Cosmology       2.532690
Statistics                                     1.943200
Electrical Engineering and Systems Science     1.921364
Nuclear Theory                                 1.336856
Mathematical Physics                           1.269282
Quantitative Biology                           1.130245
High Energy Physics - Experiment               0.923877
Nonlinear Sciences                             0.906051
High Energy Physics - Lattice                  0.722127
Nuclear Experiment                             0

## Step 3: Feature Extraction

### Term Frequency-Inverse Document Frequency Algorithm:

$$
\text{TF}(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$



$$
\text{IDF}(t,D) = \log \left(\frac{\text{Total number of documents in database } D}{\text{Number of documents containing term } t}\right)
$$



$$
\text{TF-IDF}(t,d,D) = \text{TF}(t,d) \times \text{IDF}(t,D)
$$

In [18]:
%%time

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()  # Set max_features=n to limit number of words to n
X = vectorizer.fit_transform(filtered_data['clean_abstract'])

Wall time: 2min 54s


In [19]:
print(X.shape)
print()
print(X)

(2468403, 2533742)

  (0, 862214)	0.08293958103318441
  (0, 2053486)	0.08477614820253011
  (0, 1255217)	0.15425399375918483
  (0, 1700110)	0.05708105537141631
  (0, 2091813)	0.07774524903088945
  (0, 2062517)	0.08559299877567954
  (0, 833661)	0.0859757624682085
  (0, 2078509)	0.08317188656619424
  (0, 1871777)	0.07105973350003987
  (0, 1906474)	0.08678840190399148
  (0, 629785)	0.13927501475951487
  (0, 482962)	0.09599230270028729
  (0, 1094689)	0.09290685873532283
  (0, 690052)	0.0770407066309956
  (0, 1355661)	0.18858842144346227
  (0, 608215)	0.09943011567757727
  (0, 1329830)	0.05116499284727827
  (0, 831665)	0.05368928437291294
  (0, 1873028)	0.1648194589162823
  (0, 733405)	0.2935533199314207
  (0, 742313)	0.14997577890625052
  (0, 2078525)	0.06101527913274819
  (0, 532687)	0.1309657936327278
  (0, 2252260)	0.0906301440417239
  (0, 716578)	0.07994490981635656
  :	:
  (2468402, 954626)	0.09193228023753006
  (2468402, 1339921)	0.07848372512069199
  (2468402, 1666253)	0.100133888564

In [20]:
from sklearn.preprocessing import LabelEncoder

# Encode
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(filtered_data['main_primary_category']) # Decide which column to use
labels = label_encoder.inverse_transform(range(len(label_encoder.classes_)))

In [None]:
## C

### Experiment 1 Model Data

filtered_data[]

## Step 4: Model Training and Evaluation

### Choose Data

from sklearn.preprocessing import LabelEncoder

# Encode
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(filtered_data['main_primary_category']) # Decide which column to use
labels = label_encoder.inverse_transform(range(len(label_encoder.classes_)))

### Splitting Data

In [21]:
from sklearn.model_selection import train_test_split

# Split the data into training and the rest
X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y, test_size=0.30, random_state=220199)

# Split the rest into validation and testing sets
X_val, X_test, Y_val, Y_test = train_test_split(X_temp, Y_temp, test_size=0.50, random_state=220199)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Validation set size: {X_val.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Training set size: 1727882 samples
Validation set size: 370260 samples
Testing set size: 370261 samples


### Naïve Bayes Classifier

In [22]:
%%time

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Train the classifier
nb_classifier = MultinomialNB(alpha=0.1) # alpha is Lidstone Smoothing
nb_classifier.fit(X_train, Y_train)

# Predict validation and test set
Y_val_pred = nb_classifier.predict(X_val)
Y_test_pred = nb_classifier.predict(X_test)

# Decode labels
Y_test_decoded = label_encoder.inverse_transform(Y_test)
Y_test_pred_decoded = label_encoder.inverse_transform(Y_test_pred)

# Calculate accuracy
val_accuracy = accuracy_score(Y_val, Y_val_pred)
test_accuracy = accuracy_score(Y_test, Y_test_pred)
print(f"Validation Accuracy: {val_accuracy:.2f}")
print(f"Test Accuracy: {test_accuracy:.2f}")

# Generate classification report
print(classification_report(Y_test_decoded, Y_test_pred_decoded, target_names=labels))

Validation Accuracy: 0.88
Test Accuracy: 0.88
                                            precision    recall  f1-score   support

                          Computer Science       0.74      0.94      0.83     75787
                                 Economics       0.00      0.00      0.00      1018
Electrical Engineering and Systems Science       0.58      0.03      0.06      7230
                               Mathematics       0.86      0.90      0.88     74433
                                   Physics       0.97      0.94      0.95    198731
                      Quantitative Biology       0.81      0.18      0.30      4183
                      Quantitative Finance       0.91      0.05      0.10      1648
                                Statistics       0.74      0.17      0.28      7231

                                  accuracy                           0.88    370261
                                 macro avg       0.70      0.40      0.43    370261
                            