## Retrieving Arxiv Papers

In [1]:
!pip install feedparser



## Filtering Search to only Papers under the category: CS

In [47]:
## 2023-01-01 to 2024-12-31

import time
import requests
import xml.etree.ElementTree as ET
import pandas as pd

# Base OAI-PMH URL for arXiv
base_url = 'http://export.arxiv.org/oai2?verb=ListRecords'

# Search parameters for OAI-PMH (for Computer Science category and metadata format 'oai_dc')
metadata_format = 'oai_dc'
category = 'cs'  # Computer Science category
from_date = '2023-01-01'  # Start date (YYYY-MM-DD)
until_date = '2024-12-31'  # End date (YYYY-MM-DD)

# List to store the data
data = []

# Initial request URL (without a resumptionToken)
url = f'{base_url}&metadataPrefix={metadata_format}&from={from_date}&until={until_date}&set={category}'

while True:
    # Fetch the data
    response = requests.get(url)
    response_text = response.text
    
    # Parse the response using ElementTree
    root = ET.fromstring(response_text)

    # Find all the 'record' elements in the XML
    records = root.findall('.//{http://www.openarchives.org/OAI/2.0/}record')
    
    # Iterate over each record and extract relevant fields
    for record in records:
        metadata = record.find('.//{http://www.openarchives.org/OAI/2.0/oai_dc/}dc')
        
        if metadata is not None:
            title = metadata.find('.//{http://purl.org/dc/elements/1.1/}title').text if metadata.find('.//{http://purl.org/dc/elements/1.1/}title') is not None else 'N/A'
            summary = metadata.find('.//{http://purl.org/dc/elements/1.1/}description').text if metadata.find('.//{http://purl.org/dc/elements/1.1/}description') is not None else 'N/A'
            updated = record.find('.//{http://www.openarchives.org/OAI/2.0/}datestamp').text
            category = metadata.find('.//{http://purl.org/dc/elements/1.1/}subject').text if metadata.find('.//{http://purl.org/dc/elements/1.1/}subject') is not None else 'N/A'
            
            # Append the data to the list
            data.append({
                'Title': title,
                'Summary': summary,
                'Updated': updated,
                'Category': category
            })

    # Find resumptionToken for pagination (if available)
    resumption_token = root.find('.//{http://www.openarchives.org/OAI/2.0/}resumptionToken')

    # If there is no resumptionToken, we have retrieved all the data
    if resumption_token is None or resumption_token.text is None:
        break

    # Update the URL with the resumptionToken for the next batch of records
    token = resumption_token.text
    url = f'{base_url}&resumptionToken={token}'

    # Sleep for 3 seconds to avoid overwhelming the server
    print("Sleeping for 3 seconds before the next request...")
    time.sleep(3)

# Convert the list of entries into a pandas DataFrame
df = pd.DataFrame(data)

# Convert the 'Updated' column to datetime format
df['Updated'] = pd.to_datetime(df['Updated'])

# Display the DataFrame
display(df)


Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...
Sleeping for 3 seconds before the next request...


Unnamed: 0,Title,Summary,Updated,Category
0,Third Order Newton's Method for Zernike Polyno...,The Zernike radial polynomials are a system ...,2024-04-23,Mathematics - Numerical Analysis
1,A recursive linear time modular decomposition ...,A module of a graph G is a set of vertices t...,2024-07-15,Computer Science - Discrete Mathematics
2,"Tur\'an Graphs, Stability Number, and Fibonacc...",The Fibonacci index of a graph is the number...,2024-03-11,Computer Science - Discrete Mathematics
3,Projective Expected Utility,Motivated by several classic decision-theore...,2024-01-18,Quantum Physics
4,A Universal In-Place Reconfiguration Algorithm...,In the modular robot reconfiguration problem...,2024-03-15,Computer Science - Computational Geometry
...,...,...,...,...
217847,On Approximately Symmetric Informationally Com...,We address the problem of constructing posit...,2023-11-27,Quantum Physics
217848,Limitations of Quantum Coset States for Graph ...,It has been known for some time that graph i...,2023-11-27,Quantum Physics
217849,Computing Local Invariants of Qubit Systems,We investigate means to describe the non-loc...,2023-11-27,Quantum Physics
217850,Fast Quantum Fourier Transforms for a Class of...,An algorithm is presented allowing the const...,2023-11-27,Quantum Physics


In [5]:
#2022

import requests
import xml.etree.ElementTree as ET
import pandas as pd

# Base OAI-PMH URL for arXiv
base_url = 'http://export.arxiv.org/oai2?verb=ListRecords'

# Search parameters for OAI-PMH (for Computer Science category and metadata format 'oai_dc')
metadata_format = 'oai_dc'
category = 'cs'  # Computer Science category
from_date = '2023-01-01'  # Start date (YYYY-MM-DD)
until_date = '2023-12-31'  # End date (YYYY-MM-DD)

# List to store the data
data = []

# Initial request URL (without a resumptionToken)
url = f'{base_url}&metadataPrefix={metadata_format}&from={from_date}&until={until_date}&set={category}'

# Fetch the data
response = requests.get(url)
response_text = response.text

# Parse the response using ElementTree
root = ET.fromstring(response_text)

# Find all the 'record' elements in the XML
records = root.findall('.//{http://www.openarchives.org/OAI/2.0/}record')

# Iterate over each record and extract relevant fields (limit to 10)
for record in records[:100]:
    metadata = record.find('.//{http://www.openarchives.org/OAI/2.0/oai_dc/}dc')
    
    if metadata is not None:
        title = metadata.find('.//{http://purl.org/dc/elements/1.1/}title').text if metadata.find('.//{http://purl.org/dc/elements/1.1/}title') is not None else 'N/A'
        summary = metadata.find('.//{http://purl.org/dc/elements/1.1/}description').text if metadata.find('.//{http://purl.org/dc/elements/1.1/}description') is not None else 'N/A'
        updated = record.find('.//{http://www.openarchives.org/OAI/2.0/}datestamp').text
        category = metadata.find('.//{http://purl.org/dc/elements/1.1/}subject').text if metadata.find('.//{http://purl.org/dc/elements/1.1/}subject') is not None else 'N/A'
        
        # Extract the identifier (article link)
        identifier = metadata.find('.//{http://purl.org/dc/elements/1.1/}identifier').text if metadata.find('.//{http://purl.org/dc/elements/1.1/}identifier') is not None else 'N/A'
        
        # Convert arXiv identifier to URL
        if identifier.startswith('http://arxiv.org/abs/'):
            link = identifier
        elif identifier.startswith('arXiv:'):
            link = f'http://arxiv.org/abs/{identifier[6:]}'
        else:
            link = f'http://arxiv.org/abs/{identifier}'
        
        # Append the data to the list
        data.append({
            'Title': title,
            'Summary': summary,
            'Updated': updated,
            'Category': category,
            'Link': link
        })

# Convert the list of entries into a pandas DataFrame
df = pd.DataFrame(data)

# Convert the 'Updated' column to datetime format
df['Updated'] = pd.to_datetime(df['Updated'])

# Display the DataFrame
display(df)

Unnamed: 0,Title,Summary,Updated,Category,Link
0,On the `Semantics' of Differential Privacy: A ...,"Differential privacy is a definition of ""pri...",2023-01-24,Computer Science - Cryptography and Security,http://arxiv.org/abs/0803.3946
1,On the Dual Formulation of Boosting Algorithms,We study boosting algorithms from a new pers...,2023-05-30,Computer Science - Machine Learning,http://arxiv.org/abs/0901.3590
2,Domain Adaptation: Learning Bounds and Algorithms,This paper addresses the general problem of ...,2023-12-04,Computer Science - Machine Learning,http://arxiv.org/abs/0902.3430
3,"Q-system Cluster Algebras, Paths and Total Pos...",We review the solution of the $A_r$ Q-system...,2023-07-12,Economics - General Economics,http://arxiv.org/abs/0906.3421
4,Optimization of Synthesis Oversampled Complex ...,An important issue with oversampled FIR anal...,2023-01-19,Computer Science - Information Theory,http://arxiv.org/abs/0907.3654
...,...,...,...,...,...
95,Graphs of Edge-Intersecting Non-Splitting Path...,Given a tree and a set P of non-trivial simp...,2023-06-22,Computer Science - Discrete Mathematics,http://arxiv.org/abs/1309.6471
96,Multiuser Diversity for Secrecy Communications...,"In this paper, we propose opportunistic jamm...",2023-07-19,Computer Science - Information Theory,http://arxiv.org/abs/1309.7451
97,When does a physical system compute?,Computing is a high-level process of a physi...,2023-04-21,Computer Science - Emerging Technologies,http://arxiv.org/abs/1309.7979
98,SVD Factorization for Tall-and-Fat Matrices on...,We demonstrate an implementation for an appr...,2023-04-04,"Computer Science - Distributed, Parallel, and ...",http://arxiv.org/abs/1310.4664


In [6]:
# Saving the DataFrame to a CSV file
df.to_csv('100_arxiv_cs_papers_2023_2023.csv', index=False)

In [3]:
# Saving the DataFrame to a CSV file
df.to_csv('arxiv_cs_papers_2022.csv', index=False)

In [4]:
display(df)

Unnamed: 0,Title,Summary,Updated,Category
0,Solution of moving-boundary problems by the sp...,This paper describes a novel numerical model...,2022-09-29,"Computer Science - Computational Engineering, ..."
1,A Polynomial Time Algorithm for Graph Isomorphism,We claimed that there is a polynomial algori...,2022-10-18,Computer Science - Computational Complexity
2,On the Frame Error Rate of Transmission Scheme...,It is known that the frame error rate of tur...,2022-03-08,Computer Science - Information Theory
3,Diversity-Integration Trade-offs in MIMO Detec...,"In this work, a MIMO detection problem is co...",2022-03-09,Computer Science - Other Computer Science
4,Perfect tag identification protocol in RFID ne...,Radio Frequency IDentification (RFID) system...,2022-07-04,Computer Science - Networking and Internet Arc...
...,...,...,...,...
82440,Bounded-Error Quantum State Identification and...,We consider the problem of bounded-error qua...,2022-03-29,Quantum Physics
82441,Strengths and Weaknesses of Quantum Fingerprin...,We study the power of quantum fingerprints i...,2022-03-29,Quantum Physics
82442,On the Role of Shared Entanglement,Despite the apparent similarity between shar...,2022-03-29,Quantum Physics
82443,Exponential Separation of Quantum and Classica...,We give an exponential separation between on...,2022-03-29,Quantum Physics


In [5]:
# Get the counts of all unique entries in the 'Category' column
category_counts = df['Category'].value_counts()

# Print the counts of all unique entries in the 'Category' column
print("\nCounts of all unique entries in column 'Category':")
print(category_counts)



Counts of all unique entries in column 'Category':
Category
Computer Science - Computer Vision and Pattern Recognition    14353
Computer Science - Machine Learning                           13769
Computer Science - Computation and Language                    7032
Computer Science - Robotics                                    3719
Mathematics - Numerical Analysis                               2928
                                                              ...  
Quantitative Biology - Cell Behavior                              1
Condensed Matter - Other Condensed Matter                         1
Nonlinear Sciences - Pattern Formation and Solitons               1
Physics - Atomic and Molecular Clusters                           1
Mathematics - Operator Algebras                                   1
Name: count, Length: 146, dtype: int64


In [6]:
# Filter entries that start with "Computer Science"
filtered_df = df[df['Category'].str.startswith('Computer Science')]

# Display the filtered DataFrame
print("\nFiltered DataFrame where 'Category' starts with 'Computer Science':")
display(filtered_df)


Filtered DataFrame where 'Category' starts with 'Computer Science':


Unnamed: 0,Title,Summary,Updated,Category
0,Solution of moving-boundary problems by the sp...,This paper describes a novel numerical model...,2022-09-29,"Computer Science - Computational Engineering, ..."
1,A Polynomial Time Algorithm for Graph Isomorphism,We claimed that there is a polynomial algori...,2022-10-18,Computer Science - Computational Complexity
2,On the Frame Error Rate of Transmission Scheme...,It is known that the frame error rate of tur...,2022-03-08,Computer Science - Information Theory
3,Diversity-Integration Trade-offs in MIMO Detec...,"In this work, a MIMO detection problem is co...",2022-03-09,Computer Science - Other Computer Science
4,Perfect tag identification protocol in RFID ne...,Radio Frequency IDentification (RFID) system...,2022-07-04,Computer Science - Networking and Internet Arc...
...,...,...,...,...
82430,Notes for Miscellaneous Lectures,Here I share a few notes I used in various c...,2022-07-20,Computer Science - Discrete Mathematics
82431,Throughput Optimal Distributed Control of Stoc...,This paper has been withdrawn by the author ...,2022-04-26,Computer Science - Networking and Internet Arc...
82432,On the intersection of additive perfect codes,The intersection problem for additive (exten...,2022-04-26,Computer Science - Information Theory
82433,A Union Bound Approximation for Rapid Performa...,"In this paper, we present a simple technique...",2022-03-08,Computer Science - Information Theory


In [9]:
# Convert the filtered DataFrame to a CSV file
filtered_df.to_csv('arxiv_cs_papers_2022_clean.csv', index=False)

In [10]:
import pandas as pd

# Define the paths for the CSV files
path_2023_2024 = '/Users/bryansoh/Documents/NUS/PLP/Group Project/plp_practice_proj-1/arxiv/datasets/arxiv_cs_papers_2023_2024_clean.csv'
path_2022 = '/Users/bryansoh/Documents/NUS/PLP/Group Project/plp_practice_proj-1/arxiv/arxiv_cs_papers_2022_clean.csv'

# Load the first CSV file
df1 = pd.read_csv(path_2023_2024)

# Load the second CSV file
df2 = pd.read_csv(path_2022)

# Combine the two DataFrames
combined_df = pd.concat([df1, df2])

# Save the combined DataFrame to a new CSV file
combined_df.to_csv('arxiv_cs_papers_2022_2024_clean.csv', index=False)

In [11]:
display(combined_df)

Unnamed: 0,Title,Summary,Updated,Category
0,A recursive linear time modular decomposition ...,A module of a graph G is a set of vertices t...,2024-07-15,Computer Science - Discrete Mathematics
1,"Tur\'an Graphs, Stability Number, and Fibonacc...",The Fibonacci index of a graph is the number...,2024-03-11,Computer Science - Discrete Mathematics
2,A Universal In-Place Reconfiguration Algorithm...,In the modular robot reconfiguration problem...,2024-03-15,Computer Science - Computational Geometry
3,New probabilistic interest measures for associ...,Mining association rules is an important tec...,2024-01-01,Computer Science - Databases
4,On the `Semantics' of Differential Privacy: A ...,"Differential privacy is a definition of ""pri...",2023-01-24,Computer Science - Cryptography and Security
...,...,...,...,...
64669,Notes for Miscellaneous Lectures,Here I share a few notes I used in various c...,2022-07-20,Computer Science - Discrete Mathematics
64670,Throughput Optimal Distributed Control of Stoc...,This paper has been withdrawn by the author ...,2022-04-26,Computer Science - Networking and Internet Arc...
64671,On the intersection of additive perfect codes,The intersection problem for additive (exten...,2022-04-26,Computer Science - Information Theory
64672,A Union Bound Approximation for Rapid Performa...,"In this paper, we present a simple technique...",2022-03-08,Computer Science - Information Theory
