# Data collection process

This notebook scrapes the arXiv website for papers in the category "cs.CV" (Computer Vision), "stat.ML" / "cs.LG" (Machine Learning) and "cs.AI" (Artificial Intelligence). The papers are then saved in a csv file.

In [1]:
import arxiv
import pandas as pd

from tqdm import tqdm
from pathlib import Path

In [None]:
PATH_DATA_BASE = Path.cwd().parent / "data"

## Scraping the arXiv website

Let's start by defining a list of keywords that we will use to query the arXiv API.

In [2]:
query_keywords = [
    "\"image segmentation\"",
    "\"self-supervised learning\"",
    "\"representation learning\"",
    "\"image generation\"",
    "\"object detection\"",
    "\"transfer learning\"",
    "\"transformers\"",
    "\"adversarial training",
    "\"generative adversarial networks\"",
    "\"model compressions\"",
    "\"image segmentation\"",
    "\"few-shot learning\"",
    "\"natural language\"",
    "\"graph\"",
    "\"colorization\"",
    "\"depth estimation\"",
    "\"point cloud\"",
    "\"structured data\"",
    "\"optical flow\"",
    "\"reinforcement learning\"",
    "\"super resolution\"",
    "\"attention\"",
    "\"tabular\"",
    "\"unsupervised learning\"",
    "\"semi-supervised learning\"",
    "\"explainable\"",
    "\"radiance field\"",
    "\"decision tree\"",
    "\"time series\"",
    "\"molecule\"",
    "\"large language models\"",
    "\"llms\"",
    "\"language models\"",
    "\"image classification\"",
    "\"document image classification\"",
    "\"encoder\"",
    "\"decoder\"",
    "\"multimodal\"",
    "\"multimodal deep learning\"",
]

Afterwards, we define a function that creates a search object using the given query. It sets the maximum number of results for each category to 6000 and sorts them by the last updated date. 

In [3]:
client = arxiv.Client(num_retries=20, page_size=500)


def query_with_keywords(query) -> tuple:
    """
    Query the arXiv API for research papers based on a specific query and filter results by selected categories.
    
    Args:
        query (str): The search query to be used for fetching research papers from arXiv.
    
    Returns:
        tuple: A tuple containing three lists - terms, titles, and abstracts of the filtered research papers.
        
            terms (list): A list of lists, where each inner list contains the categories associated with a research paper.
            titles (list): A list of titles of the research papers.
            abstracts (list): A list of abstracts (summaries) of the research papers.
            urls (list): A list of URLs for the papers' detail page on the arXiv website.
    """
    
    # Create a search object with the query and sorting parameters.
    search = arxiv.Search(
        query=query,
        max_results=6000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    
    # Initialize empty lists for terms, titles, abstracts, and urls.
    terms = []
    titles = []
    abstracts = []
    urls = []

    # For each result in the search...
    for res in tqdm(client.results(search), desc=query):
        # Check if the primary category of the result is in the specified list.
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG", "cs.AI"]:
            # If it is, append the result's categories, title, summary, and url to their respective lists.
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
            urls.append(res.entry_id)

    # Return the four lists.
    return terms, titles, abstracts, urls

In [4]:
all_titles = []
all_abstracts = []
all_terms = []
all_urls = []

for query in query_keywords:
    terms, titles, abstracts, urls = query_with_keywords(query)
    all_titles.extend(titles)
    all_abstracts.extend(abstracts)
    all_terms.extend(terms)
    all_urls.extend(urls)

"image segmentation": 3100it [00:49, 62.37it/s]
"self-supervised learning": 0it [00:03, ?it/s]
"representation learning": 6586it [01:48, 60.87it/s]
"image generation": 2286it [00:38, 59.80it/s]
"object detection": 7055it [02:06, 55.82it/s]
"transfer learning": 5346it [01:24, 63.51it/s]
"transformers": 30000it [09:13, 54.18it/s]
"adversarial training: 0it [00:03, ?it/s]
"generative adversarial networks": 5800it [01:42, 56.82it/s]
"model compressions": 761it [00:16, 45.18it/s]
"image segmentation": 3100it [00:47, 65.53it/s]
"few-shot learning": 0it [00:03, ?it/s]
"natural language": 14041it [04:16, 54.81it/s]
"graph": 30000it [08:41, 57.56it/s]
"colorization": 30000it [08:26, 59.21it/s]
"depth estimation": 1318it [00:20, 63.67it/s]
"point cloud": 4776it [01:11, 66.43it/s]
"structured data": 2020it [00:34, 59.02it/s]
"optical flow": 1596it [00:25, 61.72it/s]
"reinforcement learning": 17300it [04:49, 59.70it/s]
"super resolution": 3100it [00:48, 64.55it/s]
"attention": 30000it [08:50, 56.5

Now, we create a pandas.DataFrame object to store the results.

In [5]:
arxiv_data = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls
})

Finally, we export the DataFrame to a csv file.

In [6]:
arxiv_data.to_csv(PATH_DATA_BASE / 'data.csv', index=False)