# Example #1: Scraping papers from online archives

This notebook demonstrates how to use the `Scraper` class to scrape papers from online archives.

In [1]:
%load_ext autoreload
%autoreload 2

The warnings you see below are a results of loading the `paperscraper` package and can be safely ignored.

In [2]:
import os
os.chdir("..")

from src.scraper import Scraper



In order to scrape from online resources, you must specify the following:
* **Keywords to search for**. Note that this parameter accepts lists of lists. If there are multiple lists of keywords, these can be collected into a list and submitted.
* **Start and end dates for search**: The dates must be in YYYY-MM-DD format.
* **Archives**: A list of archives to scrape from. Currently, only PubMed and ArXiv are supported but others may be added at a later date.

**WARNING**: The scrapers use the API for the relevant archives. Do not run concurrent requests or too many large requests at once. PubMed is relatively fast but ArXiv can be very slow and may take up to 30 minutes or more for a large query. When testing this function out, it is recommended that you use a small date range and a single search term. The following code below should take a minimum of 40 seconds seconds to run to completion.

After initializing the `Scraper` class use the `run()` method to run the scraper on the specified terms. If the `return_results` parameter is set to `True` then results will be return from the function as an object (shown below). If the `return_results` is `False` and an `outpath` to export to is available no object will be returned and the results will be exported to CSV. If `return_results` is `False` but no `outpath` is specified, an export path will be generated from a time stamp in `data/tables/`.

In [3]:
scraper = Scraper(
    keywords=["active inference"],   # List[Union[str, List[str]]]
    start_date="2025-01-01",         # str in YYYY-MM-DD format
    end_date="2025-02-01",           # str in YYYY-MM-DD format
    archives=["pubmed", "arxiv"],    # List[str]
)

df = scraper.run(return_results=True)
df

INFO:src.scraper:Scraping from PubMed...
INFO:src.scraper:Scraping from ArXiv...


Processing all:active inference AND submittedDate:[202501010000 TO 202502010000]: 1117it [00:54, 20.36it/s]


Unnamed: 0,title,authors,where_published,year,doi
0,The role of the intraparietal sulcus in numera...,"Erin Duricy, Corrine Durisko, Julie AFiez",Behavioural brain research,2025,10.1016/j.bbr.2025.115453
1,Premovement activity in the corticospinal trac...,"Mehran Emadi Andani, Miriam Braga, Francesco D...",Social cognitive and affective neuroscience,2025,10.1093/scan/nsaf014
2,APBIO: bioactive profiling of air pollutants t...,"Eva Viesi, Ugo Perricone, Patrick Aloy, Rosalb...",Journal of cheminformatics,2025,10.1186/s13321-025-00961-1
3,A deep learning model for assistive decision-m...,"David Martínez-Pascual, José MCatalán, Luís DL...",Journal of neuroengineering and rehabilitation,2025,10.1186/s12984-024-01517-4
4,Phosphoproteomics delineates hepatocellular ca...,"Ze Zhang, Zhenpeng Zhang, Yao Zhang, Yuan Li, ...","Hepatology (Baltimore, Md.)",2025,10.1097/HEP.0000000000001250
...,...,...,...,...,...
1112,The ENUBET monitored neutrino beam and its imp...,"ENUBET collaboration, L. Halić, F. Acerbi, I. ...",arXiv,2025,10.48550/arXiv.2501.04531
1113,Measurements of the Temperature and E-mode Pol...,"T. -L. Chou, P. A. R. Ade, A. J. Anderson, J. ...",arXiv,2025,10.48550/arXiv.2501.06890
1114,Discovery of Ancient Globular Cluster Candidat...,"Katherine E. Whitaker, Sam E. Cutler, Rupali C...",arXiv,2025,10.48550/arXiv.2501.07627
1115,FAUST XX. The chemical structure and temperatu...,"J. Frediani, M. De Simone, L. Testi, L. Podio,...",arXiv,2025,10.48550/arXiv.2501.19188


Since we specified scraping from PubMed and ArXiv, the resulting table has both results concatenated together.

Note that the results are far from perfect. There are many papers in here that do not have anything to do with active inference. This can be cleared up later once these papers are loaded into the database.