In [None]:
from arxivester import Paper, arXiv, inSPIRE

# Harvest papers on arXiv

## Get papers in a specific date range 

in order to scrape arxiv.org, first we need to instantiate a `Paper` object which specifies the date range (via `from_` and `to_` arguments, and the category via `set_` argument. For example:

In [None]:
paper = Paper(from_="2023-05-01",
              to_="2023-07-01",
              set_="physics:astro-ph",
             )

In [None]:
paper

by passing this object into an instance of `arXiv`, we start harvesting the records (this is implicitly done via `arXiv`'s `.harvest()` method.

In [None]:
arxiv = arXiv(paper)

In [None]:
paper.pile.head()

Note that the `from_`-`to_` date range does not necessarily specify the creation or updation (🤔?!) date 

In [None]:
print(f"'datestamps' range between: {paper.pile.datestamp.min()} and {paper.pile.datestamp.max()}")

In [None]:
print(f"'created' range between: {paper.pile.created.min()} and {paper.pile.created.max()}")

In [None]:
print(f"'updated' range between: {paper.pile.updated.min()} and {paper.pile.updated.max()}")

## save to file 

We can easily save the harvested data using

In [None]:
paper.save_to_csv()

by default a cvs file is saved in the current directory under the following name:

In [None]:
paper.get_file_name()

## Get today's papers

If no dates are passed to the `Paper` instance, the default date range is set to today's papers. Also the default category for `set_` is `physics:astro-ph`.

In [None]:
paper = Paper()
arxiv = arXiv(paper)

In [None]:
paper.pile.head()

# Harvest number of citations from inSPIRE

The number of citations can be scraped via the `inSPIRE` object. Let's load the papers that we previouslt scraped from arXiv.

In [None]:
paper = Paper(from_="2023-05-01",
              to_="2023-07-01",
              set_="physics:astro-ph",
             )
paper.load_from_csv()

By passing `paper` to an instance of `inSPIRE` we start scraping the number of citations. The optional argument `n_chunks` makes parallelizes this process. By default `n_chunk` is set to the value returned by `multiprocessing.cpu_counts()`.
Let's slice the first 100 papers and find the number of times they've been cited so far from inSPIRE.

In [None]:
paper.pile = paper.pile[:100]

In [None]:
inspire = inSPIRE(paper)

In [None]:
paper.pile["n_citations"].head()

**Voila!**