#### General Notes

This notebook performs *web scraping* with Python for the arXiv.org website. It includes:
- search by keywords
- search by specific year
directly "asking" the website. 

There is the use of the **arxiv** package available for Python which allows to perform a search by paper ID and getting information about the paper, such as the authors, title, publication date, etc. The paper ID can be found through web scraping of the arXiv.org website. 

The results of the search are:
- titles
- link for paper info
- link for download pdf
- link for donwload the source code folder

which are stored in a *.csv* file at the end of the search, to keep track of the results. The notebook is interactive, then the user chooses the order to display results:
- relevance
- submission date (newest first)
- submission date (oldest first)

and what to download.

The notebook also provides a *download* function.

#### Updates 

The notebook has been updated (*v2*) and now includes:
- a class **Search** to keep track of all the main information about the current search
- a new type of search has been added, called **version_search()**

Suppose you need papers with at max N versions, where the latest version has been published in the year YYYY, then the *version_search()* function allows to look for each paper that matches the above criteria.

In this case, it is possible to download the latest version of each paper within a previous version (by default, the first version).

In [58]:
from selenium import webdriver
import sys
import os
import pandas as pd
import tarfile
from urllib import request
import arxiv
from dataclasses import dataclass

In [59]:
# class search to create a search object to store the search information

@dataclass
class Search:
    papers_id = []
    titles = []
    links_pdf = []
    links_source = []
    links_info = []
    total_results = 0
    remaining_results = 0

# useful global variables

OPTION_NUM_PER_PAGE = 1 # 1: 25, 2: 50, 3: 100, 4: 200  --> possible enumeration
MAX_NUM_PER_PAGE = 25   # change in base of option_num_per_page

CURRENT_SIZE = 0 # number of elements in current page
GIVE_MORE_RESULTS = True
PREV_VERSION = "1" # by default

In [60]:
website = "https://arxiv.org/search/advanced"

# UNCOMMENT THE LINES BELOW IF YOU WANT THE WINDOW TO BE HIDDEN
#options = webdriver.FirefoxOptions()
#options.add_argument("--headless")
#driver = webdriver.Firefox(options = options)

driver = webdriver.Firefox() 
driver.get(website)

search = Search()

In [61]:
# functions to interact with the user and start the search process

def start():
    q = input("Do you want a (1/2):\n1. Search by keywords\n2. Search for papers with more versions"
                + "\nInsert a number: ")
    if q == "1":
        search.type = "keywords"
        normal_search()
    elif q == "2":
        search.type = "versions"
        versions_search()
    else:
        driver.quit()
        sys.exit("Try again")

# search by keywords
def normal_search():
    terms = input("\nEnter keywords for your search: ")
    q1 = input("\nAre you looking for papers of a specific year? (y/n): ")
    if q1 == "y":
        year = input("Enter year (2007-YYYY): ")
        if year.isdigit():
            if int(year) >= 2007: 
                driver.find_element('xpath', '//input[@id="date-year"]').send_keys(year)
            else:
                driver.quit()
                sys.exit("Try again")
        else:
            driver.quit()
            sys.exit("Try again")
    else:
        driver.find_element('xpath', '//input[@id="date-filter_by-0"]').click()

    driver.find_element('xpath', '//input[@id="terms-0-term"]').send_keys(terms)
    driver.find_element('xpath', '//button[@class="button is-link is-medium"]').click()

    search.order = input("\nWhich order do you prefer? Choose an option: \n1. Relevance\n" + 
                    "2. Submission Date (oldest first)\n3. Submission Date (newest first) \n"
                    + "Insert a number: ")

# search for versions
def versions_search():
    year = input("\nEnter a year to start the search (YYYY): ")
    version_max = input("Look for papers with a max number of versions (2-9): ")
    
    query = year[len(year) - 2:] # take last two digits
    query += "*v" + version_max     # result: YY*vX 

    search.version_max = version_max

    driver.find_element('xpath', '//input[@id="terms-0-term"]').send_keys(query)
    driver.find_element('xpath', '//select[@id="terms-0-field"]/option[9]').click()
    driver.find_element('xpath', '//button[@class="button is-link is-medium"]').click()

    search.order = input("\nWhich order do you prefer? Choose an option: \n1. Relevance\n" + 
                    "2. Submission Date (oldest first)\n3. Submission Date (newest first) \n"
                    + "Insert a number: ")

# order the results by relevance or submission date
def choose_order(select_order):
    switcher = {
        '1': driver.find_element('xpath', '//select[@id="order"]/option[5]'),
        '2': driver.find_element('xpath', '//select[@id="order"]/option[4]'),
        '3': driver.find_element('xpath', '//select[@id="order"]/option[3]'),
    }
    return switcher.get(select_order)

The following cell contains all the functions needed to manage pages.
As default value, the maximum number of elements per page is 25. If the total number of elements for the whole search is greater than 25, then the user is asked if more results are needed; if so, the next page is loaded, otherwise all the results found up to that point are saved in the .csv file and then, the user is asked if the download is needed.

In [62]:
# get the size of the first page and extract all the relevant info of papers in it
def first_page():
    list_size = (driver.find_element('xpath', '/html/body/main/div[1]/div[1]/h1').text).split(" ")[-2]
    list_size = list_size.replace(",", "")
    if list_size.isdigit():
        search.remaining_results = int(list_size) # the full list size
        choose_order(search.order).click()
        driver.find_element('xpath', '//button[@class="button is-small is-link"]').click()
        get_page_size()
        extract_search_results(CURRENT_SIZE) # CURRENT SIZE initialized by get_page_size()
    else:
        driver.quit()
        sys.exit("No results found! Try again.")

# utility function to get the size of the current page
def get_page_size():
    global CURRENT_SIZE, GIVE_MORE_RESULTS
    driver.find_element('xpath', '//select[@id="size"]/option[' + str(OPTION_NUM_PER_PAGE) + ']').click ()
    driver.find_element('xpath', '//button[@class="button is-small is-link"]').click()
    if search.remaining_results > MAX_NUM_PER_PAGE: 
        CURRENT_SIZE = MAX_NUM_PER_PAGE
        search.remaining_results -= MAX_NUM_PER_PAGE
    else:
        CURRENT_SIZE = search.remaining_results
        # no more elements for next page
        search.remaining_results = 0
        GIVE_MORE_RESULTS = False

# manage next pages and extract search results
def next_page():
    driver.find_element('xpath', '/html/body/main/div[2]/nav[1]/a[2]').click()
    get_page_size()
    extract_search_results(CURRENT_SIZE)

# extract titles, link for info, link to download pdf, link to download source code      
def extract_search_results(size):
    search.total_results += size
    for i in range(1, size+1):
        url_xpath = '/html/body/main/div[2]/ol/li[' + str(i) +']/div/p/a'
        paper_id = (driver.find_elements('xpath', url_xpath)[0].text).split(":")[-1]
        search_paper = next(arxiv.Search(id_list=[paper_id]).results())        
        search.papers_id.append(paper_id)
        search.titles.append(search_paper.title)
        search.links_pdf.append("https://arxiv.org/pdf/" + paper_id + ".pdf")
        search.links_info.append("https://arxiv.org/abs/" + paper_id)
        search.links_source.append("https://arxiv.org/e-print/" + paper_id)
    print("\nDone! " + str(size) + " results found.")


def ask_for_more():
    global GIVE_MORE_RESULTS
    q = input("\nDo you want more results? (y/n): ") 
    if q == "y":
        next_page()
    else:
        GIVE_MORE_RESULTS = False


def create_db():
    results_db = pd.DataFrame({'Paper ID': search.papers_id, 'Title': search.titles, 'Paper Info': search.links_info, 
                    'Link PDF': search.links_pdf, 'Link Source': search.links_source})
    results_db.to_csv('results.csv')
    print(str(search.total_results) + " results saved to results.csv")
    driver.quit()

In [63]:
# ask for download and download function in base of the search.type (keywords or versions)
def ask_download():
    to_download = input("\nDo you want to download the results? (y/n): ") 
    if to_download == "y":
        to_print = input("\nHow many files do you want to download? (1-" + str(search.total_results) + "): ")
        q = input("\nWhat do you want to download? (1/2/3)\n" + "1. PDF version\n2." +
                    " Source code folder\n3. Both (1/2/3)\nInsert number: ")
        if search.type == "versions": 
            q2 = input("\nDo you want to download also version n." + PREV_VERSION + "? (y/n): ")
            if q2 == "y":
                if q == "1":
                    download_both_pdf(to_print)
                elif q == "2":
                    download_both_source(to_print)
                elif q == "3":
                    download_prev_both(to_print)
                else:
                    sys.exit("Try again")
            else:
                if q == "1":
                    download_pdf(to_print)
                elif q == "2":
                    download_source(to_print)
                elif q == "3":
                    download_both(to_print)
                else:
                    sys.exit("Try again")
    else:
        print("Bye!")

# functions to download in case of search.type == keywords
def download_pdf(to_print):
    for i in range(0, int(to_print)):
        request.urlretrieve(search.links_pdf[i], search.papers_id[i] + ".pdf")
    print("Download complete!")

def download_source(to_print):
    q = input("\nDo you also want to extract the folder? (y/n): ")
    if q == "y":
        for i in range(0, int(to_print)):
            source = request.urlretrieve(search.links_source[i], search.papers_id[i] + ".tar.gz")
            extract(source[0])
        print("Download and extraction complete!")
    else:
        for i in range(0, int(to_print)):
            request.urlretrieve(search.links_source[i], search.papers_id[i] + ".tar.gz")
        print("Download complete!")
    
def download_both(to_print):
    for i in range(1, int(to_print)+1):
        request.urlretrieve(search.links_pdf[i], search.papers_id[i] + ".pdf")
        request.urlretrieve(search.links_source[i], search.papers_id[i] + ".tar.gz")


# functions to download in case of search.type == versions
# and download both previous and latest versions simultaneously
def download_both_pdf(to_print):
    for i in range(0, int(to_print)):
        prev_paper_id = get_prev_id(i)
        request.urlretrieve(search.links_pdf[i], search.papers_id[i] + ".pdf")
        request.urlretrieve("https://arxiv.org/pdf/" + prev_paper_id + ".pdf", prev_paper_id + ".pdf")
    print("Download complete!")

def download_both_source(to_print):
    q = input("\nDo you also want to extract the folder? (y/n): ")
    if q == "y":
        for i in range(0, int(to_print)):
            prev_paper_id = get_prev_id(i)
            source1 = request.urlretrieve(search.links_source[i], search.papers_id[i] + ".tar.gz")
            source2 = request.urlretrieve("https://arxiv.org/e-print/" + prev_paper_id, prev_paper_id + ".tar.gz")
            extract(source1[0])
            extract(source2[0])
        print("Download and extraction complete!")
    else:
        for i in range(0, int(to_print)):
            prev_paper_id = get_prev_id(i)
            source1 = request.urlretrieve(search.links_source[i], search.papers_id[i] + ".tar.gz")
            source2 = request.urlretrieve("https://arxiv.org/e-print/" + prev_paper_id, prev_paper_id + ".tar.gz")
        print("Download complete!")


def download_prev_both(to_print):
    for i in range(0, int(to_print)):
        prev_paper_id = get_prev_id(i)
        request.urlretrieve("https://arxiv.org/pdf/" + prev_paper_id + ".pdf", prev_paper_id + ".pdf")
        request.urlretrieve("https://arxiv.org/e-print/" + prev_paper_id, prev_paper_id + ".tar.gz")
        request.urlretrieve(search.links_pdf[i], search.papers_id[i] + ".pdf")
        request.urlretrieve(search.links_source[i], search.papers_id[i] + ".tar.gz")

# other utility functions
def extract(filename):
    folder_name = filename.split(".tar.gz")[0]
    with tarfile.open(filename, "r:gz") as tar:
        tar.extractall(path = os.path.join("../jupyter", folder_name))

def get_prev_id(index):
    paper_id = search.papers_id[index]
    if paper_id[len(paper_id) - 2:].isdigit():  # no v in it
        prev_paper_id = paper_id + "v" + PREV_VERSION
        search.papers_id[index] += "v" + search.version_max
    else:
        prev_paper_id = paper_id.replace("v" + search.version_max, "v" + PREV_VERSION)
    return prev_paper_id

In [64]:
# main

if __name__ == "__main__":
    start()
    first_page()
    while search.remaining_results > 0 and GIVE_MORE_RESULTS:
        ask_for_more()
    create_db()
    ask_download()


Done! 25 results found.
25 results saved to results.csv


SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
