# Tutorial Notebook to explain the process of using the downloaders module

In [1]:
from cord19_plus.downloadpdf.downloaders import Cord19Reader, Index, IndexRow, Status, OpenAlex, URLDownloader
import logging

logging.basicConfig(level=logging.INFO)
logging.getLogger("urllib3").setLevel(logging.CRITICAL)  # Dont clutter logs with urllib3 messages.

## Information
### Classes

The downloaders module defines different classes that help you download PDF's from the Cord19 dataset.

- Cord19Reader : Read the metadata file of the cord19 dataset.  
- Index : Keeps track of which files have already been downloaded and their status.  
- OpenAlex : Used to find OpenAccess URl's via OpenAlex.  
- URLDownloader : Download a PDF using a link to that PDF.  


### Errors
- PDFNotAvailableError : Raised when a PDF can not be downloaded over the direct link to that PDF. This can occurs when websites implement extra measures to keep people from automatically scraping their website, or simply when OpenAlex does  not have a direct link to the PDF indexed.
- NotOpenAccessError : The requested PDF is not OpenAccess.


## Hands on

To download the PDF's we will first need to create an Index Object. This will ensure that we do not check the same PDF twice, the Index aditionally handles saving the PDF's. The entries inside the Index are directly written and loaded from a file called `index.jsonl`. The path to which we have to specify beforehand, aswell as the directory to which we will want to download the PDF's. The Index is represented as a dictionary which has DOI's as the key and a IndexRow object as a value. A IndexRow consists of a DOI, Status, pdf_path and pdf_url.

In [None]:
# Create the index from the specified index.jsonl file and specify which directory to use for saving.
# If the file does not exist it will be created. Same goes for the download directory.
index = Index.from_jsonl("index.jsonl", "pdfs")

Create the reader object, which is a generator yielding the next line of the metadata.csv file. If a DOI from the metadata.csv is already contained in the Index this entry will be skipped.

In [None]:
# create the reader object
reader = Cord19Reader("")  # 
meta = reader.read_metadata(index)

Specify OpenAlex and URLDownloader classes to retrieve open-access URL's and download them.

In [None]:
oa = OpenAlex(
    ""  # set e-mail address
)  # OpenAlex class used to retrieve open-access URL's from the openalex api.
dl = URLDownloader()  # Download PDF's from a URL.

for idx, m in enumerate(meta):  # read in the first entries of the metadata.csv file
    if idx >= 10:
        break
    try:
        oa_url = oa.find_pdf_url(m["doi"])  # request open-access URL
        content = dl.request_pdf(oa_url)  # retrieve PDF bytes
        index.save_file(content, m["doi"], oa_url)  # Write PDF bytes to download directory and add to Index.
    except:
        index.add(
            m["doi"], Status.UNAVAILABLE, None, oa_url
        )  # If an error occured during requesting add entry to index as unavailable.