# Playing with ePrints publications

This notebook is a space for getting a feel for ePrints repositories, how best to request data, how to parse it and get the info that we want.
It's also a useful space to debug snippets of code that throw errors in the main scripts.

## Getting the publications

In [None]:
import requests

- Filtering for Item Types *Articles* and *Research Reports or Papers*
- XML output works for Southampton Uni (yay!)
- based on Philly's Code

In [None]:
repo = "eprints.soton.ac.uk"
date = "2022-"
type = "paper"

request = f"https://{repo}/cgi/search/archive/advanced?screen=Search&" \
            "output=XML&" \
            "_action_export_redir=Export&" \
            "dataset=archive&" \
            "_action_search=Search&" \
            "documents_merge=ALL&" \
            "documents=&" \
            "eprintid=&" \
            "title_merge=ALL&" \
            "title=&" \
            "contributors_name_merge=ALL&" \
            "contributors_name=&" \
            "abstract_merge=ALL&" \
            "abstract=&" \
            f"date={date}&" \
            "keywords_merge=ALL&" \
            "keywords=&" \
            "divisions_merge=ANY&" \
            f"pres_type={type}&" \
            "refereed=EITHER&" \
            "publication%2Fseries_name_merge=ALL&" \
            "publication%2Fseries_name=&" \
            "documents.date_embargo=&" \
            "lastmod=&" \
            "pure_uuid=&" \
            "contributors_id=&" \
            "satisfyall=ALL&" \
            "order=contributors_name%2F-date%2Ftitle"

In [None]:
print(request)

In [None]:
response = requests.get(request)

In [None]:
with open("export.xml", "wb") as f:
    f.write(response.content)

## Parsing the publications

Useful links:
- [Pubmed Parser](https://github.com/titipata/pubmed_parser)
- [lxml](https://pypi.org/project/lxml/)
- [short intro to lxml](https://realpython.com/python-xml-parser/#lxml-use-elementtree-on-steroids)
- [lxml tutorial: parsing](https://lxml.de/tutorial.html)

Problem: does not contain full text

In [None]:
from lxml import etree

In [None]:
data = "export_soton_XML.xml"
with open(data, "rb") as f:
    tree = etree.parse(f)
root = tree.getroot()  # holds list of eprints, tagged <eprints>...</eprints>

In [None]:
root_tag = etree.QName(root.tag)
print(root_tag.localname)
print(root_tag.namespace)

In [None]:
children = list(root)  # should be list of entries <eprint>;;;</eprint>
print(len(children))

In [None]:
print(children[0].keys())
print(children[0].get("id"))

In [None]:
for c in list(children[0]):
    local_tag = etree.QName(c.tag).localname
    print(local_tag)

In [None]:
def get_specific_fields_content(element, field_name):
    print("Contents:")
    print(type(element))
    contents = []
    for child in list(element):
        if field_name in child.tag:
            contents.append(child.text)
            print(type(child.text))
    return contents

In [None]:
def get_specific_fields_elements(element, field_name):
    print("Elements:")
    print(type(element))
    elements = []
    for child in list(element):
        if field_name in child.tag:
            elements.append(child)
            print(type(child))
    return elements

In [None]:
file_w_download = children[5]
documents_holders = get_specific_fields_elements(file_w_download, "documents")
for documents_list in documents_holders:
    documents = get_specific_fields_elements(documents_list, "document")
    for document in documents:
        files_holders = get_specific_fields_elements(document, "files")
        for files_list in files_holders:
            files = get_specific_fields_elements(files_list, "file")
            for file in files:
                print(get_specific_fields_content(file, "url"))

## Getting URLs

Good example of PDF with github URL: https://eprints.soton.ac.uk/455168/1/MARINE2021_OC4_TUDelft_WavEC.pdf.pdf

In [None]:
import requests
from pdfminer.high_level import extract_text_to_fp
import re

Download PDF

In [None]:
sample_pdf = requests.get("https://eprints.soton.ac.uk/455168/1/MARINE2021_OC4_TUDelft_WavEC.pdf.pdf")

Extract text from PDF.

**Note:** Might want to `extract_pages` instead, and only check the first page as this is where we would expect the link to be.

In [None]:
from io import BytesIO
out = BytesIO()
extract_text_to_fp(BytesIO(sample_pdf.content), out, output_type="text")

In [None]:
text = out.getvalue().decode("utf-8")

In [None]:
pattern = r"(?P<url>https?://(www.)?github.com[^\s]+)"
result = re.search(pattern, text).group("url")

In [None]:
print(type(result))

In [None]:
sample = "initial text https://www.github.com/abgs some other text http://github.com/username more other text"
matches = re.findall(pattern, sample)

In [None]:
for m in re.finditer(pattern, "Just some random text"):
    print(m.group("url"))

In [None]:
print(type(matches))

In [None]:
first_match = re.search(pattern, sample)
print(type(first_match))

## Debugging Space

### Broken PDFs and access denied

In [None]:
url_access_denied = "https://eprints.soton.ac.uk/471291/1/2201.09919_1_.pdf"
response_breaks = requests.get(url_access_denied)
print(response_breaks.status_code)

In [None]:
response_ok = requests.get("https://eprints.soton.ac.uk/455168/1/MARINE2021_OC4_TUDelft_WavEC.pdf.pdf")
print(response_ok.status_code)
print(response_ok.headers['content-length'])
print(response_ok.headers['content-type'])

In [None]:
url_eof = "https://eprints.soton.ac.uk/474475/1/Unconfirmed_794153.crdownload"
response_eof = requests.get(url_eof)
print(response_eof.headers)