# Playing with ePrints publications

This notebook is a space for getting a feel for ePrints repositories, how best to request data, how to parse it and get the info that we want.
It's also a useful space to debug snippets of code that throw errors in the main scripts.

## Getting the publications

In [12]:
import requests

- Filtering for Item Types *Articles* and *Research Reports or Papers*
- XML output works for Southampton Uni (yay!)
- based on Philly's Code

In [13]:
repo = "eprints.soton.ac.uk"
date = "2022-"
type = "paper"

request = f"https://{repo}/cgi/search/archive/advanced?screen=Search&" \
            "output=XML&" \
            "_action_export_redir=Export&" \
            "dataset=archive&" \
            "_action_search=Search&" \
            "documents_merge=ALL&" \
            "documents=&" \
            "eprintid=&" \
            "title_merge=ALL&" \
            "title=&" \
            "contributors_name_merge=ALL&" \
            "contributors_name=&" \
            "abstract_merge=ALL&" \
            "abstract=&" \
            f"date=&" \
            "keywords_merge=ALL&" \
            "keywords=&" \
            "divisions_merge=ANY&" \
            f"pres_type={type}&" \
            "refereed=EITHER&" \
            "publication%2Fseries_name_merge=ALL&" \
            "publication%2Fseries_name=&" \
            "documents.date_embargo=&" \
            "lastmod=&" \
            "pure_uuid=&" \
            "contributors_id=&" \
            "satisfyall=ALL&" \
            "order=contributors_name%2F-date%2Ftitle"

In [14]:
print(request)

https://eprints.soton.ac.uk/cgi/search/archive/advanced?screen=Search&output=XML&_action_export_redir=Export&dataset=archive&_action_search=Search&documents_merge=ALL&documents=&eprintid=&title_merge=ALL&title=&contributors_name_merge=ALL&contributors_name=&abstract_merge=ALL&abstract=&date=&keywords_merge=ALL&keywords=&divisions_merge=ANY&pres_type=paper&refereed=EITHER&publication%2Fseries_name_merge=ALL&publication%2Fseries_name=&documents.date_embargo=&lastmod=&pure_uuid=&contributors_id=&satisfyall=ALL&order=contributors_name%2F-date%2Ftitle


In [15]:
response = requests.get(request)

In [16]:
with open("export.xml", "wb") as f:
    f.write(response.content)

## Parsing the publications

Useful links:
- [Pubmed Parser](https://github.com/titipata/pubmed_parser)
- [lxml](https://pypi.org/project/lxml/)
- [short intro to lxml](https://realpython.com/python-xml-parser/#lxml-use-elementtree-on-steroids)
- [lxml tutorial: parsing](https://lxml.de/tutorial.html)

Problem: does not contain full text

In [1]:
from lxml import etree

In [17]:
data = "export.xml"
with open(data, "rb") as f:
    tree = etree.parse(f)
root = tree.getroot()  # holds list of eprints, tagged <eprints>...</eprints>

In [18]:
root_tag = etree.QName(root.tag)
print(root_tag.localname)
print(root_tag.namespace)

eprints
http://eprints.org/ep2/data/2.0


In [19]:
children = list(root)  # should be list of entries <eprint>;;;</eprint>
print(len(children))

27366


In [5]:
def get_specific_fields_content(element, field_name):
    """Returns content of XML fields of a specific name of an element.

    Args:
        element (lxml.etree._Element): XML element to analyse
        field_name (str): name of field to look for

    Returns:
        list<str>: list of contents found in children of element of given name
    """
    contents = []
    for child in list(element):
        if field_name in child.tag:
            contents.append(child.text)
    return contents

def get_specific_fields_elements(element, field_name):
    """Returns XML subelements of the given element with the given name.

    Args:
        element (lxml.etree._Element): XML element to analyse
        field_name (str): name of field to look for

    Returns:
        list<lxml.etree._Element>: list of children found of the given name
    """
    elements = []
    for child in list(element):
        if field_name in child.tag:
            elements.append(child)
    return elements


In [6]:
def parse_pdf_urls(path):
    """Extracts download URLs of PDFs from XML file.

    Args:
        path (str): path to XML file

    Yields:
        str: download URL for PDF
    """
    with open(path, "rb") as f:
        tree = etree.parse(f)
    root = tree.getroot()
    children = list(root)
    for c in children:
        urls = []
        documents_holders = get_specific_fields_elements(c, "documents")
        for documents_list in documents_holders:
            documents = get_specific_fields_elements(documents_list, "document")
            for document in documents:
                files_holders = get_specific_fields_elements(document, "files")
                for files_list in files_holders:
                    files = get_specific_fields_elements(files_list, "file")
                    for file in files:
                        urls += get_specific_fields_content(file, "url")
        if len(urls) > 0:  # NOTE: can sometimes include jpegs, docx etc.
            yield urls

In [7]:
cnt = 0
for pdf_url in parse_pdf_urls("data/export_eprints.soton.ac.uk_2022-.xml"):
    if pdf_url != []:
        cnt += 1

In [8]:
cnt

293

In [9]:
import requests
from io import BytesIO
from pdfminer.high_level import extract_text_to_fp

def check_text_access(pdf_url):
    pdf = requests.get(pdf_url)
    if pdf.status_code == 200 and "pdf" in pdf.headers['content-type']:
        out = BytesIO()
        try:
            extract_text_to_fp(BytesIO(pdf.content), out, output_type="text")
            text = out.getvalue().decode("utf-8")
            return True
        except:
            return False

In [10]:
cnt = 0
for pdf_urls in parse_pdf_urls("data/export_eprints.soton.ac.uk_2022-.xml"):
    for pdf_url in pdf_urls:
        if check_text_access(pdf_url):
            cnt += 1
            break

The PDF <_io.BytesIO object at 0x10e2dfd80> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case


In [11]:
cnt

173

In [None]:
print(children[0].keys())
print(children[0].get("id"))

In [None]:
for c in list(children[0]):
    local_tag = etree.QName(c.tag).localname
    print(local_tag)

In [None]:
def get_specific_fields_content(element, field_name):
    print("Contents:")
    print(type(element))
    contents = []
    for child in list(element):
        if field_name in child.tag:
            contents.append(child.text)
            print(type(child.text))
    return contents

In [None]:
def get_specific_fields_elements(element, field_name):
    print("Elements:")
    print(type(element))
    elements = []
    for child in list(element):
        if field_name in child.tag:
            elements.append(child)
            print(type(child))
    return elements

In [None]:
file_w_download = children[5]
documents_holders = get_specific_fields_elements(file_w_download, "documents")
for documents_list in documents_holders:
    documents = get_specific_fields_elements(documents_list, "document")
    for document in documents:
        files_holders = get_specific_fields_elements(document, "files")
        for files_list in files_holders:
            files = get_specific_fields_elements(files_list, "file")
            for file in files:
                print(get_specific_fields_content(file, "url"))

## Getting URLs

Good example of PDF with github URL: https://eprints.soton.ac.uk/455168/1/MARINE2021_OC4_TUDelft_WavEC.pdf.pdf

In [None]:
import requests
from pdfminer.high_level import extract_text_to_fp
import re

Download PDF

In [None]:
sample_pdf = requests.get("https://eprints.soton.ac.uk/455168/1/MARINE2021_OC4_TUDelft_WavEC.pdf.pdf")

Extract text from PDF.

**Note:** Might want to `extract_pages` instead, and only check the first page as this is where we would expect the link to be.

In [None]:
from io import BytesIO
out = BytesIO()
extract_text_to_fp(BytesIO(sample_pdf.content), out, output_type="text")

In [None]:
text = out.getvalue().decode("utf-8")

In [None]:
pattern = r"(?P<url>https?://(www.)?github.com[^\s]+)"
result = re.search(pattern, text).group("url")

In [None]:
print(type(result))

In [None]:
sample = "initial text https://www.github.com/abgs some other text http://github.com/username more other text"
matches = re.findall(pattern, sample)

In [None]:
for m in re.finditer(pattern, "Just some random text"):
    print(m.group("url"))

In [None]:
print(type(matches))

In [None]:
first_match = re.search(pattern, sample)
print(type(first_match))

## Debugging Space

### Broken PDFs and access denied

In [None]:
url_access_denied = "https://eprints.soton.ac.uk/471291/1/2201.09919_1_.pdf"
response_breaks = requests.get(url_access_denied)
print(response_breaks.status_code)

In [None]:
response_ok = requests.get("https://eprints.soton.ac.uk/455168/1/MARINE2021_OC4_TUDelft_WavEC.pdf.pdf")
print(response_ok.status_code)
print(response_ok.headers['content-length'])
print(response_ok.headers['content-type'])

In [None]:
url_eof = "https://eprints.soton.ac.uk/474475/1/Unconfirmed_794153.crdownload"
response_eof = requests.get(url_eof)
print(response_eof.headers)

### Cleaning Links

In [14]:
import re

pattern = r"github.com/[A-Za-z0-9-]+/[A-Za-z_\-]+"
link = "https://github.com/stuartemiddleton/uos_clpsych.Table2showstheresultsofourmodelonthevalidationsetusingthestandardevaluationmetrics.7Pleasenotethatthedatasetisimbalancedandthereforeintu-itionsjustdrawnfromonlyaccuracyarenotcorrect.Table2:PerformanceoftheproposedmodelsonTaskAandTaskBusingthevalidationset.MomentsofChangeSuicidalRiskLevelsModelPRF1PRF1Multitask-attn-score0.6740.8000.7240.4150.3970.382Multitask-score0.6800.7600.7130.3550.3310.334Multitask0.5820.7170.6290.3520.3270.335Multitask-attn0.6630.6970.6760.4080.3780.388Here,theprecision,recallandF1scorevaluesob-tainedforeachclass(seeTable5intheappendix)havebeenmacro-averagedbycalculatingthearith-meticmeanofindividualclasses’precision,recallandF1scores.Wehaveusedthemacro-averagingscoretotreatalltheclassesequallyforevaluatingtheoverallperformanceoftheclassifierregard-lessoftheirsupportvalues(i.etheactualoccur-rencesoftheclassinthedataset).Here,weob-servethatMultitask-attn-scoremodelgivesmorepromisingresultsascomparedtootherenlistedmodelsonbothtasks.Thisbehaviourisreflectedintheclassificationresultsontestdatatoo(Table3),whereMultitask-attn-scorehasoutperformedtheremainingfeatureembeddingswiththeBi-LSTMmodelaswellasthebaselinestateoftheartre-sults(Tsakalidisetal.,2022a).FromthemodeloutcomesinTable2and3,onecouldalsoseetheimpactofintroducingattentionlayersintheBi-LSTMmodel.AddingattentionlayersinBi-LSTMmodelhashelpedaccuracyforboththetasks.GiventheclassimbalanceinthedatasetwithmajorityofpostinstancesbelongingtotheNone(0)classandminorityinstancestoEscalation(IE)andSwitch(IS)classes,weseetheperformanceiscom-promisedandbiasedtowardsthemajorityclass,i.e.theclassifierismoresensitivetodetectingthemajorityclass(None(0))patternspreciselybutlesssensitivetodetectingtheminorityclasspatterns{IE,IS}.SeeTable5intheAppendixtoobservetheprecision,recallandF1scoreofthemodelsforeachindividualclassintaskA.Thedatadistri-butionisskewedfortaskBtoo,thusinfluencingitsresultsformajorityandminorityclassesshowninTable6.Overall,onthevalidationset,thepro-posedmodelshaveshownbetterrecallratethanprecision,revealinglowfalsenegativesthanthefalsepositives.Table3andTable4showtheperformanceofourproposedapproachwithvariablefeatureen-codingschemesandattentionlayersinBi-LSTMonthetestsetprovidedbytheCLPsychSharedTask2022.Theentiretrainsetcomprisingof5143"

In [17]:
re.findall(pattern, link)

['github.com/stuartemiddleton/uos_clpsych']

In [19]:
import requests
requests.get("https://github.com/fal025/product_hgcn")

<Response [200]>