# Crunching data from SciELO's services from DOI / PID

In [1]:
from io import BytesIO
import re
from urllib.error import HTTPError
from urllib.parse import urlencode
from urllib.request import urlopen, Request

from bs4 import BeautifulSoup
from lxml import etree
import pandas as pd

## Querying by DOI in the internal ArticleMeta API

There's an internal API for looking for a DOI, as shown in
[this notebook](https://github.com/gustavofonseca/notebooks/blob/master/Receitas%20do%20ArticleMeta%20API.ipynb),
which is part of the [articlemetaapi]().
Though we can write something like:

```python
import json
from articlemeta.client import ThriftClient

client = ThriftClient()

def get_internal_info_from_doi(doi):
    try:
        return next(client.documents(only_identifiers=True, extra_filter=json.dumps({"doi": doi})))
    except StopIteration:
        return None
```

That API is internal to SciELO,
and it can't be accessed by an external network.
It'll try $10x$ to reconnect,
issuing warnings telling it couldn't connect to that host,
raising a `ServerError` at the end.

Therefore we need another approach to publicly access data from DOI values.

## Searching by DOI in [search.scielo.org](https://search.scielo.org) + BeautifulSoup scraping

We can find some article using the SciELO search engine
(the one DuckDuckGo redirect to when using the `!scielo` bang).

In [2]:
def search_scielo_org_bsoup(search_string):
    try:
        query = urlencode({"q": search_string})
        resp = urlopen("https://search.scielo.org/?" + query)
        return BeautifulSoup(resp.read(), "lxml")
    except HTTPError:
        return None

We can't load that directly with `lxml.etree.parse`,
it'll break telling the input isn't a valid XML.
That's not a problem:
the HTML syntax tells us not to close some tags,
and there are stuff like the `<pre>` and `<script>` tags
which might include non-tags text in their text content body
that might have a tag-like syntax.
Tweaking the parser
(using the HTML parser instead of the XML parser,
 or using only parts of the input)
would help,
but BeautifulSoup is written for parsing HTML
(using the lxml HTML parser)
and works directly with CSS,
which should help on scraping the search results' HTML.

Looking for `DOI: 10.----/--------` as the search string
replacing the dashes by the actual DOI contents
is enough to find the article.
The items are in a `<div>` element that has `item`
It's paginated ($15$ items per page) but we just need the first item.

In [3]:
full_bsoup = search_scielo_org_bsoup("DOI: 10.1590/rbz4720170186")
first_item = full_bsoup.select(".results .item")[0]
print(first_item.prettify())

<div class="item" id="S1516-35982018000100506-scl">
 <div class="col-md-1 col-sm-2 col-xs-1">
  <input class="checkbox my_selection" id="select_S1516-35982018000100506-scl" type="checkbox" value="S1516-35982018000100506-scl"/>
  <label class="checkbox" for="select_S1516-35982018000100506-scl">
   1.
  </label>
 </div>
 <div class="col-md-11 col-sm-10 col-xs-11">
  <!-- title -->
  <div class="line">
   <img class="openAccessIcon showTooltip" data-toggle="tooltip" src="https://search.scielo.org/static/image/open-access-icon.png" title="CC-BY/4.0"/>
   <a href="http://www.scielo.br/scielo.php?script=sci_arttext&amp;pid=S1516-35982018000100506&amp;lang=pt" target="_blank" title="Effects of 1,25-dihydroxycholecalciferol and reduced vitamin D3 level on broiler performance and bone quality">
    <strong class="title" id="title-S1516-35982018000100506">
     Effects of 1,25-dihydroxycholecalciferol and reduced vitamin D3 level on broiler performance and bone quality
    </strong>
   </a>
   <

That's still too much information.
Displaying just a part of it using just the indentation
instead of angle bracketing for nesting:

In [4]:
# These 2 functions are just helpers to display some selected HTML contents
def stringfy_attr_value(k, v):
    if k not in ['title', 'id', 'class']:
        return "..."
    if isinstance(v, str):
        return v
    return ",".join(v)

def tree_str(root, indent=2, max_length=79):

    def tree_gen(element, level=0):
        prefix = indent * level * ' '

        if element.name in ["div", "span", "a"]:
            attrs = "".join(f" {k}={stringfy_attr_value(k, v)}"
                            for k, v in element.attrs.items())
            yield prefix + element.name + attrs
            for child in element.children:
                yield from tree_gen(child, level + 1)

        elif not element.name: # Text node
            raw_text = element.strip()
            if raw_text and "author" in element.parent.attrs.get("class", []):
                yield prefix + raw_text
                
    lines = (line if len(line) <= max_length else line[:max_length - 3] + "..."
             for line in tree_gen(root))
    return "\n".join(lines)

In [5]:
print(tree_str(first_item))

div id=S1516-35982018000100506-scl class=item
  div class=col-md-1,col-sm-2,col-xs-1
  div class=col-md-11,col-sm-10,col-xs-11
    div class=line
      a href=... title=Effects of 1,25-dihydroxycholecalciferol and reduced ...
      span class=socialLinks,articleShareLink
        a href=...
          span class=st_email_custom st_title=... st_url=... st_image=... st...
        a href=... class=articleAction,shareFacebook data-toggle=... data-pl...
        a href=... class=articleAction,shareTwitter data-toggle=... data-pla...
        a href=... class=showTooltip,dropdown-toggle data-toggle=... data-pl...
          span class=glyphBtn,otherNetworks
    div class=line,authors
      a href=... target=... class=author
        Castro, Fernanda Lima de Souza
      a href=... target=... class=author
        Baião, Nelson Carneiro
      a href=... target=... class=author
        Ecco, Roselene
      a href=... target=... class=author
        Louzada, Mário Jefferson Quirino
      a href=... tar

The PID is in the \[root\] item element's `id` attribute,
which is written in a `"{pid}-{collection}"` format.
The DOI is part of the single `<span class="DOIResults">` element.
Some other information can be found from that HTML snippet,
like the article title (first `<div class="line">` homonymous attribute)
and the authors' names (text in the `<a class="author">` elements).

It has more information (e.g. volume, ISSN and journal name)
which won't be scraped:

In [6]:
first_item.select(".source")[0]

<div class="line source">
<span style="margin: 0"><a class="openJournalInfo" data-collection="scl" data-issn="1516-3598" data-publisher="" href="#">Revista Brasileira de Zootecnia</a>, </span>
<span style="margin: 0">Abr</span>
<span style="margin: 0">2018, </span>
<small>Volume</small>
<span>47</span>
<small>elocation</small>
<span>e20170186</span>
</div>

A simple scraper function is:

In [7]:
def get_doi_meta(doi):
    bsoup = search_scielo_org_bsoup("DOI: " + doi)
    item = bsoup.select(".results .item")[0]
    doi_span = item.select(".DOIResults")
    doi_ctx = doi_span[0].text if doi_span else item.text
    pid, collection = item.attrs["id"].rsplit("-", 1)
    return {
        "doi": re.search("[\s>](10.[0-9.]+/[^< \n]+)", doi_ctx).groups()[0],
        "pid": pid,
        "collection": collection,
        "authors": [author.text.strip() for author in item.select(".author")],
        "title": item.select(".line a:nth-of-type(1)")[0].attrs["title"],
    }

Scraping that DOI entry we were analyzing:

In [8]:
get_doi_meta("10.1590/rbz4720170186")

{'doi': '10.1590/rbz4720170186',
 'pid': 'S1516-35982018000100506',
 'collection': 'scl',
 'authors': ['Castro, Fernanda Lima de Souza',
  'Baião, Nelson Carneiro',
  'Ecco, Roselene',
  'Louzada, Mário Jefferson Quirino',
  'Melo, Érica de Faria',
  'Saldanha, Mariana Masseo',
  'Triginelli, Marcela Viana',
  'Lara, Leonardo José Camargos'],
 'title': 'Effects of 1,25-dihydroxycholecalciferol and reduced vitamin D3 level on broiler performance and bone quality'}

Some other examples:

In [9]:
get_doi_meta("10.1590/2446-4740.02618")

{'doi': '10.1590/2446-4740.02618',
 'pid': 'S2446-47402018000200157',
 'collection': 'scl',
 'authors': ['Kauati, Adriana',
  'Pereira, Wagner Coelho de Albuquerque',
  'Campos, Marcello Luiz Rodrigues'],
 'title': 'Mean scatterer space estimation from ultrasound signals combining singular spectral analysis and entropy'}

In [10]:
get_doi_meta("10.5028/jatm.v9i1.717")

{'doi': '10.5028/jatm.v9i1.717',
 'pid': 'S2175-91462017000100071',
 'collection': 'scl',
 'authors': ['Adami, Amirhossein', 'Mortazavi, Mahdi', 'Nosratollahi, Mehran'],
 'title': 'A New Approach to Multidisciplinary Design Optimization of Solid Propulsion System Including Heat Transfer and Ablative Cooling'}

In [11]:
get_doi_meta("10.1590/1980-57642015DN92000002")

{'doi': '10.1590/1980-57642015DN92000002',
 'pid': 'S1980-57642015000200093',
 'collection': 'scl',
 'authors': ['Jacinto, Alessandro Ferrari',
  'Leite, Ananda Ghelfi Raza',
  'Lima Neto, José Luiz de',
  'Vidal, Edison Iglesias de Oliveira',
  'Bôas, Paulo José Fortes Villas'],
 'title': 'O ensino de demência nas escolas médicas: uma breve revisão'}

In [12]:
get_doi_meta("10.4321/S1135-57272015000400009")

{'doi': '10.4321/S1135-57272015000400009',
 'pid': 'S1135-57272015000400009',
 'collection': 'esp',
 'authors': ['Montaño Remacha, Carmen',
  'Gallardo García, Virtudes',
  'Mochón Ochoa, M. Mar',
  'García Fernández, Marcelino',
  'Mayoral Cortés, José María',
  'Ruiz Fernández, Josefa'],
 'title': 'Outbreaks of Measles in Andalusia, Spain, during the Period 2010-2015'}

## Getting metadata information from PID using [scielo.br](http://scielo.br)

The [2018-07-23 experiments](experiments_2018-07-23.ipynb) had shown
that we can get the DOI from the PID using the
[www.scielo.br](http://www.scielo.br) web page
using the XML debug output for a `sci_arttext` result,
and it works even with the older "ahead of print" PID value:

In [13]:
def get_scielo_php_etree_br(pid):
    query = urlencode({
        "script": "sci_arttext",
        "pid": pid,
        "debug": "xml",
    })
    try:
        return etree.parse("http://www.scielo.br/scielo.php?" + query)
    except:
        return None

Let's test this for the PIDs we've seen in the previous section:

In [14]:
pids = [
    "S1135-57272015000400009",
    "S1980-57642015000200093",
    "S2175-91462017000100071",
    "S2446-47402018000200157",
    "S1516-35982018000100506",
]

In [15]:
pts_dict = {pid: get_scielo_php_etree_br(pid) for pid in pids}
pts_dict

{'S1135-57272015000400009': None,
 'S1980-57642015000200093': <lxml.etree._ElementTree at 0x7feccef51108>,
 'S2175-91462017000100071': <lxml.etree._ElementTree at 0x7fecceff5848>,
 'S2446-47402018000200157': <lxml.etree._ElementTree at 0x7fecceff57c8>,
 'S1516-35982018000100506': <lxml.etree._ElementTree at 0x7fecceff5748>}

The first PID wasn't found (you can try
[this link](http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1135-57272015000400009&debug=xml)
to check it was a 404 error),
Actually, this PID is correct, but it's not in the Brazilian collection,
so we should use the [scielo.isciii.es](http://scielo.isciii.es) as the base URL.
An enhanced XML element tree getter is:

In [16]:
def get_scielo_php_etree(pid, base_url="http://www.scielo.br"):
    query = urlencode({
        "script": "sci_arttext",
        "pid": pid,
        "debug": "xml",
    })
    resp = urlopen(Request(
        url=base_url + "/scielo.php?" + query,
        headers={
            "User-Agent": "Normalizator!!!",
        },
    ))
    return etree.parse(BytesIO(resp.read().strip()))

In [17]:
def get_scielo_php_etree_404_none(*args, **kwargs):
    try:
        return get_scielo_php_etree(*args, **kwargs)
    except HTTPError as exc:
        if exc.status == 404:
            return None
        raise

*Info: we should use a modified `User-Agent` header
because the original `Python-urllib/3.6` value
crashes this server's request handler,
and we should use the `io.BytesIO` because
the [scielo.isciii.es](http://scielo.isciii.es) response
has leading whitespaces that crashes the `lxml` parser.*

In [18]:
pts_dict[pids[0]] = \
    get_scielo_php_etree_404_none(pids[0], base_url="http://scielo.isciii.es")
pts_dict

{'S1135-57272015000400009': <lxml.etree._ElementTree at 0x7fecceff5d48>,
 'S1980-57642015000200093': <lxml.etree._ElementTree at 0x7feccef51108>,
 'S2175-91462017000100071': <lxml.etree._ElementTree at 0x7fecceff5848>,
 'S2446-47402018000200157': <lxml.etree._ElementTree at 0x7fecceff57c8>,
 'S1516-35982018000100506': <lxml.etree._ElementTree at 0x7fecceff5748>}

Though these servers behave differently,
their `//ARTICLE` result structure is the same:

In [19]:
data_from_pids = pd.DataFrame(dict(tree.xpath("//ARTICLE")[0].attrib)
                              for tree in pts_dict.values())
data_from_pids.T

Unnamed: 0,0,1,2,3,4
AREASGEO,,,,,
CITED,,,,,
CLINICALTRIALS,,0,0,0,0
CURR_DATE,20180802,20180802,20180802,20180802,20180802
DOCTOPIC,oa,ra,oa,oa,
DOCTYPE,article,article,article,article,article
DOI,10.4321/S1135-57272015000400009,10.1590/1980-57642015DN92000002,10.5028/jatm.v9i1.717,10.1590/2446-4740.02618,10.1590/rbz4720170186
FPAGE,407,93,71,157,
LPAGE,418,95,82,165,
ORIGINALLANG,es,en,en,en,en


That works, but for general use, we would have to know
the collection for the given PID input,
and the URLs for its collection.

The http://articlemeta.scielo.org/api/v1/collection/identifiers/
JSON has the collections' domains,
two of them had been "summoned" here.

## Single-level crawling on [search.scielo.org](https://search.scielo.org)

There's another information in the `first_item`
we've got from the SciELO's search engine:
the link to the article.

In [20]:
title_link = first_item.select(".line a:nth-of-type(1)")[0]
title_link # We've already seen this to get the article title

<a href="http://www.scielo.br/scielo.php?script=sci_arttext&amp;pid=S1516-35982018000100506&amp;lang=pt" target="_blank" title="Effects of 1,25-dihydroxycholecalciferol and reduced vitamin D3 level on broiler performance and bone quality">
<strong class="title" id="title-S1516-35982018000100506">Effects of 1,25-dihydroxycholecalciferol and reduced vitamin D3 level on broiler performance and bone quality</strong>
</a>

It'll have the correct domain and we can use that link to easily build the debug XML URL:

In [21]:
title_link.attrs["href"] + "&debug=xml"

'http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-35982018000100506&lang=pt&debug=xml'

Also, we can get the data from that search using the actual PID
(not the old one) by prefixing the search string with `PID:`.

In [22]:
def search_article(*, doi=None, pid=None):
    if bool(doi) == bool(pid):
        raise ValueError("The input should be either a DOI or a PID")
    search_string = "DOI: " + doi if doi else "PID: " + pid
    query = urlencode({"q": search_string})
    resp = urlopen("https://search.scielo.org/?" + query)
    bsoup = BeautifulSoup(resp.read(), "lxml")
    item = bsoup.select(".results .item")[0]
    doi_span = item.select(".DOIResults")
    doi_ctx = doi_span[0].text if doi_span else item.text
    pid, collection = item.attrs["id"].rsplit("-", 1)
    title_link = item.select(".line a:nth-of-type(1)")[0]
    return {
        "doi": re.search("[\s>](10.[0-9.]+/[^< \n]+)", doi_ctx).groups()[0],
        "pid": pid,
        "collection": collection,
        "authors": [author.text.strip() for author in item.select(".author")],
        "title": title_link.attrs["title"],
        "url": title_link.attrs["href"],
    }

In [23]:
def get_article(*, doi=None, pid=None):
    try:
        art_info = search_article(doi=doi, pid=pid)
    except (HTTPError, IndexError):
        return None
    try:
        resp = urlopen(Request(url=art_info["url"] + "&debug=xml",
                               headers={"User-Agent": "Python"}))
        art_info["etree"] = etree.parse(BytesIO(resp.read().strip()))
        art_info["attr"] = dict(art_info["etree"].xpath("//ARTICLE")[0].attrib)
    except (HTTPError, etree.XMLSyntaxError):
        pass
    return art_info

This way we can get the same information from either the PID or the DOI:

In [24]:
get_article(pid="S1516-35982018000100506")

{'doi': '10.1590/rbz4720170186',
 'pid': 'S1516-35982018000100506',
 'collection': 'scl',
 'authors': ['Castro, Fernanda Lima de Souza',
  'Baião, Nelson Carneiro',
  'Ecco, Roselene',
  'Louzada, Mário Jefferson Quirino',
  'Melo, Érica de Faria',
  'Saldanha, Mariana Masseo',
  'Triginelli, Marcela Viana',
  'Lara, Leonardo José Camargos'],
 'title': 'Effects of 1,25-dihydroxycholecalciferol and reduced vitamin D3 level on broiler performance and bone quality',
 'url': 'http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-35982018000100506&lang=pt',
 'etree': <lxml.etree._ElementTree at 0x7feccf0a6748>,
 'attr': {'TEXTLANG': 'en',
  'ORIGINALLANG': 'en',
  'PID': 'S1516-35982018000100506',
  'DOCTYPE': 'article',
  'RELATED': '',
  'CITED': '',
  'PROJFAPESP': '',
  'CLINICALTRIALS': ' 0',
  'AREASGEO': '',
  'PROCESSDATE': '20180417',
  'CURR_DATE': '20180802',
  'ahpdate': '20180423',
  'DOI': '10.1590/rbz4720170186',
  'PDF': '1'}}

In [25]:
get_article(doi="10.1590/rbz4720170186")

{'doi': '10.1590/rbz4720170186',
 'pid': 'S1516-35982018000100506',
 'collection': 'scl',
 'authors': ['Castro, Fernanda Lima de Souza',
  'Baião, Nelson Carneiro',
  'Ecco, Roselene',
  'Louzada, Mário Jefferson Quirino',
  'Melo, Érica de Faria',
  'Saldanha, Mariana Masseo',
  'Triginelli, Marcela Viana',
  'Lara, Leonardo José Camargos'],
 'title': 'Effects of 1,25-dihydroxycholecalciferol and reduced vitamin D3 level on broiler performance and bone quality',
 'url': 'http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-35982018000100506&lang=pt',
 'etree': <lxml.etree._ElementTree at 0x7feccd9ed288>,
 'attr': {'TEXTLANG': 'en',
  'ORIGINALLANG': 'en',
  'PID': 'S1516-35982018000100506',
  'DOCTYPE': 'article',
  'RELATED': '',
  'CITED': '',
  'PROJFAPESP': '',
  'CLINICALTRIALS': ' 0',
  'AREASGEO': '',
  'PROCESSDATE': '20180417',
  'CURR_DATE': '20180802',
  'ahpdate': '20180423',
  'DOI': '10.1590/rbz4720170186',
  'PDF': '1'}}

However, the [www.scielosp.org](http://www.scielosp.org) source
ignores the `debug=xml` argument,
and that will be the new default in the other collection links.

In [26]:
get_article(pid="S1414-32832017000200349")

{'doi': '10.1590/1807-57622016.0103',
 'pid': 'S1414-32832017000200349',
 'collection': 'spa',
 'authors': ['Orlandin, Eduardo Antônio de Sousa',
  'Moscovici, Leonardo',
  'Franzon, Ana Carolina Arruda',
  'Passos, Afonso Dinis Costa',
  'Fabbro, Amaury Lelis Dal',
  'Vieira, Elisabeth Meloni',
  'Bellissimo-Rodrigues, Fernando',
  'Gusso, Gustavo Diniz Ferreira',
  'Ferreira, Janise Braga Barros',
  'Marques, João Mazzoncini de Azevedo',
  'Ribeiro, Luciana Cisoto',
  'Santos, Luciane Loures dos',
  'Demarzo, Marcelo Marcos Piva',
  'Fontão, Paulo Celso Nogueira',
  'Souza, João Paulo'],
 'title': 'A research agenda for Primary Health Care in the state of Sao Paulo, Brazil: the ELECT study',
 'url': 'http://www.scielosp.org/scielo.php?script=sci_arttext&pid=S1414-32832017000200349&lang=pt'}

And the queried search engine isn't indexed by the ahead of print PID:

In [27]:
get_article(pid="S1414-32832016005024105")