A small library for extracting references used in scholarly communication.
$ pip install refextract
To get structured information from a publication reference:
>>> from refextract import extract_journal_reference
>>> reference = extract_journal_reference('J.Phys.,A39,13445')
>>> print(reference)
{
'extra_ibids': [],
'is_ibid': False,
'misc_txt': u'',
'page': u'13445',
'title': u'J. Phys.',
'type': 'JOURNAL',
'volume': u'A39',
'year': '',
}
To extract references from a PDF:
>>> from refextract import extract_references_from_file
>>> references = extract_references_from_file('1503.07589.pdf')
>>> print(references[0])
{
'author': [u'F. Englert and R. Brout'],
'doi': [u'doi:10.1103/PhysRevLett.13.321'],
'journal_page': [u'321'],
'journal_reference': [u'Phys. Rev. Lett. 13 (1964) 321'],
'journal_title': [u'Phys. Rev. Lett.'],
'journal_volume': [u'13'],
'journal_year': [u'1964'],
'linemarker': [u'1'],
'raw_ref': [u'[1] F. Englert and R. Brout, \u201cBroken symmetry and the mass of gauge vector mesons\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],
'texkey': [u'Englert:1964et'],
'year': [u'1964'],
}
To extract directly from a URL:
>>> from refextract import extract_references_from_url
>>> references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')
>>> print(references[0])
{
'author': [u'F. Englert and R. Brout'],
'doi': [u'doi:10.1103/PhysRevLett.13.321'],
'journal_page': [u'321'],
'journal_reference': [u'Phys. Rev. Lett. 13 (1964) 321'],
'journal_title': [u'Phys. Rev. Lett.'],
'journal_volume': [u'13'],
'journal_year': [u'1964'],
'linemarker': [u'1'],
'raw_ref': [u'[1] F. Englert and R. Brout, \u201cBroken symmetry and the mass of gauge vector mesons\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],
'texkey': [u'Englert:1964et'],
'year': [u'1964'],
}
refextract
depends on pdftotext.
refextract
is based on code and ideas from the following people, who
contributed to the docextract
module in Invenio:
- Alessio Deiana
- Federico Poli
- Gerrit Rindermann
- Graham R. Armstrong
- Grzegorz Szpura
- Jan Aage Lavik
- Javier Martin Montull
- Micha Moskovic
- Samuele Kaplun
- Thorsten Schwander
- Tibor Simko
GPLv2