<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#PyBibClean" data-toc-modified-id="PyBibClean-1" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">1&nbsp;&nbsp;</span>PyBibClean</a></span><ul class="toc-item"><li><span><a href="#Using-BibTexParser" data-toc-modified-id="Using-BibTexParser-1.1" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Using BibTexParser</a></span></li><li><span><a href="#Pandas-and-qgrid" data-toc-modified-id="Pandas-and-qgrid-1.2" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Pandas and <code>qgrid</code></a></span></li><li><span><a href="#PyInspire" data-toc-modified-id="PyInspire-1.3" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><code>PyInspire</code></a></span></li><li><span><a href="#Combining-pyinspire-and-bibtextparser" data-toc-modified-id="Combining-pyinspire-and-bibtextparser-1.4" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Combining <code>pyinspire</code> and <code>bibtextparser</code></a></span></li><li><span><a href="#BeautifulSoup" data-toc-modified-id="BeautifulSoup-1.5" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>BeautifulSoup</a></span></li></ul></li><li><span><a href="#Python-Project-Area" data-toc-modified-id="Python-Project-Area-2" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">2&nbsp;&nbsp;</span>Python Project Area</a></span></li><li><span><a href="#Test-Area" data-toc-modified-id="Test-Area-3" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">3&nbsp;&nbsp;</span>Test Area</a></span></li><li><span><a href="#Using-PyBTex" data-toc-modified-id="Using-PyBTex-4" data-vivaldi-spatnav-clickable="1"><span class="toc-item-num">4&nbsp;&nbsp;</span>Using PyBTex</a></span></li></ul></div>

# PyBibClean

The objective of this notebook is:

* To clean a given bibtex file
* To fix broken entries in a bibtex file
* To update entries in a bibtex file according to the latest bibliographic information available online

## Using BibTexParser

Have finally decided that bibtexparser is simpler to use and more useful than pybtex.

In [1]:
import matplotlib as mpl

In [2]:
import pandas as pd

In [3]:
import bibtexparser
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import *

In [4]:
# Let's define a function to customize our entries.
# It takes a record and return this record.
def customizations(record):
    """Use some functions delivered by the library

    :param record: a record
    :returns: -- customized record
    """
    record = type(record)
    record = author(record)
    record = editor(record)
    record = journal(record)
    record = keyword(record)
    record = link(record)
    record = page_double_hyphen(record)
    record = doi(record)
    return record

In [5]:
with open('bibtex.clean.bib') as bibtex_file:
    parser = BibTexParser()
    parser.customization = customizations
    parser.customization = homogenize_latex_encoding
#     bib_database = bibtexparser.load(bibtex_file, parser=parser)
    parser = bibtexparser.bparser.BibTexParser(common_strings=True)
    bib_database = bibtexparser.load(bibtex_file, parser=parser)
    print(bib_database.entries)


[{'bdsk-url-1': 'http://arxiv.org/abs/1002.1462', 'bdsk-file-2': 'YnBsaXN0MDDUAQIDBAUGJCVYJHZlcnNpb25YJG9iamVjdHNZJGFyY2hpdmVyVCR0b3ASAAGGoKgHCBMUFRYaIVUkbnVsbNMJCgsMDxJXTlMua2V5c1pOUy5vYmplY3RzViRjbGFzc6INDoACgAOiEBGABIAFgAdccmVsYXRpdmVQYXRoWWFsaWFzRGF0YV8QWC4uL2JpYmRlc2svVmFpZC5EX0VtYmVkZGluZyB0aGUgQmlsc29uLVRob21wc29uIG1vZGVsIGluIGFuIExRRy1saWtlIGZyYW1ld29ya18yMDEwYi5wZGbSFwsYGVdOUy5kYXRhTxEChAAAAAAChAACAAAMTWFjaW50b3NoIEhEAAAAAAAAAAAAAAAAAAAA0InpokgrAAAAFmuGH1ZhaWQuRF9FbWJlZGRpbmcgdGhlIzE2NzRCMi5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAWdLLHwV5KUERGIAAAAAAAAQACAAAJIAAAAAAAAAAAAAAAAAAAAAdiaWJkZXNrAAAQAAgAANCJnEoAAAARAAgAAMfBEPIAAAABABgAFmuGAM7TSgDOzz4AFNSTAA8AAgACk9MAAgBeTWFjaW50b3NoIEhEOlVzZXJzOgBkZWVwYWs6AG93bkNsb3VkOgByb290OgByZXNlYXJjaDoAYmliZGVzazoAVmFpZC5EX0VtYmVkZGluZyB0aGUjMTY3NEIyLnBkZgAOAJwATQBWAGEAaQBkAC4ARABfAEUAbQBiAGUAZABkAGkAbgBnACAAdABoAGUAIABCAGkAbABzAG8AbgAtAFQAaABvAG0AcABzAG8AbgAgAG0AbwBkAGUAbAAgAGkAbgAgAGEAbgAgAEwAUQBHAC0AbABpAGsAZQAgAGYAcgBhAG0AZQB3AG8Acg

In [6]:
keySet = ()
for entry in bib_database.entries:
    tempSet = set(entry.keys())
    keySet = tempSet.union(keySet)
print(keySet)

{'priority', 'eprint', 'date-modified', 'ID', 'bdsk-file-1', 'citeulike-article-id', 'abstract', 'posted-at', 'keywords', 'day', 'author', 'archiveprefix', 'year', 'url', 'date-added', 'month', 'ENTRYTYPE', 'title', 'citeulike-linkout-0', 'citeulike-linkout-1', 'bdsk-file-2', 'bdsk-url-1'}


In [7]:
no_journal = []

In [8]:
entry.keys()

dict_keys(['bdsk-url-1', 'bdsk-file-2', 'bdsk-file-1', 'year', 'url', 'title', 'priority', 'posted-at', 'month', 'keywords', 'eprint', 'day', 'date-modified', 'date-added', 'citeulike-linkout-1', 'citeulike-linkout-0', 'citeulike-article-id', 'author', 'archiveprefix', 'abstract', 'ENTRYTYPE', 'ID'])

In [9]:
for entry in bib_database.entries:
    if 'journal' not in entry.keys():
        no_journal.append(entry)

In [10]:
no_journal

[{'bdsk-url-1': 'http://arxiv.org/abs/1002.1462',
  'bdsk-file-2': 'YnBsaXN0MDDUAQIDBAUGJCVYJHZlcnNpb25YJG9iamVjdHNZJGFyY2hpdmVyVCR0b3ASAAGGoKgHCBMUFRYaIVUkbnVsbNMJCgsMDxJXTlMua2V5c1pOUy5vYmplY3RzViRjbGFzc6INDoACgAOiEBGABIAFgAdccmVsYXRpdmVQYXRoWWFsaWFzRGF0YV8QWC4uL2JpYmRlc2svVmFpZC5EX0VtYmVkZGluZyB0aGUgQmlsc29uLVRob21wc29uIG1vZGVsIGluIGFuIExRRy1saWtlIGZyYW1ld29ya18yMDEwYi5wZGbSFwsYGVdOUy5kYXRhTxEChAAAAAAChAACAAAMTWFjaW50b3NoIEhEAAAAAAAAAAAAAAAAAAAA0InpokgrAAAAFmuGH1ZhaWQuRF9FbWJlZGRpbmcgdGhlIzE2NzRCMi5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAWdLLHwV5KUERGIAAAAAAAAQACAAAJIAAAAAAAAAAAAAAAAAAAAAdiaWJkZXNrAAAQAAgAANCJnEoAAAARAAgAAMfBEPIAAAABABgAFmuGAM7TSgDOzz4AFNSTAA8AAgACk9MAAgBeTWFjaW50b3NoIEhEOlVzZXJzOgBkZWVwYWs6AG93bkNsb3VkOgByb290OgByZXNlYXJjaDoAYmliZGVzazoAVmFpZC5EX0VtYmVkZGluZyB0aGUjMTY3NEIyLnBkZgAOAJwATQBWAGEAaQBkAC4ARABfAEUAbQBiAGUAZABkAGkAbgBnACAAdABoAGUAIABCAGkAbABzAG8AbgAtAFQAaABvAG0AcABzAG8AbgAgAG0AbwBkAGUAbAAgAGkAbgAgAGEAbgAgAEwAUQBHAC0AbABpAGsAZQAgAGYAcgBhAG0AZQB3AG8A

In [11]:
from pyinspire import *

In [12]:
for entry in no_journal:
    keylist = entry.keys()
    if 'eprint' in keylist:
        data = get_text_from_inspire('eprint ' + entry['eprint'], resultformat='bibtex')
        print(data)



  soup = BeautifulSoup(data)



@article{Vaid:2010ya,
      author         = "Vaid, Deepak",
      title          = "{Embedding the Bilson-Thompson Model in an LQG-Like
                        Framework}",
      year           = "2010",
      eprint         = "1002.1462",
      archivePrefix  = "arXiv",
      primaryClass   = "hep-th",
      SLACcitation   = "%%CITATION = ARXIV:1002.1462;%%"
}


@article{Vaid:2012pr,
      author         = "Vaid, Deepak",
      title          = "{Quantum Hall Effect and Black Hole Entropy in Loop
                        Quantum Gravity}",
      year           = "2012",
      eprint         = "1208.3335",
      archivePrefix  = "arXiv",
      primaryClass   = "gr-qc",
      SLACcitation   = "%%CITATION = ARXIV:1208.3335;%%"
}


@article{Vaid:2013qja,
      author         = "Vaid, Deepak",
      title          = "{Elementary Particles as Gates for Universal Quantum
                        Computation}",
      year           = "2013",
      eprint         = "1307.0096",
      archivePr

## Pandas and `qgrid`

In [16]:
import qgrid

In [14]:
bib_data = pd.DataFrame(bib_database.entries)

In [19]:
qgrid.show_grid(bib_data)

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

In [18]:
bib_grid = qgrid.show_grid(bib_data)

## `PyInspire`

In [20]:
import pyinspire as spires

In [21]:
spires.query_inspire("Ahluwalia")

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">\n<head>\n <style type="text/css">\n   @media print {\n    a[href]:after {\n    content: none !important;\n    }\n   }\n </style>\n <title>Ahluwalia - Search Results - INSPIRE-HEP</title>\n <link rev="made" href="mailto:admin@inspirehep.net" />\n <link rel="stylesheet" href="/img/invenio_inspire.css?59f927f5dd60d502535ca20cf9057784" type="text/css" />\n   <link rel="canonical" href="https://inspirehep.net/search?action_search=Search&amp;rg=100&amp;of=hb&amp;p=Ahluwalia&amp;jrec=0" />\n  <link rel="alternate" hreflang="x-default" href="https://inspirehep.net/search?action_search=Search&amp;rg=100&amp;of=hb&amp;p=Ahluwalia&amp;jrec=0" />\n  <link rel="alternate" hreflang="el" href="https://inspirehep.net/search?action_search=Search&amp;rg=100&amp;of=hb&amp;p=Ahluwalia&amp;jrec=0&amp;ln=el" />\n  <l

In [22]:
data = spires.get_text_from_inspire('Vaid',resultformat='bibtex')



  soup = BeautifulSoup(data)


In [23]:
spires.get_text_from_inspire('eprint 1208.3335',resultformat='bibtex')

'\n@article{Vaid:2012pr,\n      author         = "Vaid, Deepak",\n      title          = "{Quantum Hall Effect and Black Hole Entropy in Loop\n                        Quantum Gravity}",\n      year           = "2012",\n      eprint         = "1208.3335",\n      archivePrefix  = "arXiv",\n      primaryClass   = "gr-qc",\n      SLACcitation   = "%%CITATION = ARXIV:1208.3335;%%"\n}\n'

## Combining `pyinspire` and `bibtextparser`

In [26]:
parser = BibTexParser()
parser.customization = customizations
parser.customization = homogenize_latex_encoding
bib_database = bibtexparser.loads(data, parser=parser)
print(bib_database.entries)

[{'slaccitation': '\\%\\%CITATION = ARXIV:1805.11053;\\%\\%', 'primaryclass': 'gr-qc', 'archiveprefix': 'arXiv', 'eprint': '1805.11053', 'year': '2018', 'title': '{J}oule-{T}homson expansion in {A}dS black hole with a global\nmonopole', 'author': 'Rizwan C. L., Ahmed and Kumara A., Naveena and Vaid,\nDeepak and Ajith, K. M.', 'ENTRYTYPE': 'article', 'ID': 'Rizwan:2018mpy'}, {'slaccitation': '\\%\\%CITATION = APCPC,1953,040026;\\%\\%', 'doi': '10.1063/1.5032646', 'pages': '040026', 'number': '1', 'year': '2018', 'volume': '1953', 'journal': 'AIP Conf. Proc.', 'booktitle': 'Proceedings, 2nd International Conference on Condensed\nMatter and Applied Physics (ICC 2017): Bikaner, India,\nNovember 24-25, 2017', 'title': '{S}econd order phase transition in thermodynamic geometry\nand holographic superconductivity in low-energy stringy\nblack holes', 'author': 'Ahmed Rizwan, C. L. and Vaid, Deepak', 'ENTRYTYPE': 'article', 'ID': 'Rizwan:2018cdy'}, {'slaccitation': '\\%\\%CITATION = ARXIV:1711.0

## BeautifulSoup

Not needed, because `pyinspire` module already imports `BeautifulSoup`

In [27]:
from bs4 import BeautifulSoup as bsoup

In [28]:
soup = bsoup(data, "html")

In [29]:
soup.pre?

In [30]:
text = "\n".join([tag.text for tag in soup.find_all("pre")])

In [31]:
print(text)




# Python Project Area

In [32]:
%gui qt

In [33]:
from PyQt5 import QtCore, QtGui, QtWidgets
from PyQt5.QtWidgets import QGridLayout

In [34]:
# class Ui_Form(object):
#     def setupUi(self, Form):
#         Form.setObjectName("Form")
#         Form.resize(645, 497)
#         sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.MinimumExpanding, QtWidgets.QSizePolicy.Preferred)
#         sizePolicy.setHorizontalStretch(0)
#         sizePolicy.setVerticalStretch(0)
#         sizePolicy.setHeightForWidth(Form.sizePolicy().hasHeightForWidth())
#         Form.setSizePolicy(sizePolicy)
#         self.gridLayoutWidget = QtWidgets.QWidget(Form)
#         self.gridLayoutWidget.setGeometry(QtCore.QRect(-30, -20, 871, 621))
#         self.gridLayoutWidget.setObjectName("gridLayoutWidget")
#         self.gridLayout = QtWidgets.QGridLayout(self.gridLayoutWidget)
#         self.gridLayout.setContentsMargins(0, 0, 0, 0)
#         self.gridLayout.setObjectName("gridLayout")
#         self.lineEdit = QtWidgets.QLineEdit(self.gridLayoutWidget)
#         sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Fixed, QtWidgets.QSizePolicy.Fixed)
#         sizePolicy.setHorizontalStretch(0)
#         sizePolicy.setVerticalStretch(0)
#         sizePolicy.setHeightForWidth(self.lineEdit.sizePolicy().hasHeightForWidth())
#         self.lineEdit.setSizePolicy(sizePolicy)
#         self.lineEdit.setObjectName("lineEdit")
#         self.gridLayout.addWidget(self.lineEdit, 0, 0, 1, 1)
#         self.pushButton = QtWidgets.QPushButton(self.gridLayoutWidget)
#         self.pushButton.setObjectName("pushButton")
#         self.gridLayout.addWidget(self.pushButton, 1, 0, 1, 1)

#         self.retranslateUi(Form)
#         QtCore.QMetaObject.connectSlotsByName(Form)

#     def retranslateUi(self, Form):
#         _translate = QtCore.QCoreApplication.translate
#         Form.setWindowTitle(_translate("Form", "Form"))
#         self.pushButton.setText(_translate("Form", "PushButton"))

In [35]:
class Ui_Form(QtWidgets.QWidget):
    
    def __init__(self):
        super().__init__()
        self.title = "PyBibClean"
        self.left = 10
        self.top = 10
        self.width = 320
        self.height = 100
        self.initUI()
    
    def initUI(self):
        self.setWindowTitle(self.title)
        self.setGeometry(self.left, self.top, self.width, self.height)
        
        self.lineEdit = QtWidgets.QLineEdit(self)
        self.pushButton = QtWidgets.QPushButton("Select File", self)
        self.textBrowser = QtWidgets.QTextBrowser(self)
        
        self.createGridLayout()
        
    def createGridLayout(self):
        layout = QGridLayout()
        
        layout.addWidget(self.lineEdit)
        layout.addWidget(self.pushButton)
        layout.addWidget(self.textBrowser)
        
        self.setLayout(layout)
        

In [36]:
from PyQt5.QtWidgets import QApplication, QWidget
from PyQt5.QtWidgets import QFileDialog
import sys

In [37]:
class MainWindow(Ui_Form):
    def __init__(self,parent=None):
        super().__init__()
        self.initUI()
        self.pushButton.clicked.connect(self.selectFile)
        # self.pushButton.clicked.connect(self.pressed)

    def selectFile(self):
        file_name = QFileDialog.getOpenFileName(self, 'Open File')
        print(file_name[0])
        self.lineEdit.setText(file_name[0])

In [38]:
app = QApplication([])

In [39]:
view = MainWindow()

In [40]:
view.show()

In [41]:
view.close()

True

In [42]:
del(view)

In [43]:
app.exit()

In [44]:
del app

# Test Area

In [45]:
import sys
try:
    from urllib.request import urlopen
except ImportError:
    from urllib import urlopen
try:
    from urllib.parse import urlencode
except ImportError:
    from urllib import urlencode

from bs4 import BeautifulSoup
import optparse
import logging
import re

In [46]:
APIURL = "https://inspirehep.net/search?"
logging.basicConfig()
log = logging.getLogger("pyinspire")

In [47]:
def get_text_from_inspire(search="", resultformat="brief"):
    """Extract text from an INSPIRE search."""
    log.info("Search of INSPIRE started...")
    data = query_inspire(search, resultformat=resultformat)
    text = extract_from_data(data)
    return text

In [48]:
def inspire_url(search="", resultformat="brief", startrecord=0):
    """Construct the query string for INSPIRE"""
    formats = {"brief": "hb",
               "bibtex": "hx",
               "latexEU": "hlxe",
               "latexUS": "hlxu",
               "marcxml": "xm"}
    inspireoptions = dict(action_search="Search",
                          rg=100, #number of results to return in one page
                          of=formats[resultformat], # format of results 
                          ln="en", #language
                          p=search, # search string
                          jrec=startrecord, # record number to start at
                          )
    
    url = APIURL + urlencode(inspireoptions)
    return url

In [49]:
def query_inspire(search="", resultformat="brief"):
    """Query the INSPIRE HEP database and return the entries.

    Parameters
    ----------
    search : string
             search string to use in query

    resultformat : string
             long hand name of format, ["brief", "bibtex", "latexEU", "latexUS"]

    """
    url = inspire_url(search, resultformat)
    log.debug("Query URL is %s", str(url))

    try:
        f = urlopen(url)
        log.debug("Starting to read data from %s.", str(url))
        data = f.read()
        log.debug("Data has been read: \n %s", str(data))
    except IOError as e:
        log.error("Error retrieving results: %s", str(e))
        raise
    return data

In [50]:
query_inspire("Ahluwalia")

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">\n<head>\n <style type="text/css">\n   @media print {\n    a[href]:after {\n    content: none !important;\n    }\n   }\n </style>\n <title>Ahluwalia - Search Results - INSPIRE-HEP</title>\n <link rev="made" href="mailto:admin@inspirehep.net" />\n <link rel="stylesheet" href="/img/invenio_inspire.css?59f927f5dd60d502535ca20cf9057784" type="text/css" />\n   <link rel="canonical" href="https://inspirehep.net/search?action_search=Search&amp;rg=100&amp;of=hb&amp;p=Ahluwalia&amp;jrec=0" />\n  <link rel="alternate" hreflang="x-default" href="https://inspirehep.net/search?action_search=Search&amp;rg=100&amp;of=hb&amp;p=Ahluwalia&amp;jrec=0" />\n  <link rel="alternate" hreflang="el" href="https://inspirehep.net/search?action_search=Search&amp;rg=100&amp;of=hb&amp;p=Ahluwalia&amp;jrec=0&amp;ln=el" />\n  <l

# Using PyBTex

Reference: [Pybtex](https://pybtex.org/)

In [1]:
from pybtex.database import parse_file

ModuleNotFoundError: No module named 'pybtex'

In [None]:
bib_data = parse_file('bibtex.bib')
type(bib_data)

In [None]:
for x in bib_data.entries:
    print(x,bib_data.entries[x])

In [None]:
bib_data.entries[x]

In [None]:
type(_12)

In [None]:
bib_data.entries[x].fields.keys()