First, let's open up the data we pulled from CatHub and poke around to see what's going on

In [1]:
import pickle


with open('cathub.pkl', 'rb') as file_handle:
    all_data = pickle.load(file_handle)

In [2]:
docs = [doc for docs in all_data.values() for doc in docs]
docs[0]

{'Equation': 'H2(g) + 2.0* -> 2.0H*',
 'coverages': None,
 'dftCode': 'VASP.5.4.4',
 'dftFunctional': 'SCAN',
 'username': 'yasheng.maimaiti@gmail.com',
 'pubId': 'SharadaAdsorption2019',
 'systems': [Atoms(symbols='Ni16H', pbc=True, cell=[4.892061697, 4.892061697, 19.188815], constraint=FixAtoms(indices=[0, 1, 2, 3, 4, 5, 6, 7]), calculator=SinglePointCalculator(...)),
  Atoms(symbols='H2', pbc=True, cell=[12.000000006, 12.000000006, 12.750645752], calculator=SinglePointCalculator(...)),
  Atoms(symbols='Ni16', pbc=True, cell=[4.892061697, 4.892061697, 17.188815], constraint=FixAtoms(indices=[0, 1, 2, 3, 4, 5, 6, 7]), calculator=SinglePointCalculator(...))],
 'energy': -1.2879526899999973}

In [3]:
reactions = {doc['Equation'] for doc in docs}
reactions

{'0.5H2(g) + * -> H*',
 '0.5N2(g) + * -> N*',
 '2.0H2O(g) + * -> OOH* + 1.5H2(g)',
 '2.0H2O(g) - 1.5H2(g) + * -> OOH*',
 'CH2CH2* + H2(g) + * -> CH3CH2* + H*',
 'CH4(g) - 2.0H2(g) + * -> C*',
 'CO(g) + * -> CO*',
 'H2(g) + 2.0* -> 2.0H*',
 'H2O(g) - 0.5H2(g) + * -> OH*',
 'H2O(g) - H2(g) + * -> O*'}

In [4]:
codes = {doc['dftCode'] for doc in docs}
codes

{'Quantum ESPRESSO 5.1',
 'VASP',
 'VASP 5.3.5',
 'VASP 5.4.1',
 'VASP 5.4.4',
 'VASP-5.4.4',
 'VASP.5.4.4'}

In [5]:
functionals = {doc['dftFunctional'] for doc in docs}
functionals

{'BEEF-vdW',
 'HSE06',
 'PBE',
 'PBE+U',
 'PBE+U-500eV',
 'PBE+U=1',
 'PBE+U=3.32',
 'RPBE',
 'SCAN'}

Ok, let's profile how many documents we can find in each of the sub-categories we know matter.

In [6]:
for reaction in reactions:
    _docs = [doc for doc in docs if doc['Equation'] == reaction]
    print('%i documents for %s reaction' % (len(_docs), reaction))
print()

for code in codes:
    _docs = [doc for doc in docs if doc['dftCode'] == code]
    print('%i documents for %s reaction' % (len(_docs), code))
print()

for xc in functionals:
    _docs = [doc for doc in docs if doc['dftFunctional'] == xc]
    print('%i documents for %s reaction' % (len(_docs), xc))

9300 documents for H2O(g) - H2(g) + * -> O* reaction
10074 documents for 0.5H2(g) + * -> H* reaction
9000 documents for 0.5N2(g) + * -> N* reaction
1400 documents for H2O(g) - 0.5H2(g) + * -> OH* reaction
219 documents for CH2CH2* + H2(g) + * -> CH3CH2* + H* reaction
450 documents for CO(g) + * -> CO* reaction
138 documents for 2.0H2O(g) - 1.5H2(g) + * -> OOH* reaction
12 documents for 2.0H2O(g) + * -> OOH* + 1.5H2(g) reaction
657 documents for H2(g) + 2.0* -> 2.0H* reaction
6800 documents for CH4(g) - 2.0H2(g) + * -> C* reaction

1860 documents for VASP 5.4.4 reaction
1374 documents for VASP-5.4.4 reaction
2709 documents for VASP reaction
31558 documents for Quantum ESPRESSO 5.1 reaction
186 documents for VASP 5.3.5 reaction
273 documents for VASP.5.4.4 reaction
90 documents for VASP 5.4.1 reaction

1302 documents for PBE+U-500eV reaction
9 documents for PBE+U=3.32 reaction
108 documents for RPBE reaction
434 documents for PBE reaction
6 documents for HSE06 reaction
273 documents for 

It looks like most of the data was calculated using Quantum Espresso with BEEF-vdW. Let's take a closer look at this subset of data and profile their chemistries and source publications.

In [7]:
_docs = [doc for doc in docs if (doc['dftCode'] == 'Quantum ESPRESSO 5.1' and
                                 doc['dftFunctional'] == 'BEEF-vdW')]
print('%i QE-BEEF documents' % len(_docs))

for reaction in reactions:
    __docs = [doc for doc in _docs if doc['Equation'] == reaction]
    print('    %i documents for %s reaction' % (len(__docs), reaction))
print()

pubs = {doc['pubId'] for doc in _docs}
print('%i total publications' % len(pubs))
for pub in pubs:
    __docs = [doc for doc in _docs if doc['pubId'] == pub]
    print('    %i documents from %s' % (len(__docs), pub))

31456 QE-BEEF documents
    3534 documents for H2O(g) - H2(g) + * -> O* reaction
    10074 documents for 0.5H2(g) + * -> H* reaction
    9000 documents for 0.5N2(g) + * -> N* reaction
    1148 documents for H2O(g) - 0.5H2(g) + * -> OH* reaction
    219 documents for CH2CH2* + H2(g) + * -> CH3CH2* + H* reaction
    234 documents for CO(g) + * -> CO* reaction
    9 documents for 2.0H2O(g) - 1.5H2(g) + * -> OOH* reaction
    0 documents for 2.0H2O(g) + * -> OOH* + 1.5H2(g) reaction
    438 documents for H2(g) + 2.0* -> 2.0H* reaction
    6800 documents for CH4(g) - 2.0H2(g) + * -> C* reaction

6 total publications
    198 documents from SandbergStrongly2018
    793 documents from HansenFirst2018
    30420 documents from MamunHighT2019
    18 documents from Unpublished
    9 documents from Park2D2019
    18 documents from ChanMolybdenum2014


Woah. It turns out that most of this stuff is from one publication. Let's see what reactions they are focusing on.

In [8]:
_docs = [doc for doc in docs if (doc['pubId'] == 'MamunHighT2019' and
                                 doc['dftCode'] == 'Quantum ESPRESSO 5.1' and
                                 doc['dftFunctional'] == 'BEEF-vdW')]
print('%i Mamun documents' % len(_docs))

for reaction in reactions:
    __docs = [doc for doc in _docs if doc['Equation'] == reaction]
    print('    %i documents for %s reaction' % (len(__docs), reaction))

30420 Mamun documents
    3534 documents for H2O(g) - H2(g) + * -> O* reaction
    10074 documents for 0.5H2(g) + * -> H* reaction
    9000 documents for 0.5N2(g) + * -> N* reaction
    1148 documents for H2O(g) - 0.5H2(g) + * -> OH* reaction
    0 documents for CH2CH2* + H2(g) + * -> CH3CH2* + H* reaction
    0 documents for CO(g) + * -> CO* reaction
    0 documents for 2.0H2O(g) - 1.5H2(g) + * -> OOH* reaction
    0 documents for 2.0H2O(g) + * -> OOH* + 1.5H2(g) reaction
    0 documents for H2(g) + 2.0* -> 2.0H* reaction
    6664 documents for CH4(g) - 2.0H2(g) + * -> C* reaction


Looks like we have C, N, H, O, and OH energies. Sounds good to me!