Our preliminary results show that the MMS + null model pairing works extraordinarily well. In fact, it works so well that we are having trouble believing it. We hypothesize that it does well because some bulks/surfaces in our dataset have very few sites. For example:
- Say that one surface has one site, while a different surface has 20. During the Active Learning portion of MMS, we pick the surface with more uncertainty. If all sites are yet unsampled, then we will probably pick the surface with one site because it will have a wider uncertainty.
- Now that the surface with one site is fully sampled, it contributes much less uncertainty to the value of the bulk. This bulk will now be much more certain than other, similar bulks.
- Similarly, bulk with fewer surfaces are likely to have wider uncertainties and are therefore less likely to be chosen during the Level Set Estimation portion of MMS.

This means than MMS may disproportionaly choose to sample materials with less sites. Which is not necessarily bad in practice, but in our hallucinations it becomes disproportionaly good at filling out a wide search space very quickly.

To temper this exceedingly good performance, we hypothesize that removing bulks from the data set that have very few adsorption sites may mitigate this issue. To test this, we must first take out the bulk with few sites. Kirby did some offline testing and found that 40 is a good number to start with. So that's where we start.

In [1]:
adsorbates = {'CO', 'H', 'OH'}

In [12]:
from collections import defaultdict
from tqdm.notebook import tqdm
import ase.db


min_sites = 40
dbs = {ads: ase.db.connect(ads + '/%s.db' % ads) for ads in adsorbates}

# Initialize
for ads, db in dbs.items():
    sites_per_mpid = defaultdict(int)
    atomss = []
    docs = []
    
    # Count the number of sites per MPID
    for row in tqdm(db.select(), desc='Reading %s' % ads, total=db.count()):
        sites_per_mpid[row.data['mpid']] += 1

        # Rip the data out
        atomss.append(row.toatoms())
        docs.append(row.data)

    # Re-save the databases with sufficient numbers of sites
    good_mpids = {mpid for mpid, n_sites in sites_per_mpid.items() if n_sites >= min_sites}
    new_db = ase.db.connect(ads + '_truncated/%s.db' % ads)
    for atoms, doc in tqdm(zip(atomss, docs), desc='Writing %s' % ads, total=len(docs)):
        if doc['mpid'] in good_mpids:
            new_db.write(atoms, data=doc)

HBox(children=(FloatProgress(value=0.0, description='Reading CO', max=19105.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Writing %s', max=19105.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Reading OH', max=4193.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Writing %s', max=4193.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Reading H', max=22694.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Writing %s', max=22694.0, style=ProgressStyle(description…


