# Mineral name pangrams

My mineral list is from [this 200+ page PDF table from IMA](http://nrmima.nrm.se//imalist.htm).

Some prelims:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

import pandas as pd
import requests
import io
import re

Read the data:

In [7]:
with open('./data/IMA_mineral_names.txt', 'r') as f:
    names = [i.strip() for i in f.readlines()]

## Finding pangrams

Finding pangrams in a list of words amounts to solving the classical [set cover problem](https://en.wikipedia.org/wiki/Set_cover_problem).

I've written about this on agilescientific.com.

In [8]:
# from random import shuffle
# shuffle(names)

In [9]:
names = sorted(names, key=len)

In [10]:
names[:20]

['ice',
 'tin',
 'gold',
 'iron',
 'lead',
 'lime',
 'opal',
 'talc',
 'urea',
 'zinc',
 'beryl',
 'niter',
 'topaz',
 'trona',
 'tuite',
 'uvite',
 'abuite',
 'afmite',
 'agaite',
 'ajoite']

To reduce the problem to something my laptop can actually solve, I'm guessing that Quartz will be useful, having Q and Z, so let's move it to the front of the list.

In [4]:
names.pop(names.index('quartz'))
names = ['quartz'] + names

Make a set of all characters a to z to compare to.

In [11]:
alphaset = set(chr(65+i).lower() for i in range(26))
print(sorted(alphaset))

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


We're using brute force. This will take a while, since there are 4670+ items, and therefore over 475 quadrillion ways to arrange 4 items.

Rather than only finding pangrams, I think I'll just find every shortest subset that satisfies the subuniverses of the alphabet. So we'll get all the combinations with exactly 15 unique letters of the alphabet, etc.

In [12]:
from itertools import combinations

def find_pangrams(words, universe=None, seed=None, start_length=0, stop_length=27, min_length=0):
    """
    words: the corpora, S
    universe: the universe, defaults to the alphabet
    seed: a word you'd like to force it to use, e.g. quartz.
    start_length: skip collections of this length or less.
    stop_length: don't check any collections longer than this.
    min_length: make sure there are at least this many *letters* in the collection.
    """
    
    if universe is None:
        universe = set(chr(65+i).lower() for i in range(26))

    most = 0
    results = {n: [] for n in range(27)}
    shortest = {n:np.inf for n in range(27)}
    t = "{:2s}{}{} ({}, {})"
    
    for length in range(start_length, stop_length+1):  # This won't finish for some corpora.
        print('\nLength: {}'.format(length))
    
        for c in combinations(words, length):
            printed = False

            if seed is not None:
                c = list(c) + [seed]

            j = ''.join(c)
            all_letters = len(j)
            if all_letters < min_length: continue
            s = set(j)
            letters = len(s)

            if letters > most:
                most = letters
                print(t.format(str(most), '++++', c, all_letters,  ''.join(list(universe - s))))
                printed = True

            if all_letters < shortest[letters]:
                shortest[letters] = all_letters
                results[letters] = c
                if not printed:
                    print(t.format(str(letters), '....', c, all_letters, ''.join(list(universe - s))))

    return results

I'm fairly sure that we're looking for something with 4 mineral names. I have run through all the combinations of 3 names, and got a maximum o f 24 letter of the alphabet, getting `('hexamolybdenum', 'pizgrischite', 'kvanefjeldite')`, which lacks `q` and `w`.

When you're ready, run the next code block!

In [None]:
find_pangrams(names, start_length=4, stop_length=4, min_length=25)


Length: 4
14++++('ice', 'tin', 'gold', 'addibischoffite') (25, wvrzmkxqyjpu)
13....('ice', 'tin', 'gold', 'andreyivanovite') (25, bwzxmhkqfjspu)
12....('ice', 'tin', 'gold', 'arseniosiderite') (25, bwvzmhkxqfyjpu)
16++++('ice', 'tin', 'gold', 'barioperovskite') (25, wzxmhqfyju)
15....('ice', 'tin', 'gold', 'ferrisymplesite') (25, bvzwxhkqjau)
11....('ice', 'tin', 'gold', 'grootfonteinite') (25, bwvzmhkxqyjsapu)
17++++('ice', 'tin', 'gold', 'pushcharovskite') (25, bwzxmqfyj)
10....('ice', 'tin', 'iron', 'arseniosiderite') (25, bwvzmhkxqflyjpgu)
9 ....('ice', 'tin', 'iron', 'grootfonteinite') (25, bwvzmhkxqyjsdaplu)
8 ....('ice', 'tin', 'niter', 'henritermierite') (26, bwvzxkqflyjsdapgou)
18++++('ice', 'tin', 'topaz', 'hydrobasaluminite') (28, wvxkqfjg)
8 ....('ice', 'tin', 'tuite', 'uchucchacuaite') (25, bwrvmzkxqflyjsdpgo)
18....('ice', 'tin', 'davyne', 'plumbojarosite') (26, wzxkhqfg)
18....('ice', 'tin', 'gypsum', 'fluorowardite') (25, bvzxhkqj)
19++++('ice', 'tin', 'gypsum', 'hydro

There you have it. Run the block above if you dare... I have not let it complete yet. If it doesn't complete, note that the printouts are the only record you have. It's probably a good idea to modify the function to write a file as we go.

Some other ideas for things to try:

- Parallelize the search. It's not embarrassingly parallel, so the combined dataset would be much larger than a single-core dataset, but each one will be a lot smaller, so it will be faster overall.
- Do a more rigorous search for the best seed word; I'm sure it's not quartz. Intuitively, I think you want the longest name with the most rare letters. I don't know what that is, but it would be easy to do some letter counts and rough searches.
- Er, that's all I can think of right now.

<hr />

Undersampled Radio 2017