## Phoneword generation: explanation & usage examples

### Prep work

First, we'll need a list of valid English words. We will pull it from NLTK 'words' corpus. We can also filter our very short words like "I", "AM", "BE" by setting min_vocab_word_len to some value above 1. Such short words tend to be not very meaningful for creating useful phonewords. A helper function find_valid_word() will be used to extract words from this vocabulary using regular expressions. 

Finally, we'll specify mapping between numbers and digits on phone's keypad. For convenience sake, we'll also map each digit to itself.

In [1]:
import re
from nltk.corpus import words

def get_english_vocabulary(min_vocab_word_len):
    """ Use nltk.words word list,  words shorter than min_vocab_word_len will be excluded."""
    try:
        vocab = set([word.upper() for word in words.words() if len(word) >= min_vocab_word_len])
    except:
        print("NLTK 'words' corpus not found. Downloading...")
        import nltk
        nltk.download('words')
        vocab = set([word.upper() for word in words.words() if len(word) >= min_vocab_word_len])
            
    return vocab


def find_valid_word(regexp, vocab, get_all_words = False):
    """  Output either a single valid English word or a set of all English words matching input regex string. """
    all_words = set()
    for word in vocab: 
        hit = re.match(regexp, word.upper())
        if hit is not None:
            if not get_all_words: return word.upper()
            all_words.add(word.upper())
    return all_words

mapping = {"1": ["1"],
           "2": ["A","B","C","2"],
           "3": ["D","E","F","3"],
           "4": ["G","H","I","4"],
           "5": ["J","K","L","5"],
           "6": ["M","N","O","6"],
           "7": ["P","Q","R","S","7"],
           "8": ["T","U","V", "8"],
           "9": ["W","X","Y","Z", "9"],
           "0": ["0"]}

vocabulary = get_english_vocabulary(min_vocab_word_len = 3)
words_starting_with_snow = find_valid_word('^SNOW.*', vocabulary, get_all_words = True)
print(words_starting_with_snow)

{'SNOWFOWL', 'SNOW', 'SNOWCAP', 'SNOWBREAK', 'SNOWLESS', 'SNOWBOUND', 'SNOWISH', 'SNOWBERRY', 'SNOWY', 'SNOWFLIGHT', 'SNOWSHOED', 'SNOWHOUSE', 'SNOWL', 'SNOWPLOW', 'SNOWBUSH', 'SNOWBERG', 'SNOWSHOE', 'SNOWSUIT', 'SNOWINESS', 'SNOWSLIDE', 'SNOWSHINE', 'SNOWBLINK', 'SNOWMOBILE', 'SNOWSHOER', 'SNOWBALL', 'SNOWMANSHIP', 'SNOWLIKE', 'SNOWSHED', 'SNOWDRIFT', 'SNOWILY', 'SNOWFALL', 'SNOWSTORM', 'SNOWDONIAN', 'SNOWSCAPE', 'SNOWSHOEING', 'SNOWCRAFT', 'SNOWSHADE', 'SNOWLAND', 'SNOWK', 'SNOWFLOWER', 'SNOWIE', 'SNOWSLIP', 'SNOWWORM', 'SNOWBANK', 'SNOWFLAKE', 'SNOWHAMMER', 'SNOWPROOF', 'SNOWBELL', 'SNOWDROP', 'SNOWBIRD'}


<b>words_to_number()</b> uses inverse mapping dict lookup to convert phonewords back to proper phone numbers:

In [2]:
def words_to_number(wordified, mapping):
    """ Reverse of number_to_words(). """
    reversed_num = ""
    for ch in wordified.replace("-","").upper():
        reversed_num +=  next(key for key, val in mapping.items() if ch in val)
            
    return reversed_num[0] + '-' + reversed_num[1:4] + "-" + reversed_num[4:7] + "-" + reversed_num[7:]

test_strings = ["1-800-HAMSTER", "1-800-PAINTER"]
for string in test_strings:
    print(string, ' <-> ', words_to_number(string, mapping))

1-800-HAMSTER  <->  1-800-426-7837
1-800-PAINTER  <->  1-800-724-6837


A single phone number can be wordified in many different ways, <b>number_to_words()</b> will return only the first wordification it finds as follows:

input string -> substrings -> regexes -> english_words (stop when found 1) -> wordified_string

The print statement near the bottom of the function can be uncommented to show these intermediate steps. 

Also note that we can specify minimal word length that can be higher than min length in our vocabulary.

In [3]:
def number_to_words(num, vocab, mapping, min_word_len = 1):
    """ Convert str representing phone number to wordified str. If no valid wordifications exist - return original digits. """
    def find_usable_substrings(num_str, min_len = 1):
        """ Find substr of num_str and furter split them on chars that can't be converted to letters('1' and '0'). """   
        num_len, subs = len(num_str), set()
        
        for i in range(num_len):
            subs = subs.union([num_str[i:j + 1] for j in range(i + min_len - 1, num_len)])
        
        for el in subs.copy():
            if ("1" in el) or ("0" in el):
                subs.remove(el)
                subs.union(el.replace("1", "0").split("0"))
            return subs

    wordified = num.replace("-","")
    subs = find_usable_substrings(wordified, min_len = min_word_len)
    
    while len(subs)>0:
        substr = subs.pop()
        reg = "^" + "".join(["[" + "".join(mapping[ch]) + "]" for ch in substr]).replace("[-]","") + "$"
        word = find_valid_word(reg, vocab)
        if word == set(): word = None
        if word is not None: 
            wordified = wordified.replace(substr, word) #will replace all copies of substr if >1 present. Spec didn't specify desired behavior so leaving this as-is
            #print(substr.rjust(12),'<->', reg.rjust(50), '<->', str(word).rjust(12),'<->', wordified.rjust(12)) #uncomment to see intermediate steps
            break

    return wordified

test_strings = ["1-800-724-6837", "1-111-111-1111"]
for string in test_strings:
    print(string, ' <-> ', number_to_words(string, vocabulary, mapping, min_word_len = 4))


1-800-724-6837  <->  1800724OVER
1-111-111-1111  <->  11111111111


The following function returns all possible wordifications (incl. multiple wordifications per phone number). 

In [4]:
from itertools import product

def all_wordifications(num, vocab, mapping):
    """ Output all possible combinations of numbers and English words in a phone number. """
    
    def gen_partitions(num_str):
        """ Generate all possible partitions of a string into constituent substrings. """
        for i in range(len(num_str)):
            if i == len(num_str) - 1:
                yield num_str

            first, rest = num_str[0:i+1], num_str[i+1:]

            for j in gen_partitions(rest):
                yield '~'.join([first, j])
                
    num = num.replace("-","")

    subs_dict, ans = dict(), set()
    for partition in gen_partitions(num):
        subs = partition.split('~')

        for sub in subs:
            #adding dict entries for previously not encountered substrings
            if sub not in subs_dict:
                if ("1" in sub) or ("0" in sub):
                    subs_dict[sub] = set()
                else:
                    reg = "^" + "".join(["[" + "".join(mapping[ch]) + "]" for ch in sub]) + "$" 
                    subs_dict[sub] = find_valid_word(reg, vocab, get_all_words = True)
                subs_dict[sub].add(sub)

        combos = product(*[subs_dict[el] for el in subs])
        
        for c in combos:
            ans.add("".join(c))
            
    return ans

test_string = "1-800-724-6837"
all_words = all_wordifications(test_string, vocabulary, mapping)
print("Done! Printing output (if minimal word size was set to 1 it may take some time):\n")
print(all_words)

Done! Printing output (if minimal word size was set to 1 it may take some time):

{'18007AHOUDS', '18007AINU37', '1800RAH6837', '1800PAH6UDS', '1800RAHMUD7', '1800SAIOVER', '1800SAIOUF7', '1800SAHOVER', '1800RAIN837', '1800PAHOUF7', '180072HOVER', '18007CINTER', '18007AGOUDS', '180072IMU37', '1800RAG6837', '1800SAGOUF7', '1800RAGMUD7', '1800724OVER', '1800RAHOUF7', '1800SAGO837', '1800SAHO837', '18007BINUDS', '1800SCHO837', '1800PAHOUDS', '1800SAG6837', '1800SAIMUDS', '1800PAH6837', '1800SAIM837', '18007CHO837', '1800PAINT37', '18007BIM837', '1800PAINTER', '18007AIMUDS', '18007AGO837', '180072GOTE7', '18007AIM837', '18007CHOUDS', '180072HOVE7', '1800SAGMUD7', '1800RAGOUF7', '1800724MUD7', '180072HOT37', '180072GNU37', '180072INTER', '1800SAHMUD7', '1800SAG6UDS', '1800SCIOT37', '1800RAHOVER', '1800PAHMUD7', '1800PAHO837', '1800SAH6837', '1800RAINUDS', '18007AINT37', '18007BINT37', '1800RAH6UDS', '1800SCHOUDS', '1800SAINT37', '1800724OUF7', '18007246UDS', '180072GOVE7', '18007CHOU37', '1

If some of these words don't look familiar, it is due to one of two reasons: 
1. Two (or more) neighboring wordifications looking like one word (e.g. GNUMAD = GNU + MAD).
2. words.words() corpus containing some unusual/odd words. As a quick sanity check we can verify that they are indeed valid words according to that dictionary:

In [5]:
unusual_words = ['GOVE', 'SAH', 'OUF',  'OBVIOUSLYFAKEWORD']
for word in unusual_words:
    print('word', word, 'in vocabulary:', word in vocabulary)

word GOVE in vocabulary: True
word SAH in vocabulary: True
word OUF in vocabulary: True
word OBVIOUSLYFAKEWORD in vocabulary: False


If necessary - the number of these unusual words can be reduced by increasing min_vocab_word_len and/or by using a different vocabulary.