# Dappity Dap

### Characteristics of Puns
* Converging Meanings 
* Sound 
* Association

_Things to try:_
* split words by sound / parsing to increase accuracy of converging meanings hypothesis
    -  e.g. "The soundtrack for Blackfish was **orca**strated."

### Target: Converging Meanings

We have observed that puns often make use of words that have very similar meanings. For example:

'He said I was **average** - but he was just being **mean**.'

where 'average' and 'mean' have the same meanings but are expressed differently. 

___

In order to test this, we will do the following:

* Step 1: Use Synset to list synonyms of tokens
* Step 2: Find common words in Synsets within a sentence
* Step 3: Determine correlation between converging meanings & whether a sentence is a pun or not

---

Import/Download relevant packages:

In [51]:
from textblob import Word
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


For this method, we will use NLTK's WordNet corpus to find the synsets of each token in a sentence.

As an example, let's test it out on the word **'plant'** first:

In [52]:
word = Word('plant')
for i in range(3):
    print('Use Case ', i)
    print(word.synsets[i])
    print(word.definitions[i])
    print(word.synsets[i].lemma_names())
    print(' ')

Use Case  0
Synset('plant.n.01')
buildings for carrying on industrial labor
['plant', 'works', 'industrial_plant']
 
Use Case  1
Synset('plant.n.02')
(botany) a living organism lacking the power of locomotion
['plant', 'flora', 'plant_life']
 
Use Case  2
Synset('plant.n.03')
an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience
['plant']
 


Through WordNet, the **use cases** (Synsets) of the word "Plant" can be found, as well as the **definitions** and **Synonyms** (Lemma Names) as the input.

---
        
           
Let's first eyeball how relevant the lemmas of each significant word in a sentence to determining if a sentence is a pun. 

**The example we will use is: "The past, the present and the future walked into a bar. It was tense."**



In [53]:
# First, importing relevant packages, etc

import codecs
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import PunktSentenceTokenizer,sent_tokenize, word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer, PorterStemmer
import re

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We'll need to process the sentence, which includes lemmatizing, filtering out stop words, stripping punctuation and tokenizing the sentence.

In [54]:
def simpleFilter(sentence):
    
    '''This function filters out stopwords, lemmatizes, tokenizes, and 
    strips punctuation from the input sentence and returns the a list of 
    filtered tokens'''
    
    filtered_sent = []
    
    # Strip punctuation
    stripped = re.sub("[(.)',=!#@]", '', sentence)
        
    # filter out stopwords 
    stop_words = set(stopwords.words("english"))
    
    # Tokenize
    words = word_tokenize(stripped)
    
    # Lemmatize and Filter out Stopwords
    lemmatizer = WordNetLemmatizer()
    for w in words:
        if w not in stop_words:
            filtered_sent.append(lemmatizer.lemmatize(w))

    return filtered_sent
  
def printLemmas(word):
    
    '''This function prints out all synonyms of a given word.'''
    
    for ss in Word(word).synsets:
        print(ss.lemma_names())
        

# Print 

s = 'The past, the present and the future walked into a bar. It was tense.'

for word in simpleFilter(s):
    print("Filtered word: '" + word + "' and its lemmas:")
    printLemmas(word)
    print()


Filtered word: 'The' and its lemmas:

Filtered word: 'past' and its lemmas:
['past', 'past_times', 'yesteryear']
['past']
['past', 'past_tense']
['past']
['past', 'preceding', 'retiring']
['by', 'past']

Filtered word: 'present' and its lemmas:
['present', 'nowadays']
['present']
['present', 'present_tense']
['show', 'demo', 'exhibit', 'present', 'demonstrate']
['present', 'represent', 'lay_out']
['stage', 'present', 'represent']
['present', 'submit']
['present', 'pose']
['award', 'present']
['give', 'gift', 'present']
['deliver', 'present']
['introduce', 'present', 'acquaint']
['portray', 'present']
['confront', 'face', 'present']
['present']
['salute', 'present']
['present']
['present']

Filtered word: 'future' and its lemmas:
['future', 'hereafter', 'futurity', 'time_to_come']
['future', 'future_tense']
['future']
['future']
['future']
['future', 'next', 'succeeding']
['future']

Filtered word: 'walked' and its lemmas:
['walk']
['walk']
['walk']
['walk']
['walk']
['walk']
['walk']
[

---
## **Hypothesis 1: Converging Meaning Pun**

We observe that the word 'tense' appears as a synonym of the words 'present', 'past', and 'future'. Since we are exploring puns with converging meanings, **we hypothesise that we are more likely to find words with converging meanings in puns than in non-puns.**

---

To do this, we first produce a list of unique synonyms of a certain word, excluding the word itself.


Let's try this on the word "plant".

In [55]:
def create_lemmas(word):
    lemmas_list = []
    for ss in Word(word).synsets:
        lemmas_list.append(ss.lemma_names())
    return lemmas_list

def process_lemmas(lemmas_list, word):
    '''
    This function process the lemma list of all the definition of a word
    and returns a list of all associated unrepeated words with the word
    '''
    all_lemmas = []
    for each_list in lemmas_list:
        for lemma in each_list:
            if lemma != word and lemma not in all_lemmas:
                all_lemmas.append(lemma)
    return all_lemmas


print(process_lemmas(create_lemmas('plant'), 'plant'))

['works', 'industrial_plant', 'flora', 'plant_life', 'set', 'implant', 'engraft', 'embed', 'imbed', 'establish', 'found', 'constitute', 'institute']


Next, we have to find out if synonyms of any word in a sentence can be found in the rest of the sentence, and count the number of times this occurs.

In [56]:
def common_syn(s):
    
    '''
    This function takes in a sentence, processes and tokenizes it and
    prints each significant word and tests if its synonyms can be found
    in the rest of the sentence. It prints the pair and returns the
    number of pairs found.
    '''
    
    count = 0
    
    # Filter the sentence to remove filler words / stopwords
    filtered_words = simpleFilter(s)
    
    for index, word in enumerate(filtered_words):
        if word.isalpha():
            lemma_list_of_term = process_lemmas(create_lemmas(word),word)

            # test if any word in the rest of the sentence appears in the lemma list of current word
            for other_word in filtered_words[index+1:]:
                if other_word in ' '.join(lemma_list_of_term):
                    count += 1
                    print(word, other_word)
    return count
    
    
s = 'The past, the present and the future walked into a bar. It was tense.'
print('The number of synonym pairs in this sentence is',common_syn(s))

past tense
present tense
future tense
The number of synonym pairs in this sentence is 3


In order to see if this method does work, we will test it out on our list of pre-tagged puns and non-puns where puns are tagged '0' and non-puns are tagged '1'

We import the list and apply our function common_syn to it, under the label 'Syn Count'.

In [57]:
import pandas as pd
df = pd.read_csv('puns_final.csv', encoding='latin-1')
# df = df.drop('Unnamed: 0', axis=1)

df['Syn Count'] = df['Sentence'].apply(common_syn)
df.head()

tuna fish
I le
bigger le
bed le
I le
pirate high
pirate sea
make hit
I one
Going sound
bed sleep
After ate
said ate
turned around
broke leg
I one
got one
cannibal eat
got make
paid make
reversing back
I one
got one
got back
one I
go work
Cheap u
Thrills u
want u
post office
wear wear
look see
whistle whistle
mad hare
Old go
die go
back second
call phone
Cell phone
mean egg
laying egg
I number
people wash
little light
seems see
door door
take make
fly fly
like like
metal met
I 5
mean end
went low
wardrobe closet
one I
punch punch
went last
Do get
know get
broth stock
cat sick
cheese cheese
Buffalo Bison
Make one
call one
right -
duck put
Thieves steal
dentist tooth
theatrical performance
pun play
pun word
play word
average mean
In I
past tense
present tense
future tense
soda soda
running go
Better go
present tense
past tense
saw ad
happens come
Id I
Id I
know get
alarm clock
Have eat
ever time
tried time
clock time
take make
seasoned veteran
remember back
boomerang back
I atom
error err

Unnamed: 0,Sentence,P/NP,Syn Count
0,"You can tune a guitar, but you can't tuna fish...",1,1
1,Two peanuts were walking in a tough neighborho...,1,0
2,If I buy a bigger bed will I have more or less...,1,4
3,The earth's rotation really makes my day.,1,0
4,I told my friend she drew her eyebrows too hig...,1,0


To find out if this method is accurate, we use the correlation between whether the sentence is a pun or not and the Syn Count. 

In [58]:
corr = df.corr()
corr

Unnamed: 0,P/NP,Syn Count
P/NP,1.0,-0.24147
Syn Count,-0.24147,1.0


In this case, it appears the Syn Count is not very highly correlated with whether the sentence is a pun or not...

Perhaps we should try a different approach.

---

Other than the ability to find synonyms, WordNet can also find out a range of other details about a word.  

The functions below make use of WordNet to yield synonyms, hyponyms, antonyms, words that are similar to as well as words that the WordNet corpus has recorded as "also sees".

In [59]:
from nltk.corpus import wordnet as wn

def get_all_synsets(word, pos=None):
    for ss in wn.synsets(word):
        for lemma in ss.lemma_names():
            yield (lemma, ss.name())


def get_all_hyponyms(word, pos=None):
    for ss in wn.synsets(word, pos=pos):
            for hyp in ss.hyponyms():
                for lemma in hyp.lemma_names():
                    yield (lemma, hyp.name())


def get_all_similar_tos(word, pos=None):
    for ss in wn.synsets(word):
            for sim in ss.similar_tos():
                for lemma in sim.lemma_names():
                    yield (lemma, sim.name())


def get_all_antonyms(word, pos=None):
    for ss in wn.synsets(word, pos=None):
        for sslema in ss.lemmas():
            for antlemma in sslema.antonyms():
                    yield (antlemma.name(), antlemma.synset().name())


def get_all_also_sees(word, pos=None):
        for ss in wn.synsets(word):
            for also in ss.also_sees():
                for lemma in also.lemma_names():
                    yield (lemma, also.name())


def get_all_synonyms(word, pos=None):
    for x in get_all_synsets(word, pos):
        yield (x[0], x[1], 'ss')
    for x in get_all_hyponyms(word, pos):
        yield (x[0], x[1], 'hyp')
    for x in get_all_similar_tos(word, pos):
        yield (x[0], x[1], 'sim')
    for x in get_all_antonyms(word, pos):
        yield (x[0], x[1], 'ant')
    for x in get_all_also_sees(word, pos):
        yield (x[0], x[1], 'also')
       

Let's use the words 'happy' and 'cutlery' to see what kind of details WordNet can figure out about a word.

In [60]:
print("The following are synonyms of 'happy':")
for x in get_all_synsets('happy'):
    print(x)
print()
print("The following are hyponyms (words that are more specific) of 'cutlery':")
for x in get_all_hyponyms('cutlery'):
    print(x)
print()
print("The following are similar to 'happy':")
for x in get_all_similar_tos('happy'):
    print(x)
print()
print("The following are antonyms (opposite) of 'happy':")
for x in get_all_antonyms('happy'):
    print(x)
print()
print("The following are words that should also be seen with 'happy':")
for x in get_all_also_sees('happy'):
    print(x)

The following are synonyms of 'happy':
('happy', 'happy.a.01')
('felicitous', 'felicitous.s.02')
('happy', 'felicitous.s.02')
('glad', 'glad.s.02')
('happy', 'glad.s.02')
('happy', 'happy.s.04')
('well-chosen', 'happy.s.04')

The following are hyponyms (words that are more specific) of 'cutlery':
('bolt_cutter', 'bolt_cutter.n.01')
('cigar_cutter', 'cigar_cutter.n.01')
('die', 'die.n.03')
('edge_tool', 'edge_tool.n.01')
('glass_cutter', 'glass_cutter.n.03')
('tile_cutter', 'tile_cutter.n.01')
('fork', 'fork.n.01')
('spoon', 'spoon.n.01')
('Spork', 'spork.n.01')
('table_knife', 'table_knife.n.01')

The following are similar to 'happy':
('blessed', 'blessed.s.06')
('blissful', 'blissful.s.01')
('bright', 'bright.s.09')
('golden', 'golden.s.02')
('halcyon', 'golden.s.02')
('prosperous', 'golden.s.02')
('laughing', 'laughing.s.01')
('riant', 'laughing.s.01')
('fortunate', 'fortunate.a.01')
('willing', 'willing.a.01')
('felicitous', 'felicitous.a.01')

The following are antonyms (opposite) 

In [61]:
for x in (get_all_synonyms('happy')):
    print(x)

('happy', 'happy.a.01', 'ss')
('felicitous', 'felicitous.s.02', 'ss')
('happy', 'felicitous.s.02', 'ss')
('glad', 'glad.s.02', 'ss')
('happy', 'glad.s.02', 'ss')
('happy', 'happy.s.04', 'ss')
('well-chosen', 'happy.s.04', 'ss')
('blessed', 'blessed.s.06', 'sim')
('blissful', 'blissful.s.01', 'sim')
('bright', 'bright.s.09', 'sim')
('golden', 'golden.s.02', 'sim')
('halcyon', 'golden.s.02', 'sim')
('prosperous', 'golden.s.02', 'sim')
('laughing', 'laughing.s.01', 'sim')
('riant', 'laughing.s.01', 'sim')
('fortunate', 'fortunate.a.01', 'sim')
('willing', 'willing.a.01', 'sim')
('felicitous', 'felicitous.a.01', 'sim')
('unhappy', 'unhappy.a.01', 'ant')
('cheerful', 'cheerful.a.01', 'also')
('contented', 'contented.a.01', 'also')
('content', 'contented.a.01', 'also')
('elated', 'elated.a.01', 'also')
('euphoric', 'euphoric.a.01', 'also')
('felicitous', 'felicitous.a.01', 'also')
('glad', 'glad.a.01', 'also')
('joyful', 'joyful.a.01', 'also')
('joyous', 'joyous.a.01', 'also')


Let's all the categories above words that are **related** to the main word. 

Now, we want to do the same as we did for the synonym count and define some functions that will find the common related words - not just within the sentence, but also with the related words of the other words in the sentence. 

In [62]:
def related_list(word):
    lemma_list = []
    for x in get_all_synonyms(word):
        lemma_list.append(x)
    return list(set(lemma_list))

def common_related(s):
    filtered = simpleFilter(s)
    count = 0
    for index, word in enumerate(filtered):
        related = related_list(word)
        for r_set in related:
            if r_set[0] in filtered[index+1:]:
                count += 1
    return count


**Example:**

'What do you call a belt with a watch on it? A waist of time.'

In [63]:
s = 'What do you call a belt with a watch on it? A waist of time.'

filtered = simpleFilter(s)
count = 0
print('Sentence:',s)
print('-----' *10)
print()
for index, word in enumerate(filtered):
    related = related_list(word)
    for r_set in related:
        if r_set[0] in filtered[index+1:]:
            print("The word '" + word + "' in the sentence is related to '" + r_set[0] + "' as", r_set, "to mean '" + wordnet.synset(r_set[1]).definition() +"'")
            print()
            count += 1
print('-----' * 10)
print('Number of Related pairs:', count)


Sentence: What do you call a belt with a watch on it? A waist of time.
--------------------------------------------------

--------------------------------------------------
Number of Related pairs: 0


Now we want to apply this to the rest of our data.

In [64]:
df['Length'] = df['Sentence'].apply(len) #added this because it's mysteriously missing, but need to filter the length next time
df['Related Count'] = df['Sentence'].apply(common_related)
df['Rel Count / Len'] = df['Related Count'] / df['Length']
df.sample(5)

Unnamed: 0,Sentence,P/NP,Syn Count,Length,Related Count,Rel Count / Len
321,All that is gold does not glitterNot all those...,0,2,287,0,0.0
41,Have you ever tried to milk a cow which has be...,1,0,76,0,0.0
50,"I was accused of being a plagiarist, their wor...",1,0,57,0,0.0
253,"I used to be afraid of hurdles, but I got over...",1,0,50,3,0.06
333,Yesterday is history tomorrow is a mystery tod...,0,1,101,2,0.019802


Here is a description of the values. 

In [65]:
import matplotlib.pyplot as plt
df.describe()

r = df['Related Count']
plt.histfit(r)

AttributeError: module 'matplotlib.pyplot' has no attribute 'histfit'

The code below finds the correlation between the different variables in the data frame. 

As can be seen, the correlation between whether a sentence is a pun or not and the number of related count pairs is debatable.

We also took related count / len of sentence as a longer sentence is more likely to have more related pairs.

In [None]:
corr = df.corr()
corr

We'll try to turn this correlation into an actionable "algorithm" to predict if a sentence is a pun or not. 

The following is another data set with 60 puns and 100 non-puns.

In [None]:
test_df = pd.read_csv('puns_test.csv')
test_df.sample(5)

Let's now code the "algorithm".'

In [None]:
common_related(s)

### Target: Similar Sounds

Other puns involve usage of homophones, words with similar sound but different meanings. For example:

'The **pony** had a **raspy** voice. It was **hoarse**.'

where 'hoarse' means the same as 'raspy', but is also related to 'pony' as it sounds like 'horse'. 

___

In order to test this, we will do the following:

* Step 1: Use Synset to list synonyms of tokens
* Step 2: Find matching similar sounding words in Synsets within the sentence
* Step 3: 

Some inspirations:
* https://stackabuse.com/phonetic-similarity-of-words-a-vectorized-approach-in-python/
* https://pypi.org/project/phonetics/#usage
* https://pypi.org/project/jellyfish/
* https://github.com/mphilli/English-to-IPA

---


Import/Download relevant packages:

In [66]:
# from textblob import Word
# import nltk
# nltk.download('wordnet')
# from nltk.corpus import wordnet as wn


# import codecs
# nltk.download('stopwords')
# nltk.download('punkt')
# from nltk.tokenize import PunktSentenceTokenizer,sent_tokenize, word_tokenize
# from nltk.corpus import stopwords, wordnet
# from nltk.stem import WordNetLemmatizer, PorterStemmer
# import re

import phonetics
import jellyfish
import eng_to_ipa as ipa


Testing the library's differeny phonetic functions with two similar sounding words 'horse' and 'hoarse', and observing the result.

In [67]:
test_words = ['horse', 'hoarse']
print(test_words)
def print_phonetic_index(test_words):
    functions = (phonetics.soundex, phonetics.nysiis, phonetics.metaphone, phonetics.dmetaphone
                 , jellyfish.match_rating_codex, ipa.convert)
    for func in functions:
        print(f'{func.__name__}: ' , end='')
        for word in test_words:
            code = func(word)
            print(str(code) + ' ', end='')
        print()
print_phonetic_index(test_words)

['horse', 'hoarse']
soundex: h0620 h0620 
nysiis: HA HA 
metaphone: HRS HRS 
dmetaphone: ('HRS', '') ('HRS', '') 
match_rating_codex: HRS HRS 
convert: hɔrs hɔrs 


Testing with the words 'horse' and 'haorse' gave identical phonetic indexes from different packages. Let's try this with another set of words!
Pun examples:
* A harp which sounds too good to be true is probably a lyre. (lie)
* Religious lions get down to their knees to prey. (pray)
* A big computerized dog needs a megabyte. (mega bite)
* Lions eat their prey fresh and roar. (raw)

In [68]:
test_words_2 = ['lyre', 'lie', 'prey', 'pray', 'roar', 'raw']
print_phonetic_index(test_words_2)

soundex: l060 l000 p600 p600 r060 r000 
nysiis: LA LA PA PA RA RA 
metaphone: LR L PR PR RR R 
dmetaphone: ('LR', '') ('L', '') ('PR', '') ('PR', '') ('RR', '') ('R', 'RF') 
match_rating_codex: LYR L PRY PRY RR RW 
convert: laɪr laɪ pre pre rɔr rɑ 


In [69]:
# def related_list(word):
#     lemma_list = []
#     for x in get_all_synonyms(word):
#         lemma_list.append(x)
#     return list(set(lemma_list))

# def common_related(s):
#     filtered = simpleFilter(s)
#     count = 0
#     for index, word in enumerate(filtered):
#         related = related_list(word)
#         for r_set in related:
#             if r_set[0] in filtered[index+1:]:
#                 count += 1
#     return count


In [70]:
# word and desired word pair
word_pairs = [('harp', 'lyre'), ('religious', 'pray'), ('computerized', 'megabyte'), ('fresh', 'raw')]
for pair in word_pairs:
    related_words = related_list(pair[0])
    print(f'Related words of "{pair[0]}": {related_words}')
    print(f'Desired word "{pair[1]}" found: {pair[1] in related_words}\n')

Related words of "harp": [('harp', 'harp.v.01', 'ss'), ('wind_harp', 'aeolian_harp.n.01', 'hyp'), ('aeolian_harp', 'aeolian_harp.n.01', 'hyp'), ('dwell', 'harp.v.01', 'ss'), ('aeolian_lyre', 'aeolian_harp.n.01', 'hyp'), ('mouth_organ', 'harmonica.n.01', 'ss'), ('harp', 'harp.n.02', 'ss'), ('harp', 'harmonica.n.01', 'ss'), ('harp', 'harp.v.02', 'ss'), ('harmonica', 'harmonica.n.01', 'ss'), ('harp', 'harp.n.01', 'ss'), ('lyre', 'lyre.n.01', 'hyp'), ('mouth_harp', 'harmonica.n.01', 'ss')]
Desired word "lyre" found: False

Related words of "religious": [('interfaith', 'interfaith.s.01', 'sim'), ('Benedictine', 'benedictine.n.01', 'hyp'), ('sacred', 'sacred.a.01', 'sim'), ('religious', 'religious.n.01', 'ss'), ('Jesuit', 'jesuit.n.01', 'hyp'), ('churchgoing', 'churchgoing.s.01', 'sim'), ('churchly', 'churchly.s.01', 'sim'), ('religious', 'religious.s.04', 'ss'), ('scrupulous', 'scrupulous.a.01', 'sim'), ('god-fearing', 'devout.s.01', 'sim'), ('pious', 'pious.a.01', 'also'), ('coenobite', 'c

---
In most cases, the desired word is not in the related word list, thus we need to find a way to further expand or dictionary or related words, before comparing their soundex and ipa.
After that, we'll calculate degree of similarity between the two pronunciations.


In [71]:
jellyfish.levenshtein_distance('jellyfish', 'smellyfish')


2

In [72]:
jellyfish.jaro_distance('jellyfish', 'smellyfish')

0.8962962962962964

In [73]:
sentence = 'How do mountains see? They peak'
words = simpleFilter(sentence)
related_words = {}

for word in words:
    related_words[word] = related_list(word)

print(related_words)

{'How': [], 'mountain': [('slew', 'batch.n.02', 'ss'), ('great_deal', 'batch.n.02', 'ss'), ('hatful', 'batch.n.02', 'ss'), ('volcano', 'volcano.n.02', 'hyp'), ('peck', 'batch.n.02', 'ss'), ('mint', 'batch.n.02', 'ss'), ('mickle', 'batch.n.02', 'ss'), ('stack', 'batch.n.02', 'ss'), ('sight', 'batch.n.02', 'ss'), ('ben', 'ben.n.01', 'hyp'), ('mess', 'batch.n.02', 'ss'), ('spate', 'batch.n.02', 'ss'), ('haymow', 'haymow.n.01', 'hyp'), ('alp', 'alp.n.01', 'hyp'), ('wad', 'batch.n.02', 'ss'), ('inundation', 'flood.n.02', 'hyp'), ('mountain', 'mountain.n.01', 'ss'), ('deluge', 'flood.n.02', 'hyp'), ('good_deal', 'batch.n.02', 'ss'), ('raft', 'batch.n.02', 'ss'), ('tidy_sum', 'batch.n.02', 'ss'), ('muckle', 'batch.n.02', 'ss'), ('heap', 'batch.n.02', 'ss'), ('batch', 'batch.n.02', 'ss'), ('flood', 'flood.n.02', 'hyp'), ('deal', 'batch.n.02', 'ss'), ('plenty', 'batch.n.02', 'ss'), ('flock', 'batch.n.02', 'ss'), ('pot', 'batch.n.02', 'ss'), ('quite_a_little', 'batch.n.02', 'ss'), ('torrent', 'f

Steps in plan

* get words in dictionary
* find the soundex of words
* make dictionary, key=soundex, value=list of same sounding words
* function to get algorithm to get fuzzy words
* function to check similarity of each word (lemmatized) in sentence with word (original + fuzzyz)
* function to return the difference in spike of similarity with original and another word in fuzzyz

In [74]:
''' Runs through all the words in the dictionary (in wordnet)
make dictionary of soundex words
Some words do not have soundex assigned, so will give error,
skip those words and continue

'''
soundex_dict = {}
count = 15000
for ss in wn.all_synsets():
    word = ss.lemma_names()[0]
    try:
        sound = phonetics.soundex(word)
    except:
        # If got error then discard word, continue to next cycle
#         print("***Error", word)
        continue
        
#     print(word)
    if sound not in soundex_dict:
        soundex_dict[sound] = [word]
        count -= 1
    else:
        if word not in soundex_dict[sound]:
            soundex_dict[sound].append(word)
            print("repeat", sound, soundex_dict[sound])
            count -= 1
    if count == 0:
        break


repeat a20302 ['ascetic', 'acidic']
repeat a32023010 ['adjustive', 'adjective']
repeat a20302 ['ascetic', 'acidic', 'aquatic']
repeat a5305060305 ['antemeridian', 'ante_meridiem']
repeat p601605304 ['preprandial', 'prefrontal']
repeat b020 ['busy', 'back']
repeat c030 ['cut', 'cute']
repeat u5010203 ['unabused', 'unabashed']
repeat a20203010 ['acquisitive', 'associative']
repeat a2020140 ['accessible', 'associable']
repeat a10203 ['abused', 'affixed']
repeat a1053053 ['abundant', 'appendant']
repeat u5010203 ['unabused', 'unabashed', 'unaffixed']
repeat b0530140 ['bondable', 'bindable']
repeat b020 ['busy', 'back', 'big']
repeat a030202 ['audacious', 'autoecious']
repeat a1040140 ['appealable', 'available']
repeat u501040140 ['unappealable', 'unavailable']
repeat u5020503 ['unashamed', 'unawakened']
repeat u5060 ['unwary', 'unaware']
repeat a104052 ['appealing', 'appalling']
repeat r060 ['rare', 'rear']
repeat b0203 ['backed', 'beaked']
repeat b02402 ['backless', 'beakless']
repeat b60

repeat a2053053 ['assentient', 'ascendant']
repeat h050 ['honey', 'hewn', 'hammy']
repeat s000 ['shy', 'showy']
repeat s3020 ['sticky', 'stagy']
repeat b060 ['bare', 'beery']
repeat p030 ['pat', 'petty', 'pithy', 'potty']
repeat s0106 ['super', 'sober']
repeat l01040 ['lovely', 'lively']
repeat b6020 ['brash', 'braky', 'brusque', 'breakaway', 'breezy']
repeat b014052 ['baffling', 'bubbling']
repeat a603 ['arrayed', 'arid']
repeat p6030 ['pretty', 'proto']
repeat m0340 ['motley', 'middle']
repeat m030 ['mute', 'made', 'mid']
repeat u506503 ['unarmed', 'unearned']
repeat n063023065 ['northwestern', 'northeastern']
repeat s03023065 ['southwestern', 'southeastern']
repeat i5030140 ['inaudible', 'immutable', 'inedible']
repeat u520403 ['unsoiled', 'unsullied', 'unschooled']
repeat o106053 ['overhand', 'operant']
repeat d050 ['downy', 'dim', 'dun', 'done', 'damn', 'down']
repeat u5106203 ['unpierced', 'unforced']
repeat b0520 ['bunchy', 'bouncy']
repeat c000 ['coy', 'chewy']
repeat e21053014

repeat d0204030 ['decollete', 'desolate']
repeat u160203 ['upraised', 'upright']
repeat s04053 ['silent', 'salient']
repeat b05303 ['banded', 'bounded', 'bended']
repeat c06502 ['corneous', 'cernuous']
repeat p6050 ['primo', 'prime', 'prone']
repeat s0303 ['shaded', 'suited', 'seated']
repeat b0306 ['better', 'bitter']
repeat f060 ['fair', 'far', 'faraway', 'fore', 'fiery']
repeat h0303 ['headed', 'heated']
repeat b402 ['black', 'bleak']
repeat h03402 ['heedless', 'headless', 'heatless']
repeat s01060 ['sapphire', 'severe', 'shivery']
repeat u50303 ['unheaded', 'unheated']
repeat s05060 ['shimmery', 'summery']
repeat f60203 ['fraught', 'frigid']
repeat d05050203 ['diminished', 'dehumanized']
repeat p030402 ['pathless', 'pitiless']
repeat w030 ['wide', 'white', 'woody', 'wet', 'witty']
repeat f050203 ['finished', 'famished']
repeat h0603 ['hired', 'horrid', 'hurried']
repeat r020 ['rose', 'rocky', 'rush']
repeat d020 ['dishy', 'dusky', 'doughy', 'dicky']
repeat u50510603 ['unhampered', 

repeat s3010 ['stubby', 'stuffy']
repeat u5240203 ['unglazed', 'unclogged']
repeat a106053 ['afferent', 'aperient', 'aberrant', 'apparent', 'abhorrent']
repeat s01060 ['sapphire', 'severe', 'shivery', 'savory']
repeat u5010204 ['unifacial', 'unofficial']
repeat d60503 ['drained', 'drumhead']
repeat g60103 ['grouped', 'grooved']
repeat n010 ['niffy', 'naive']
repeat h060 ['hairy', 'hoary']
repeat p03203 ['pitched', 'patched']
repeat r02052 ['reeking', 'raging', 'rising']
repeat u5302303 ['untoasted', 'untested']
repeat b020 ['busy', 'back', 'big', 'beige', 'buggy', 'bushy', 'base', 'bass', 'baggy', 'boyish']
repeat p01020 ['pappose', 'puppyish']
repeat o500 ['one', 'on']
repeat s030 ['shady', 'sooty', 'saute', 'shadowy', 'suety', 'sad', 'south', 'shut']
repeat c40203 ['clogged', 'closed']
repeat t0503 ['timid', 'tumid', 'tanned']
repeat u530503 ['undimmed', 'untanned']
repeat t0103 ['taped', 'tapped']
repeat o1063050 ['overdone', 'opportune']
repeat u5010203 ['unabused', 'unabashed', 'u

repeat u53030603 ['untethered', 'undeterred']
repeat d0106053 ['different', 'deferent']
repeat i5103053 ['impatient', 'impotent', 'impudent']
repeat m0203 ['masked', 'meshed']
repeat c4052052 ['clanking', 'clinking']
repeat h040 ['hale', 'holey', 'hollow']
repeat j0524052 ['jangling', 'jingling']
repeat t0524052 ['twinkling', 'tinkling']
repeat a50202 ['amnesic', 'anechoic']
repeat r0106053 ['referent', 'reverent']
repeat r0503 ['rhymed', 'renewed']
repeat u503 ['unwed', 'unawed']
repeat b05303 ['banded', 'bounded', 'bended', 'bountied']
repeat j020 ['joyous', 'juicy', 'jazzy']
repeat r0103 ['rapid', 'roofed', 'ribbed']
repeat r01402 ['roofless', 'ribless']
repeat b6020 ['brash', 'braky', 'brusque', 'breakaway', 'breezy', 'broke']
repeat m0503 ['maimed', 'manned', 'mined', 'moneyed']
repeat r0503 ['rhymed', 'renewed', 'rimmed']
repeat h05303 ['hunted', 'haunted', 'handed']
repeat h065402 ['harmless', 'hornless']
repeat b010 ['buff', 'beefy']
repeat h0630 ['hearty', 'hardy']
repeat d026

repeat e04020 ['eyelike', 'eellike']
repeat f042030 ['falsetto', 'falcate']
repeat r0206103 ['reserved', 'recurved']
repeat c0403 ['cowled', 'coiled']
repeat c04052 ['chilling', 'coiling']
repeat i5104030 ['impolite', 'involute']
repeat u520403 ['unsoiled', 'unsullied', 'unschooled', 'unsealed', 'unskilled', 'uncoiled']
repeat s5020 ['smoggy', 'snazzy', 'smoky', 'sneaky']
repeat s360203 ['straight', 'streaked', 'stressed']
repeat u50205303 ['unacquainted', 'unaccented']
repeat u50510302 ['unambitious', 'unemphatic']
repeat a30502 ['atomic', 'atonic']
repeat b6050 ['brainy', 'brawny']
repeat b04020 ['bullish', 'bullocky']
repeat b0420 ['bilgy', 'bulky', 'bolshy']
repeat b040303 ['belated', 'bullheaded']
repeat d0203 ['dashed', 'dazed', 'dished', 'dosed', 'discoid', 'decayed', 'dogged']
repeat r0403 ['rolled', 'ruled']
repeat b010 ['buff', 'beefy', 'boffo']
repeat h03402 ['heedless', 'headless', 'heatless', 'hatless', 'hitless']
repeat u5140203 ['unblessed', 'unbleached', 'unplaced']
rep

repeat n05030 ['ninety', 'nonwoody']
repeat b60303 ['breathed', 'braided']
repeat u50105 ['uneven', 'unwoven']
repeat k50303 ['knotted', 'knitted']
repeat w065 ['warm', 'worn']
repeat c4010303 ['clubfooted', 'clapped_out']
repeat f603 ['fried', 'frayed']
repeat m0520 ['manque', 'manky', 'mangy']
repeat r03403 ['riddled', 'raddled']
repeat s26010 ['scrappy', 'scruffy']
repeat t030603 ['tethered', 'tattered']
repeat t05103 ['tinpot', 'thumbed']
repeat h03602 ['hydrous', 'hydric']
repeat a20502 ['axenic', 'agamic', 'azonic']
repeat f020604 ['figural', 'fossorial']
repeat u50103 ['unavowed', 'unhoped', 'unmoved', 'unwebbed']
repeat u51020303 ['unbigoted', 'unfaceted']
repeat s0403 ['solid', 'sealed', 'soled', 'shelled']
repeat u520403 ['unsoiled', 'unsullied', 'unschooled', 'unsealed', 'unskilled', 'uncoiled', 'unshelled']
repeat j030 ['jade', 'jawed']
repeat j0402 ['joyless', 'jealous', 'jawless']
repeat a10202 ['aphasic', 'abasic']
repeat a10304 ['apodal', 'abbatial']
repeat a1053 ['abey

repeat b60304 ['brutal', 'bridal']
repeat c06302 ['courteous', 'cardiac']
repeat c061050 ['cervine', 'corvine']
repeat f0204 ['focal', 'facial', 'fossil', 'fiscal']
repeat i530620402302 ['interscholastic', 'intergalactic']
repeat a41050 ['alpine', 'alvine']
repeat h0502 ['humic', 'hemic']
repeat c04010602 ['cheliferous', 'chyliferous']
repeat i20502 ['ischemic', 'iconic']
repeat i60302 ['iritic', 'iridic']
repeat p02050 ['piscine', 'phocine']
repeat v0502 ['venous', 'vinous']
repeat p040506102 ['polymorphic', 'polymorphous']
repeat p060504 ['perennial', 'perianal']
repeat p060504 ['perennial', 'perianal', 'perineal']
repeat p060504 ['perennial', 'perianal', 'perineal', 'peroneal']
repeat a26050502 ['acrimonious', 'agronomic', 'acronymic']
repeat p0230604 ['pectoral', 'pastoral']
repeat p0230604 ['pectoral', 'pastoral', 'pictorial']
repeat v020604 ['vicarial', 'visceral']
repeat m040302 ['melodious', 'melodic']
repeat m020502 ['mechanic', 'messianic']
repeat m020204 ['musical', 'mucosal

repeat p06203 ['parched', 'pierced', 'parked', 'pursued']
repeat r052052 ['ranking', 'ranging']
repeat s402052 ['slashing', 'slowgoing', 'sluicing']
repeat s20203 ['schizoid', 'squashed']
repeat s30203 ['stocked', 'staged', 'stacked']
repeat s36052 ['strong', 'straying', 'strung']
repeat t060303 ['throated', 'threaded', 'thoriated']
repeat h06340 ['hardly_a', 'hardly']
repeat e510140 ['enviable', 'enviably']
repeat s05140 ['simple', 'simply']
repeat a620140 ['arguable', 'arguably']
repeat a16020140 ['approachable', 'appreciable', 'appreciably']
repeat w040 ['well', 'wooly', 'whole', 'wholly']
repeat p4020 ['plushy', 'please']
repeat f040 ['full', 'fallow', 'foul', 'fully']
repeat b0340 ['beetle', 'badly']
repeat q030 ['quiet', 'quite']
repeat l052020 ['longish', 'long_ago']
repeat c0510650140 ['confirmable', 'conformable', 'conformably']
repeat n030 ['needy', 'neat', 'net', 'nutty', 'not']
repeat n060 ['near', 'narrow', 'nary', 'nowhere']
repeat s05060 ['shimmery', 'summery', 'somewher

repeat i6010620140 ['irreversible', 'irreversibly']
repeat j0605240 ['jeeringly', 'jarringly']
repeat j040240 ['joylessly', 'jealously']
repeat m0306040 ['materially', 'maturely']
repeat l05040 ['lonely', 'lamely']
repeat l0306040 ['literally', 'laterally']
repeat l020140 ['legible', 'likable', 'legibly', 'laughably']
repeat l02040 ['likely', 'legally', 'locally', 'loosely', 'lazily']
repeat l0340 ['little', 'loudly', 'lewdly']
repeat l020340 ['lightly', 'lucidly']
repeat m05020140 ['manageable', 'manageably']
repeat u505020140 ['unmanageable', 'unmanageably']
repeat l05040 ['lonely', 'lamely', 'lineally']
repeat l023 ['last', 'lost', 'least', 'lowest']
repeat m0502040 ['minuscule', 'maniacally']
repeat m0202040 ['majuscule', 'magically', 'mawkishly']
repeat m0540 ['manly', 'meanly']
repeat i502060140 ['immeasurable', 'inexorably', 'immeasurably']
repeat m02060140 ['miserable', 'measurable', 'measurably']
repeat m03040 ['motile', 'mutely', 'medially']
repeat m05060140 ['memorable', 'me

Have a lot of repeated words, but the words don't even sound similar. e.g.:

l020 ['lousy', 'loose', 'leaky', 'lax', 'liege', 'lazy', 'lucky', 'lush', 'like', 'lossy', 'less', 'lacy']

r0203 ['russet', 'right', 'rugged', 'raised', 'ragged', 'rigid', 'rigged', 'rescued']

In [75]:
''' Runs through all the words in the dictionary (in wordnet)
make dictionary of IPA words
Those that doesn't have legit IPA sound will print out with *
and is not added into the ipa_dict
'''
ipa_dict = {}
count = 1000
invalid_sound_count = 0
for ss in wn.all_synsets():
    word = ss.lemma_names()[0]
    sound = ipa.convert(word)
#     try:
#         sound = ipa.convert(word)
#     except:
#         # If got error then discard word, continue to next cycle
#         print("***Error", word)
#         continue
        
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
#         print(sound)
        invalid_sound_count += 1
        continue
    
#     print(word)
    if sound not in ipa_dict:
        ipa_dict[sound] = [word]
    else:
        if word not in ipa_dict[sound]:
            ipa_dict[sound].append(word)
            print("repeat", sound, ipa_dict[sound])
    count -= 1
    if count == 0:
        break
        
print(ipa_dict)
print(invalid_sound_count)

{'ˈebəl': ['able'], 'əˈnebəl': ['unable'], 'ˈnesənt': ['nascent'], 'ˈimərʤənt': ['emergent'], 'daɪɪŋ': ['dying'], 'ˈmɔrəbənd': ['moribund'], 'læst': ['last'], 'əˈbrɪʤd': ['abridged'], 'kət': ['cut'], 'ˈpɑtɪd': ['potted'], 'ˌənəˈbrɪʤd': ['unabridged'], 'ˈæbsəˌlut': ['absolute'], 'dɪˈrɛkt': ['direct'], 'ˌɪmˈplɪsət': ['implicit'], 'ˈɪnfənət': ['infinite'], 'ˈlɪvɪŋ': ['living'], 'ˈrɛlətɪv': ['relative'], 'riˈleʃənəl': ['relational'], 'əbˈzɔrbənt': ['absorbent'], 'əˈsɪməˌletɪŋ': ['assimilating'], 'rɪˈsɛptɪv': ['receptive'], 'ˈspənʤi': ['spongy'], 'ˈθərsti': ['thirsty'], 'rɪˈpɛlənt': ['repellent'], 'ˈæbstənənt': ['abstinent'], 'əˈsɛtɪk': ['ascetic'], 'ˈglətənəs': ['gluttonous'], 'ˈgridi': ['greedy'], 'ˈæbˌstrækt': ['abstract'], 'kənˈsɛpʧuəl': ['conceptual'], 'aɪˈdil': ['ideal'], 'ˌaɪdiəˈlɑʤɪkəl': ['ideological'], 'ˈkɑnkrit': ['concrete'], 'əˈbʤɛktɪv': ['objective'], 'ril': ['real'], 'əˈbəndənt': ['abundant'], 'əˈbaʊndɪŋ': ['abounding'], 'ˈæmpəl': ['ample'], 'ˈkoʊpiəs': ['copious'], 'ˈizi': [

IPA cannot be used as it does not have enough sound words assigned to all the words in the dictionary, as indicated by the words with * behind. Does not have repeated sound words even with 10000 words inputted

In [76]:
''' Runs through all the words in the dictionary (in wordnet)
make dictionary of Double Metaphone words
Those that doesn't have legit Dmetaphone sound will print out with *
and is not added into the dmeta_dict
'''
dmeta_dict = {}
count = 15000
total = 0
for ss in wn.all_synsets():
    word = ss.lemma_names()[0].lower()
    sound = phonetics.dmetaphone(word)
  
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
        print(sound)
        continue
    
#     print(word)
    if sound not in dmeta_dict:
        dmeta_dict[sound] = [word]
    else:
        if word not in dmeta_dict[sound]:
            dmeta_dict[sound].append(word)
            if len(dmeta_dict[sound]) >= 10:
                print("repeat", sound, dmeta_dict[sound])
    count -= 1
    if count == 0:
        break

# print(dmeta_dict)

repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed']
repeat ('KLT', '') ['cold', 'clawed', 'guilty', 'glued', 'quality', 'cloudy', 'cowled', 'quelled', 'glad', 'gold']
repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid']
repeat ('PRT', '') ['bright', 'pretty', 'broad', 'buried', 'bratty', 'proto', 'berried', 'barred', 'paired', 'powered']
repeat ('PRT', '') ['bright', 'pretty', 'broad', 'buried', 'bratty', 'proto', 'berried', 'barred', 'paired', 'powered', 'proud']
repeat ('FLT', '') ['faulty', 'flat', 'foliate', 'fluid', 'veiled', 'valid', 'fleet', 'filled', 'fueled', 'valued']
repeat ('PRT', '') ['bright', 'pretty', 'broad', 'buried', 'bratty', 'proto', 'berried', 'barred', 'paired', 'powered', 'proud', 'port']
repeat ('FLT', '') ['faulty', 'flat', 'foliate', 'fluid', 'veiled', 'valid', 'fleet', 'filled', 'fueled', 'valued', 'flighty']
repeat ('KLT', '') ['cold', 'clawed'

Double metaphone

('LS', '') ['lousy', 'loose', 'lazy', 'lossy']

('PRT', '') ['bright', 'pretty', 'broad', 'buried', 'bratty', 'proto', 'berried', 'barred']

('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty']

In [77]:

''' Runs through all the words in the dictionary (in wordnet)
make dictionary of match rating codex
'''
match_rating_dict = {}
count = 15000
for ss in wn.all_synsets():
    word = ss.lemma_names()[0].lower()
    sound = jellyfish.match_rating_codex(word)
  
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
        print(sound)
        continue
    
#     print(word)
    if sound not in match_rating_dict:
        match_rating_dict[sound] = [word]
    else:
        if word not in match_rating_dict[sound]:
            match_rating_dict[sound].append(word)
            print("repeat", sound, match_rating_dict[sound])
    count -= 1
    if count == 0:
        break
        
# print(match_rating_dict)

repeat HYPCTV ['hyperactive', 'hypoactive']
repeat IDL ['ideal', 'idle']
repeat AFRD ['afraid', 'afeard']
repeat PRNTL ['prenatal', 'perinatal']
repeat AMPYLR ['amphiprostylar', 'amphistylar']
repeat CT ['cut', 'cute']
repeat BNDBL ['bondable', 'bindable']
repeat ADBL ['addable', 'audible']
repeat UNPSNG ['unprepossessing', 'unpromising']
repeat APLNG ['appealing', 'appalling']
repeat RR ['rare', 'rear']
repeat FLT ['flat', 'foliate']
repeat BLD ['billed', 'bald']
repeat PLS ['plus', 'pilous']
repeat WRY ['wary', 'wiry']
repeat BLD ['billed', 'bald', 'bellied']
repeat UNBSHD ['unabashed', 'unblemished']
repeat BLD ['billed', 'bald', 'bellied', 'bold']
repeat FTRD ['featured', 'fettered']
repeat FRLD ['frilled', 'furled']
repeat DMD ['doomed', 'dimmed']
repeat BRD ['broad', 'buried']
repeat BNY ['bonny', 'bony']
repeat BND ['bound', 'boned']
repeat BLWY ['billowy', 'blowy']
repeat PRDCS ['predaceous', 'predacious']
repeat CLN ['clean', 'cauline']
repeat CSL ['casual', 'causal']
repeat O

repeat NTRL ['neutral', 'natural']
repeat INTGBL ['intangible', 'intelligible']
repeat SLRD ['salaried', 'slurred']
repeat CSLS ['ceaseless', 'causeless']
repeat INTPCS ['interspecies', 'intraspecies']
repeat BRNG ['bearing', 'boring']
repeat INTMRL ['intramural', 'intermural']
repeat INTRSV ['introversive', 'intrusive']
repeat JTNG ['jetting', 'jutting']
repeat EXTRSV ['extroversive', 'extrusive']
repeat EXHTNG ['exhausting', 'exhilarating']
repeat INVTNG ['invigorating', 'inviting']
repeat UNNTNG ['uninteresting', 'uninviting']
repeat UNRND ['unearned', 'unironed']
repeat UNDRVD ['underived', 'undeserved']
repeat CHRTLS ['christless', 'chartless']
repeat UNDSTD ['unadjusted', 'undigested', 'understood']
repeat INTRTD ['integrated', 'interpreted']
repeat UNNSTD ['uninterested', 'ununderstood']
repeat UNCNDD ['uncompounded', 'uncomprehended']
repeat MCRCPC ['macroscopic', 'microscopic']
repeat MCR ['macro', 'micro']
repeat LWRD ['lowered', 'leeward']
repeat BND ['bound', 'boned', 'bann

repeat UNMNZD ['unmechanized', 'unmodernized']
repeat DGRSV ['digressive', 'degressive']
repeat PRDCTV ['predicative', 'productive', 'predictive']
repeat NNPCTV ['nonpsychoactive', 'nonproductive', 'nonprognosticative']
repeat UNPCTV ['unproductive', 'unappreciative', 'unpredictive']
repeat MTD ['matted', 'mated', 'moated']
repeat UNPCTD ['unappreciated', 'unprotected']
repeat UNPCTV ['unproductive', 'unappreciative', 'unpredictive', 'unprotective']
repeat PRD ['paired', 'proud']
repeat ARGNT ['argent', 'arrogant']
repeat CNCTD ['connected', 'conceited']
repeat UNVRFD ['unvitrified', 'unverified']
repeat IMPDNT ['imprudent', 'improvident']
repeat UNFTFL ['unfruitful', 'unforethoughtful']
repeat RSNG ['rising', 'rousing']
repeat UNPCTV ['unproductive', 'unappreciative', 'unpredictive', 'unprotective', 'unprovocative']
repeat RSH ['rush', 'rash']
repeat BLTD ['belted', 'belated']
repeat UNPSHD ['unpublished', 'unpolished', 'unpunished']
repeat UNDRTD ['undistorted', 'unadulterated']
repe

repeat SRD ['serried', 'seared', 'soured']
repeat UNSRD ['unassured', 'unsoured']
repeat UNSCTD ['unselected', 'unsophisticated', 'unsuspected']
repeat CRCT ['correct', 'cruciate']
repeat INTGBL ['intangible', 'intelligible', 'interchangeable']
repeat SRL ['sorrel', 'serial']
repeat CMPTRY ['complimentary', 'complementary']
repeat TMD ['timid', 'tumid', 'timed', 'tamed']
repeat FRL ['frail', 'feral']
repeat DLRS ['dolorous', 'delirious']
repeat UNDTTD ['undifferentiated', 'understated']
repeat FRTY ['forty', 'fruity']
repeat UNSLTD ['unstilted', 'unsalted']
repeat DTBL ['datable', 'dutiable']
repeat UNRTBL ['unreportable', 'unrepeatable', 'unrentable', 'unrespectable', 'unratable']
repeat OVRRNG ['overpowering', 'overstrung']
repeat PPRY ['peppery', 'papery']
repeat TNS ['tense', 'tenuous']
repeat GLTNS ['gluttonous', 'gelatinous']
repeat RPY ['ropey', 'ropy']
repeat SPY ['sappy', 'soupy']
repeat BRDNG ['breeding', 'brooding']
repeat UNRCTV ['unreactive', 'unrestrictive', 'unreflective

repeat MCRNMC ['macroeconomic', 'microeconomic']
repeat PRMTRY ['peremptory', 'paramilitary']
repeat MTC ['meiotic', 'miotic']
repeat MNCLNL ['monoclinal', 'monoclonal']
repeat NVL ['novel', 'naval']
repeat PRMDCL ['premedical', 'paramedical']
repeat PRS ['porous', 'porose', 'parous']
repeat PRTD ['parted', 'parotid']
repeat PLN ['plain', 'pauline']
repeat PTY ['petty', 'potty', 'peaty']
repeat PLGSTC ['plagiaristic', 'plagioclastic']
repeat PLR ['polar', 'pilar']
repeat INTTRY ['introductory', 'integumentary', 'interplanetary']
repeat EXTTRL ['extraterritorial', 'extraterrestrial']
repeat PLMNCS ['plumbaginaceous', 'polemoniaceous']
repeat PLTRCT ['politically_correct', 'politically_incorrect']
repeat PRNL ['perennial', 'preanal']
repeat RBD ['ribbed', 'rabid']
repeat RCLS ['recluse', 'recoilless']
repeat SPNS ['spinose', 'spinous', 'sapiens']
repeat SCRFY ['scruffy', 'scurfy']
repeat SMNL ['semiannual', 'seminal']
repeat SMPSTC ['simplistic', 'semiparasitic']
repeat SPLD ['spoiled', 

Jellyfish's match rating codex is not very accurate as well. e.g.

UNPNTD ['unprecedented', 'unpatented', 'unpainted', 'unparented', 'unplanted', 'unpigmented']

INTCLR ['intraventricular', 'intramuscular', 'intramolecular', 'intermolecular']

In [78]:

''' Runs through all the words in the dictionary (in wordnet)
make dictionary of match rating codex
'''
nysiis_dict = {}
count = 1500
for ss in wn.all_synsets():
    word = ss.lemma_names()[0].lower()
    sound = phonetics.nysiis(word)
  
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
        print(sound)
        continue
    
#     print(word)
    if sound not in nysiis_dict:
        nysiis_dict[sound] = [word]
    else:
        if word not in nysiis_dict[sound]:
            nysiis_dict[sound].append(word)
            print("repeat", sound, nysiis_dict[sound])
    count -= 1
    if count == 0:
        break
        
print(nysiis_dict)

repeat A ['able', 'abaxial']
repeat A ['able', 'abaxial', 'adaxial']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent']
repeat DA ['dissilient', 'dying']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged']
repeat PA ['parturient', 'potted']
repeat UA ['unable', 'unabridged']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged', 'absolute']
repeat DA ['dissilient', 'dying', 'direct']
repeat LA ['last', 'living']
repeat RA ['relative', 'relational']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged', 'absolute', 'absorbent']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged', 'absolute', 'absorbent', 'absorbefacient']
repeat RA ['relative', 'relational', 'receptive']


IndexError: list index out of range

nysiis is not suitable for our method as its sound indexes are too inaccurate. e.g.

A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged', 'absolute', 'absorbent', 'absorbefacient']

---

Thus, **soundex**, **double metaphone**, and maybe match rating codex (?) seems suitable. After making dictionary, continue with

* function to get algorithm to get fuzzy words
* function to check similarity of each word (lemmatized) in sentence with word (original + fuzzyz)
* function to return the difference in spike of similarity with original and another word in fuzzyz

Below is the creation of dmeta_dict, a dictionary with each word assigned to its corresponding sound.

In [79]:
''' Runs through all the words in the dictionary (in wordnet)
make dictionary of Double Metaphone words with its corresponding word to each metaphones
'''
dmeta_dict = {}
# count = 15000
total = 0
for ss in wn.all_synsets():
    word = ss.lemma_names()[0].lower()
    sound = phonetics.dmetaphone(word)
  
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
        print(sound)
        continue
    
#     print(word)
    if sound not in dmeta_dict:
        dmeta_dict[sound] = [word]
    else:
        if word not in dmeta_dict[sound]:
            dmeta_dict[sound].append(word)
            if len(dmeta_dict[sound]) >= 30:
                print("repeat", sound, dmeta_dict[sound])
#     count -= 1
#     if count == 0:
#         break
    total += 1
    if total % 10000 == 0:
        print(f'printed words: {total}')

print(f'Done, {total} printed')
# print(dmeta_dict)

printed words: 10000
printed words: 20000
printed words: 30000
repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty', 'beat', 'bite', 'boot', 'battue', 'bat', 'bet', 'putt', 'bid', 'buyout', 'pet', 'padda', 'pitta', 'buteo', 'boidae', 'bowhead', 'bot', 'pad', 'potto']
repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty', 'beat', 'bite', 'boot', 'battue', 'bat', 'bet', 'putt', 'bid', 'buyout', 'pet', 'padda', 'pitta', 'buteo', 'boidae', 'bowhead', 'bot', 'pad', 'potto', 'bait']
repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty', 'beat', 'bite', 'boot', 'battue', 'bat', 'bet', 'putt', 'bid', 'buyout', 'pet', 'padda', 'pitta', 'buteo', 'boidae', 'bowhead', 'bot', 'pad', 'potto', 'bait', 'bead']
repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitt

repeat ('KT', '') ['cut', 'quiet', 'cute', 'good', 'keyed', 'gouty', 'kuwaiti', 'quite', 'c.o.d.', 'getaway', 'gait', 'kittee', 'get', 'kit', 'kite', 'coat', 'coot', 'coyote', 'cat', 'kitty', 'goat', 'kid', 'kudu', 'coati', 'cod', 'caddy', 'coatee', 'cot', 'cote', 'cowhide', 'cuddy', 'cutaway', 'gaddi', 'gat', 'gate', 'gateway', 'ghat', 'guide', 'kat']
repeat ('KT', '') ['cut', 'quiet', 'cute', 'good', 'keyed', 'gouty', 'kuwaiti', 'quite', 'c.o.d.', 'getaway', 'gait', 'kittee', 'get', 'kit', 'kite', 'coat', 'coot', 'coyote', 'cat', 'kitty', 'goat', 'kid', 'kudu', 'coati', 'cod', 'caddy', 'coatee', 'cot', 'cote', 'cowhide', 'cuddy', 'cutaway', 'gaddi', 'gat', 'gate', 'gateway', 'ghat', 'guide', 'kat', 'khadi']
repeat ('KT', '') ['cut', 'quiet', 'cute', 'good', 'keyed', 'gouty', 'kuwaiti', 'quite', 'c.o.d.', 'getaway', 'gait', 'kittee', 'get', 'kit', 'kite', 'coat', 'coot', 'coyote', 'cat', 'kitty', 'goat', 'kid', 'kudu', 'coati', 'cod', 'caddy', 'coatee', 'cot', 'cote', 'cowhide', 'cudd

repeat ('PLT', '') ['billed', 'bald', 'bellied', 'bloody', 'bold', 'boiled', 'polite', 'built', 'plowed', 'bullate', 'played', 'blot', 'bolt', 'ballet', 'polity', 'plataea', 'platy', 'pullet', 'blatta', 'peludo', 'bullhead', 'bolti', 'belt', 'beltway', 'billet', 'blade', 'bolo_tie', 'bullet', 'palette', 'pallet', 'pallette', 'plat', 'plate', 'pleat', 'plywood', 'blood']
repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty', 'beat', 'bite', 'boot', 'battue', 'bat', 'bet', 'putt', 'bid', 'buyout', 'pet', 'padda', 'pitta', 'buteo', 'boidae', 'bowhead', 'bot', 'pad', 'potto', 'bait', 'bead', 'bed', 'bight', 'bit', 'boat', 'body', 'bootee', 'bota', 'bow_tie', 'butt', 'patio', 'pieta', 'pit', 'pod', 'pot', 'puttee', 'beauty']
repeat ('PRT', '') ['bright', 'pretty', 'broad', 'buried', 'bratty', 'proto', 'berried', 'barred', 'paired', 'powered', 'proud', 'port', 'brut', 'bored', 'broody', 'by_heart', 'pirouette', 'parade', 'pa

repeat ('KRT', '') ['greedy', 'crude', 'cruddy', 'cured', 'great', 'grotty', 'carroty', 'guard', 'karate', 'carrot', 'court', 'caretta', 'krait', 'gruidae', 'coreidae', 'grade', 'card', 'cart', 'cord', 'crate', 'cruet', 'curette', 'garrote', 'gourd', 'grate', 'grid', 'kurta', 'quirt', 'greed', 'quarto', 'choroid', 'cardia', 'creed', 'key_word', 'caret']
repeat ('PRT', '') ['bright', 'pretty', 'broad', 'buried', 'bratty', 'proto', 'berried', 'barred', 'paired', 'powered', 'proud', 'port', 'brut', 'bored', 'broody', 'by_heart', 'pirouette', 'parade', 'parody', 'pride', 'bird', 'paridae', 'parrot', 'brit', 'pieridae', 'pierid', 'beard', 'bard', 'barrette', 'beret', 'biretta', 'board', 'brad', 'braid', 'burette', 'part', 'pirate', 'prod', 'purdah', 'parity', 'breadth', 'breed', 'period']
repeat ('KRT', '') ['greedy', 'crude', 'cruddy', 'cured', 'great', 'grotty', 'carroty', 'guard', 'karate', 'carrot', 'court', 'caretta', 'krait', 'gruidae', 'coreidae', 'grade', 'card', 'cart', 'cord', 'cr

repeat ('KL', '') ['gluey', 'clayey', 'cool', 'gaily', 'coyly', 'coolly', 'call', 'goal', 'kill', 'gala', 'quail', 'koala', 'quill', 'gull', 'collie', 'galloway', 'gayal', 'gulo', 'coil', 'cowl', 'cul', 'galley', 'keel', 'kohl', 'kylie', 'caul', 'cull', 'clue', 'kiliwa', 'kwela', 'kale']
repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty', 'beat', 'bite', 'boot', 'battue', 'bat', 'bet', 'putt', 'bid', 'buyout', 'pet', 'padda', 'pitta', 'buteo', 'boidae', 'bowhead', 'bot', 'pad', 'potto', 'bait', 'bead', 'bed', 'bight', 'bit', 'boat', 'body', 'bootee', 'bota', 'bow_tie', 'butt', 'patio', 'pieta', 'pit', 'pod', 'pot', 'puttee', 'beauty', 'piety', 'pate', 'pout', 'paiute', 'bade', 'bata', 'bayat', 'pity', 'bout', 'patty', 'pita', 'butty', 'beet']
repeat ('KRS', '') ['grassy', 'greasy', 'curious', 'crazy', 'gross', 'carious', 'cross', 'coarse', 'crass', 'course', 'caress', 'cruise', 'carouse', 'curacy', 'graze', 'crecy',

repeat ('KRT', '') ['greedy', 'crude', 'cruddy', 'cured', 'great', 'grotty', 'carroty', 'guard', 'karate', 'carrot', 'court', 'caretta', 'krait', 'gruidae', 'coreidae', 'grade', 'card', 'cart', 'cord', 'crate', 'cruet', 'curette', 'garrote', 'gourd', 'grate', 'grid', 'kurta', 'quirt', 'greed', 'quarto', 'choroid', 'cardia', 'creed', 'key_word', 'caret', 'chord', 'curd', 'crowd']
repeat ('ST', '') ['sooty', 'saute', 'suety', 'pseudo', 'sad', 'sewed', 'seedy', 'side', 'set', 'sideway', 'suttee', 'stay', 'seaweed', 'zooid', 'zeidae', 'sitta', 'seta', 'suidae', 'psetta', 'seat', 'settee', 'sty', 'suit', 'suite', 'sight', 'zeta', 'sadhe', 'suet', 'ziti', 'city']
repeat ('KT', '') ['cut', 'quiet', 'cute', 'good', 'keyed', 'gouty', 'kuwaiti', 'quite', 'c.o.d.', 'getaway', 'gait', 'kittee', 'get', 'kit', 'kite', 'coat', 'coot', 'coyote', 'cat', 'kitty', 'goat', 'kid', 'kudu', 'coati', 'cod', 'caddy', 'coatee', 'cot', 'cote', 'cowhide', 'cuddy', 'cutaway', 'gaddi', 'gat', 'gate', 'gateway', 'gh

repeat ('TK', '') ['twiggy', 'dicky', 'tacky', 'doggo', 'tug', 'dig', 'tag', 'tack', 'takeaway', 'toke', 'dekko', 'take', 'tick', 'duck', 'dog', 'dock', 'dug', 'teg', 'dacha', 'dachau', 'deck', 'dickey', 'toga', 'toque', 'tuck', 'tokay', 'taco', 'deco', 'togo', 'dhaka', 'tokyo', 'taegu']
repeat ('PRT', '') ['bright', 'pretty', 'broad', 'buried', 'bratty', 'proto', 'berried', 'barred', 'paired', 'powered', 'proud', 'port', 'brut', 'bored', 'broody', 'by_heart', 'pirouette', 'parade', 'parody', 'pride', 'bird', 'paridae', 'parrot', 'brit', 'pieridae', 'pierid', 'beard', 'bard', 'barrette', 'beret', 'biretta', 'board', 'brad', 'braid', 'burette', 'part', 'pirate', 'prod', 'purdah', 'parity', 'breadth', 'breed', 'period', 'prate', 'party', 'bread', 'burrito', 'bordeaux', 'brood', 'porte', 'bayrut']
repeat ('ST', '') ['sooty', 'saute', 'suety', 'pseudo', 'sad', 'sewed', 'seedy', 'side', 'set', 'sideway', 'suttee', 'stay', 'seaweed', 'zooid', 'zeidae', 'sitta', 'seta', 'suidae', 'psetta', 's

repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty', 'beat', 'bite', 'boot', 'battue', 'bat', 'bet', 'putt', 'bid', 'buyout', 'pet', 'padda', 'pitta', 'buteo', 'boidae', 'bowhead', 'bot', 'pad', 'potto', 'bait', 'bead', 'bed', 'bight', 'bit', 'boat', 'body', 'bootee', 'bota', 'bow_tie', 'butt', 'patio', 'pieta', 'pit', 'pod', 'pot', 'puttee', 'beauty', 'piety', 'pate', 'pout', 'paiute', 'bade', 'bata', 'bayat', 'pity', 'bout', 'patty', 'pita', 'butty', 'beet', 'paddy', 'biota', 'padua', 'butte', 'pee_dee']
repeat ('PLT', '') ['billed', 'bald', 'bellied', 'bloody', 'bold', 'boiled', 'polite', 'built', 'plowed', 'bullate', 'played', 'blot', 'bolt', 'ballet', 'polity', 'plataea', 'platy', 'pullet', 'blatta', 'peludo', 'bullhead', 'bolti', 'belt', 'beltway', 'billet', 'blade', 'bolo_tie', 'bullet', 'palette', 'pallet', 'pallette', 'plat', 'plate', 'pleat', 'plywood', 'blood', 'palate', 'plot', 'ballad', 'ballade', 'ballo

repeat ('PL', '') ['billowy', 'blowy', 'blue', 'blae', 'pale', 'bally', 'play', 'bull', 'ball', 'pull', 'ploy', 'polo', 'pool', 'bill', 'poll', 'plea', 'belly', 'billy', 'bailey', 'bale', 'bell', 'bola', 'bolo', 'boulle', 'bowl', 'bulla', 'paisley', 'pall', 'pawl', 'pile', 'pill', 'ply', 'pole', 'pulley', 'bile', 'bail', 'pali', 'bole', 'peal', 'paella', 'bialy', 'peel', 'bleu', 'bali', 'palau', 'bulawayo', 'belay', 'baal', 'bel', 'bailee']
repeat ('PRN', '') ['brown', 'barren', 'prone', 'brainy', 'born', 'brawny', 'bahraini', 'burn', 'prinia', 'prawn', 'bruin', 'bighorn', 'piranha', 'barn', 'bren', 'burin', 'brawn', 'brain', 'bourn', 'purana', 'bran', 'brownie', 'biryani', 'prune', 'pruno', 'brine', 'barony', 'brno', 'borneo', 'bahrain', 'brunei', 'bern', 'bryan', 'baryon', 'parana', 'prion', 'bairn']
repeat ('PRN', '') ['brown', 'barren', 'prone', 'brainy', 'born', 'brawny', 'bahraini', 'burn', 'prinia', 'prawn', 'bruin', 'bighorn', 'piranha', 'barn', 'bren', 'burin', 'brawn', 'brain

repeat ('MT', '') ['mute', 'made', 'mid', 'moody', 'mod', 'midi', 'meaty', 'mighty', 'moot', 'myoid', 'mot', 'midway', 'mite', 'mat', 'middy', 'moat', 'might', 'mode', 'mud', 'mayday', 'motto', 'meet', 'meat', 'mead', 'mate', 'moiety', 'medea', 'maidu', 'mahdi', 'maid']
repeat ('PRS', '') ['porous', 'porose', 'breezy', 'brassy', 'parous', 'press', 'piracy', 'browse', 'parus', 'parazoa', 'borzoi', 'pieris', 'bryozoa', 'bourse', 'brace', 'brass', 'brassie', 'poorhouse', 'purse', 'price', 'bursa', 'prose', 'praise', 'bercy', 'powerhouse', 'prussia', 'brescia', 'paris', 'purace', 'purus', 'boreas', 'parsee']
repeat ('PR', '') ['bare', 'pure', 'beery', 'poor', 'bowery', 'pro', 'parry', 'praya', 'beroe', 'bear', 'prey', 'burro', 'boar', 'parr', 'bar', 'bore', 'bur', 'burr', 'power', 'pore', 'bura', 'bray', 'purr', 'puree', 'berry', 'pear', 'brie', 'beer', 'perry', 'pair', 'pyre', 'barrio', 'praia', 'bari', 'beira', 'peru', 'peoria', 'pierre', 'brae', 'para', 'peri', 'peer', 'buyer', 'pawer']

repeat ('AL', '') ['ill', 'whole', 'all', 'awheel', 'oily', 'wholly', 'awhile', 'alee', 'aliyah', 'owl', 'whale', 'ala', 'eel', 'aisle', 'alley', 'awl', 'ell', 'oil', 'wheel', 'yawl', 'aioli', 'ale', 'ally', 'islay', 'isle', 'yalu', 'allah', 'elli', 'ull', 'awol', 'ailey']
repeat ('AL', '') ['ill', 'whole', 'all', 'awheel', 'oily', 'wholly', 'awhile', 'alee', 'aliyah', 'owl', 'whale', 'ala', 'eel', 'aisle', 'alley', 'awl', 'ell', 'oil', 'wheel', 'yawl', 'aioli', 'ale', 'ally', 'islay', 'isle', 'yalu', 'allah', 'elli', 'ull', 'awol', 'ailey', 'ali']
repeat ('PRK', '') ['braky', 'breakaway', 'baroque', 'broke', 'barky', 'baric', 'boric', 'pyrrhic', 'break', 'bear_hug', 'prick', 'perca', 'bark', 'barrack', 'brake', 'brick', 'brig', 'burqa', 'park', 'parka', 'periwig', 'brag', 'burgoo', 'pork', 'burgh', 'burg', 'prague', 'paraguay', 'braga', 'brook', 'parcae', 'berk', 'baraka']
repeat ('PRN', '') ['brown', 'barren', 'prone', 'brainy', 'born', 'brawny', 'bahraini', 'burn', 'prinia', 'prawn'

repeat ('MT', '') ['mute', 'made', 'mid', 'moody', 'mod', 'midi', 'meaty', 'mighty', 'moot', 'myoid', 'mot', 'midway', 'mite', 'mat', 'middy', 'moat', 'might', 'mode', 'mud', 'mayday', 'motto', 'meet', 'meat', 'mead', 'mate', 'moiety', 'medea', 'maidu', 'mahdi', 'maid', 'meade', 'mott']
repeat ('AR', '') ['aware', 'eerie', 'airy', 'aweary', 'awry', 'ara', 'uria', 'eira', 'area', 'areaway', 'array', 'oar', 'wherry', 'air', 'aura', 'ear', 'uighur', 'oriya', 'aria', 'whir', 'oreo', 'aerie', 'ayr', 'ur', 'erie', 'aare', 'aire', 'arroyo', 'eyre', "o'hara"]
repeat ('AR', '') ['aware', 'eerie', 'airy', 'aweary', 'awry', 'ara', 'uria', 'eira', 'area', 'areaway', 'array', 'oar', 'wherry', 'air', 'aura', 'ear', 'uighur', 'oriya', 'aria', 'whir', 'oreo', 'aerie', 'ayr', 'ur', 'erie', 'aare', 'aire', 'arroyo', 'eyre', "o'hara", 'orr']
repeat ('AT', '') ['white', 'odd', 'out', 'eyed', 'awed', 'eight', 'eighty', 'ad', 'yet', 'aid', 'whydah', 'uta', 'ao_dai', 'audio', 'etui', 'id', 'idea', 'ode', 'ad

repeat ('KNT', '') ['gowned', 'canty', 'quaint', 'kind', 'canned', 'connate', 'cuneate', 'khanate', 'count', 'gannet', 'canidae', 'ganoidei', 'ganoid', 'cunt', 'gonad', 'keynote', 'canto', 'khanty', 'kannada', 'gondi', 'cant', 'candy', 'chianti', 'county', 'kandy', 'canada', 'kent', 'kennedy', 'gond', 'canetti', 'canute', 'gandhi', 'gounod', 'kant', 'kaunda', 'kenyata', 'coontie']
repeat ('TN', '') ['downy', 'dun', 'tan', 'tawny', 'done', 'down', 'tinny', 'ten', 'deweyan', 'tune', 'taenia', 'tinea', 'tuna', 'den', 'tannoy', 'tin', 'tine', 'tun', 'tone', 'don', 'dona', 'tai_nuea', 'tai_yuan', 'dawn', 'town', 'taiyuan', 'taiwan', 'dune', 'tauon', 'twin', 'tyne', 'danu', 'diana', 'dane', 'dean', 'donna', 'doyenne', 'duenna', 'townee', 'townie', 'dayan', 'donne', 'taney', 'tawney', 'tunney', 'dioon']
repeat ('PN', '') ['bonny', 'bony', 'bone', 'boon', 'puny', 'piano', 'pawn', 'boyne', 'pen', 'bunny', 'pony', 'pan', 'beanie', 'bin', 'pane', 'peen', 'pin', 'pain', 'pun', 'pawnee', 'paean', '

repeat ('KL', '') ['gluey', 'clayey', 'cool', 'gaily', 'coyly', 'coolly', 'call', 'goal', 'kill', 'gala', 'quail', 'koala', 'quill', 'gull', 'collie', 'galloway', 'gayal', 'gulo', 'coil', 'cowl', 'cul', 'galley', 'keel', 'kohl', 'kylie', 'caul', 'cull', 'clue', 'kiliwa', 'kwela', 'kale', 'kahlua', 'cola', 'cali', 'galway', 'gaul', 'gulu', 'col', 'gully', 'gula', 'kali', 'ghoul', 'clio', 'coolie', 'gael', 'gal', 'clay', 'kelly', 'klee', 'gale', 'calla', 'guayule', 'glaux']
repeat ('TT', '') ['tight', 'dead', 'tied', 'tweedy', 'dowdy', 'toed', 'dud', 'tod', 'taut', 'tidy', 'to_a_t', 'today', 'tattoo', 'duty', 'diet', 'twit', 'tyto', 'teiidae', 'dodo', 'tody', 'tatouay', 'titi', 'dado', 'deadeye', 'dhoti', 'diode', 'teddy', 'tweed', 'deed', 'dot', 'toda', 'duet', 'ditty', 'toot', 'tweet', 'tide', 'date', 'data', 'dada', 'toyota', 'tideway', 'deity', 'dido', 'dad', 'ted', 'todd', 'tout', 'towhead', 'tutee', 'tate', 'tati', 'tito', 'tutu', 'dita', 'toetoe']
repeat ('P', '') ['bay', 'by', 'b

repeat ('TL', '') ['dull', 'tall', 'daily', 'dual', 'dully', 'duel', 'delay', 'deal', 'dole', 'teal', 'dhole', 'tail', 'dial', 'doily', 'doll', 'dolly', 'dowel', 'tile', 'tole', 'tool', 'towel', 'tuille', 'tulle', 'twill', 'tai_lue', 'tulu', 'dill', 'douala', 'delhi', 'dale', 'dell', 'tell', 'dali', 'dahlia', 'tilia', 'dalea', 'tolu']
repeat ('KR', '') ['gory', 'grey', 'care', 'carry', 'cowrie', 'cur', 'gaur', 'gar', 'car', 'core', 'curio', 'gharry', 'gore', 'gray', 'cree', 'khowar', 'kera', 'gur', 'cry', 'caraway', 'curry', 'curia', 'cairo', 'korea', 'gary', 'kura', 'quaoar', 'gauri', 'guru', 'curie', 'kauri', 'grewia', 'carya', 'guar', 'quira']
repeat ('KN', '') ['keen', 'gone', 'chian', 'con', 'gun', 'cannae', 'guan', 'cuon', 'queen', 'coney', 'can', 'cane', 'canoe', 'cone', 'gown', 'quoin', 'koan', 'khuen', 'koine', 'cayenne', 'kin', 'qin', 'kenya', 'ghana', 'guinea', 'guiana', 'guyana', 'kanawha', 'kaon', 'gwyn', 'cain', 'coon', 'conoy', 'khan', 'cohn', 'gawain', 'gonne', 'gwynn',

repeat ('FLT', '') ['faulty', 'flat', 'foliate', 'fluid', 'veiled', 'valid', 'fleet', 'filled', 'fueled', 'valued', 'flighty', 'vault', 'flit', 'flight', 'flood', 'fold', 'fault', 'fellatio', 'field', 'felidae', 'phyllidae', 'pholidae', 'fauld', 'felt', 'filet', 'fillet', 'float', 'flute', 'violet', 'velleity', 'veloute', 'veld', 'volta', 'flyweight', 'valet', 'fallot', 'pholiota']
repeat ('KL', '') ['gluey', 'clayey', 'cool', 'gaily', 'coyly', 'coolly', 'call', 'goal', 'kill', 'gala', 'quail', 'koala', 'quill', 'gull', 'collie', 'galloway', 'gayal', 'gulo', 'coil', 'cowl', 'cul', 'galley', 'keel', 'kohl', 'kylie', 'caul', 'cull', 'clue', 'kiliwa', 'kwela', 'kale', 'kahlua', 'cola', 'cali', 'galway', 'gaul', 'gulu', 'col', 'gully', 'gula', 'kali', 'ghoul', 'clio', 'coolie', 'gael', 'gal', 'clay', 'kelly', 'klee', 'gale', 'calla', 'guayule', 'glaux', 'kola', 'coyol', 'galea']
repeat ('PLT', '') ['billed', 'bald', 'bellied', 'bloody', 'bold', 'boiled', 'polite', 'built', 'plowed', 'bulla

printed words: 90000
repeat ('PNK', '') ['panicky', 'pink', 'punic', 'bionic', 'bang', 'buying', 'bowing', 'bank', 'bunny_hug', 'bunco', 'bankia', 'punkie', 'bongo', 'pongo', 'bhang', 'bung', 'bunk', 'pung', 'punkah', 'ponca', 'paiwanic', 'pengo', 'bong', 'ping', 'pang', 'panic', 'bannock', 'bangui', 'pangaea', 'poyang', 'pinko', 'pin_oak', 'punica', 'peeing']
repeat ('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty', 'beat', 'bite', 'boot', 'battue', 'bat', 'bet', 'putt', 'bid', 'buyout', 'pet', 'padda', 'pitta', 'buteo', 'boidae', 'bowhead', 'bot', 'pad', 'potto', 'bait', 'bead', 'bed', 'bight', 'bit', 'boat', 'body', 'bootee', 'bota', 'bow_tie', 'butt', 'patio', 'pieta', 'pit', 'pod', 'pot', 'puttee', 'beauty', 'piety', 'pate', 'pout', 'paiute', 'bade', 'bata', 'bayat', 'pity', 'bout', 'patty', 'pita', 'butty', 'beet', 'paddy', 'biota', 'padua', 'butte', 'pee_dee', 'ptah', 'buddha', 'buddy', 'poet', 'bede', 'pitt', 'bud

repeat ('TRT', '') ['dirty', 'dowered', 'true_to', 'dried', 'torrid', 'dirt', 'terete', 'tired', 'tiered', 'trot', 'trade', 'tort', 'drogheda', 'teredo', 'trout', 'dart', 'tie_rod', 'tread', 'triode', 'turret', 'trait', 'treaty', 'dard', 'trad', 'tirade', 'treat', 'tart', 'torte', 'tartu', 'dorado', 'dryad', 'druid', 'derrida', 'tourette', 'tarwood', 'tarweed', 'toroid']
repeat ('PLT', '') ['billed', 'bald', 'bellied', 'bloody', 'bold', 'boiled', 'polite', 'built', 'plowed', 'bullate', 'played', 'blot', 'bolt', 'ballet', 'polity', 'plataea', 'platy', 'pullet', 'blatta', 'peludo', 'bullhead', 'bolti', 'belt', 'beltway', 'billet', 'blade', 'bolo_tie', 'bullet', 'palette', 'pallet', 'pallette', 'plat', 'plate', 'pleat', 'plywood', 'blood', 'palate', 'plot', 'ballad', 'ballade', 'ballot', 'plight', 'blowout', 'bleat', 'poulette', 'bollywood', 'blida', 'blighty', 'platte', 'pluto', 'pill_head', 'pilot', 'palladio', 'pilate', 'plato', 'bolide', 'balata', 'blueweed', 'ballota', 'bolete', 'pel

repeat ('KLK', '') ['colicky', 'calico', 'coeliac', 'gallic', 'click', 'galago', 'calk', 'cloak', 'clock', 'clog', 'golliwog', 'gulag', 'cowlick', 'cloaca', 'calque', 'khalkha', 'gaelic', 'colloquy', 'clack', 'cluck', 'glogg', 'claque', 'clique', 'kaluga', 'kalki', 'colleague', 'gluck', 'kellogg', 'galega', 'colic']
repeat ('PK', '') ['back', 'pawky', 'big', 'buggy', 'peaky', 'baggy', 'pukka', 'boggy', 'bc', 'pick', 'peek', 'pica', 'beak', 'buck', 'pug', 'bug', 'pika', 'paca', 'puku', 'pike', 'pogge', 'backhoe', 'bag', 'big_h', 'book', 'pack', 'peg', 'pig', 'pique', 'puck', 'beck', 'bach', 'bock', 'peak', 'bioko', 'baku', 'bog', 'bhaga', 'puka', 'poke', 'buckeye', 'peck', 'pock']
repeat ('TK', '') ['twiggy', 'dicky', 'tacky', 'doggo', 'tug', 'dig', 'tag', 'tack', 'takeaway', 'toke', 'dekko', 'take', 'tick', 'duck', 'dog', 'dock', 'dug', 'teg', 'dacha', 'dachau', 'deck', 'dickey', 'toga', 'toque', 'tuck', 'tokay', 'taco', 'deco', 'togo', 'dhaka', 'tokyo', 'taegu', 'decoy', 'dick', 'duke

repeat ('KL', '') ['gluey', 'clayey', 'cool', 'gaily', 'coyly', 'coolly', 'call', 'goal', 'kill', 'gala', 'quail', 'koala', 'quill', 'gull', 'collie', 'galloway', 'gayal', 'gulo', 'coil', 'cowl', 'cul', 'galley', 'keel', 'kohl', 'kylie', 'caul', 'cull', 'clue', 'kiliwa', 'kwela', 'kale', 'kahlua', 'cola', 'cali', 'galway', 'gaul', 'gulu', 'col', 'gully', 'gula', 'kali', 'ghoul', 'clio', 'coolie', 'gael', 'gal', 'clay', 'kelly', 'klee', 'gale', 'calla', 'guayule', 'glaux', 'kola', 'coyol', 'galea', 'gall', 'glue', 'coal']
repeat ('TL', '') ['dull', 'tall', 'daily', 'dual', 'dully', 'duel', 'delay', 'deal', 'dole', 'teal', 'dhole', 'tail', 'dial', 'doily', 'doll', 'dolly', 'dowel', 'tile', 'tole', 'tool', 'towel', 'tuille', 'tulle', 'twill', 'tai_lue', 'tulu', 'dill', 'douala', 'delhi', 'dale', 'dell', 'tell', 'dali', 'dahlia', 'tilia', 'dalea', 'tolu', 'toll', 'tala', 'dol', 'tael', 'diol']
repeat ('SN', '') ['sane', 'soon', 'sunnah', 'sin', 'sciaena', 'sauna', 'scene', 'seine', 'cyan',

repeat ('AL', '') ['ill', 'whole', 'all', 'awheel', 'oily', 'wholly', 'awhile', 'alee', 'aliyah', 'owl', 'whale', 'ala', 'eel', 'aisle', 'alley', 'awl', 'ell', 'oil', 'wheel', 'yawl', 'aioli', 'ale', 'ally', 'islay', 'isle', 'yalu', 'allah', 'elli', 'ull', 'awol', 'ailey', 'ali', 'yale', 'olea', 'aloe', 'aalii', 'alloy', 'ola', 'while']
repeat ('AR', '') ['aware', 'eerie', 'airy', 'aweary', 'awry', 'ara', 'uria', 'eira', 'area', 'areaway', 'array', 'oar', 'wherry', 'air', 'aura', 'ear', 'uighur', 'oriya', 'aria', 'whir', 'oreo', 'aerie', 'ayr', 'ur', 'erie', 'aare', 'aire', 'arroyo', 'eyre', "o'hara", 'orr', 'urey', 'uriah', 'are', 'euro', 'ore', 'urea', 'yore', 'year', 'iyar', 'era']
repeat ('AK', '') ['awake', 'icky', 'ago', 'okay', 'yoga', 'egg', 'agua', 'uca', 'auk', 'yak', 'uke', 'yoke', 'ego', 'aga', 'o.k.', 'whack', 'ackee', 'iago', 'whig', 'eck', 'eyck', 'yukawa', 'yacca', 'oak', 'yucca', 'oca', 'akee', 'oka', 'ouguiya', 'ague', 'y2k']
repeat ('TS', '') ['dizzy', 'diazo', 'twic

repeat ('TN', '') ['downy', 'dun', 'tan', 'tawny', 'done', 'down', 'tinny', 'ten', 'deweyan', 'tune', 'taenia', 'tinea', 'tuna', 'den', 'tannoy', 'tin', 'tine', 'tun', 'tone', 'don', 'dona', 'tai_nuea', 'tai_yuan', 'dawn', 'town', 'taiyuan', 'taiwan', 'dune', 'tauon', 'twin', 'tyne', 'danu', 'diana', 'dane', 'dean', 'donna', 'doyenne', 'duenna', 'townee', 'townie', 'dayan', 'donne', 'taney', 'tawney', 'tunney', 'dioon', 'toyon', 'toona', 'dionaea', 'danaea', 'dyne', 'tiyin', 'tyiyn', 'din']
repeat ('AR', '') ['aware', 'eerie', 'airy', 'aweary', 'awry', 'ara', 'uria', 'eira', 'area', 'areaway', 'array', 'oar', 'wherry', 'air', 'aura', 'ear', 'uighur', 'oriya', 'aria', 'whir', 'oreo', 'aerie', 'ayr', 'ur', 'erie', 'aare', 'aire', 'arroyo', 'eyre', "o'hara", 'orr', 'urey', 'uriah', 'are', 'euro', 'ore', 'urea', 'yore', 'year', 'iyar', 'era', 'err']
repeat ('SK', '') ['sick', 'sec', 'zoic', 'sikh', 'sic', 'segue', 'soak', 'sack', 'skua', 'saiga', 'saki', 'sockeye', 'screw_eye', 'segway', '

repeat ('KL', '') ['gluey', 'clayey', 'cool', 'gaily', 'coyly', 'coolly', 'call', 'goal', 'kill', 'gala', 'quail', 'koala', 'quill', 'gull', 'collie', 'galloway', 'gayal', 'gulo', 'coil', 'cowl', 'cul', 'galley', 'keel', 'kohl', 'kylie', 'caul', 'cull', 'clue', 'kiliwa', 'kwela', 'kale', 'kahlua', 'cola', 'cali', 'galway', 'gaul', 'gulu', 'col', 'gully', 'gula', 'kali', 'ghoul', 'clio', 'coolie', 'gael', 'gal', 'clay', 'kelly', 'klee', 'gale', 'calla', 'guayule', 'glaux', 'kola', 'coyol', 'galea', 'gall', 'glue', 'coal', 'kwell', 'cloy']
repeat ('KL', '') ['gluey', 'clayey', 'cool', 'gaily', 'coyly', 'coolly', 'call', 'goal', 'kill', 'gala', 'quail', 'koala', 'quill', 'gull', 'collie', 'galloway', 'gayal', 'gulo', 'coil', 'cowl', 'cul', 'galley', 'keel', 'kohl', 'kylie', 'caul', 'cull', 'clue', 'kiliwa', 'kwela', 'kale', 'kahlua', 'cola', 'cali', 'galway', 'gaul', 'gulu', 'col', 'gully', 'gula', 'kali', 'ghoul', 'clio', 'coolie', 'gael', 'gal', 'clay', 'kelly', 'klee', 'gale', 'calla',

repeat ('TT', '') ['tight', 'dead', 'tied', 'tweedy', 'dowdy', 'toed', 'dud', 'tod', 'taut', 'tidy', 'to_a_t', 'today', 'tattoo', 'duty', 'diet', 'twit', 'tyto', 'teiidae', 'dodo', 'tody', 'tatouay', 'titi', 'dado', 'deadeye', 'dhoti', 'diode', 'teddy', 'tweed', 'deed', 'dot', 'toda', 'duet', 'ditty', 'toot', 'tweet', 'tide', 'date', 'data', 'dada', 'toyota', 'tideway', 'deity', 'dido', 'dad', 'ted', 'todd', 'tout', 'towhead', 'tutee', 'tate', 'tati', 'tito', 'tutu', 'dita', 'toetoe', 'todea', 'doodia', 'tad', 'tot', 'dyewood', 'tet', 'die_out', 'ditto', 'tat', 'dote']
repeat ('AN', '') ['one', 'in', 'on', 'own', 'any', "e'en", 'in_a_way', 'yawn', 'ani', 'unio', 'anoa', 'yin', 'ayin', 'yana', 'yuan', 'ana', 'ionia', 'ion', 'anu', 'eon', 'iowan', 'anne', 'ono', 'owen', 'awn', 'anna', 'yen', 'en', 'whine', 'annoy']
repeat ('ALT', '') ['allied', 'old', 'alto', 'alloyed', 'auld', 'oiled', 'wheeled', 'alate', 'a_lot', 'aloud', 'all_too', 'alauda', 'owlet', 'alouatta', 'eyelet', 'yield', 'ey

repeat ('KP', '') ['coup', 'kip', 'gape', 'cub', 'gobio', 'guppy', 'cob', 'coypu', 'kob', 'cobia', 'goby', 'cab', 'cap', 'cape', 'cope', 'copy', 'coupe', 'cubby', 'cube', 'cup', 'gap', 'kaaba', 'keep', 'kepi', 'quipu', 'kappa', 'kai_apple', 'cuppa', 'gob', 'cuba', 'kobe', 'gobi', 'capo', 'cowboy', 'kei_apple', 'cowpea', 'kobo', 'kibe', 'cow_pie', 'go_by', 'go_up', 'keep_away']
repeat ('MR', '') ['mere', 'more', 'moire', 'murre', 'mara', 'mare', 'moray', 'miri', 'maori', 'mwera', 'maar', 'mire', 'moor', 'murray', 'moirai', 'moro', 'mayor', 'mary', 'mary_i', 'mary_ii', 'mayer', 'meir', 'miro', 'moore', 'muir', 'maria', 'myrrh', 'mohria', 'mar', 'marry']
repeat ('KRS', '') ['grassy', 'greasy', 'curious', 'crazy', 'gross', 'carious', 'cross', 'coarse', 'crass', 'course', 'caress', 'cruise', 'carouse', 'curacy', 'graze', 'crecy', 'characeae', 'grouse', 'grus', 'guereza', 'crosse', 'cruse', 'cuirass', 'kris', 'grace', 'craze', 'crus', 'curiosa', 'curse', 'chorus', 'chorizo', 'cress', 'greece

In [208]:
def get_fuzzy_words(word):
    '''Takes in a word
    Output a list of words that sound similar to it
    '''
    first_ch = word[0]
    sound = phonetics.dmetaphone(word)
    final_list = []
    for similar_word in dmeta_dict[sound]:
        # Assume: the first character should be the same
        if similar_word[0] == first_ch:
            final_list.append(similar_word)
    return final_list

In [209]:
get_fuzzy_words('fire')

['fair',
 'far',
 'faraway',
 'free',
 'fore',
 'fiery',
 'fewer',
 'four',
 'faro',
 'feria',
 'foray',
 'fire',
 'ferry',
 'fur',
 'frau',
 'fury',
 'fear',
 'fare',
 'fairway',
 'fairy',
 'frey',
 'freya',
 'fry',
 'frye',
 'fir',
 'fray']

Exploring libraries and writing functions that will help to filter out words with different syllables. 

https://www.howmanysyllables.com/divideintosyllables
https://stackoverflow.com/questions/46759492/syllable-count-in-python

In [82]:
import pyphen
dic = pyphen.Pyphen(lang='nl_NL')

In [83]:
dic.inserted('pseudoscorpiones')

'pseu-do-scor-pi-o-nes'

In [84]:
def syllable_count(word):
    word = word.lower()
    count = 0
    vowels = "aeiouy"
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
    if word.endswith("e") and word[-2] not in vowels:
        count -= 1
    if count == 0:
        count += 1
    return count

In [85]:
test_words = [ 'pieridae', 'foliate', 'fluid', 'veiled', 'valid', 'fleet', 'piroutte', 'pseudoscorpiones', 'acromion', 'achiotes', ]

In [107]:
for word in test_words:
    print(word, syllable_count(word))
    print(dic.inserted(word), dic.inserted(word).count('-'))
print(dic.inserted('megabyte'))

pieridae 3
pie-ri-dae 2
foliate 2
fo-li-a-te 3
fluid 1
fluid 0
veiled 2
vei-led 1
valid 2
va-lid 1
fleet 1
fleet 0
piroutte 2
pi-rou-t-te 3
pseudoscorpiones 5
pseu-do-scor-pi-o-nes 5
acromion 3
acro-mi-on 2
achiotes 3
achi-o-tes 2
me-ga-by-te


The syllables counter function is slightly off for certain cases like 'scorpion' and 'portion' (counting of 'io)
and words ending with 'es' e.g. 'achiotes'

---
Now, exploring the fuzziness of the sound keys in the dmeta_dict:

In [90]:
count = 0
for key, value in dmeta_dict.items():
    print(key, value)
    continue
    break

('APL', '') ['able', 'appeal', 'eyeball', 'uppsala', 'apollo', 'abel', 'opel', 'abelia', 'abulia', 'opal']
('ANPL', '') ['unable', 'in_play', 'unhappily', 'enable', 'ennoble']
('APKSL', '') ['abaxial', 'abaxially']
('ATKSL', '') ['adaxial', 'adaxially']
('AKRSKPK', '') ['acroscopic']
('PSSKPK', '') ['basiscopic']
('APTSNT', '') ['abducent']
('ATSNT', '') ['adducent']
('NSNT', '') ['nascent', 'nocent', 'nescient', 'nissen_hut']
('AMRJNT', 'AMRKNT') ['emergent', 'emarginate']
('TSLNT', '') ['dissilient', 'desalinate']
('PRTRNT', '') ['parturient']
('TNK', '') ['dying', 'dinky', 'tonic', 'tannic', 'dunk', 'tying', 'dyeing', 'tango', 'tang', 'tinca', 'dingo', 'tunga', 'dinghy', 'tank', 'tanka', 'tongue', 'tunic', 'twang', 'tanakh', 'donkey', 'tonga', 'dinka', 'ding', 'ting', 'twinkie', 'dink', 'toyonaki', 'tanga', 'tanguy', 'dong', 'dengue', 'dung']
('MRPNT', '') ['moribund']
('LST', '') ['last', 'laced', 'lost', 'lowset', 'licit', 'least', 'lucid', 'lowest', 'lust', 'lawsuit', 'list', 'lu

('PTRKXNL', '') ['bidirectional']
('PFS', '') ['biface', 'pavis']
('TPLKS', '') ['duplex']
('ANTRKXNL', '') ['unidirectional', 'interactional']
('SMPLKS', '') ['simplex', 'symplocaceae', 'symplocus']
('ANFSL', 'ANFXL') ['unifacial', 'unofficial', 'unofficially']
('FTRT', '') ['featured', 'fettered', 'future_day', 'federate', 'futurity', 'feterita', 'foot_rot']
('FSJT', 'FSKT') ['visaged']
('FSLS', '') ['faceless', 'vesalius', 'phaseolus', 'physalis', 'fossilize', 'visualize']
('PPT', '') ['bibbed', 'popeyed', 'pipit', 'pipidae', 'bobwhite', 'biped', 'pipet', 'poppet', 'puppet', 'papeete', 'peabody', 'pea_pod', 'pupate', 'poop_out', 'pop_out', 'babbitt']
('PPLS', '') ['bibless', 'bibulous', 'populous', 'bubalus', 'peplos', 'peoples', 'populace', 'byblos', 'populus', 'payables']
('ANLTRL', '') ['unilateral', 'unilaterally']
('MLTLTRL', '') ['multilateral', 'multilaterally']
('PLTRL', '') ['bilateral', 'bilaterally', 'plate_rail']
('PPRTT', '') ['bipartite']
('JNT', 'ANT') ['joint', 'join

('AMPRTRPPL', '') ['imperturbable']
('KLKTT', '') ['collected']
('ANFLRT', '') ['unflurried']
('TSKMPST', '') ['discomposed']
('APXT', '') ['abashed']
('P0RT', 'PTRT') ['bothered']
('TSKMPPLTT', '') ['discombobulated']
('FLSTRT', '') ['flustered']
('ANSTRNK', '') ['unstrung', 'unstring']
('KMPRHNSPL', '') ['comprehensible']
('APHNSPL', '') ['apprehensible']
('FTMPL', '') ['fathomable']
('ANKMPRHNSPL', '') ['incomprehensible']
('ANFTMPL', '') ['unfathomable']
('AMPNTRPL', '') ['impenetrable', 'imponderable']
('ANTSFRPL', '') ['indecipherable']
('KNKF', '') ['concave', 'kung_fu']
('ASTPLR', '') ['acetabular']
('PKNKF', '') ['biconcave']
('PRSFRM', '') ['bursiform']
('KPLK', '') ['cuplike', 'cubelike']
('KPLR', '') ['cupular', 'capillary', 'copular', 'cobbler', 'gobbler', 'quibbler', 'kepler']
('PLNKNKF', '') ['planoconcave']
('AMPLKT', '') ['umbilicate', 'implicate']
('KNFKS', '') ['convex']
('PKNFKS', '') ['biconvex']
('KPS', 'JPS') ['gibbous', 'gyps', 'gypsy', 'gibbs']
('PLNKNFKS', '')

('STRMLNT', '') ['streamlined']
('ANFSNT', 'ANFXNT') ['inefficient', 'envisioned']
('ANKNMKL', '') ['uneconomical']
('FRSFL', '') ['forceful', 'forcefully']
('TRSTK', '') ['drastic']
('FRM', '') ['firm', 'frame', 'foram', 'farm', 'form', 'forum', 'vroom', 'fermi', 'viremia']
('FRSPL', '') ['forcible', 'foreseeable', 'forcibly']
('AMPLNT', '') ['impellent', 'ambulant', 'implant']
('FRSLS', '') ['forceless', 'pharsalus', 'versailles']
('AMPX', 'FMPX') ['wimpish']
('ALSTK', '') ['elastic']
('PNS', '') ['bouncy', 'pounce', 'bonasa', 'peneus', 'pineus', 'bones', 'panacea', 'buoyancy', 'bounce', 'pons', 'penis', 'bonus', 'banzai', 'banns', 'ponce', 'pinaceae', 'pinus', 'paeoniaceae', 'pansy', 'bonsai']
('ALSTSST', '') ['elasticized']
('FKTL', '') ['fictile', 'vegetal', 'factual', 'factually', 'victual']
('FLKSPL', '') ['flexible', 'flexibly']
('RPR', '') ['rubbery', 'rubber', 'repair', 'robbery', 'riparia', 'rob_roy', 'raper', 'rapper', 'ripper', 'robber', 'roper', 'reappear', 'rebury']
('SP

('ANTSTRLST', '') ['industrialized', 'industrialist']
('PSTNTSTRL', '') ['postindustrial']
('NNNTSTRL', '') ['nonindustrial']
('TFLPNK', '') ['developing']
('ANNTSTRLST', '') ['unindustrialized']
('ANFKTS', '') ['infectious']
('KNTJS', 'KNTKS') ['contagious']
('KRPTNK', '') ['corrupting']
('NNNFKTS', '') ['noninfectious']
('NNKMNKPL', '') ['noncommunicable']
('ANFRNL', '') ['infernal', 'infernally']
('K0NN', 'KTNN') ['chthonian']
('HTN', '') ['hadean', 'hidden', 'haydn', 'houghton', 'houdini', 'hutton', 'hottonia', 'houttuynia', 'heighten']
('STJN', 'STKN') ['stygian']
('SPRNL', '') ['supernal']
('ANFRMTF', '') ['informative']
('ATFSR', '') ['advisory', 'adviser']
('AKSMPLFNK', '') ['exemplifying']
('RFLNK', '') ['revealing', 'refilling', 'raveling']
('ANNFRMTF', '') ['uninformative']
('NSLS', '') ['newsless', 'noseless', 'noiseless', 'nasalis', 'nucellus', 'nasalize']
('NSTK', '') ['gnostic', 'nostoc', 'nest_egg']
('AKNSTK', 'AKSTK') ['agnostic']
('ATFST', '') ['advised', 'out_of_sigh

('ANPSPL', '') ['unopposable', 'unpeaceable']
('KNFLKTNK', '') ['conflicting']
('APTMSTK', '') ['optimistic']
('SNKN', '') ['sanguine', 'senecan']
('PSMSTK', '') ['pessimistic']
('TMRLST', '') ['demoralized']
('PKL', '') ['buccal', 'beagle', 'boucle', 'buckle', 'bugle', 'piccolo', 'biweekly', 'pickle', 'beckley', 'baikal', 'buchloe', 'buckleya', 'picul', 'boggle']
('APRL', '') ['aboral', 'whippoorwill', 'apparel', 'april']
('AKTNL', '') ['actinal']
('APKTNL', '') ['abactinal']
('ARTRL', '') ['orderly', 'arterial', 'arteriole']
('TSRTRL', '') ['disorderly']
('MPX', '') ['mobbish']
('ARTRT', '') ['ordered']
('KNSKTF', '') ['consecutive']
('TSRTRT', '') ['disordered']
('ARKNST', '') ['organized', 'organist']
('M0TKL', 'MTTKL') ['methodical', 'methodically']
('TSRKNST', '') ['disorganized']
('XTK', '') ['chaotic', 'shtik', 'shiitake']
('SKMPLT', '') ['scrambled']
('ANM0TKL', 'ANMTTKL') ['unmethodical']
('KNFKRT', '') ['configured']
('KRPRT', '') ['corporate', 'carport', 'garboard', 'carbur

('PNKRPT', '') ['bankrupt']
('RMLS', '') ['rimless', 'romulus']
('HNTLS', '') ['handless']
('HNTLT', '') ['handled', 'handhold']
('HNTLLS', '') ['handleless']
('AKPST', '') ['equipoised']
('ALTLN', '') ['oldline']
('RKXNR', '') ['reactionary']
('RTX', '') ['rightish', 'radicchio', 'radish', 'red_ash']
('FRLFT', '') ['far_left']
('LFTX', '') ['leftish']
('LFTST', '') ['leftist']
('SNTRST', '') ['centrist', 'centriscidae']
('RTMST', '') ['rightmost']
('STRPRT', '') ['starboard', 'strawboard']
('LFTMST', '') ['leftmost']
('HRNT', '') ['horned', 'hirundo', 'hornet', 'hairnet', 'hour_hand']
('ANTLRT', '') ['antlered']
('PKRN', '') ['bicorn', 'bukharin', 'pig_iron']
('HRN', '') ['horny', 'herein', 'horn', 'heron', 'heroin', 'heroine', 'horne', 'horney', 'hernia']
('HRNLS', '') ['hornless']
('KNTMNPL', '') ['condemnable']
('MSKTT', '') ['misguided', 'muscadet']
('ANRTS', '') ['unrighteous']
('SNFL', '') ['sinful', 'synovial', 'sun_valley']
('FRL', '') ['frail', 'virile', 'feral', 'freewill', 

('ANSTRNT', '') ['unstrained']
('TRTRL', '') ['territorial', 'territorially', 'deer_trail']
('JRSTKXNL', 'ARSTKXNL') ['jurisdictional']
('RJNL', 'RKNL') ['regional', 'regionally']
('SKXNL', '') ['sectional']
('AKSTRTRTRL', '') ['extraterritorial']
('NNTRTRL', '') ['nonterritorial']
('0RMPLSTK', 'TRMPLSTK') ['thermoplastic']
('0RMSTNK', 'TRMSTNK') ['thermosetting']
('0KNT', 'TKNT') ['thickened']
('TFNS', '') ['diaphanous', 'defense', 'defiance', 'deafness']
('FLMNTS', '') ['filamentous']
('HPRFN', '') ['hyperfine']
('PPR0N', 'PPRTN') ['paper_thin']
('RPNLK', '') ['ribbonlike']
('SLS', 'XLS') ['sleazy', 'slice', 'sialis', 'sluice']
('KKLPL', '') ['coagulable']
('KKLT', '') ['coagulate', 'cuculidae', 'cuckold']
('KLTNS', 'JLTNS') ['gelatinous', 'gelatinize']
('SRP', '') ['syrupy', 'serape', 'sorb', 'syrup', 'serbia']
('0NKPL', 'TNKPL') ['thinkable']
('KJTPL', 'KKTPL') ['cogitable']
('KNSFPL', '') ['conceivable', 'conceivably']
('PRSMPL', '') ['presumable', 'presumably']
('AN0NKPL', 'ANTNK

('PLPPRT', '') ['palpebrate']
('PPLFRM', '') ['papilliform']
('PRTKMTK', '') ['paradigmatic']
('PRKSSML', '') ['paroxysmal']
('PXL', '') ['paschal', 'boyishly', 'patchily', 'patchouli', 'bushel']
('PSRN', '') ['passerine', 'passerina', 'bass_horn']
('NNPSRN', '') ['nonpasserine']
('PSKPNK', '') ['peacekeeping']
('PRKNL', '') ['perigonal']
('PR0LL', 'PRTLL') ['perithelial']
('PKTNL', '') ['pectineal']
('PMFKS', '') ['pemphigous', 'pemphigus']
('PTLT', '') ['petaloid', 'patellidae', 'butt_weld', 'boatload', 'baddeleyite', 'butylate']
('FKSTK', '') ['phagocytic']
('FLNJL', 'FLNKL') ['phalangeal']
('FNSN', 'FNXN') ['phoenician']
('FNKRMK', '') ['phonogramic']
('FNLJKL', 'FNLKKL') ['phonological']
('FTMXNKL', 'FTMKNKL') ['photomechanical']
('FTMTRK', '') ['photometric']
('FTSN0TK', 'FTSNTTK') ['photosynthetic']
('NNFTSN0TK', 'NNFTSNTTK') ['nonphotosynthetic']
('FRTK', '') ['phreatic', 'foredeck', 'fried_egg']
('FRNLJKL', 'FRNLKKL') ['phrenological']
('PKTKRFK', '') ['pictographic']
('PLJKLS

('NTRJNS', 'NTRKNS') ['nitrogenous', 'nitrogenase']
('NNTRNSLXNL', '') ['nontranslational']
('NRMN', '') ['norman']
('ALMPK', '') ['olympic', 'alembic']
('ALMPN', '') ['olympian']
('SPJNKTF', '') ['subjunctive']
('AMPLKXNL', '') ['implicational']
('ARN0LJKL', 'ARNTLKKL') ['ornithological']
('AR0PTK', 'ARTPTK') ['orthopedic', 'orthoptic']
('AKSNN', '') ['oxonian']
('PKSTN', '') ['pakistani', 'pakistan', 'paxton']
('PPN', '') ['papuan', 'baboon', 'bioweapon', 'bobbin', 'pippin', 'pepin', 'papain', 'pipe_in', 'pop_in']
('PRNTRL', '') ['parenteral', 'parenterally']
('PR0N', 'PRTN') ['parthian', 'burthen', 'parathion']
('PRTSPL', '') ['participial', 'participle']
('PTRNMK', '') ['patronymic']
('PKTK', '') ['pectic']
('SKTL', '') ['scrotal', 'scuttle', 'schedule', 'skittle']
('PNNSLR', '') ['peninsular']
('PNTFLNT', '') ['pentavalent']
('PNTKSTL', '') ['pentecostal']
('FRMSTKL', '') ['pharmaceutical']
('FLSTN', '') ['philistine']
('FSFRS', '') ['phosphorous', 'vice_versa', 'phosphorus', 'pho

('F0LSL', 'FTLSL') ['faithlessly']
('FLSL', '') ['falsely', 'fuel_cell']
('FMLRL', '') ['familiarly']
('FMSL', '') ['famously']
('FSTTSL', '') ['fastidiously']
('FLTLSL', '') ['faultlessly']
('FRSML', '') ['fearsomely']
('FLNKL', '') ['feelingly']
('FLSTSL', '') ['felicitously']
('ANFLSTSL', '') ['infelicitously']
('FF0L', 'FFTL') ['fifthly', 'fifth_wheel']
('FKRTFL', '') ['figuratively']
('FRSTKLS', '') ['first_class']
('FKSTL', '') ['fixedly', 'foxtail', 'fixed_oil']
('FLKRNTL', '') ['flagrantly']
('FLMPNTL', '') ['flamboyantly']
('FLMSL', '') ['flimsily', 'flame_cell']
('FLPNTL', '') ['flippantly']
('FRPTNKL', '') ['forbiddingly']
('FRJFNKL', 'FRKFNKL') ['forgivingly']
('ANFRJFNKL', 'ANFRKFNKL') ['unforgivingly']
('FRLRNL', '') ['forlornly']
('FRMLSL', '') ['formlessly']
('FRFLT', '') ['fourfold', 'frivolity']
('MLNFLT', '') ['millionfold']
('FR0L', 'FRTL') ['fourthly', 'frothily']
('FRKTSL', '') ['fractiously']
('FRTLNTL', '') ['fraudulently']
('FRNSTL', '') ['frenziedly']
('FRKL',

('TSLXN', '') ['dissolution', 'tessellation']
('SPLTSFL', '') ['splitsville']
('AFR0R', 'AFRTRF') ['overthrow']
('SPFRSN', 'SPFRXN') ['subversion']
('ATJRNMNT', '') ['adjournment']
('TSMSL', '') ['dismissal']
('KNJ', 'KNK') ['conge', 'congee', 'coinage']
('RMFL', '') ['removal', 'roomful']
('PRJ', 'PRK') ['purge', 'barrage', 'porgy', 'barge', 'pirogi', 'borage', 'peerage', 'perigee', 'bragi', 'borgia']
('TSSTR', '') ['disaster']
('LNKST', '') ['laying_waste', 'langside', 'long_suit', 'linguist']
('ANHLXN', '') ['annihilation']
('TSMXN', '') ['decimation', 'tsimshian']
('ATMSXN', '') ['atomization']
('PLFRSXN', '') ['pulverization']
('FPRSXN', '') ['vaporization']
('T0PL', 'TTPLF') ['deathblow']
('A0NS', 'ATNX') ['euthanasia']
('HMST', '') ['homicide']
('HNRKLNK', '') ['honor_killing']
('MNSLFTR', '') ['manslaughter']
('ASSNXN', '') ['assassination']
('KNTRKTKLNK', '') ['contract_killing']
('MRTST', '') ['mariticide', 'mordacity']
('FRTRST', '') ['fratricide']
('AKSRST', '') ['uxoricide

('ARTRXP', 'FRTRXP') ['wardership']
('AMNT', 'FMNT') ['womanhood']
('TRTML', '') ['treadmill']
('PSNSLF', '') ['business_life']
('ARPLNMXNKS', 'ARPLNMKNKS') ['airplane_mechanics']
('ATMXNKS', 'ATMKNKS') ['auto_mechanics']
('PSKTR', '') ['basketry', 'bass_guitar']
('PKPNTNK', '') ['bookbinding']
('PRKLNK', '') ['bricklaying', 'prickling']
('KPNTRK', '') ['cabinetwork']
('KRPNTR', '') ['carpentry', 'carpenter', 'carpenteria']
('TRSMKNK', '') ['dressmaking']
('ALKTRKLRK', '') ['electrical_work']
('ANTRRTKRXN', '') ['interior_decoration']
('FRNXNK', '') ['furnishing']
('LMRNK', '') ['lumbering']
('AKLSM', '') ['oculism']
('PPRMKNK', '') ['papermaking']
('PRFSN', '') ['profession']
('MT', 'MTR') ['metier']
('LRNTPRFSN', '') ['learned_profession']
('LTRTR', '') ['literature']
('ARKTKTR', '') ['architecture']
('JRNLSM', 'ARNLSM') ['journalism']
('NSPPRNK', '') ['newspapering']
('PLTKS', '') ['politics', 'poll_tax']
('MTSN', '') ['medicine', 'mute_swan', 'metazoan', 'madison']
('PRFNTFMTSN', '

('PLJKLTFNS', 'PLKKLTFNS') ['biological_defense']
('KMKLTFNS', '') ['chemical_defense']
('RPLN', '') ['rebellion']
('SFLR', '') ['civil_war', 'safflower']
('RFLXN', '') ['revolution', 'reaffiliation', 'reevaluation', 'revelation', 'reflation']
('KNTRFLXN', '') ['counterrevolution']
('ANSRJNS', 'ANSRKNS') ['insurgency']
('ANTFT', '') ['intifada']
('PSFKXN', '') ['pacification']
('PSNTSRFLT', '') ["peasant's_revolt"]
('AKRSN', '') ['aggression']
('HSTLTS', '') ['hostilities']
('TRNXRFR', 'TRNKRFR') ['trench_warfare']
('MTKRNTR', '') ['meat_grinder']
('TMSTKFLNS', '') ['domestic_violence']
('PNTTR', '') ['banditry', 'pontederia']
('KMKLRFR', '') ['chemical_warfare']
('PLJKLRFR', 'PLKKLRFR') ['biological_warfare']
('PLJKLRFRTFNS', 'PLKKLRFRTFNS') ['biological_warfare_defense']
('FRSTKRST', '') ['first_crusade']
('SKNTKRST', '') ['second_crusade']
('0RTKRST', 'TRTKRST') ['third_crusade']
('FR0KRST', 'FRTKRST') ['fourth_crusade']
('FF0KRST', 'FFTKRST') ['fifth_crusade']
('SKS0KRST', 'SKSTKRS

('RNKRTPKTR', '') ['ring_rot_bacteria']
('STMNT', '') ['pseudomonad', 'sediment']
('SNTMNS', '') ['xanthomonas']
('SNTMNT', '') ['xanthomonad', 'sentiment']
('A0RTS', 'ATRTS') ['athiorhodaceae']
('NTRPKTRS', '') ['nitrobacteriaceae']
('NTRPKTR', '') ['nitrobacter']
('NTRKPKTR', '') ['nitric_bacteria']
('NTRSMNS', '') ['nitrosomonas']
('NTRSPKTR', '') ['nitrosobacteria']
('0PKTRS', 'TPKTRS') ['thiobacteriaceae']
('JNS0PSLS', 'KNSTPSLS') ['genus_thiobacillus']
('0PSLS', 'TPSLS') ['thiobacillus']
('0PKTR', 'TPKTR') ['thiobacteria']
('JNSSPRLM', 'KNSSPRLM') ['genus_spirillum']
('RTPTFFRPKTRM', '') ['ratbite_fever_bacterium']
('JNSFPR', 'KNSFPR') ['genus_vibrio']
('FPR', '') ['vibrio', 'viper', 'vipera', 'fiber', 'vapor']
('KMPSLS', '') ['comma_bacillus']
('FPRFTS', '') ['vibrio_fetus']
('PKTRTS', '') ['bacteroidaceae', 'bacteroides']
('KLMTPKTRM', '') ['calymmatobacterium']
('KLMTPKTRMKRNLMTS', '') ['calymmatobacterium_granulomatis']
('FRNSSL', '') ['francisella']
('FRNSSLTLRNSS', '') ['fr

('STKSR', '') ['stegosaur', 'psittacosaur']
('JNSNKLSRS', 'KNSNKLSRS') ['genus_ankylosaurus']
('MRJNSFL', 'MRKNSFL') ['marginocephalia']
('SPRTRPKSFLSRS', '') ['suborder_pachycephalosaurus']
('PKSFLSR', '') ['pachycephalosaur']
('SRTPS', 'SRTPX') ['ceratopsia']
('SRTPSN', 'SRTPXN') ['ceratopsian']
('SRTPST', '') ['ceratopsidae']
('JNSPRTSRTPS', 'KNSPRTSRTPS') ['genus_protoceratops']
('PRTSRTPS', '') ['protoceratops']
('JNSTRSRTPS', 'KNSTRSRTPS') ['genus_triceratops']
('TRSRTPS', '') ['triceratops']
('JNSPSTKSRS', 'KNSPSTKSRS') ['genus_psittacosaurus']
('ARN0PT', 'ARNTPT') ['euronithopoda', 'ornithopod']
('HTRSRT', '') ['hadrosauridae']
('HTRSR', '') ['hadrosaur']
('JNSNTTTN', 'KNSNTTTN') ['genus_anatotitan']
('ANTTTN', '') ['anatotitan']
('JNSKR0SRS', 'KNSKRTSRS') ['genus_corythosaurus']
('KR0SR', 'KRTSR') ['corythosaur']
('JNSTMNTSRS', 'KNSTMNTSRS') ['genus_edmontosaurus']
('ATMNTSRS', '') ['edmontosaurus']
('JNSTRKTN', 'KNSTRKTN') ['genus_trachodon']
('TRKTN', '') ['trachodon', 'tree

('FRKT', '') ['fregata', 'frigate', 'fur_coat', 'fire_code', 'farragut', 'variegate', 'forget', 'freak_out']
('FRKTPRT', '') ['frigate_bird']
('FLKRKRST', '') ['phalacrocoracidae']
('FLKRKRKS', '') ['phalacrocorax']
('KRMRNT', '') ['cormorant']
('JNSNNK', 'KNSNNK') ['genus_anhinga']
('SNKPRT', 'XNKPRT') ['snakebird']
('ATRTRK', 'FTRTRK') ['water_turkey']
('F0NTT', 'FTNTT') ['phaethontidae']
('F0N', 'FTN') ['phaethon']
('TRPKPRT', '') ['tropic_bird']
('SFNSFRMS', '') ['sphenisciformes']
('SFNST', '') ['spheniscidae']
('SFNSFRMSPRT', '') ['sphenisciform_seabird']
('PNKN', '') ['penguin', 'pannikin', 'pinecone', 'pinckneya']
('PKSLS', '') ['pygoscelis']
('APTNTTS', '') ['aptenodytes']
('KNKPNKN', '') ['king_penguin']
('AMPRRPNKN', '') ['emperor_penguin']
('SFNSKS', '') ['spheniscus']
('JKSPNKN', 'AKSPNKN') ['jackass_penguin']
('RKPR', '') ['rock_hopper', 'rock_opera']
('PRSLRFRMS', '') ['procellariiformes']
('PLJKPRT', 'PLKKPRT') ['pelagic_bird']
('PRSLRFRMSPRT', '') ['procellariiform_sea

('ALXPN', 'FLXPN') ['welsh_pony']
('AKSMR', '') ['exmoor', 'aqueous_humor']
('RSHRS', '') ['racehorse']
('SRPRTN', '') ['sir_barton', 'sauerbraten']
('KLNTFKS', '') ['gallant_fox']
('AMH', '') ['omaha']
('ARTMRL', 'FRTMRL') ['war_admiral']
('KNTFLT', '') ['count_fleet', 'canada_violet']
('SKRTRT', '') ['secretariat']
('STLSL', 'STLSLF') ['seattle_slew']
('AFRMT', '') ['affirmed']
('STPLXSR', 'STPLKSR') ['steeplechaser']
('FNXR', '') ['finisher', 'vanisher']
('TRKRS', '') ['dark_horse', 'trachurus', 'tower_cress', 'deer_grass']
('NNSTRTR', '') ['nonstarter']
('HRNSRS', '') ['harness_horse', 'harness_race']
('HKN', '') ['hackney', 'hogan', 'hokan', 'haganah']
('ARKRS', 'FRKRS') ['workhorse']
('TRFTRS', '') ['draft_horse']
('KR0RS', 'KRTRS') ['carthorse', 'carothers']
('KLTSTL', '') ['clydesdale']
('PRXRN', 'PRKRN') ['percheron']
('FRMRS', '') ['farm_horse']
('ALRS', '') ['wheel_horse', 'ailurus', 'yellow_race', 'yellow_iris']
('KXRS', 'KKRS') ['coach_horse']
('TRTNKRS', '') ['trotting_ho

('AR', 'ARF') ['arrow', 'yarrow']
('ARSNL', '') ['arsenal']
('ARTMSNTFSS', 'ARTMXNTFSS') ['artemision_at_ephesus']
('ARTRLRT', '') ['arterial_road']
('ARTRKRM', '') ['arteriogram']
('ARTSNL', '') ['artesian_well']
('AR0RKRM', 'ARTRKRM') ['arthrogram']
('ARTKLFKMRS', '') ['article_of_commerce']
('ARTKLTTLTR', '') ['articulated_ladder']
('ARTFSLFLR', 'ARTFXLFLR') ['artificial_flower']
('ARTFSLRT', 'ARTFXLRT') ['artificial_heart']
('ARTFSLRSN', 'ARTFXLRSN') ['artificial_horizon']
('ARTFSLJNT', 'ARTFXLJNT') ['artificial_joint']
('ARTFSLKTN', 'ARTFXLKTN') ['artificial_kidney']
('ARTFSLSKN', 'ARTFXLSKN') ['artificial_skin']
('ARTLR', '') ['artillery', 'eurodollar']
('ARTLRXL', '') ['artillery_shell']
('ARTSTSLFT', '') ["artist's_loft"]
('ARTSTSRKRM', '') ["artist's_workroom"]
('ARTSKL', '') ['art_school']
('ASKT', '') ['ascot', 'eye_socket', 'ask_out']
('AXKN', '') ['ashcan', 'ash_can']
('AXLR', '') ['ashlar']
('AXTR', '') ['ashtray', 'ishtar']
('ASPRJNS', 'ASPRKNS') ['asparaginase']
('ASPRK

('TTKNFRTR', '') ['data_converter']
('TTNPTTFS', '') ['data_input_device']
('TTMLTPLKSR', '') ['data_multiplexer']
('TTSSTM', '') ['data_system']
('TFNPRT', '') ['davenport']
('TFSKP', '') ['davis_cup']
('TPK', '') ['daybook', 'tab_key', 'tea_bag', 'tobacco', 'tupik', 'topic', 'tapioca', 'tepic', 'tobago', 'tabuk', 'dubuque', 'topeka', 'dybbuk', 'de_bakey', 'dieback', 'debug', 'die_back']
('TKMP', '') ['day_camp', 'decamp']
('TNRSR', '') ['day_nursery']
('TSKL', '') ['day_school']
('TTKSL', '') ['dead_axle']
('T0PT', 'TTPT') ['deathbed']
('T0KMP', 'TTKMP') ['death_camp']
('T0S', 'TTS') ['death_house', 'tethys']
('T0KNL', 'TTKNL') ['death_knell']
('T0MSK', 'TTMSK') ['death_mask']
('T0ST', 'TTST') ['death_seat', "death's_head"]
('T0TRP', 'TTTRP') ['deathtrap']
('TKXR', 'TKKR') ['deck_chair']
('TKLJ', '') ['deckle_edge']
('TKLNMTR', '') ['declinometer']
('TKLTJ', 'TKLTK') ['decolletage']
('TKNJSTNT', 'TKNKSTNT') ['decongestant']
('TTKTTFLSRFR', '') ['dedicated_file_server']
('TRSTLKR', ''

('HRKLS', '') ['hourglass', 'hercules']
('HSLTS', '') ['houselights']
('HSFKRTS', '') ['house_of_cards']
('HSFKRKXN', '') ['house_of_correction']
('HSPNT', '') ['house_paint', 'husband']
('HSTP', '') ['housetop']
('HSNK', '') ['housing']
('HFRKRFT', '') ['hovercraft']
('HRX', 'HRK') ['huarache']
('HPKP', '') ['hubcap']
('HLK', '') ['hulk', 'halakah', 'holy_week']
('HMRPRJ', '') ['humber_bridge']
('HMRLFL', '') ['humeral_veil']
('HMNKTP', '') ['humming_top']
('HMF', '') ['humvee', 'humify']
('HNTNKKNF', '') ['hunting_knife']
('HRKNTK', '') ['hurricane_deck']
('HRKNLMP', '') ['hurricane_lamp']
('HTMNT', '') ['hutment']
('HTNTN', '') ['hydantoin']
('HTRLSN', '') ['hydralazine']
('HTRNT', '') ['hydrant']
('HTRLKPRK', '') ['hydraulic_brake']
('HTRLKPRS', '') ['hydraulic_press']
('HTRLKPMP', '') ['hydraulic_pump']
('HTRLKSSTM', '') ['hydraulic_system']
('HTRLKTRNSMSN', '') ['hydraulic_transmission']
('HTRKLR0ST', 'HTRKLRTST') ['hydrochlorothiazide']
('HTRLKTRKTRPN', '') ['hydroelectric_turbi

('APNST', '') ['open_sight']
('APNF', '') ['open_weave']
('APNRK', '') ['openwork']
('APRKLK', '') ['opera_cloak']
('APRTNKMKRSKP', '') ['operating_microscope']
('APRTNKRM', '') ['operating_room']
('APRTNKTPL', '') ['operating_table']
('APMTN', '') ['opium_den']
('APTKLPNX', 'APTKLPNK') ['optical_bench']
('APTKLTFS', '') ['optical_device']
('APTKLTSK', '') ['optical_disk']
('APTKLFPR', '') ['optical_fiber']
('APTKLNSTRMNT', '') ['optical_instrument']
('APTKLPRMTR', '') ['optical_pyrometer']
('APTKLTLSKP', '') ['optical_telescope']
('ARNJKRF', 'ARNKKRF') ['orange_grove']
('ARPP', '') ['orb_web']
('ARKSTR', '') ['orchestra', 'uricaciduria']
('ARKSTRPT', '') ['orchestra_pit']
('ARSRKT', '') ['or_circuit']
('ARTRPK', '') ['order_book']
('ARKNLFT', '') ['organ_loft']
('ARKNPP', '') ['organ_pipe']
('ARKNSTP', '') ['organ_stop']
('ARKNS', '') ['organza', 'arrogance', 'organize']
('ARFLM', '') ['oriflamme', 'eriophyllum']
('ARLPTK', '') ['orlop_deck']
('ARFNJ', 'ARFNK') ['orphanage', 'irvingia

('SFTSP', '') ['soft_soap']
('SFTRPKJ', 'SFTRPKK') ['software_package']
('SLPP', '') ['soil_pipe', 'syllabub']
('SLRR', '') ['solar_array', 'solar_year']
('SLRSL', '') ['solar_cell', 'solresol']
('SLRTX', '') ['solar_dish']
('SLRTR', '') ['solar_heater']
('SLRTLSKP', '') ['solar_telescope']
('SLR0RMLSSTM', 'SLRTRMLSSTM') ['solar_thermal_system']
('SLTRNKRN', '') ['soldering_iron']
('SMPRR', '') ['sombrero', 'simperer']
('SNKTP0FNTR', 'SNKTPTFNTR') ['sonic_depth_finder']
('S0NKSRP', 'STNKSRP') ['soothing_syrup']
('SRTR', '') ['sorter', 'sirdar', 'sartre']
('SNTP', 'SNTPF') ['sound_bow']
('SNTKMR', '') ['sound_camera']
('SNTFLM', '') ['sound_film']
('SNTNKPRT', '') ['sounding_board']
('SNTNKLT', '') ['sounding_lead']
('SNTNKRKT', '') ['sounding_rocket']
('SNTRKRTNK', '') ['sound_recording']
('SNTSPKTRKRF', '') ['sound_spectrograph']
('SNTTRK', '') ['sound_truck']
('SPPL', '') ['soup_bowl']
('SPLTL', '') ['soup_ladle']
('SPPLT', '') ['soup_plate']
('SPSPN', '') ['soupspoon']
('SRSFLMNXN',

KeyboardInterrupt: 

In [127]:
for key, value in dmeta_dict.items():
    if 'pray' in value:
#     if key[0] == 'R':
        print(key, value)
    continue
    break

('PR', '') ['bare', 'pure', 'beery', 'poor', 'bowery', 'pro', 'parry', 'praya', 'beroe', 'bear', 'prey', 'burro', 'boar', 'parr', 'bar', 'bore', 'bur', 'burr', 'power', 'pore', 'bura', 'bray', 'purr', 'puree', 'berry', 'pear', 'brie', 'beer', 'perry', 'pair', 'pyre', 'barrio', 'praia', 'bari', 'beira', 'peru', 'peoria', 'pierre', 'brae', 'para', 'peri', 'peer', 'buyer', 'pawer', 'payer', 'barrie', 'beria', 'berra', 'bohr', 'peary', 'brya', 'par', 'birr', 'barye', 'pyorrhea', 'pyuria', 'pare', 'beware', 'pray', 'pry', 'bury', 'pour']


In [214]:
def starts_with_same_ch(word1, word2):
    return word1[0] == word2[0]

def has_same_syllable(word1, word2):
    # +- 1 syllable as the function syllable_count is not very accurate
    count1 = syllable_count(word1)
    count2 = syllable_count(word2)
    
    # if syllable defer by count of 2 or more, they're not similar sounding
    if abs(count1 - count2) < 2:
        return True
    return False
    

def get_fuzzy_words(word):
    '''Takes in a word
    Output a list of words that sound similar to it
    '''
    sound = phonetics.dmetaphone(word)
    final_list = []
    for similar_word in dmeta_dict[sound]:
        # Assume: the first character should be the same
        if similar_word != word:
            if starts_with_same_ch(word, similar_word) and has_same_syllable(word, similar_word):
                final_list.append(similar_word)
    return final_list

In [215]:
get_fuzzy_words('fire')

['fair',
 'far',
 'free',
 'fore',
 'fiery',
 'fewer',
 'four',
 'faro',
 'feria',
 'foray',
 'ferry',
 'fur',
 'frau',
 'fury',
 'fear',
 'fare',
 'fairway',
 'fairy',
 'frey',
 'freya',
 'fry',
 'frye',
 'fir',
 'fray']

* A harp which sounds too good to be true is probably a lyre. (lie)
* Religious lions get down to their knees to prey. (pray)
* A big computerized dog needs a megabyte. (mega bite)
* Lions eat their prey fresh and roar. (raw)

In [129]:
print(simpleFilter('Religious lions get down to their knees to prey.'))

['Religious', 'lion', 'get', 'knee', 'prey']


In [139]:
wn.synsets('religious')

[Synset('religious.n.01'),
 Synset('religious.s.01'),
 Synset('religious.a.02'),
 Synset('religious.a.03'),
 Synset('religious.s.04')]

In [140]:
print(Word('religious').definitions)

['a member of a religious order who is bound by vows of poverty and chastity and obedience', 'concerned with sacred matters or religion or the church', 'having or showing belief in and reverence for a deity', 'of or relating to clergy bound by monastic vows', 'extremely scrupulous and conscientious']


In [152]:
for i in Word('dog').synsets:
    print(i)

Synset('dog.n.01')
Synset('frump.n.01')
Synset('dog.n.03')
Synset('cad.n.01')
Synset('frank.n.02')
Synset('pawl.n.01')
Synset('andiron.n.01')
Synset('chase.v.01')


In [169]:
def highest_similarity(word1, word2):
    '''Since there are multiple synsets (versions) of a word and wordnet similarity requires specifying which version
    This function checks through all the combination of different versions of both word
    and returns the highest similarity score
    '''
    highest = 0
    for ss1 in Word(word1).synsets:
        for ss2 in Word(word2).synsets:
            similarity = ss1.wup_similarity(ss2)
            if similarity is not None:
                highest = max(highest, similarity)
            
    return(highest)

In [170]:
print(highest_similarity('religious', 'pray'))

0


Tried all the similarity functions in this http://www.nltk.org/howto/wordnet.html

but all give score of 0 for 'religious' and 'pray'

In [180]:
pair_list = [('lyre', 'harp'), ('lyre', 'true'), ('true', 'lie'), ('true', 'lyre'), 
             ('lion', 'prey'), ('lion', 'pray'), ('religious', 'pray'), ('religious', 'prey'), 
             ('dog', 'bite'),
             ('fresh', 'raw'), ('lion', 'roar')]
for pair in pair_list:
    print(pair, highest_similarity(pair[0], pair[1]))
    
print()
for pair in pair_list:
    print(pair, model.similarity(pair[0], pair[1]))

('lyre', 'harp') 0.9565217391304348
('lyre', 'true') 0.10526315789473684
('true', 'lie') 0.2857142857142857
('true', 'lyre') 0.1111111111111111
('lion', 'prey') 0.75
('lion', 'pray') 0
('religious', 'pray') 0
('religious', 'prey') 0.7058823529411765
('dog', 'bite') 0.375
('fresh', 'raw') 0
('lion', 'roar') 0.16666666666666666

('lyre', 'harp') 0.5839912
('lyre', 'true') 0.055185165
('true', 'lie') 0.31717193
('true', 'lyre') 0.055185165
('lion', 'prey') 0.3409069
('lion', 'pray') 0.18141624
('religious', 'pray') 0.106095985
('religious', 'prey') 0.01113707
('dog', 'bite') 0.54168963
('fresh', 'raw') 0.6488273
('lion', 'roar') 0.3275783


  # Remove the CWD from sys.path while we load stuff.
  if np.issubdtype(vec.dtype, np.int):


In [177]:
from gensim.models import word2vec
import logging
 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus('C:/Users/WaThone/Downloads/text8/text8')
 
model = word2vec.Word2Vec(sentences, size=200)

2019-01-30 13:02:07,731 : INFO : collecting all words and their counts
2019-01-30 13:02:07,738 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-01-30 13:02:13,750 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2019-01-30 13:02:13,751 : INFO : Loading a fresh vocabulary
2019-01-30 13:02:13,977 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2019-01-30 13:02:13,978 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2019-01-30 13:02:14,225 : INFO : deleting the raw counts dictionary of 253854 items
2019-01-30 13:02:14,235 : INFO : sample=0.001 downsamples 38 most-common words
2019-01-30 13:02:14,236 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2019-01-30 13:02:14,557 : INFO : estimated required memory for 71290 words and 200 dimensions: 149709000 bytes
2019-01-30 13:02:14,557 :

2019-01-30 13:03:10,474 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-01-30 13:03:10,475 : INFO : EPOCH - 3 : training on 17005207 raw words (12506617 effective words) took 17.7s, 705792 effective words/s
2019-01-30 13:03:11,495 : INFO : EPOCH 4 - PROGRESS: at 5.76% examples, 706303 words/s, in_qsize 6, out_qsize 0
2019-01-30 13:03:12,497 : INFO : EPOCH 4 - PROGRESS: at 10.99% examples, 676014 words/s, in_qsize 6, out_qsize 0
2019-01-30 13:03:13,500 : INFO : EPOCH 4 - PROGRESS: at 16.64% examples, 684978 words/s, in_qsize 5, out_qsize 0
2019-01-30 13:03:14,508 : INFO : EPOCH 4 - PROGRESS: at 22.52% examples, 696104 words/s, in_qsize 6, out_qsize 0
2019-01-30 13:03:15,511 : INFO : EPOCH 4 - PROGRESS: at 27.51% examples, 683033 words/s, in_qsize 6, out_qsize 0
2019-01-30 13:03:16,515 : INFO : EPOCH 4 - PROGRESS: at 33.27% examples, 690482 words/s, in_qsize 5, out_qsize 0
2019-01-30 13:03:17,522 : INFO : EPOCH 4 - PROGRESS: at 39.15% examples, 696712 words/s, in_

In [181]:
print(simpleFilter('Religious lions get down to their knees to prey.'))

['Religious', 'lion', 'get', 'knee', 'prey']


- First, find the pair with the most similarity.
- Then, find all the similar sounding words of the first word in pair, and run similarity check with the rest of the words
- Hypothesis: If any of the resulting pairs have a sudden jump in similarity than the original pair, double meaning detected

In [217]:
filtered_words = simpleFilter('Religious lions get down to their knees to prey.')
filtered_words = [word.lower() for word in filtered_words]

most_similar_pair = (filtered_words[0], filtered_words[1]) # dummy pair
hsscore = model.similarity(most_similar_pair[0], most_similar_pair[1])

for index, word in enumerate(filtered_words):
    for other_word in filtered_words[index+1:]:
        sscore = model.similarity(word, other_word)
        
        if sscore > hsscore:
            most_similar_pair = (word, other_word)
            print(sscore, most_similar_pair)
            hsscore = sscore
            
print('most similar pair', most_similar_pair, 'and their score:', hsscore)
#             for other_word in filtered_words[index+1:]:

0.01113707 ('religious', 'prey')
0.31679302 ('lion', 'knee')
0.3409069 ('lion', 'prey')
0.3425663 ('get', 'prey')
0.4300648 ('knee', 'prey')
most similar pair ('knee', 'prey') and their score: 0.4300648


  """
  if np.issubdtype(vec.dtype, np.int):
  if __name__ == '__main__':


Weird how the most similar pair is 'knee' and 'prey', but at least the intended pair 'lion' and 'prey'
is the third most similar pair. Using the intended pair, we will now test similar sounding words of them 
with the rest of the sentence

In [216]:
get_fuzzy_words('lion')

['lean',
 'lone',
 'line',
 'loon',
 'loin',
 'lanai',
 'lane',
 'luwian',
 'lawn',
 'leon',
 'lyon',
 'lena',
 'llano',
 'luna',
 'laney',
 'lin',
 'lyonia',
 'linnaea',
 'liana',
 'loan',
 'lien',
 'leone',
 'lie_in']

In [222]:
most_similar_pair = ('lion', 'prey')

for word1 in most_similar_pair:
    fuzzy_words = get_fuzzy_words(word1)
    
    highest_fuzzy_pair = (fuzzy_words[0], filtered_words[0])
    hfscore = model.similarity(fuzzy_words[0], filtered_words[0])
    
    for fuzzy in fuzzy_words:
        for word in filtered_words:
            # Don't want a repeated similarity comparison with the word in original pair
            if word not in most_similar_pair:
                try:
                    fscore = model.similarity(fuzzy, word)
                except:
                    continue
                if fscore > hfscore:
                    highest_fuzzy_pair = (fuzzy, word)
                    hfscore = fscore
                    print(hfscore, highest_fuzzy_pair)
                
print(model.similarity('pray', 'religious'))

0.29408664 ('lean', 'get')
0.40375918 ('lean', 'knee')
0.40892372 ('lone', 'knee')
0.20383403 ('poor', 'religious')
0.20526001 ('pro', 'religious')
0.309024 ('parry', 'knee')
0.32370543 ('pore', 'knee')
0.3767869 ('purr', 'knee')
0.40190464 ('pair', 'knee')
0.55285835 ('pyre', 'knee')
0.106095985


  import sys
  if np.issubdtype(vec.dtype, np.int):
  


In [229]:
#fscore = similarity score between *fuzzy word (similar sounding word) and the words in sentence
#oscore = similarity score between *original word in the pair and the words in sentence
#these two scores are compared to see if there is a jump in similarity
most_similar_pair = ('lion', 'prey')

for word1 in most_similar_pair:
    fuzzy_words = get_fuzzy_words(word1)
    
    highest_fuzzy_pair = (fuzzy_words[0], filtered_words[0])
    hfscore = model.similarity(fuzzy_words[0], filtered_words[0])
    hoscore = model.similarity(word1, filtered_words[0])
    hratio = hfscore / hoscore
    
    for fuzzy in fuzzy_words:
        for word in filtered_words:
            # Don't want a repeated similarity comparison with the word in original pair
            if word not in most_similar_pair:
                try:
                    fscore = model.similarity(fuzzy, word)
                    oscore = model.similarity(word1, word)
                    ratio = fscore / oscore
                except:
                    continue
                if ratio > hratio:
                    highest_fuzzy_pair = (fuzzy, word)
                    hfscore = fscore
                    hoscore = oscore
                    hratio = ratio
                    print(f'Highest ratio jump: {hratio}. Fuzzy pair: {highest_fuzzy_pair}. Original word: {word1}. Fuzzy score: {hfscore}. Original score:{hoscore}')
                    print()

fscore = model.similarity('pray', 'religious')
oscore = model.similarity('prey', 'religious')
ratio = fscore / oscore
print(ratio, ('pray', 'religious'), ('prey', 'religious'), fscore, oscore)

Highest ratio jump: -3.653233289718628. Fuzzy pair: ('lean', 'get'). Original word: lion. Fuzzy score: 0.29408663511276245. Original score:-0.08050037175416946

Highest ratio jump: 1.2745203971862793. Fuzzy pair: ('lean', 'knee'). Original word: lion. Fuzzy score: 0.4037591814994812. Original score:0.3167930245399475

Highest ratio jump: 5.501311302185059. Fuzzy pair: ('lone', 'religious'). Original word: lion. Fuzzy score: -0.20714876055717468. Original score:-0.03765443339943886

Highest ratio jump: 6.273898601531982. Fuzzy pair: ('lane', 'religious'). Original word: lion. Fuzzy score: -0.23624008893966675. Original score:-0.03765443339943886

Highest ratio jump: 9.78561019897461. Fuzzy pair: ('luna', 'religious'). Original word: lion. Fuzzy score: -0.368471622467041. Original score:-0.03765443339943886

Highest ratio jump: 18.302303314208984. Fuzzy pair: ('poor', 'religious'). Original word: prey. Fuzzy score: 0.20383402705192566. Original score:0.011137070134282112

Highest ratio j

  # Remove the CWD from sys.path while we load stuff.
  if np.issubdtype(vec.dtype, np.int):
  # This is added back by InteractiveShellApp.init_path()
