# Dappity Dap

### Characteristics of Puns
* Converging Meanings 
* Sound 
* Association

_Things to try:_
* split words by sound / parsing to increase accuracy of converging meanings hypothesis
    -  e.g. "The soundtrack for Blackfish was **orca**strated."

### Target: Converging Meanings

We have observed that puns often make use of words that have very similar meanings. For example:

'He said I was **average** - but he was just being **mean**.'

where 'average' and 'mean' have the same meanings but are expressed differently. 

___

In order to test this, we will do the following:

* Step 1: Use Synset to list synonyms of tokens
* Step 2: Find common words in Synsets within a sentence
* Step 3: Determine correlation between converging meanings & whether a sentence is a pun or not

---

Import/Download relevant packages:

In [36]:
from textblob import Word
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


For this method, we will use NLTK's WordNet corpus to find the synsets of each token in a sentence.

As an example, let's test it out on the word **'plant'** first:

In [37]:
word = Word('plant')
for i in range(3):
    print('Use Case ', i)
    print(word.synsets[i])
    print(word.definitions[i])
    print(word.synsets[i].lemma_names())
    print(' ')

Use Case  0
Synset('plant.n.01')
buildings for carrying on industrial labor
['plant', 'works', 'industrial_plant']
 
Use Case  1
Synset('plant.n.02')
(botany) a living organism lacking the power of locomotion
['plant', 'flora', 'plant_life']
 
Use Case  2
Synset('plant.n.03')
an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience
['plant']
 


Through WordNet, the **use cases** (Synsets) of the word "Plant" can be found, as well as the **definitions** and **Synonyms** (Lemma Names) as the input.

---
        
           
Let's first eyeball how relevant the lemmas of each significant word in a sentence to determining if a sentence is a pun. 

**The example we will use is: "The past, the present and the future walked into a bar. It was tense."**



In [38]:
# First, importing relevant packages, etc

import codecs
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import PunktSentenceTokenizer,sent_tokenize, word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer, PorterStemmer
import re

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We'll need to process the sentence, which includes lemmatizing, filtering out stop words, stripping punctuation and tokenizing the sentence.

In [39]:
def simpleFilter(sentence):
    
    '''This function filters out stopwords, lemmatizes, tokenizes, and 
    strips punctuation from the input sentence and returns the a list of 
    filtered tokens'''
    
    filtered_sent = []
    
    # Strip punctuation
    stripped = re.sub("[(.)',=!#@]", '', sentence)
        
    # filter out stopwords 
    stop_words = set(stopwords.words("english"))
    
    # Tokenize
    words = word_tokenize(stripped)
    
    # Lemmatize and Filter out Stopwords
    lemmatizer = WordNetLemmatizer()
    for w in words:
        if w not in stop_words:
            filtered_sent.append(lemmatizer.lemmatize(w))

    return filtered_sent
  
def printLemmas(word):
    
    '''This function prints out all synonyms of a given word.'''
    
    for ss in Word(word).synsets:
        print(ss.lemma_names())
        

# Print 

s = 'The past, the present and the future walked into a bar. It was tense.'

for word in simpleFilter(s):
    print("Filtered word: '" + word + "' and its lemmas:")
    printLemmas(word)
    print()


Filtered word: 'The' and its lemmas:

Filtered word: 'past' and its lemmas:
['past', 'past_times', 'yesteryear']
['past']
['past', 'past_tense']
['past']
['past', 'preceding', 'retiring']
['by', 'past']

Filtered word: 'present' and its lemmas:
['present', 'nowadays']
['present']
['present', 'present_tense']
['show', 'demo', 'exhibit', 'present', 'demonstrate']
['present', 'represent', 'lay_out']
['stage', 'present', 'represent']
['present', 'submit']
['present', 'pose']
['award', 'present']
['give', 'gift', 'present']
['deliver', 'present']
['introduce', 'present', 'acquaint']
['portray', 'present']
['confront', 'face', 'present']
['present']
['salute', 'present']
['present']
['present']

Filtered word: 'future' and its lemmas:
['future', 'hereafter', 'futurity', 'time_to_come']
['future', 'future_tense']
['future']
['future']
['future']
['future', 'next', 'succeeding']
['future']

Filtered word: 'walked' and its lemmas:
['walk']
['walk']
['walk']
['walk']
['walk']
['walk']
['walk']
[

---
## **Hypothesis 1: Converging Meaning Pun**

We observe that the word 'tense' appears as a synonym of the words 'present', 'past', and 'future'. Since we are exploring puns with converging meanings, **we hypothesise that we are more likely to find words with converging meanings in puns than in non-puns.**

---

To do this, we first produce a list of unique synonyms of a certain word, excluding the word itself.


Let's try this on the word "plant".

In [40]:
def create_lemmas(word):
    lemmas_list = []
    for ss in Word(word).synsets:
        lemmas_list.append(ss.lemma_names())
    return lemmas_list

def process_lemmas(lemmas_list, word):
    '''
    This function process the lemma list of all the definition of a word
    and returns a list of all associated unrepeated words with the word
    '''
    all_lemmas = []
    for each_list in lemmas_list:
        for lemma in each_list:
            if lemma != word and lemma not in all_lemmas:
                all_lemmas.append(lemma)
    return all_lemmas


print(process_lemmas(create_lemmas('plant'), 'plant'))

['works', 'industrial_plant', 'flora', 'plant_life', 'set', 'implant', 'engraft', 'embed', 'imbed', 'establish', 'found', 'constitute', 'institute']


Next, we have to find out if synonyms of any word in a sentence can be found in the rest of the sentence, and count the number of times this occurs.

In [71]:
def common_syn(s):
    
    '''
    This function takes in a sentence, processes and tokenizes it and
    prints each significant word and tests if its synonyms can be found
    in the rest of the sentence. It prints the pair and returns the
    number of pairs found.
    '''
    
    count = 0
    
    # Filter the sentence to remove filler words / stopwords
    filtered_words = simpleFilter(s)
    
    for index, word in enumerate(filtered_words):
        if word.isalpha():
            lemma_list_of_term = process_lemmas(create_lemmas(word),word)

            # test if any word in the rest of the sentence appears in the lemma list of current word
            for other_word in filtered_words[index+1:]:
                if other_word in ' '.join(lemma_list_of_term):
                    count += 1
                    print(word, other_word)
    return count
    
    
s = 'The past, the present and the future walked into a bar. It was tense.'
print('The number of synonym pairs in this sentence is',common_syn(s))

past tense
present tense
future tense
The number of synonym pairs in this sentence is 3


In order to see if this method does work, we will test it out on our list of pre-tagged puns and non-puns where puns are tagged '0' and non-puns are tagged '1'

We import the list and apply our function common_syn to it, under the label 'Syn Count'.

In [42]:
import pandas as pd
df = pd.read_csv('puns_final.csv', encoding='latin-1')
# df = df.drop('Unnamed: 0', axis=1)

df['Syn Count'] = df['Sentence'].apply(common_syn)
df.head()

tuna fish
I le
bigger le
bed le
I le
pirate high
pirate sea
make hit
I one
Going sound
bed sleep
After ate
said ate
turned around
broke leg
I one
got one
cannibal eat
got make
paid make
reversing back
I one
got one
got back
one I
go work
Cheap u
Thrills u
want u
post office
wear wear
look see
whistle whistle
mad hare
Old go
die go
back second
call phone
Cell phone
mean egg
laying egg
I number
people wash
little light
seems see
door door
take make
fly fly
like like
metal met
I 5
mean end
went low
wardrobe closet
one I
punch punch
went last
Do get
know get
broth stock
cat sick
cheese cheese
Buffalo Bison
Make one
call one
right -
duck put
Thieves steal
dentist tooth
theatrical performance
pun play
pun word
play word
average mean
In I
past tense
present tense
future tense
soda soda
running go
Better go
present tense
past tense
saw ad
happens come
Id I
Id I
know get
alarm clock
Have eat
ever time
tried time
clock time
take make
seasoned veteran
remember back
boomerang back
I atom
error err

Unnamed: 0,Sentence,P/NP,Syn Count
0,"You can tune a guitar, but you can't tuna fish...",1,1
1,Two peanuts were walking in a tough neighborho...,1,0
2,If I buy a bigger bed will I have more or less...,1,4
3,The earth's rotation really makes my day.,1,0
4,I told my friend she drew her eyebrows too hig...,1,0


To find out if this method is accurate, we use the correlation between whether the sentence is a pun or not and the Syn Count. 

In [43]:
corr = df.corr()
corr

Unnamed: 0,P/NP,Syn Count
P/NP,1.0,-0.24147
Syn Count,-0.24147,1.0


In this case, it appears the Syn Count is not very highly correlated with whether the sentence is a pun or not...

Perhaps we should try a different approach.

---

Other than the ability to find synonyms, WordNet can also find out a range of other details about a word.  

The functions below make use of WordNet to yield synonyms, hyponyms, antonyms, words that are similar to as well as words that the WordNet corpus has recorded as "also sees".

In [44]:
from nltk.corpus import wordnet as wn

def get_all_synsets(word, pos=None):
    for ss in wn.synsets(word):
        for lemma in ss.lemma_names():
            yield (lemma, ss.name())


def get_all_hyponyms(word, pos=None):
    for ss in wn.synsets(word, pos=pos):
            for hyp in ss.hyponyms():
                for lemma in hyp.lemma_names():
                    yield (lemma, hyp.name())


def get_all_similar_tos(word, pos=None):
    for ss in wn.synsets(word):
            for sim in ss.similar_tos():
                for lemma in sim.lemma_names():
                    yield (lemma, sim.name())


def get_all_antonyms(word, pos=None):
    for ss in wn.synsets(word, pos=None):
        for sslema in ss.lemmas():
            for antlemma in sslema.antonyms():
                    yield (antlemma.name(), antlemma.synset().name())


def get_all_also_sees(word, pos=None):
        for ss in wn.synsets(word):
            for also in ss.also_sees():
                for lemma in also.lemma_names():
                    yield (lemma, also.name())


def get_all_synonyms(word, pos=None):
    for x in get_all_synsets(word, pos):
        yield (x[0], x[1], 'ss')
    for x in get_all_hyponyms(word, pos):
        yield (x[0], x[1], 'hyp')
    for x in get_all_similar_tos(word, pos):
        yield (x[0], x[1], 'sim')
    for x in get_all_antonyms(word, pos):
        yield (x[0], x[1], 'ant')
    for x in get_all_also_sees(word, pos):
        yield (x[0], x[1], 'also')
       

Let's use the words 'happy' and 'cutlery' to see what kind of details WordNet can figure out about a word.

In [45]:
print("The following are synonyms of 'happy':")
for x in get_all_synsets('happy'):
    print(x)
print()
print("The following are hyponyms (words that are more specific) of 'cutlery':")
for x in get_all_hyponyms('cutlery'):
    print(x)
print()
print("The following are similar to 'happy':")
for x in get_all_similar_tos('happy'):
    print(x)
print()
print("The following are antonyms (opposite) of 'happy':")
for x in get_all_antonyms('happy'):
    print(x)
print()
print("The following are words that should also be seen with 'happy':")
for x in get_all_also_sees('happy'):
    print(x)

The following are synonyms of 'happy':
('happy', 'happy.a.01')
('felicitous', 'felicitous.s.02')
('happy', 'felicitous.s.02')
('glad', 'glad.s.02')
('happy', 'glad.s.02')
('happy', 'happy.s.04')
('well-chosen', 'happy.s.04')

The following are hyponyms (words that are more specific) of 'cutlery':
('bolt_cutter', 'bolt_cutter.n.01')
('cigar_cutter', 'cigar_cutter.n.01')
('die', 'die.n.03')
('edge_tool', 'edge_tool.n.01')
('glass_cutter', 'glass_cutter.n.03')
('tile_cutter', 'tile_cutter.n.01')
('fork', 'fork.n.01')
('spoon', 'spoon.n.01')
('Spork', 'spork.n.01')
('table_knife', 'table_knife.n.01')

The following are similar to 'happy':
('blessed', 'blessed.s.06')
('blissful', 'blissful.s.01')
('bright', 'bright.s.09')
('golden', 'golden.s.02')
('halcyon', 'golden.s.02')
('prosperous', 'golden.s.02')
('laughing', 'laughing.s.01')
('riant', 'laughing.s.01')
('fortunate', 'fortunate.a.01')
('willing', 'willing.a.01')
('felicitous', 'felicitous.a.01')

The following are antonyms (opposite) 

In [46]:
for x in (get_all_synonyms('happy')):
    print(x)

('happy', 'happy.a.01', 'ss')
('felicitous', 'felicitous.s.02', 'ss')
('happy', 'felicitous.s.02', 'ss')
('glad', 'glad.s.02', 'ss')
('happy', 'glad.s.02', 'ss')
('happy', 'happy.s.04', 'ss')
('well-chosen', 'happy.s.04', 'ss')
('blessed', 'blessed.s.06', 'sim')
('blissful', 'blissful.s.01', 'sim')
('bright', 'bright.s.09', 'sim')
('golden', 'golden.s.02', 'sim')
('halcyon', 'golden.s.02', 'sim')
('prosperous', 'golden.s.02', 'sim')
('laughing', 'laughing.s.01', 'sim')
('riant', 'laughing.s.01', 'sim')
('fortunate', 'fortunate.a.01', 'sim')
('willing', 'willing.a.01', 'sim')
('felicitous', 'felicitous.a.01', 'sim')
('unhappy', 'unhappy.a.01', 'ant')
('cheerful', 'cheerful.a.01', 'also')
('contented', 'contented.a.01', 'also')
('content', 'contented.a.01', 'also')
('elated', 'elated.a.01', 'also')
('euphoric', 'euphoric.a.01', 'also')
('felicitous', 'felicitous.a.01', 'also')
('glad', 'glad.a.01', 'also')
('joyful', 'joyful.a.01', 'also')
('joyous', 'joyous.a.01', 'also')


Let's all the categories above words that are **related** to the main word. 

Now, we want to do the same as we did for the synonym count and define some functions that will find the common related words - not just within the sentence, but also with the related words of the other words in the sentence. 

In [47]:
def related_list(word):
    lemma_list = []
    for x in get_all_synonyms(word):
        lemma_list.append(x)
    return list(set(lemma_list))

def common_related(s):
    filtered = simpleFilter(s)
    count = 0
    for index, word in enumerate(filtered):
        related = related_list(word)
        for r_set in related:
            if r_set[0] in filtered[index+1:]:
                count += 1
    return count


**Example:**

'What do you call a belt with a watch on it? A waist of time.'

In [48]:
s = 'What do you call a belt with a watch on it? A waist of time.'

filtered = simpleFilter(s)
count = 0
print('Sentence:',s)
print('-----' *10)
print()
for index, word in enumerate(filtered):
    related = related_list(word)
    for r_set in related:
        if r_set[0] in filtered[index+1:]:
            print("The word '" + word + "' in the sentence is related to '" + r_set[0] + "' as", r_set, "to mean '" + wordnet.synset(r_set[1]).definition() +"'")
            print()
            count += 1
print('-----' * 10)
print('Number of Related pairs:', count)


Sentence: What do you call a belt with a watch on it? A waist of time.
--------------------------------------------------

--------------------------------------------------
Number of Related pairs: 0


Now we want to apply this to the rest of our data.

In [49]:
df['Length'] = df['Sentence'].apply(len) #added this because it's mysteriously missing, but need to filter the length next time
df['Related Count'] = df['Sentence'].apply(common_related)
df['Rel Count / Len'] = df['Related Count'] / df['Length']
df.sample(5)

Unnamed: 0,Sentence,P/NP,Syn Count,Length,Related Count,Rel Count / Len
67,I heard Donald Trump is going to ban shredded ...,1,0,83,0,0.0
69,People say i look better without glasses but i...,1,1,65,1,0.015385
178,"The duck said to the bartender, 'put it on my ...",1,1,52,0,0.0
49,My friends say they donÂt like skeleton puns....,1,0,84,0,0.0
308,Ive learned that people will forget what you s...,0,0,133,20,0.150376


Here is a description of the values. 

In [50]:
import matplotlib.pyplot as plt
df.describe()

r = df['Related Count']
plt.histfit(r)

AttributeError: module 'matplotlib.pyplot' has no attribute 'histfit'

The code below finds the correlation between the different variables in the data frame. 

As can be seen, the correlation between whether a sentence is a pun or not and the number of related count pairs is debatable.

We also took related count / len of sentence as a longer sentence is more likely to have more related pairs.

In [None]:
corr = df.corr()
corr

We'll try to turn this correlation into an actionable "algorithm" to predict if a sentence is a pun or not. 

The following is another data set with 60 puns and 100 non-puns.

In [None]:
test_df = pd.read_csv('puns_test.csv')
test_df.sample(5)

Let's now code the "algorithm".'

In [None]:
common_related(s)

### Target: Similar Sounds

Other puns involve usage of homophones, words with similar sound but different meanings. For example:

'The **pony** had a **raspy** voice. It was **hoarse**.'

where 'hoarse' means the same as 'raspy', but is also related to 'pony' as it sounds like 'horse'. 

___

In order to test this, we will do the following:

* Step 1: Use Synset to list synonyms of tokens
* Step 2: Find matching similar sounding words in Synsets within the sentence
* Step 3: 

Some inspirations:
* https://stackabuse.com/phonetic-similarity-of-words-a-vectorized-approach-in-python/
* https://pypi.org/project/phonetics/#usage
* https://pypi.org/project/jellyfish/
* https://github.com/mphilli/English-to-IPA

---


Import/Download relevant packages:

In [51]:
import phonetics
import jellyfish
import eng_to_ipa as ipa

Testing the library's differeny phonetic functions with two similar sounding words 'horse' and 'hoarse', and observing the result.

In [52]:
test_words = ['horse', 'hoarse']
print(test_words)
def print_phonetic_index(test_words):
    functions = (phonetics.soundex, phonetics.nysiis, phonetics.metaphone, phonetics.dmetaphone
                 , jellyfish.match_rating_codex, ipa.convert)
    for func in functions:
        print(f'{func.__name__}: ' , end='')
        for word in test_words:
            code = func(word)
            print(str(code) + ' ', end='')
        print()
print_phonetic_index(test_words)

['horse', 'hoarse']
soundex: h0620 h0620 
nysiis: HA HA 
metaphone: HRS HRS 
dmetaphone: ('HRS', '') ('HRS', '') 
match_rating_codex: HRS HRS 
convert: hɔrs hɔrs 


Testing with the words 'horse' and 'haorse' gave identical phonetic indexes from different packages. Let's try this with another set of words!
Pun examples:
* A harp which sounds too good to be true is probably a lyre. (lie)
* Religious lions get down to their knees to prey. (pray)
* A big computerized dog needs a megabyte. (mega bite)
* Lions eat their prey fresh and roar. (raw)

In [53]:
test_words_2 = ['lyre', 'lie', 'prey', 'pray', 'roar', 'raw']
print_phonetic_index(test_words_2)

soundex: l060 l000 p600 p600 r060 r000 
nysiis: LA LA PA PA RA RA 
metaphone: LR L PR PR RR R 
dmetaphone: ('LR', '') ('L', '') ('PR', '') ('PR', '') ('RR', '') ('R', 'RF') 
match_rating_codex: LYR L PRY PRY RR RW 
convert: laɪr laɪ pre pre rɔr rɑ 


In [54]:
# def related_list(word):
#     lemma_list = []
#     for x in get_all_synonyms(word):
#         lemma_list.append(x)
#     return list(set(lemma_list))

# def common_related(s):
#     filtered = simpleFilter(s)
#     count = 0
#     for index, word in enumerate(filtered):
#         related = related_list(word)
#         for r_set in related:
#             if r_set[0] in filtered[index+1:]:
#                 count += 1
#     return count


In [55]:
# word and desired word pair
word_pairs = [('harp', 'lyre'), ('religious', 'pray'), ('computerized', 'megabyte'), ('fresh', 'raw')]
for pair in word_pairs:
    related_words = related_list(pair[0])
    print(f'Related words of "{pair[0]}": {related_words}')
    print(f'Desired word "{pair[1]}" found: {pair[1] in related_words}\n')

Related words of "harp": [('harp', 'harp.v.01', 'ss'), ('harmonica', 'harmonica.n.01', 'ss'), ('harp', 'harmonica.n.01', 'ss'), ('dwell', 'harp.v.01', 'ss'), ('lyre', 'lyre.n.01', 'hyp'), ('aeolian_harp', 'aeolian_harp.n.01', 'hyp'), ('harp', 'harp.n.01', 'ss'), ('harp', 'harp.n.02', 'ss'), ('mouth_harp', 'harmonica.n.01', 'ss'), ('wind_harp', 'aeolian_harp.n.01', 'hyp'), ('aeolian_lyre', 'aeolian_harp.n.01', 'hyp'), ('mouth_organ', 'harmonica.n.01', 'ss'), ('harp', 'harp.v.02', 'ss')]
Desired word "lyre" found: False

Related words of "religious": [('eremite', 'eremite.n.01', 'hyp'), ('devout', 'devout.s.01', 'sim'), ('scrupulous', 'scrupulous.a.01', 'sim'), ('superior', 'superior.n.02', 'hyp'), ('monk', 'monk.n.01', 'hyp'), ('votary', 'votary.n.01', 'hyp'), ('religious', 'religious.s.04', 'ss'), ('monastic', 'monk.n.01', 'hyp'), ('religious', 'religious.n.01', 'ss'), ('churchly', 'churchly.s.01', 'sim'), ('secular', 'secular.a.04', 'ant'), ('religious', 'religious.s.01', 'ss'), ('men

---
In most cases, the desired word is not in the related word list, thus we need to find a way to further expand or dictionary or related words, before comparing their soundex and ipa.
After that, we'll calculate degree of similarity between the two pronunciations.


In [56]:
jellyfish.levenshtein_distance('jellyfish', 'smellyfish')


2

In [57]:
jellyfish.jaro_distance('jellyfish', 'smellyfish')

0.8962962962962964

In [58]:
sentence = 'How do mountains see? They peak'
words = simpleFilter(sentence)
related_words = {}

for word in words:
    related_words[word] = related_list(word)

print(related_words)

{'How': [], 'mountain': [('hatful', 'batch.n.02', 'ss'), ('mountain', 'mountain.n.01', 'ss'), ('alp', 'alp.n.01', 'hyp'), ('mount', 'mountain.n.01', 'ss'), ('sight', 'batch.n.02', 'ss'), ('muckle', 'batch.n.02', 'ss'), ('plenty', 'batch.n.02', 'ss'), ('mint', 'batch.n.02', 'ss'), ('inundation', 'flood.n.02', 'hyp'), ('flock', 'batch.n.02', 'ss'), ('volcano', 'volcano.n.02', 'hyp'), ('batch', 'batch.n.02', 'ss'), ('peck', 'batch.n.02', 'ss'), ('mountain', 'batch.n.02', 'ss'), ('seamount', 'seamount.n.01', 'hyp'), ('deluge', 'flood.n.02', 'hyp'), ('mess', 'batch.n.02', 'ss'), ('torrent', 'flood.n.02', 'hyp'), ('pile', 'batch.n.02', 'ss'), ('slew', 'batch.n.02', 'ss'), ('great_deal', 'batch.n.02', 'ss'), ('pot', 'batch.n.02', 'ss'), ('lot', 'batch.n.02', 'ss'), ('stack', 'batch.n.02', 'ss'), ('heap', 'batch.n.02', 'ss'), ('haymow', 'haymow.n.01', 'hyp'), ('deal', 'batch.n.02', 'ss'), ('tidy_sum', 'batch.n.02', 'ss'), ('wad', 'batch.n.02', 'ss'), ('raft', 'batch.n.02', 'ss'), ('quite_a_lit

Steps in plan

* get words in dictionary
* find the soundex of words
* make dictionary, key=soundex, value=list of same sounding words
* function to get algorithm to get fuzzy words
* function to check similarity of each word (lemmatized) in sentence with word (original + fuzzyz)
* function to return the difference in spike of similarity with original and another word in fuzzyz

In [60]:
''' Runs through all the words in the dictionary (in wordnet)
make dictionary of soundex words
Some words do not have soundex assigned, so will give error,
skip those words and continue

'''
soundex_dict = {}
count = 15000
for ss in wn.all_synsets():
    word = ss.lemma_names()[0]
    try:
        sound = phonetics.soundex(word)
    except:
        # If got error then discard word, continue to next cycle
#         print("***Error", word)
        continue
        
#     print(word)
    if sound not in soundex_dict:
        soundex_dict[sound] = [word]
        count -= 1
    else:
        if word not in soundex_dict[sound]:
            soundex_dict[sound].append(word)
            print("repeat", sound, soundex_dict[sound])
            count -= 1
    if count == 0:
        break


repeat a20302 ['ascetic', 'acidic']
repeat a32023010 ['adjustive', 'adjective']
repeat a20302 ['ascetic', 'acidic', 'aquatic']
repeat a5305060305 ['antemeridian', 'ante_meridiem']
repeat p601605304 ['preprandial', 'prefrontal']
repeat b020 ['busy', 'back']
repeat c030 ['cut', 'cute']
repeat u5010203 ['unabused', 'unabashed']
repeat a20203010 ['acquisitive', 'associative']
repeat a2020140 ['accessible', 'associable']
repeat a10203 ['abused', 'affixed']
repeat a1053053 ['abundant', 'appendant']
repeat u5010203 ['unabused', 'unabashed', 'unaffixed']
repeat b0530140 ['bondable', 'bindable']
repeat b020 ['busy', 'back', 'big']
repeat a030202 ['audacious', 'autoecious']
repeat a1040140 ['appealable', 'available']
repeat u501040140 ['unappealable', 'unavailable']
repeat u5020503 ['unashamed', 'unawakened']
repeat u5060 ['unwary', 'unaware']
repeat a104052 ['appealing', 'appalling']
repeat r060 ['rare', 'rear']
repeat b0203 ['backed', 'beaked']
repeat b02402 ['backless', 'beakless']
repeat b60

repeat s06052 ['searing', 'soaring']
repeat d010203 ['diffused', 'debased']
repeat p020 ['pawky', 'peachy', 'peaky']
repeat s26020 ['scraggy', 'screaky']
repeat b020 ['busy', 'back', 'big', 'beige', 'buggy', 'bushy', 'base', 'bass']
repeat m0502 ['minus', 'mimic']
repeat s30603 ['stirred', 'storied']
repeat h05040202 ['homologous', 'homologic']
repeat h01402 ['hapless', 'hipless']
repeat d0102052 ['diffusing', 'debasing']
repeat s010 ['safe', 'shabby']
repeat d0204030 ['decollete', 'desolate']
repeat u160203 ['upraised', 'upright']
repeat s04053 ['silent', 'salient']
repeat b05303 ['banded', 'bounded', 'bended']
repeat c06502 ['corneous', 'cernuous']
repeat p6050 ['primo', 'prime', 'prone']
repeat s0303 ['shaded', 'suited', 'seated']
repeat b0306 ['better', 'bitter']
repeat f060 ['fair', 'far', 'faraway', 'fore', 'fiery']
repeat h0303 ['headed', 'heated']
repeat b402 ['black', 'bleak']
repeat h03402 ['heedless', 'headless', 'heatless']
repeat s01060 ['sapphire', 'severe', 'shivery']
re

repeat p030 ['pat', 'petty', 'pithy', 'potty', 'paid']
repeat c0203 ['cased', 'cooked', 'choked', 'cashed']
repeat u5103 ['unbowed', 'unfit', 'unfed', 'unpaid']
repeat b020 ['busy', 'back', 'big', 'beige', 'buggy', 'bushy', 'base', 'bass', 'baggy', 'boyish', 'buckshee']
repeat a2052 ['aging', 'aching']
repeat b03052 ['budding', 'biting']
repeat c0103 ['capped', 'chopped', 'chafed']
repeat r02052 ['reeking', 'raging', 'rising', 'racking']
repeat u5105303 ['unbanded', 'unfunded', 'unpainted']
repeat r0203 ['russet', 'right', 'rugged', 'raised', 'ragged', 'rigid', 'rigged', 'rescued', 'rouged']
repeat u560203 ['unrigged', 'unrouged']
repeat b020 ['busy', 'back', 'big', 'beige', 'buggy', 'bushy', 'base', 'bass', 'baggy', 'boyish', 'buckshee', 'bias']
repeat i502020140 ['inaccessible', 'inexcusable']
repeat b0203 ['backed', 'beaked', 'boxed', 'baked', 'boughed', 'biased']
repeat u501020140 ['unopposable', 'unnavigable']
repeat c0104602 ['chivalrous', 'chivalric']
repeat t050 ['tan', 'tawny'

repeat d040 ['dull', 'daily', 'dual']
repeat t060 ['throwaway', 'three']
repeat f060 ['fair', 'far', 'faraway', 'fore', 'fiery', 'fewer', 'four']
repeat s020 ['sage', 'such', 'sick', 'swishy', 'sexy', 'six']
repeat s0105 ['shaven', 'seven']
repeat t050 ['tan', 'tawny', 'tame', 'thin', 'tinny', 'then', 'ten']
repeat f0630 ['forte', 'forty']
repeat f0630 ['forte', 'forty', 'fourth']
repeat f0130 ['fifty', 'fifth']
repeat s0230 ['sixty', 'sixth']
repeat s010530 ['seventy', 'seventh']
repeat e02030 ['eighty', 'eighth']
repeat t0530 ['twenty', 'tenth']
repeat s0520140 ['sensible', 'singable', 'sinkable']
repeat u520520140 ['unchangeable', 'unsinkable']
repeat l050 ['loamy', 'lean', 'lone']
repeat u5060 ['unwary', 'unaware', 'unary']
repeat s0130140 ['subduable', 'septuple']
repeat s0203 ['sexed', 'sized']
repeat s020 ['sage', 'such', 'sick', 'swishy', 'sexy', 'six', 'size']
repeat s06303 ['sordid', 'sorted']
repeat u520203 ['uncooked', 'unsized']
repeat u5206303 ['unguarded', 'unsorted']
re

repeat w0403 ['walleyed', 'wheeled']
repeat b050 ['bonny', 'bony', 'bone', 'boon', 'bum', 'beamy']
repeat w0203 ['washed', 'whacked', 'wicked', 'waxed', 'wigged']
repeat t0103 ['taped', 'tapped', 'tipped', 'topped', 'tubed', 'toupeed']
repeat d0210203 ['despised', 'disposed']
repeat l030 ['late', 'lewd', 'loud', 'laid', 'loath']
repeat b03052 ['budding', 'biting', 'batwing']
repeat v04053 ['valiant', 'violent', 'volant']
repeat b0203 ['backed', 'beaked', 'boxed', 'baked', 'boughed', 'biased', 'booked', 'based', 'bugged']
repeat w020 ['weak', 'wise']
repeat u5020 ['uneasy', 'unique', 'unwise']
repeat w0303 ['widowed', 'wooded']
repeat b020 ['busy', 'back', 'big', 'beige', 'buggy', 'bushy', 'base', 'bass', 'baggy', 'boyish', 'buckshee', 'bias', 'boss', 'boggy', 'bosky']
repeat r020 ['rose', 'rocky', 'rush', 'rich', 'racy', 'rough', 'rash', 'rushy']
repeat u50303 ['unheaded', 'unheated', 'unmated', 'unaided', 'united', 'unwooded']
repeat b0205 ['buxom', 'beechen']
repeat b0620 ['barky', '

repeat p40604 ['plural', 'pleural']
repeat b605204 ['branchial', 'bronchial']
repeat e010502 ['euphonious', 'euphonic']
repeat r030 ['ready', 'red', 'ratty', 'reedy', 'radio']
repeat t030502 ['titanic', 'totemic']
repeat R0205 ['Rousseauan', 'Russian']
repeat S020 ['Sikh', 'Swiss']
repeat A205 ['Achaean', 'Asian']
repeat c0502 ['conic', 'comic']
repeat a10204 ['abaxial', 'apical', 'affixal', 'abyssal']
repeat g050304 ['genital', 'gonadal']
repeat a2050304 ['azimuthal', 'agonadal']
repeat c06504 ['charnel', 'corneal', 'carnal']
repeat M050205 ['Manichaean', 'Monacan']
repeat G040205 ['Gallican', 'Galwegian']
repeat p06050204 ['paranasal', 'perinasal']
repeat r030504 ['rational', 'retinal']
repeat r040202 ['religious', 'rheologic']
repeat t0603 ['torrid', 'tired', 'tiered', 'thyroid']
repeat b0302 ['beauteous', 'biotic']
repeat c010406 ['cupular', 'cavalier', 'copular']
repeat h010306502 ['hypodermic', 'hypothermic']
repeat m03030504 ['matutinal', 'mutational']
repeat f0604 ['feral', 'fe

repeat d023040 ['ductile', 'distally']
repeat h01040 ['happily', 'heavily']
repeat w0240 ['wiggly', 'weekly', 'weakly']
repeat a5140 ['ample', 'amply']
repeat f02040 ['facile', 'fissile', 'fugally', 'focally', 'fiscally', 'facially', 'fussily']
repeat g05340 ['gentle', 'gently']
repeat i51020140 ['impassable', 'impossible', 'impeccable', 'invisible', 'impeccably']
repeat b405340 ['blindly', 'blandly']
repeat g601040 ['gravelly', 'gravely']
repeat s30140 ['stable', 'staple', 'stiffly']
repeat i51065040 ['infernally', 'informally']
repeat i6030140 ['irritable', 'irritably']
repeat n02040 ['nasally', 'nicely']
repeat m0202 ['mesic', 'mucous', 'much_as']
repeat s05030 ['sinuate', 'someday']
repeat c062040 ['churchly', 'coarsely']
repeat t0102040 ['typically', 'thievishly']
repeat v020140 ['visible', 'visibly']
repeat c052010140 ['conceivable', 'conceivably']
repeat a40106 ['allover', 'all_over']
repeat t023040 ['tactile', 'textile', 'tactually']
repeat i5304020140 ['intelligible', 'intelli

repeat t010260102040 ['topographically', 'typographically']
repeat u5020530140 ['unaccountable', 'unaccountably']
repeat u5043060140 ['unalterable', 'unalterably']
repeat u5302060140 ['undesirable', 'undesirably']
repeat u520536040140 ['uncontrollable', 'uncontrollably']
repeat u53050140 ['undeniable', 'undeniably']
repeat u104 ['uveal', 'uphill']
repeat v020240 ['viciously', 'vacuously']
repeat v0405340 ['violently', 'valiantly']
repeat v010340 ['vividly', 'vapidly']
repeat v060140 ['variable', 'variably']
repeat v06040 ['virile', 'verily']
repeat v02060240 ['vigorously', 'vicariously']
repeat v045060140 ['vulnerable', 'vulnerably']
repeat w02040 ['wisely', 'wheezily']
repeat w03040 ['widely', 'wittily']
repeat w063040 ['worthwhile', 'worthily']
repeat i6016020140 ['irrepressible', 'irreproachably']
repeat b05040 ['biennially', 'biannually', 'bonnily']
repeat c05202040 ['concisely', 'conjugally']
repeat d052040 ['downscale', 'densely', 'dingily']
repeat m063040 ['mortally', 'martially

Have a lot of repeated words, but the words don't even sound similar. e.g.:

l020 ['lousy', 'loose', 'leaky', 'lax', 'liege', 'lazy', 'lucky', 'lush', 'like', 'lossy', 'less', 'lacy']

r0203 ['russet', 'right', 'rugged', 'raised', 'ragged', 'rigid', 'rigged', 'rescued']

In [68]:
''' Runs through all the words in the dictionary (in wordnet)
make dictionary of IPA words
Those that doesn't have legit IPA sound will print out with *
and is not added into the ipa_dict
'''
ipa_dict = {}
count = 10000
invalid_sound_count = 0
for ss in wn.all_synsets():
    word = ss.lemma_names()[0]
    sound = ipa.convert(word)
#     try:
#         sound = ipa.convert(word)
#     except:
#         # If got error then discard word, continue to next cycle
#         print("***Error", word)
#         continue
        
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
#         print(sound)
        invalid_sound_count += 1
        continue
    
#     print(word)
    if sound not in ipa_dict:
        ipa_dict[sound] = [word]
    else:
        if word not in ipa_dict[sound]:
            ipa_dict[sound].append(word)
            print("repeat", sound, ipa_dict[sound])
    count -= 1
    if count == 0:
        break
        
print(ipa_dict)
print(invalid_sound_count)

repeat ˈkɑki ['cocky', 'khaki']
repeat kəmˈplesənt ['complaisant', 'complacent']
repeat dən ['dun', 'done']
repeat ˈmɑdərn ['modern', 'Modern']
repeat nu ['new', 'New']
repeat oʊld ['Old', 'old']
repeat wən ['one', 'won']
repeat rɪˈfɔrmd ['Reformed', 'reformed']
repeat blɛst ['blessed', 'Blessed']
repeat dɪˈskrit ['discreet', 'discrete']
repeat fɔr ['fore', 'four']
repeat ˈjunjən ['Union', 'union']
repeat stret ['straight', 'strait']
repeat ˌæbərˈɪʤənəl ['aboriginal', 'Aboriginal']
repeat ˈæˌlaɪd ['allied', 'Allied']
repeat əˈpɑkrəfəl ['apocryphal', 'Apocryphal']
repeat ˈɔrəl ['oral', 'aural']
repeat boʊˈhimiən ['bohemian', 'Bohemian']
repeat ˌdɛməˈkrætɪk ['democratic', 'Democratic']
repeat ˌaɪˈɑnɪk ['ionic', 'Ionic']
repeat pləˈtɑnɪk ['platonic', 'Platonic']
repeat mərˈkjʊriəl ['mercurial', 'Mercurial']
repeat ˈkɔrəl ['coral', 'choral']
repeat ˌkɑntəˈnɛnəl ['continental', 'Continental']
repeat ˈlɪtərəl ['literal', 'littoral']
repeat məˈsɑnɪk ['Masonic', 'masonic']
repeat ˈkæθlɪk ['cat

IPA cannot be used as it does not have enough sound words assigned to all the words in the dictionary, as indicated by the words with * behind. Does not have repeated sound words even with 10000 words inputted

In [73]:
''' Runs through all the words in the dictionary (in wordnet)
make dictionary of Double Metaphone words
Those that doesn't have legit Dmetaphone sound will print out with *
and is not added into the dmeta_dict
'''
dmeta_dict = {}
count = 15000
for ss in wn.all_synsets():
    word = ss.lemma_names()[0].lower()
    sound = phonetics.dmetaphone(word)
  
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
        print(sound)
        continue
    
#     print(word)
    if sound not in dmeta_dict:
        dmeta_dict[sound] = [word]
    else:
        if word not in dmeta_dict[sound]:
            dmeta_dict[sound].append(word)
            print("repeat", sound, dmeta_dict[sound])
    count -= 1
    if count == 0:
        break
        
# print(dmeta_dict)

repeat ('ASTK', '') ['ascetic', 'acidic']
repeat ('TT', '') ['tight', 'dead']
repeat ('ATL', '') ['ideal', 'idle']
repeat ('FLT', '') ['faulty', 'flat']
repeat ('KT', '') ['cut', 'quiet']
repeat ('ATKTT', '') ['addicted', 'adequate_to']
repeat ('ATRNT', '') ['adherent', 'adorned']
repeat ('ATSS', 'ATXS') ['edacious', 'audacious']
repeat ('AFRT', '') ['afraid', 'afeard']
repeat ('AKRFPK', '') ['acrophobic', 'agoraphobic']
repeat ('FRTNT', '') ['verdant', 'frightened']
repeat ('ANRFT', '') ['inwrought', 'unnerved']
repeat ('STRT', '') ['straight', 'stirred']
repeat ('ANST', '') ['unused', 'unsweet']
repeat ('ANPL', '') ['unable', 'in_play']
repeat ('AMTRPK', '') ['ametropic', 'emmetropic']
repeat ('FL', '') ['fly', 'full']
repeat ('AFRLNT', '') ['avirulent', 'overland']
repeat ('PRNTL', '') ['prenatal', 'perinatal']
repeat ('ANTMRTM', '') ['antemortem', 'ante_meridiem']
repeat ('PSTMRTM', '') ['postmortem', 'post_meridiem']
repeat ('PT', '') ['beady', 'pat']
repeat ('T', '') ['d.o.a.', '

repeat ('ATRFT', '') ['adrift', 'atrophied']
repeat ('PLNT', '') ['blond', 'blunt']
repeat ('PNT', '') ['bound', 'boned', 'buoyant', 'pent']
repeat ('KRTT', '') ['guarded', 'crowded']
repeat ('ANKRTT', '') ['unguarded', 'uncrowded']
repeat ('AFNT', '') ['offhand', 'affined']
repeat ('KMTS', '') ['commodious', 'comatose']
repeat ('ARTNT', '') ['ardent', 'ordained']
repeat ('PRSTL', '') ['prostyle', 'priestly']
repeat ('FST', '') ['fuzzed', 'faced', 'fast', 'fusty']
repeat ('KRNK', '') ['chronic', 'caring', 'crying']
repeat ('KNSTNT', '') ['coincident', 'constant']
repeat ('ANFLNK', '') ['unfeeling', 'unfailing']
repeat ('FKL', '') ['fugly', 'focal', 'fickle']
repeat ('KMPLSNT', '') ['complaisant', 'complacent']
repeat ('SMK', 'XMK') ['smoggy', 'smug']
repeat ('KNTNT', '') ['continued', 'contained']
repeat ('RN', '') ['roan', 'runaway']
repeat ('PX', '') ['pitchy', 'bitchy', 'bushy']
repeat ('PKT', '') ['backed', 'beaked', 'pocked', 'packed', 'baked']
repeat ('PLT', '') ['billed', 'bald'

repeat ('TT', '') ['tight', 'dead', 'tied', 'tweedy', 'dowdy']
repeat ('KLS', '') ['close', 'glossy', 'classy']
repeat ('PLSTRNK', '') ['blustering', 'blistering']
repeat ('FLT', '') ['faulty', 'flat', 'foliate', 'fluid', 'veiled', 'valid', 'fleet']
repeat ('HRNK', '') ['hearing', 'hurrying']
repeat ('RPT', '') ['rippled', 'rapid']
repeat ('LS', '') ['lousy', 'loose', 'lazy']
repeat ('PRST', '') ['braised', 'pierced', 'presto']
repeat ('LNT', '') ['lanate', 'lined', 'lento']
repeat ('LRK', '') ['lyric', 'largo']
repeat ('XP', '') ['choppy', 'cheap', 'chubby']
repeat ('KRPLNT', '') ['crapulent', 'corpulent']
repeat ('FTX', '') ['faddish', 'fattish']
repeat ('FLX', '') ['flashy', 'flush', 'fleshy']
repeat ('KRS', '') ['grassy', 'greasy', 'curious', 'crazy', 'gross']
repeat ('PRTL', '') ['brittle', 'brutal', 'portly']
repeat ('LNK', '') ['long', 'lank']
repeat ('RT', '') ['ready', 'red', 'ratty', 'right', 'reedy']
repeat ('SPR', '') ['sapphire', 'super', 'sober', 'spare']
repeat ('FT', ''

repeat ('PRNK', '') ['bearing', 'boring']
repeat ('ANSPT', '') ['unswept', 'insipid']
repeat ('ANTRMRL', '') ['intramural', 'intermural']
repeat ('PTL', '') ['bodily', 'beetle']
repeat ('JTNK', 'ATNK') ['jetting', 'jutting']
repeat ('ARPTF', '') ['eruptive', 'irruptive']
repeat ('PRSNK', '') ['pursuing', 'pressing', 'bruising', 'bracing']
repeat ('RNNK', '') ['running', 'renewing']
repeat ('ARNT', '') ['errant', 'earned', 'ironed']
repeat ('PRST', '') ['braised', 'pierced', 'presto', 'pressed']
repeat ('ANRNT', '') ['unearned', 'unironed']
repeat ('RFTRT', '') ['raftered', 'roughdried']
repeat ('ANPRST', '') ['unpierced', 'unpressed']
repeat ('ST', '') ['sooty', 'saute', 'suety', 'pseudo', 'sad']
repeat ('SRFL', '') ['servile', 'sorrowful']
repeat ('JS', 'AS') ['joyous', 'juicy']
repeat ('KT', '') ['cut', 'quiet', 'cute', 'good', 'keyed']
repeat ('KLS', '') ['close', 'glossy', 'classy', 'glassy', 'keyless']
repeat ('ANKNT', '') ['unawakened', 'unkind']
repeat ('PRFRPL', '') ['preferabl

repeat ('AKSSPL', '') ['accessible', 'excusable']
repeat ('PRNTL', '') ['prenatal', 'perinatal', 'parental']
repeat ('PST', '') ['best', 'pasty', 'poised', 'past', 'biased']
repeat ('PSNT', '') ['passant', 'basined', 'passionate']
repeat ('PLTNK', '') ['balding', 'platonic']
repeat ('AK', '') ['awake', 'icky', 'ago']
repeat ('0N', 'TN') ['thin', 'then']
repeat ('PRSNT', '') ['pursuant', 'prescient', 'present']
repeat ('AKSSTNK', '') ['exhausting', 'existing']
repeat ('ANSTNT', '') ['unstained', 'instant']
repeat ('FTRT', '') ['featured', 'fettered', 'future_day']
repeat ('PRN', '') ['brown', 'barren', 'prone', 'brainy', 'born']
repeat ('ANXT', '') ['unwashed', 'unhatched']
repeat ('PRNTT', '') ['brain_dead', 'branded', 'parented']
repeat ('ANPRNTT', '') ['unbranded', 'unparented']
repeat ('MTRNL', '') ['matronly', 'maternal']
repeat ('PTNT', '') ['buttoned', 'patent', 'patient']
repeat ('ANTRNK', '') ['underhung', 'enduring']
repeat ('ARNK', '') ['ironic', 'erring', 'irenic']
repeat ('

repeat ('KNT', '') ['gowned', 'canty', 'quaint', 'kind', 'canned', 'connate', 'cuneate']
repeat ('TLTT', '') ['diluted', 'delighted', 'deltoid']
repeat ('LRT', '') ['layered', 'lurid', 'lowered', 'laureate', 'leeward', 'lyrate']
repeat ('APFT', '') ['approved', 'obovate']
repeat ('AFT', '') ['afoot', 'avowed', 'avid', 'aft', 'ivied', 'ovate']
repeat ('PLTT', '') ['belted', 'blighted', 'belated', 'peltate']
repeat ('KMPNT', '') ['combined', 'compound']
repeat ('PLPT', '') ['blebbed', 'bilabiate', 'bilobate']
repeat ('PNT', '') ['bound', 'boned', 'buoyant', 'pent', 'bent', 'pennate', 'bandy', 'banned', 'binate']
repeat ('KLFT', '') ['gloved', 'qualified', 'cleft']
repeat ('LPT', '') ['lipped', 'labiate', 'lobed']
repeat ('PLMT', '') ['plumate', 'plumed', 'palmate']
repeat ('PRTT', '') ['bearded', 'parted']
repeat ('PNT', '') ['bound', 'boned', 'buoyant', 'pent', 'bent', 'pennate', 'bandy', 'banned', 'binate', 'pinnate']
repeat ('TRNT', '') ['trendy', 'drained', 'truant', 'trained', 'tern

repeat ('ANS', '') ['uneasy', 'in_use']
repeat ('ATLST', '') ['idealized', 'utilized']
repeat ('FTL', '') ['vital', 'fatal', 'futile']
repeat ('ATPN', '') ['eightpenny', 'utopian']
repeat ('PNTNK', '') ['pending', 'binding']
repeat ('RSNT', '') ['recent', 'reasoned']
repeat ('FLPL', '') ['fallible', 'voluble', 'valuable']
repeat ('ANFLPL', '') ['unavailable', 'infallible', 'inviolable', 'invaluable']
repeat ('AR0', 'FRT') ['worthy', 'worth']
repeat ('XF', '') ['chief', 'chaffy']
repeat ('MNK', '') ['manque', 'manky']
repeat ('FLLS', '') ['flawless', 'valueless']
repeat ('PRTN', '') ['baritone', 'preteen', 'beardown', 'protean']
repeat ('AMNFRS', '') ['omnivorous', 'omnifarious']
repeat ('FRNK', '') ['frank', 'varying']
repeat ('ANFRT', '') ['unafraid', 'unfurrowed', 'unvaried']
repeat ('ANFLT', '') ['invalid', 'unfueled', 'unifoliate', 'involute', 'unveiled']
repeat ('ART', '') ['irate', 'aureate', 'arrayed', 'arid', 'eared', 'arty', 'aired']
repeat ('LFRT', '') ['liveried', 'louvered'

repeat ('0RTT', 'TRTT') ['throated', 'threaded']
repeat ('TPL', '') ['double', 'tibial']
repeat ('TTL', '') ['deadly', 'daedal', 'tidal']
repeat ('TRT', '') ['dirty', 'dowered', 'true_to', 'dried', 'torrid', 'dirt', 'terete', 'tired', 'tiered']
repeat ('KLNK', '') ['clunky', 'glowing', 'killing', 'coiling', 'cloying', 'colonic', 'clonic']
repeat ('TRTS', '') ['tortuous', 'tortious']
repeat ('TPRKLT', '') ['tuberculate', 'tuberculoid']
repeat ('TRPNT', '') ['turbaned', 'tripinnate', 'turbinate']
repeat ('ANT', '') ['undue', 'awned', 'owned', 'indie', 'unwed', 'annoyed', 'anti', 'enate', 'unawed', 'in_height', 'uniate']
repeat ('ANFLR', '') ['unifilar', 'uniovular']
repeat ('ARSLT', '') ['irresolute', 'urceolate']
repeat ('AFL', '') ['awful', 'afoul', 'evil', 'uveal']
repeat ('FKL', '') ['fugly', 'focal', 'fickle', 'vocal', 'vagal']
repeat ('FLNT', '') ['valiant', 'fluent', 'flinty', 'violent', 'flaunty', 'volant', 'valent']
repeat ('FLFT', '') ['velvet', 'valved']
repeat ('FKRL', '') ['

Double metaphone

('LS', '') ['lousy', 'loose', 'lazy', 'lossy']

('PRT', '') ['bright', 'pretty', 'broad', 'buried', 'bratty', 'proto', 'berried', 'barred']

('PT', '') ['beady', 'pat', 'petty', 'bawdy', 'bad', 'bowed', 'potty', 'beta', 'bitty', 'boughed', 'paid', 'peaty']

In [76]:

''' Runs through all the words in the dictionary (in wordnet)
make dictionary of match rating codex
'''
match_rating_dict = {}
count = 15000
for ss in wn.all_synsets():
    word = ss.lemma_names()[0].lower()
    sound = jellyfish.match_rating_codex(word)
  
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
        print(sound)
        continue
    
#     print(word)
    if sound not in match_rating_dict:
        match_rating_dict[sound] = [word]
    else:
        if word not in match_rating_dict[sound]:
            match_rating_dict[sound].append(word)
            print("repeat", sound, match_rating_dict[sound])
    count -= 1
    if count == 0:
        break
        
# print(match_rating_dict)

repeat HYPCTV ['hyperactive', 'hypoactive']
repeat IDL ['ideal', 'idle']
repeat AFRD ['afraid', 'afeard']
repeat PRNTL ['prenatal', 'perinatal']
repeat AMPYLR ['amphiprostylar', 'amphistylar']
repeat CT ['cut', 'cute']
repeat BNDBL ['bondable', 'bindable']
repeat ADBL ['addable', 'audible']
repeat UNPSNG ['unprepossessing', 'unpromising']
repeat APLNG ['appealing', 'appalling']
repeat RR ['rare', 'rear']
repeat FLT ['flat', 'foliate']
repeat BLD ['billed', 'bald']
repeat PLS ['plus', 'pilous']
repeat WRY ['wary', 'wiry']
repeat BLD ['billed', 'bald', 'bellied']
repeat UNBSHD ['unabashed', 'unblemished']
repeat BLD ['billed', 'bald', 'bellied', 'bold']
repeat FTRD ['featured', 'fettered']
repeat FRLD ['frilled', 'furled']
repeat DMD ['doomed', 'dimmed']
repeat BRD ['broad', 'buried']
repeat BNY ['bonny', 'bony']
repeat BND ['bound', 'boned']
repeat BLWY ['billowy', 'blowy']
repeat PRDCS ['predaceous', 'predacious']
repeat CLN ['clean', 'cauline']
repeat CSL ['casual', 'causal']
repeat O

repeat STRNG ['strange', 'strong', 'stirring']
repeat UNSNLK ['unstatesmanlike', 'unseamanlike']
repeat INTTNL ['intentional', 'international']
repeat GLBL ['gullible', 'global']
repeat INTSTT ['interstate', 'intrastate']
repeat UNBCHD ['unbranched', 'unbleached']
repeat PNTD ['pointed', 'painted']
repeat NT ['neat', 'net']
repeat NRTC ['neritic', 'neurotic']
repeat MDL ['medial', 'middle', 'modal']
repeat HYPNSV ['hypertensive', 'hypotensive']
repeat NRTRLY ['northerly', 'northeasterly']
repeat NRTRLY ['northerly', 'northeasterly', 'northwesterly']
repeat NRTWRD ['northeastward', 'northwestward']
repeat STHRLY ['southerly', 'southeasterly']
repeat STHRLY ['southerly', 'southeasterly', 'southwesterly']
repeat STHWRD ['southeastward', 'southwestward']
repeat HRD ['hired', 'hard', 'horrid', 'hurried', 'heard']
repeat UNDCTD ['unaddicted', 'undedicated', 'uneducated', 'undereducated', 'undomesticated', 'undetected']
repeat CNFMBL ['confirmable', 'conformable']
repeat CNTRRY ['contemporary

repeat SLCK ['slack', 'slick']
repeat SLPNG ['sloping', 'slipping']
repeat UNLCTD ['unlocated', 'unlubricated']
repeat BRKY ['braky', 'barky']
repeat BLT ['built', 'bullate']
repeat CHPD ['chopped', 'chapped']
repeat VRCS ['veracious', 'varicose', 'verrucose']
repeat SNT ['sent', 'sinuate']
repeat UNTCHD ['unattached', 'untouched', 'unnotched']
repeat CRNT ['current', 'crenate']
repeat EMRGNT ['emergent', 'emarginate']
repeat CLTRL ['collateral', 'cultural']
repeat INTSNL ['intensional', 'interpersonal']
repeat TD ['tied', 'toed', 'tod']
repeat UNTNDD ['untended', 'unattended']
repeat CLSTRD ['cloistered', 'clustered']
repeat STNLS ['stainless', 'stoneless']
repeat SHTRD ['shattered', 'shuttered']
repeat UNSTRD ['unstirred', 'unstructured', 'unshuttered']
repeat UNSCBL ['unserviceable', 'unsociable']
repeat SLD ['solid', 'sealed', 'sold']
repeat UNSLD ['unsoiled', 'unsullied', 'unsealed', 'unsold']
repeat SLD ['solid', 'sealed', 'sold', 'soled']
repeat TBLR ['tabular', 'tubular']
repea

repeat PRVNCL ['provincial', 'provencal']
repeat PSTTRL ['postindustrial', 'postdoctoral']
repeat MCRNMC ['macroeconomic', 'microeconomic']
repeat PRMTRY ['peremptory', 'paramilitary']
repeat MTC ['meiotic', 'miotic']
repeat MNCLNL ['monoclinal', 'monoclonal']
repeat NVL ['novel', 'naval']
repeat PRMDCL ['premedical', 'paramedical']
repeat PRS ['porous', 'porose', 'parous']
repeat PRTD ['parted', 'parotid']
repeat PLN ['plain', 'pauline']
repeat PTY ['petty', 'potty', 'peaty']
repeat PLGSTC ['plagiaristic', 'plagioclastic']
repeat PLR ['polar', 'pilar']
repeat INTTRY ['introductory', 'integumentary', 'interplanetary']
repeat EXTTRL ['extraterritorial', 'extraterrestrial']
repeat PLMNCS ['plumbaginaceous', 'polemoniaceous']
repeat PLTRCT ['politically_correct', 'politically_incorrect']
repeat PRNL ['perennial', 'preanal']
repeat RBD ['ribbed', 'rabid']
repeat RCLS ['recluse', 'recoilless']
repeat SPNS ['spinose', 'spinous', 'sapiens']
repeat SCRFY ['scruffy', 'scurfy']
repeat SMNL ['sem

Jellyfish's match rating codex is not very accurate as well. e.g.

UNPNTD ['unprecedented', 'unpatented', 'unpainted', 'unparented', 'unplanted', 'unpigmented']

INTCLR ['intraventricular', 'intramuscular', 'intramolecular', 'intermolecular']

In [78]:

''' Runs through all the words in the dictionary (in wordnet)
make dictionary of match rating codex
'''
nysiis_dict = {}
count = 1500
for ss in wn.all_synsets():
    word = ss.lemma_names()[0].lower()
    sound = phonetics.nysiis(word)
  
    if '-' in word:
#         print(word)
        continue
    if '*' in sound:
        print(sound)
        continue
    
#     print(word)
    if sound not in nysiis_dict:
        nysiis_dict[sound] = [word]
    else:
        if word not in nysiis_dict[sound]:
            nysiis_dict[sound].append(word)
            print("repeat", sound, nysiis_dict[sound])
    count -= 1
    if count == 0:
        break
        
print(nysiis_dict)

repeat A ['able', 'abaxial']
repeat A ['able', 'abaxial', 'adaxial']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent']
repeat DA ['dissilient', 'dying']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged']
repeat PA ['parturient', 'potted']
repeat UA ['unable', 'unabridged']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged', 'absolute']
repeat DA ['dissilient', 'dying', 'direct']
repeat LA ['last', 'living']
repeat RA ['relative', 'relational']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged', 'absolute', 'absorbent']
repeat A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged', 'absolute', 'absorbent', 'absorbefacient']
repeat RA ['relative', 'relational', 'receptive']


IndexError: list index out of range

nysiis is not suitable for our method as its sound indexes are too inaccurate. e.g.

A ['able', 'abaxial', 'adaxial', 'acroscopic', 'abducent', 'adducent', 'abridged', 'absolute', 'absorbent', 'absorbefacient']

---

Thus, **soundex**, **double metaphone**, and maybe match rating codex (?) seems suitable. After making dictionary, continue with

* function to get algorithm to get fuzzy words
* function to check similarity of each word (lemmatized) in sentence with word (original + fuzzyz)
* function to return the difference in spike of similarity with original and another word in fuzzyz