This is part of the research we conduct at [Vocapouch](https://vocapouch.com). Our service is dedicated to language learners. The results of the study contained in this notebook were described [on our blog](https://blog.vocapouch.com/which-word-does-rhyme-the-most-ebd66dedcce7).

# Which word has the most rhymes and why it is "carburetion"?

Author: Roman Kierzkowski

## Cleaning up the data

In our pursuit of finding the word which has the most rhymes, we start with [ISLEX](http://isle.illinois.edu/sst/data/g2ps/) database of word pronunciations. We clean it out, by removing all proper nouns. Later, we use [pysle](https://github.com/timmahrt/pysle) to parse the file.

In [1]:
from __future__ import print_function
from pysle import isletool
from itertools import groupby

import io

excluded_pos = {'nnp', 'nnps'} # exclude proper nouns

def extract_root_pos(pos):
    index = pos.find('_')
    return pos[:index] if index != -1 else pos

def filter_out_proper_nouns(source, dest):
    with io.open(source, "r", encoding='utf-8') as inp:
        with io.open(dest, "w", encoding='utf-8') as outp:
            for line in inp:
                s = line.find('(')
                e = line.find(')')
                pos = line[s+1:e].split(',') # extract and split
                pos = { extract_root_pos(p) for p in pos }
                if (not pos & excluded_pos) or (pos - excluded_pos): # not proper noun or proper noun that is also regular word like brown
                    outp.write(line)

filter_out_proper_nouns('ISLEdict.txt', 'ISLEdict_npn.txt')

isleDict = isletool.LexicalTool('ISLEdict_npn.txt')

For further processing, we take only single words. We also skip the records with a hyphen. We make a list of tuples containing: word, pronunciation and the index of an accented vowel.

In [3]:
def more_than_one(word):
    return '-' in word or '_' in word

def flatten(syllables):
    return [ phoneme for syllable in syllables for phoneme in syllable ]

def parse_pronun(data):
    total = len(data.keys())
    single = 0
    accented = 0
    not_vowels = 0
    
    result = []

    words = data.keys()
    words.sort()

    for word in words:
        records = data[word]
        if not more_than_one(word):
            single+=1
            was_accented = False
            for record in records:
                parsed, accented_sylables, accented_vovel  = isletool._parsePronunciation(record[0])[0]
                if accented_sylables:
                    was_accented = True
                    accent_index = sum(len(x) for x in parsed[0:accented_sylables[0]]) + accented_vovel[0]
                    pronunc = flatten(parsed)
                    if pronunc[accent_index][1:] in isletool.vowelList:
                        item = (word, pronunc, accent_index)
                        result.append(item)
                    else:
                        not_vowels+=1
            if was_accented:
                accented+=1
    
    return (total, single, accented, not_vowels, result)

total, single, accented, not_vowels, pronun_records = parse_pronun(isleDict.data) 

print("Total %s words, sigle words %s, with accent %s, non-vowels accented %s." % (total, single, accented, not_vowels))

Total 206321 words, sigle words 126862, with accent 124958, non-vowels accented 0.


To make it more usable in further code, we create additional a lookup dictionary, with mapping from the words to the list of pronunciations.

In [6]:
pronun_records.sort(key=lambda x: (x[0], ''.join(x[1])))
pronun_dict = dict((k, list(v)) for k, v in groupby(pronun_records, key=lambda x: x[0]))

In [7]:
def present_record(r):
    return "%s => %s (with accent at %s. phonem)" % (r[0], ''.join(r[1]), r[2] + 1)
    
print(present_record(pronun_dict['love'][0]))

love => lˈʌv (with accent at 2. phonem)


We get some basic statistics about the data set.

In [8]:
def mean(numbers):
    return float(sum(numbers)) / max(len(numbers), 1)

unique_words = len(pronun_dict.keys())
average_pronun = mean([len(p) for p in pronun_dict.values()])

print("Unique words: {} Average pronunciations per word: {}".format(unique_words, average_pronun))

Unique words: 124958 Average pronunciations per word: 1.16413514941


## Matching the rhymes

According to [Wikipedia](https://en.wikipedia.org/wiki/Perfect_and_imperfect_rhymes):

"Perfect rhyme […] is a form of rhyme between two words or phrases, satisfying the following conditions:
- The stressed vowel sound in both words must be identical, as well as any subsequent sounds. […]
- The articulation that precedes the vowel in the words must differ. […]"

For example, *love* and *glove* are the perfect rhymes because their match both conditions, but *knight* and *night* are not because they don’t fulfill the second rule.

The following functions are used to check if two words are rhymes.

In [9]:
def same_ending(r1, r2):
    w1, p1, a1 = r1 # word, pronunciation, accent
    w2, p2, a2 = r2
    
    return p1[a1:] == p2[a2:]

def diffrent_begining(r1, r2):
    w1, p1, a1 = r1 # word, pronunciation, accent
    w2, p2, a2 = r2
    
    return p1[:a1] != p2[:a2]

def is_rhyme(r1, r2):
    w1, p1, a1 = r1 # word, pronunciation, accent
    w2, p2, a2 = r2
    
    return w1 != w2 and same_ending(r1, r2) and diffrent_begining(r1, r2)

Some tests:

In [10]:
love = pronun_dict['love'][0]
glove = pronun_dict['glove'][0]

assert(is_rhyme(love, glove))

uncurb = pronun_dict['uncurb'][0]
superb = pronun_dict['superb'][1]

assert(is_rhyme(uncurb, superb))

knight = pronun_dict['knight'][0]
night = pronun_dict['night'][0]

assert(not is_rhyme(knight, night))

All we have to do now is to check which words rhymes. Do we have to check each pair? That would give us 123970 ^ 2 = 15 368 560 900 checks. That would take a lot of time. Instead, we need to use a trick. We sort the list of records according to inverted pronunciation. This gives a list where the words that have the same endings are next to each other. So all we have to do is to check each pair within those small subgroups of words with the same ending.

In [11]:
def find_rhymes(records):
    result = {}
    
    records.sort(key=lambda x: list(reversed(x[1])))
    for i, record in enumerate(records):
        j = i + 1
        while j < len(records) and same_ending(record, records[j]):
            r1 = record
            r2 = records[j]
            if is_rhyme(r1, r2):
                w1 = r1[0]
                w2 = r2[0]
                result.setdefault(w1, set({})).add(w2)
                result.setdefault(w2, set({})).add(w1)
            j+=1
    return result

rhymes = find_rhymes(pronun_records)

It works!

In [197]:
rhymes['love']

{u'above',
 u'belove',
 u'deneuve',
 u'dove',
 u'glove',
 u'gov',
 u'hereof',
 u"o'glove",
 u'shove',
 u'thereof',
 u'whereof'}

And here are some tests:

In [199]:
assert('night'  in rhymes['height'])
assert('knight' in rhymes['height'])

assert('night'  not in rhymes['knight'])
assert('knight' not in rhymes['night'])

## The most popular word is *carburetion*. Why?

In [12]:
rhymes_counts = [ (word, len(word_rhymes)) for word, word_rhymes in rhymes.iteritems() ]
rhymes_counts.sort(key=lambda x: -x[1])

rhymes_counts[0:10]

[(u'carburetion', 1400),
 (u'modernization', 1390),
 (u'obligation', 1382),
 (u'ration', 1381),
 (u'ventilation', 1380),
 (u'distillation', 1380),
 (u'ordination', 1378),
 (u'concatenation', 1378),
 (u'incoordination', 1378),
 (u'detonation', 1378)]

In [13]:
list(rhymes['carburetion'])[0:10]

[u'expostulation',
 u'activation',
 u'dotation',
 u'replication',
 u'appropriation',
 u'gratification',
 u'disorientation',
 u'reduplication',
 u'ovation',
 u'accentuation']

The carburetion has exactly 1400 rhymes. But hey, why other words that rhyme with carburetion don’t have the same amount of rhymes? Take for example the second best modernization with 1390 rhymes. They both rhymes with each other, so why they don’t have an exact number of rhymes? 

It is due to the fact that both of them two pronunciations:

* carburetion: **kˌɑɹbəɹˈeiʃn̩** and **kˌɑɹbjɚˈiʃn̩**,
* modernization: **mˌɑd˺ɚnəzˈeiʃn̩** and **mˌɑd˺ɚnɑɪzˈeiʃə**.

The accented vowel is marked with an apostrophe. We can see that both carburetion and modernization share the same ending: ˈeiʃn which brings them most of their rhymes — exactly 1371. The other pronunciations bring them 29 and 19 rhymes respectively. That’s where the difference comes from.

In [14]:
def pronunc_hist(pronunciations, word):
    word_pronunciations = pronunciations[word]
    result = [0] * len(word_pronunciations)
    count = [ set({}) for _ in word_pronunciations ]
    
    
    for r in rhymes[word]:
        for i, lead in enumerate(word_pronunciations):
            for p in pronunciations[r]:
                if is_rhyme(lead, p):
                    result[i]+=1
                    count[i].add(r)
                    
    return (word_pronunciations, [len(c) for c in count])
    
       
def print_pronunc_hist(pronunciations, word):        
    word_pronunciations, counts = pronunc_hist(pronunciations, word)

    for i, p in enumerate(word_pronunciations):
        print(present_record(p))
        print(counts[i])


print_pronunc_hist(pronun_dict, 'carburetion')
print('-------')
print_pronunc_hist(pronun_dict, 'modernization')

carburetion => kˌɑɹbjɚˈiʃn̩ (with accent at 7. phonem)
29
carburetion => kˌɑɹbəɹˈeiʃn̩ (with accent at 7. phonem)
1371
-------
modernization => mˌɑd˺ɚnɑɪzˈeiʃə (with accent at 8. phonem)
19
modernization => mˌɑd˺ɚnəzˈeiʃn̩ (with accent at 8. phonem)
1371


## The most ryhming endings

Let's check which word ending (from a stressed vowel to an end of a word) is shared by the biggest number of words.

In [15]:
def ending(record):
    _, pronunciation, accent = record
    
    return ''.join(pronunciation[accent:])

def rhyme_groups(records):
    result = {}
    records.sort(key=lambda x: list(reversed(x[1])))
     
    for i, record in enumerate(records):
        word = record[0]
        result.setdefault(ending(record), set({})).add(word)
        
    return result

groups = rhyme_groups(pronun_records)

In [16]:
from random import shuffle

groups_list = [ (e, len(words)) for e, words in groups.iteritems() ]
groups_list.sort(key=lambda x: -x[1])

for e, c in groups_list[:20]:
    sample = list(groups[e])
    shuffle(sample)
    print(u'{:<10} {:>5} {}'.format(e.strip(), c, sample[:5]))

ˈeiʃn̩      1372 [u'continuation', u'gustation', u'gestation', u'denuclearization', u'stagnation']
ˈi           480 [u'mammee', u'jinni', u'franchisee', u'dupee', u'bee']
ˈei          455 [u'bellay', u'coulee', u'allay', u'formee', u'ricochet']
ˈɑlədʒi      359 [u'patrology', u'paleontology', u'paleopedology', u'hypnology', u'amphibology']
ˈɛt          351 [u'barrette', u'bett', u'stet', u'marchette', u'revet']
ˈeiʃn̩z      291 [u'explorations', u'gyrations', u'strangulations', u'accusations', u'federations']
ˈu           288 [u'doo', u'timbuctoo', u'pew', u'linyu', u'detenu']
ˈin          277 [u'unclean', u'marine', u'squireen', u'imipramine', u'achene']
ˈoʊsɪs       267 [u'pollenosis', u'spirochaetosis', u'erythroblastosis', u'acidosis', u'osteoarthrosis']
ˈɪt˺ɪk       265 [u'osteitic', u'poliomyelitic', u'pisolitic', u'scleritic', u'dyophysitic']
ˈæt˺ɪk       252 [u'technocratic', u'suprahepatic', u'epistatic', u'apochromatic', u'thematic']
ˈoʊ          246 [u'inlow', u'tarot', u'go

## Summary

You may dispute that **carburetion** is not the word with the most rhymes. You may question the completeness of the dataset, but you have to admit: the **ˈeiʃn̩** ending is the leader of rhymes.