# Haiku

## Project 16: Counting Vowels.
The objective of the first project is to Write a Python program that counts the number of syllables in an English word or phrase. Every vowel sound is a syllable. Even though some vowel sounds are silent and some combine to lengthen the sound, exhaustive corpera exist which catalog the vowel sounds that are useful for us. The primary steps of this program are:
1. Download a large corpus with syllable-count information.
2. Compare the syllable-count corpus to the haiku-training corpus and identify all the words missing from the syllable-count corpus.
3. Build a dictionary of the missing words and their syllable counts.
4. Write a program that uses both the syllable-count corpus and the missing-words dictionary to count syllables in the training corpus.
5. Write a program that checks the syllable-counting program against updates of the training corpus.

This program uses the [Natural Language Toolkit](https://www.nltk.org) to access the Carnegie Mellon University Pronouncing Dictionary (CMUdict).

In [None]:
import nltk
nltk.download() # This doesn't work very well in Jupyter, but I wanted to leave it in anyway. Use the Python interpreter for this.

In [4]:
from nltk.corpus import cmudict

CMUdict breaks words into sets of phonemes (perceptually distinct units of sound) and marks vowels for lexical stress using numbers (0, 1, and 2). You can use these numbers to identify the vowels in a word because every vowel is marked with one *and only one* of these numbers. Words with multiple pronounciations (e.g. "aged" versus "agèd") contain nested lists that return more than one syllable count.

### Managing Missing Words
The problem with corpera is that they inevitably miss words that you may want to use. In this case, there will also be some Japanese romanizations misinterpreted as English words with a different syllable count, like *sake*. We must make sure the words in our haiku training corpus (`train.txt`, from the [book's Github repo](https://github.com/rlvaugh/Impractical_Python_Projects)) are also in CMUdict. (Download `train.txt` to the same root as this Jupyter notebook.)

In [1]:
# missingWordsFinder.py
# This program checks train.txt entries for membership in CMUdict.
import sys
from string import punctuation
import pprint
import json
from nltk.corpus import cmudict

cmudict = cmudict.dict() # Carnegie Mellon University Pronouncing Dictionary

def main():
    haiku = load_haiku('train.txt')
    exceptions = cmudict_missing(haiku)
    build_dict = input("\nManually build an exceptions dictionary (y/n)? \n")
    if build_dict.lower() == 'n':
        sys.exit()
    else:
        missing_words_dict = make_exceptions_dict(exceptions)
        save_exceptions(missing_words_dict) # I originally forgot this line and had to diff with the source code from Github. :/

def load_haiku(filename):
    """Open and return training corpus of haiku as a set."""
    with open(filename) as in_file:
        haiku = set(in_file.read().replace('-', ' ').split()) # Load as a set to avoid repeats. Replace hyphens with spaces.
        return haiku

def cmudict_missing(word_set):
    """Find and return words in word set missing from cmudict."""
    exceptions = set() # Start an empty set to hold missing words.
    for word in word_set:
        word = word.lower().strip(punctuation)
        if word.endswith("'s") or word.endswith("’s"):
            word = word[:-2]
        if word not in cmudict:
            exceptions.add(word)
    print("\nexceptions:")
    print(*exceptions, sep='\n')
    print("\nNumber of unique words in haiku corpus = {}".format(len(word_set)))
    print("Number of words in corpus not in cmudict = {}".format(len(exceptions)))
    membership = (1 - (len(exceptions) / len(word_set))) * 100
    print("cmudict membership = {:.1f}{}".format(membership, '%'))
    return exceptions

def make_exceptions_dict(exceptions_set):
    """Return dictionary of words and syllable counts from a set of words."""
    missing_words = {} # Assign an empty dictionary to missing_words
    print("Input # syllables in word. Mistakes can be corrected at end. \n")
    for word in exceptions_set:
        while True:
            num_sylls = input("Enter number of syllables in {}: ".format(word))
            if num_sylls.isdigit():
                break
            else:
                print("                   Not a valid answer!", file=sys.stderr)
            missing_words[word] = int(num_sylls)
        print()
        pprint.pprint(missing_words, width=1)

        print("\nMake Changes to Dictionary Before Saving?")
        print("""
        0 - Exit & Save
        1 - Add a Word or Change a Syllable Count
        2 - Remove a Word
        """)

        while True:
            choice = input("\nEnter choice: ")
            if choice == '0':
                break
            elif choice == '1':
                word = input("\nWord to add or change: ")
                missing_words[word] = int(input("Enter number of syllables in {}: ".format(word)))
            elif choice == '2':
                word = input("\nEnter word to delete: ")
                missing_words.pop(word, None)
            
        print("\nNew words or syllable changes:")
        pprint.pprint(missing_words, width=1)

        return missing_words

def save_exceptions(missing_words):
    """Save exceptions dictionary as json file."""
    json_string = json.dumps(missing_words) # Serializes the missing_words dictionary into a string). Serializing is the process of converting data into a more transmittable string or storable format.
    f = open('missing_words.json', 'w')
    f.write(json_string)
    f.close()
    print("\nFile saved as missing_words.json")

if __name__ == '__main__':
    main()


exceptions:
swordhand
colour
hibiscus
stretchings
spiritless
deepener
cloudbanks
tendrilled
samisen
beholders
asakura
samuri
battlers
persimmons
froglings
ridgelines
furue
yowl
windless
lichened
woodcutter
whippoorwill
nightingales
treeline
evenfall
wintery
cumulus
tendrils
archways
carven
dragonfly
treehouse
priestling
camellia
pattering
wisteria
fie
oranged
scatters
skims
storks
windblown
watersplash
inuyasha
cloudbank
dusky
nursemaid
bathwater
atsuta
moonrise
shadeless
paperweights
creepers
foregather
morningglory
petaled
dewdrop
mooing

Number of unique words in haiku corpus = 1523
Number of words in corpus not in cmudict = 58
cmudict membership = 96.2%


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


Now that we have `missing_words.json`, we can write the code to count syllables.

In [2]:
# count_syllables.py

import sys
from string import punctuation
import json
from nltk.corpus import cmudict

# load dictionary of words in haiku corpus but not in cmudict
with open('missing_words.json') as f:
    missing_words = json.load(f)

cmudict = cmudict.dict() # Turns the CMUdict corpus into a dictionary.

def count_syllables(words):
    """Use corpera to count syllables in English word or phrase."""
    # prep words for cmudict corpus
    words = words.replace('-', ' ')
    words = words.lower().split()
    num_sylls = 0
    for word in words:
        word = word.strip(punctuation)
        if word.endswith("'s") or word.endswith("’s"):
            word = word[:-2]
        if word in missing_words:
            num_sylls += missing_words[word]
        else:
            for phonemes in cmudict[word][0]: # Refer to the first value in case the word has multiple pronounciations.
                for phoneme in phonemes:
                    if phoneme[-1].isdigit():
                        num_sylls += 1
    return num_sylls

def main():
    while True:
        print("Syllable Counter")
        word = input("Enter word or phrase; else press Enter to Exit: ")
        if word == '':
            sys.exit()
        try:
            num_syllables = count_syllables(word)
            print("The number of syllables in {} is: {}".format(word, num_syllables))
            print()
        except KeyError:
            print("Word not found. Try again.\n", file=sys.stderr)

if __name__ == '__main__':
    main()

Syllable Counter
The number of syllables in samuri is: 3

Syllable Counter
The number of syllables in moon is: 1

Syllable Counter
The number of syllables in sword is: 1

Syllable Counter


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


Suppose you want to add a new haiku to the corpus, but want to be aware of any words that aren't in CMUdict or your exceptions dictionary. The program below will automatically count the syllables in each word in your training corpus and display any word(s) on which it failed.

In [6]:
# test_count_syllables_w_full_corpus.py
# Make sure to include a separate count_syllables.py file in the same directory as this notebook.

import sys
import count_syllables

with open('train.txt') as in_file:
    words = set(in_file.read().split())

missing = []

for word in words:
    try:
        num_syllables = count_syllables.count_syllables(word)
        ##print(word, num_syllables, end='\n') # Uncomment to see word counts
    except KeyError:
        missing.append(word)

print("Missing words: ", missing, file=sys.stderr)

Missing words:  []


## Project 17: Writing Haiku with Markov Chain Analysis

> When applied to letters in words, a Markov model is a mathematical model that calculates a letter's probability of occurrence based on the previous *k* consecutive letters, where *k* is an integer. A *model of order 2* means that the probability of a letter occurring depends on the two letters that precede it.

Markov models store every occurrence of a word as a separate duplicate value. You may get a list item like `'the': ['clouds', 'moon', 'moon']`: in this case, the odds of selecting `moon` versus `clouds` are 2:1. On the other hand, the model automatically screens rare or impossible combinations. A Markov model of order two generates lists that look more like `'the moon':  ['a', 'therefore']`. The size of *k* determines the novelty of the output: `0` results in random output based on the word's frequency in the corpus; `3` or more in the haiku corpus would result in plagiarism.

The program will seed the haiku with a random word from the corpus; use a Markov model of order 1 to select the second word; then use order-2 models for each subsequent word. The author uses *ghost prefixes* in the case of syllable overrun: the program appends the model-2 output of a random two-word prefix.

### Scaffolding and Debugging
*Scaffolding* is temporary code written to help develop programs that is deleted from the final code. One common form of scaffolding is using the `print()` statement to check the return of a function or calculation. Other helpful pieces of scaffolding include value or variable type, dataset length, and incremental calculation results. Scaffolding can be troublesome if you accidentally comment out or delete a `print()` statement that you actually need in the end. 

An alternative to scaffolding is using the `logging` module, which reports on what your program is doing and can write permanent logfiles.

In [8]:
import logging
logging.basicConfig(level=logging.DEBUG, format='%(levelname)s - %(message)s') # Set debugging information and format.

word = 'scarecrow'
VOWELS = 'aeiouy'
num_vowels = 0
for letter in word:
    if letter in VOWELS:
        num_vowels += 1
    logging.debug('letter & count = %s-%s', letter, num_vowels) # Convert nonstring objects to strings.

DEBUG - letter & count = s-0
DEBUG - letter & count = c-0
DEBUG - letter & count = a-1
DEBUG - letter & count = r-1
DEBUG - letter & count = e-2
DEBUG - letter & count = c-2
DEBUG - letter & count = r-2
DEBUG - letter & count = o-3
DEBUG - letter & count = w-3


The output in the code above uses string formatting `%s`. Date and time are shown using `format='%(asctime)s'`. To disable the `logging` messages, insert `logging.disable(logging.CRITICAL)` under `import logging` and comment/uncomment as desired. `logging.disable()` suppresses all messages at or below the designated level; `CRITICAL` is the highest level, so passing it to `logging.disable()` turns all messages off.

In [None]:
# markov_haiku.py

# Setup
import sys
import logging
import random
from collections import defaultdict # defaultdict builds a dictionary from a list by automatically creating a new key
from count_syllables import count_syllables

logging.disable(logging.CRITICAL) # comment out to enable debugging messages
logging.basicConfig(level=logging.DEBUG, format='%(message)s')

def load_training_file(file):
    """Return text file as a string."""
    with open(file) as f:
        raw_haiku = f.read()
        return raw_haiku

def prep_training(raw_haiku):
    """Load string, remove newline, split words on spaces, and return list."""
    corpus = raw_haiku.replace('\n', ' ').split()
    return corpus

# Building the one- and two-order Markov models.
def map_word_to_word(corpus):
    """Load list & use dictionary to map word to word that follows."""
    limit = len(corpus) - 1
    dict1_to_1 = defaultdict(list)
    for index, word in enumerate(corpus):
        if index < limit:
            suffix = corpus[index + 1]
            dict1_to_1[word].append(suffix)
    logging.debug("map_word_to_word results for \"sake\" = %s\n", dict1_to_1['sake'])
    return dict1_to_1

def map_2_words_to_word(corpus):
    """Load list & use dictionary to map word-pair to trailing word."""
    limit = len(corpus) - 2
    dict2_to_1 = defaultdict(list)
    for index, word in enumerate(corpus):
        if index < limit:
            key = word + ' ' + corpus[index + 1]
            suffix = corpus[index + 2]
            dict2_to_1[key].append(suffix)
    logging.debug("map_2_words_to_word results for \"sake jug\" = %s\n", dict2_to_1['sake jug'])
    return dict2_to_1

# Choosing a random word.

def random_word(corpus):
    """Return random word and syllable count from training corpus."""
    word = random.choice(corpus)
    num_syls = count_syllables(word)
    if num_syls > 4:
        random_word(corpus)
    else:
        logging.debug("random word & syllables = %s %s\n", word, num_syls)
        return (word, num_syls)

# Applying the Markov models
def word_after_single(prefix, suffix_map_1, current_syls, target_syls):
    """Return all acceptable words in a corpus that follow a single word."""
