# Set Up

## Load the desired packages

Possible modules to work with:
* `ipa_rhyming` - https://github.com/AlinaRechina/ipa_rhyming
* `pronouncing` - https://github.com/aparrish/pronouncingpy
* `fuzzywuzz` - https://github.com/seatgeek/fuzzywuzzy

We will be working with `pronouncing`.

In [1]:
import os
from collections import Counter
import pronouncing
import fuzzywuzzy
import ipa_rhyming

In [2]:
# read in the functions
%run functions.ipynb

## Set up the corpora

In [3]:
SEUSS_FOLDER = '../data/seuss_corpus_files'
COMPARISON_FOLDER = '../data/comparison_corpus_files'

In [4]:
# List of Dr. Seuss texts
seuss_files = [fname for fname in os.listdir('../data/seuss_corpus_files/') if fname.endswith('txt')]
print(f'There are {len(seuss_files)} Seuss files')

# List of comparison texts
comparison_files = [fname for fname in os.listdir('../data/comparison_corpus_files/') if fname.endswith('txt')]
print(f'There are {len(comparison_files)} comparison corpus files')

There are 25 Seuss files
There are 25 comparison corpus files


## Example Story: Hop on Pop
This is to show what our stroies look like, and what kind of anlaysis we can and will do.

In [5]:
hop_on_pop = open(f'{SEUSS_FOLDER}/hop-on-pop.txt').read()

chars_to_strip = ',.?!'
rdict = str.maketrans('','', chars_to_strip)
hop_tokens = hop_on_pop.lower().translate(rdict).split()
hop_types = sorted(list(set(hop_tokens)))


print(f'Hop on Pop has {len(hop_types)} types and {len(hop_tokens)} tokens.')

Hop on Pop has 142 types and 389 tokens.


In [6]:
pronouncing.rhymes("beaches")

["beach's", 'impeaches', 'peaches', 'reaches', 'speeches', 'teaches']

### Create a dictionary of rhymes for hop on pop
This will be a dictionary of words, that contains the list of all rhyming words in that specific poem. Later, we will modify the pronouncing package to contain made up words by Dr. Suess

In [7]:
rhymes_dict = {}
for htype in hop_types:
    type_rhymes = set(pronouncing.rhymes(htype)) ## type_rhymes = any words that rhymes with htype
    hop_rhymes = set(hop_types).intersection(type_rhymes) ## hop_rhymes is words that rhyme with htype within the poem (hop on pop)
    rhymes_dict[htype] = list(hop_rhymes)

print(rhymes_dict['pop'])
print(rhymes_dict['stop'])

['hop', 'top', 'stop']
['hop', 'pop', 'top']


---
# Tokenize All Texts
### Dr Seuss Corpus Tokens List: `seuss_corpus_tokens`

In [8]:
# read in all seuss files
seuss_corpus = []
for file in seuss_files:
    path = SEUSS_FOLDER+'/'+file
    seuss_corpus.append(open(path).read())

In [9]:
# tokenize each file
seuss_corpus_tokens = []
chars_to_strip = ',.\xa0:-()\';$"/?][!`Ą@Ś§¨’–“”…ï‘>&\%˝˘*'

for file in seuss_corpus:
    seuss_file_tokens = tokenize(file.lower(), stripchars=chars_to_strip)
    # seuss_file_types = sorted(list(set(seuss_file_tokens)))
    seuss_corpus_tokens.append(seuss_file_tokens)

### Comparison Story/Poems Corpus Tokens List: `comp_corpus_tokens`

In [10]:
# read in all comp files
comp_corpus = []
for file in comparison_files:
    path = COMPARISON_FOLDER+'/'+file
    comp_corpus.append(open(path).read())

In [11]:
# tokenize each file
comp_corpus_tokens = []
chars_to_strip = ',.\xa0:-()\';$"/?][!`Ą@Ś§¨’–“”…ï‘>&\%˝˘*'

for file in comp_corpus:
    comp_file_tokens = tokenize(file.lower(), stripchars=chars_to_strip)
    # comp_file_types = sorted(list(set(comp_file_tokens)))
    comp_corpus_tokens.append(comp_file_tokens)

Now that we have set up the necessary material for our projecct, be can begin our analysis!