# Verbalizing Hillary

The goals of this notebook are:

- Construct an algorithm to verbalize the Hillary dataset.
- Understand the distribution of not verbalized symbols such as numbers and special characters.

In [1]:
import re
import sys

# Setup the "PYTHONPATH"
sys.path.insert(0, '../../')

from src.datasets.lj_speech import _iterate_and_replace

In [2]:
from src.datasets import hillary_dataset

data, _ = hillary_dataset(directory='../../data')
'Training rows: %s' % len(data)

No config for `hillary.hillary_dataset` (`src.datasets.hillary.hillary_dataset`)
100%|██████████| 10067/10067 [04:07<00:00, 40.73it/s]


'Training rows: 8054'

In [3]:
import random
import os

from IPython.display import Audio
from IPython.display import Markdown
from IPython.display import FileLink

def find_examples(regex, display_n=5, load_audio=False, replace=True, group=1, context=50):
    """ Print ``display_n`` examples of ``regex`` in ``lj_speech_dataset``.
    
    This is the bread and butter module for our data analysis. Enabling us to use regex to query the
    dataset and retrieve samples.
    
    Args:
        regex (str): Pattern or compiled regex object.
        display_n (int or None, optional): Number of examples to display.
        load_audio (bool, optional): If to load audio.
        replace (bool, optional): If to replace the matched characters with XXX...
        group (int, optional): Group to select in regex.
        context (int, optional): Number of characters to include on the left and right of the matched
            text as context.
        
    Returns:
        None
    """
    examples = []
    for row in data:
        matches = re.finditer(regex, row['text'])
        for match in matches:
            if match.start(group) - match.end(group) == 0:
                continue
            
            text = row['text']    
            start = match.start(group)
            end = match.end(group)
            start_context = max(start - context, 0)
            end_context = min(end + context, len(text))
            
            if replace:
                row['text'] = '{}{}{}'.format(text[:start],
                                              'X' * (end - start), 
                                              text[end:])
                
            if start != 0 or end != len(row['text']):
                text = '{}**{}**{}'.format(text[start_context:start],
                                           match.group(group),
                                           text[end:end_context])
            if start_context != 0:
                text = '…' + text
            if end_context != len(row['text']):
                text = text + '…'
                
            examples.append({
                'text': text,
                'audio': row['wav_filename']
            })
            
    # Print Examples
    display(Markdown('### Examples Captured by Regex'))
    display(Markdown('**Regex:** ' + str(regex)))
    display(Markdown('**Number of Examples:** ' + str(len(examples))))
    
    random.shuffle(examples)
    if display_n is not None:
        examples = examples[:display_n]
    
    display(Markdown('**Number of Examples Shown:** ' + str(len(examples))))
    display(Markdown('\n\n ___'))
    
    for example in examples:
        display(Markdown('**Text:** "' + example['text'] + '"'))
        display(FileLink(example['audio']))
        if load_audio:
            display(Audio(filename=str(example['audio'])))
        display(Markdown('\n\n ___'))
        display()

## Sample of the Dataset

In [4]:
find_examples(r'(?s).*', display_n=100, replace=False, group=0, load_audio=True)

### Examples Captured by Regex

**Regex:** (?s).*

**Number of Examples:** 8054

**Number of Examples Shown:** 100



 ___

**Text:** "as a result of biased attributions about leaders."



 ___

**Text:** "that create mechanisms for reporting wrongdoing"



 ___

**Text:** "and hold gently for one minute."



 ___

**Text:** "Seal programs usually require implementation of fair information practices as determined by the certification program"



 ___

**Text:** "researchers began to investigate problem solving separately in different natural knowledge domains -"



 ___

**Text:** "of their ability to sell themselves. They have fewer positive qualities than the other orientations because they are essentially empty."



 ___

**Text:** "To be a whistleblower takes bravery. Barry Adams explains the options  as, "The list of negative consequences to whistleblowing seems endless:"



 ___

**Text:** "conducted a survey"



 ___

**Text:** "This emphasized language and alignment of people within an organization to a common vision of the future of the organization,"



 ___

**Text:** "is expected to behave impersonally in regard to relationships with clients"



 ___

**Text:** "law-abiding Americans"



 ___

**Text:** "was not a single character trait, but rather, the total character organization from where many single character traits follow."



 ___

**Text:** "by robbing them of their capacity to reason."



 ___

**Text:** "I stared at the empty frames with a peculiar feeling that some mystery was about to be solved."



 ___

**Text:** "Responsive evaluation provides a naturalistic and humanistic approach to program evaluation."



 ___

**Text:** "by a deflector mounted to the registration wall. When the envelope is fed up through the gap"



 ___

**Text:** "Compliance data is defined as all data belonging or pertaining to enterprise"



 ___

**Text:** "and leader emergence. For instance, leadership performance"



 ___

**Text:** "Constructive leadership based companies engage in helping individuals to grow,"



 ___

**Text:** "work well with others, perform well, and achieve their goals with complementing hard skills."



 ___

**Text:** "Economic risks is the likelihood that economic management will cause drastic changes in a country's business environment"



 ___

**Text:** "creative individuals generate unique ideas"



 ___

**Text:** "not made. Current studies have indicated that leadership is much more complex"



 ___

**Text:** "Product forms include unidirectional tape, woven fabrics, or continuous filaments which can be layered or wrapped in various orientations."



 ___

**Text:** "to the end user, who is the customer.Corporate social irresponsibility in the supply chain"



 ___

**Text:** "was something pathetic in the girl's attitude now."



 ___

**Text:** "For example,"



 ___

**Text:** "collective efforts, and competition of many individuals."



 ___

**Text:** "These caves are blue holes, liquid time capsules where the past stares right back at you."



 ___

**Text:** "by the acronym SWOT and was "a major step forward"



 ___

**Text:** "have remained a mystery to most of the world."



 ___

**Text:** "debates about whether and when whistleblowing is permissible,"



 ___

**Text:** "Constructed in the Medieval Revival architectural style,"



 ___

**Text:** "as well as black Federal Reserve Seals and District numbers. The presses at your left are overprinting this important information,"



 ___

**Text:** "rather than strict adherence to finely honed strategic plans."



 ___

**Text:** "and have simplified, advantageous, or slightly different tax treatment."



 ___

**Text:** "Once confirmed, click the "next" button, like this. Great, now let's move on to step-"



 ___

**Text:** "A person,"



 ___

**Text:** "Verbal communication is the spoken or written conveyance of a message."



 ___

**Text:** "lines were now very"



 ___

**Text:** "economic definition of "sacrificing profits,"



 ___

**Text:** "and that the good thinker necessarily aims for styles of examination and appraisal that are analytical, abstract,"



 ___

**Text:** "In his 1962 ground breaking work Strategy and Structure,"



 ___

**Text:** "and would not become immediately apparent until the Enlightenment. By the 18th century and the Age of Enlightenment,"



 ___

**Text:** "However, only some of these CSR activities"



 ___

**Text:** "and social welfare are interdependent."



 ___

**Text:** "of how to draft privacy policies.
The United States does not have a specific federal regulation"



 ___

**Text:** "called gamma-ray bursts, and there is nothing more powerful."



 ___

**Text:** "individuals have to include dividends in their income when they complete their personal tax returns,"



 ___

**Text:** "boost productivity in the manufacture of pins. While individuals could produce 200 pins per day,"



 ___

**Text:** "in the Tirukkural, a Tamil book"



 ___

**Text:** "and across industries renders their information search less costly than for clients."



 ___

**Text:** "However, traditional Aboriginal conversational interaction is "communal", broadcast to many people,"



 ___

**Text:** "One of the top honeymoon destinations in the word, Maui's alluring beaches and immaculate resorts"



 ___

**Text:** "and ask questions. These acts may take many forms,"



 ___

**Text:** "arise mysteriously from the unconscious mind while the conscious mind is occupied on other tasks."



 ___

**Text:** "Italian rancho was a bachelor establishment."



 ___

**Text:** "last one I knew was an overseer."



 ___

**Text:** "The rhythm of the rod carries your body, mind, and spirit to the water. Whether you catch a fish or not, the water will always give you"



 ___

**Text:** "in order simplify the work for line managers."



 ___

**Text:** "to draw conclusions about the quality of critical thinking."



 ___

**Text:** "Spectacle cobras can grow upto more than 7 feet in length. Their venom attacks the nerves"



 ___

**Text:** ""Non-verbal behaviours may form a universal language system.""



 ___

**Text:** "And while prying doors is still an important skill that every firefighter must master, the alarming rise in crime in this country,"



 ___

**Text:** "or simply for escaping the world"



 ___

**Text:** "offer opportunities to engage in continuous professional development, and foster an environment,"



 ___

**Text:** "September 15, 1963, the Sixteenth Street Baptist Church in Birmingham Alabama -"



 ___

**Text:** "and the person must then figure out and formulate consciously what the mindbrain has already solved."



 ___

**Text:** "as either consumer cooperatives or worker cooperatives. Cooperatives are fundamental to the ideology"



 ___

**Text:** "A functionalist would say that any mental state--sadness, ecstasy, guilt, boredom--"



 ___

**Text:** "So many diamonds were found that miners could swim in them like large ponds of water. Suddenly, the price of diamonds plunged."



 ___

**Text:** "that do not depend on acquired knowledge: they include common sense, the ability to deal with people,"



 ___

**Text:** "With respect to the latter and most severe ramification"



 ___

**Text:** "people with borderline personality disorder react."



 ___

**Text:** "Such documents often have standard formats that are particular to the organization issuing the policy."



 ___

**Text:** "Lambert Castle was built in 1893 as the home of Catholina Lambert,"



 ___

**Text:** "Nuclear energy is the energy that is produced by a nuclear reaction."



 ___

**Text:** "or has no effect on the amount and quality of critical thinking in a course. There is some evidence to suggest a fourth,"



 ___

**Text:** "In the 2010s, there has been an increase in online management education and training"



 ___

**Text:** "including a basin wrench, used in cramped spaces."



 ___

**Text:** "When you look up at the night sky, you are looking at the ultimate history book -"



 ___

**Text:** ""Three minutes." The bombers had issued their warning; the countdown has"



 ___

**Text:** "and  place some soft, absorbent material behind the subject you are recording to minimize reflections."



 ___

**Text:** "and are responses to strategic questions about how the organization will compete,"



 ___

**Text:** "within and between different family members or groups."



 ___

**Text:** "In addition, top-level managers play a significant role in the mobilization of outside resources."



 ___

**Text:** "are mathematical techniques"



 ___

**Text:** "are also selective in their processing speed Martindale argues that in the creative process,"



 ___

**Text:** "marketing measures effectiveness with sales; guerrilla marketing, and profits."



 ___

**Text:** "But Network News is different. You've got to have a strong sense of style, of who you are,"



 ___

**Text:** "and the marketing of corporations' ethics policies."



 ___

**Text:** "is to account for the tension between predicting the creative profile of an individual, as characterised by the psychometric approach,"



 ___

**Text:** "storage and retrieval of information, strategic choice, strategic outcome and feedback."



 ___

**Text:** "about 250 micrograms"



 ___

**Text:** "mass media companies such as cable television networks,"



 ___

**Text:** "Positive affect makes additional cognitive material available for processing,"



 ___

**Text:** "and spending time with the one you love.

Whether it's hiking to one of East Maui's spectacular waterfalls,"



 ___

**Text:** "of language are governed"



 ___

**Text:** "Paper-based indices involve one or more of a variety of methods of responding."



 ___

**Text:** "Close beside him gleamed the white fangs of the wolf-dog."



 ___

## Odd Characters

In [182]:
characters = set()
for row in data:
    characters.update(list(row['text']))
characters = sorted(list(characters))
characters

['\t',
 '\n',
 ' ',
 '!',
 '"',
 '#',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 ']',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '~',
 '\xa0',
 '®',
 '°',
 'Ü',
 'á',
 'ç',
 'é',
 'í',
 'ö',
 'ø',
 'ρ',
 '–',
 '—',
 '‘',
 '’',
 '“',
 '”',
 '•',
 '…',
 '€',
 '−']

In [177]:
find_examples(r'€', display_n=10, replace=False, group=0)

### Examples Captured by Regex

**Regex:** €

**Number of Examples:** 1

**Number of Examples Shown:** 1



 ___

**Text:** "of 350 Mln. **€** and an export"



 ___

## Numbers

In [161]:
find_examples(r'\S*(\d+)\S*', display_n=10, replace=False, group=0, load_audio=True)

### Examples Captured by Regex

**Regex:** \S*(\d+)\S*

**Number of Examples:** 398

**Number of Examples Shown:** 10



 ___

**Text:** "both below and above IQ's of **120.** Preckel et al., investigating fluid intelligence …"



 ___

**Text:** "and incorporates the conventional **5WH** approach, with a systematic process of investigat…"



 ___

**Text:** "such as ISO/IEC **27002.**  The International Electrotechnical Commission"



 ___

**Text:** "only about 3,000--most in their 50s and **60s--remain,**"



 ___

**Text:** "… are asked to connect all 9 dots in the 3 rows of **3** dots"



 ___

**Text:** "…which is located at the end of a dusty road about **35** miles south of Reno."



 ___

**Text:** "Guerrilla marketing differs in twelve ways:

**1.** Traditional marketing uses as big a budget as pos…"



 ___

**Text:** "…imeters of rainfall annually and a record high of **134°F),** and"



 ___

**Text:** "Between 300,000 and **500,000** residents of this dangerously poised,"



 ___

**Text:** "…XXXX, John Kotter studied the daily activities of **15** executives"



 ___

## Bible Line Numbers

In [5]:
regex = r'\b([0-9]+:[0-9\-]+)\b'
find_examples(regex, replace=True)

### Examples Captured by Regex

**Regex:** \b([0-9]+:[0-9\-]+)\b

**Number of Examples:** 6

**Number of Examples Shown:** 5



 ___

**Text:** "…mon's wisdom in Proverbs **15:1** includes: "…"



 ___

**Text:** "…eness."; 1 Thessalonians **5:14** dictates: "…"



 ___

**Text:** "…ar in the Bible. 1 Peter **4:8-9** advises: "…"



 ___

**Text:** "…and in Galatians **6:2** encourages: "Bear one an…"



 ___

**Text:** "…imilar lines in Proverbs **16:21** includes: "…"



 ___

## Ordinals

In [6]:
regex = r'([0-9]+(st|nd|rd|th))'
find_examples(regex, replace=True)

### Examples Captured by Regex

**Regex:** ([0-9]+(st|nd|rd|th))

**Number of Examples:** 47

**Number of Examples Shown:** 5



 ___

**Text:** "…ariously from 300 BCE to **7th** century CE and attribute…"



 ___

**Text:** "…On July **2nd**, the Continental Congres…"



 ___

**Text:** "…Most theories in the **20th** century argued that grea…"



 ___

**Text:** "… English as early as the **14th** century,…"



 ___

**Text:** "…y that grew up under our **20th** century noses,…"



 ___

## Money (Dollars or Pounds) 

In [7]:
regex = r'(\S*([$£]{1}[0-9\,\.]+\b))'
find_examples(regex, replace=True)

### Examples Captured by Regex

**Regex:** (\S*([$£]{1}[0-9\,\.]+\b))

**Number of Examples:** 13

**Number of Examples Shown:** 5



 ___

**Text:** "… estimated that by 2010, **$1.2** trillion…"



 ___

**Text:** "…that provided USD **$1,500** in tax credits as well a…"



 ___

**Text:** "…which has cost nearly **$4** million…"



 ___

**Text:** "…of about **NZ$150** to NZ$180 Million…"



 ___

**Text:** "…have an annual bill of **$80** billion.  Aside from…"



 ___

## Year

Here we experiment verbalizing a year.

In [16]:
regexes = [r'(\b20[0-9]{2}s?\b)',
           r'(\b1[0-9]{3}s?\b)',
           r'\b([0-9]+)\b \b(?:BCE|BC|AD|CE)\b']
for regex in regexes:
    find_examples(regex, replace=True, display_n=20)

### Examples Captured by Regex

**Regex:** (\b20[0-9]{2}s?\b)

**Number of Examples:** 97

**Number of Examples Shown:** 20



 ___

**Text:** "…A **2009** study found that whistle…"



 ___

**Text:** "…A **2006** study found that the UK …"



 ___

**Text:** "… Blowers Protection Act, **2011** has received…"



 ___

**Text:** "…to unlock the key to a **2012** prophecy.…"



 ___

**Text:** "…g citizens of the EU. In **2001** the United States Depart…"



 ___

**Text:** "…or a Strong Economy Act, **2002** is…"



 ___

**Text:** "…incidents like the **2013** Savar building collapse,…"



 ___

**Text:** "…From **2008**, Assessment and Qualific…"



 ___

**Text:** "…nturies--all pointing to **2012** as…"



 ___

**Text:** "…omains, was described in **2005** and…"



 ___

**Text:** "…he horse meat scandal of **2013** in the United Kingdom…"



 ___

**Text:** "…Doris Sims, December **2009**. Research-based writing …"



 ___

**Text:** "…also modified theirs for **2008**. Many examinations for u…"



 ___

**Text:** "…owing policy in November **2015** that all NHS organizatio…"



 ___

**Text:** "…of actual trysts. And in **2005**, HBO viewers got an exte…"



 ___

**Text:** "…the spotlight during the **2008** presidential election.…"



 ___

**Text:** "…ess leaders worldwide in **2008**. The survey found out th…"



 ___

**Text:** "…As of **2009**, sixteen academic journa…"



 ___

**Text:** "…On October 12, **2006**, the U.S. Small Business…"



 ___

**Text:** "…ormation Technology Act, **2008** made significant changes…"



 ___

### Examples Captured by Regex

**Regex:** (\b1[0-9]{3}s?\b)

**Number of Examples:** 198

**Number of Examples Shown:** 20



 ___

**Text:** "…in **1968**, following initial work …"



 ___

**Text:** "… many Nevada counties in **1972**, the state has become a …"



 ___

**Text:** "…and then--August XXXX, **1968**. It is a sadly poetic cy…"



 ___

**Text:** "…until the **1980s**, when deregulation and a…"



 ___

**Text:** "…in the XXXXX and **1980s**. Lack of leadership is m…"



 ___

**Text:** "…in **1911**. In 1912 Yoichi Ueno int…"



 ___

**Text:** "…tive on strategy, as the **1970s** paradigm was the pursuit…"



 ___

**Text:** "…In **1909**, a man named Charles Her…"



 ___

**Text:** "…in the late **1980s** and early 1990s, possibl…"



 ___

**Text:** "…In **1978**, Derek F. Abell describe…"



 ___

**Text:** "…on March 9, **1731**.

The Alvarez Travieso a…"



 ___

**Text:** "…Administration degree in **1921**. People like Henri Fayol…"



 ___

**Text:** "…In **1973**, Mintzberg found that se…"



 ___

**Text:** "…chise from bankruptcy in **1999** and kept the team from r…"



 ___

**Text:** "…rabi dates back to about **1772** BC for example,…"



 ___

**Text:** "…The **1980s** were tragic years that s…"



 ___

**Text:** "…By mid-**1940**, the German Army had con…"



 ___

**Text:** "…exercise equipment since **1989**. Everyday, millions of p…"



 ___

**Text:** "…In **1938**, a Swiss chemist named A…"



 ___

**Text:** "…Additionally, during the **1980s** statistical advances all…"



 ___

### Examples Captured by Regex

**Regex:** \b([0-9]+)\b \b(?:BCE|BC|AD|CE)\b

**Number of Examples:** 4

**Number of Examples Shown:** 4



 ___

**Text:** "… on the XXXX of December **37** AD…"



 ___

**Text:** "…in **350** BC, and Alfarabi listed …"



 ___

**Text:** "…dated variously from **300** BCE to XXX century CE an…"



 ___

**Text:** "…anicus was sidelined. In **54** AD, Agrippina decided th…"



 ___

## Fractions

In [62]:
find_examples(r'\b(\d?-?\d/\d)\b', replace=True, group=0)

### Examples Captured by Regex

**Regex:** \b(\d?-?\d+/\d+)\b

**Number of Examples:** 2

**Number of Examples Shown:** 2



 ___

**Text:** "…pport this by offering a **48/52** pay option…"



 ___

**Text:** "…example was the infamous **9/11** attacks were labeled as …"



 ___

## Percent

In [29]:
find_examples(r'\b(\d+\b\%)', replace=True)

### Examples Captured by Regex

**Regex:** \b(\d+\b\%)

**Number of Examples:** 34

**Number of Examples Shown:** 5



 ___

**Text:** "…suggest that only around **65%** of companies are fully c…"



 ___

**Text:** "…and only **25%** from technical skills. H…"



 ___

**Text:** "…of the overall turnover, **75%** within the EU and 25% ou…"



 ___

**Text:** "…written succession plan; **38%** have an informal, unwrit…"



 ___

**Text:** "…and the remaining **52%** do not have any successi…"



 ___

## Other

```
Article: "article 362(1)" (article three six two one)
Lists: "1),", "1).", "1." (one, one)
Phone Number: 1-888-Comcast (one eight eight eight comcast)
ISO: "ISO 19600" (nineteen six zero zero), "ISO/IEC 27002"
Volume: -6db (negative six db)
Contracted Year: '80s (eighties)
Plane: 747s (seven four seven)
Range: "10-14" (ten to fourteen)
Compound Word: "12-by-21", "16- to 18-year-olds", "1,500-year-old", "60s--remain,"
"5-HTTLPR", "F-18", "trans-2-methylcyclopentanol"
Succession: "3-2-1" (three two one)
Acronym & Section Number: 3DX (three DX), "34A", "Section #3:"
Money: $1.2 trillion
Line Number: "Proverbs 15:1"
Ordinal: "7th century CE"
Temperature: "134°F),"
Math: "r = .4", "~120"
Other: "AE1.5:", "321ater", "321asepties", "ADRA2b", "328(3)", "336(2)(d)", "22575-22579", "350-100-1", "AS 3806 - Compliance Program"
```

In [136]:
find_examples(r'\S*(\d+)\S*', display_n=5, replace=False, group=0)

### Examples Captured by Regex

**Regex:** \S*(\d+)\S*

**Number of Examples:** 398

**Number of Examples Shown:** 5



 ___

**Text:** "…XXXX, one study estimated that XXX of the Fortune **500** companies"



 ___

**Text:** "… information on the funds available in the ALLTEL **401K** plan."



 ___

**Text:** "before they left Cuatitlan , November **8,** XXXX.  At Saltillo in the"



 ___

**Text:** "that 3- to **4-year-old** children can discern, to some extent, the differe…"



 ___

**Text:** "the **23-year-old** Brigadier General -- George Armstrong Custer."



 ___