# Verbalizing Hillary

The goals of this notebook are:

- Construct an algorithm to verbalize the Hillary dataset.
- Understand the distribution of not verbalized symbols such as numbers and special characters.

In [1]:
import re
import sys

# Setup the "PYTHONPATH"
sys.path.insert(0, '../')

from src.datasets.lj_speech import _iterate_and_replace

In [2]:
from src.datasets import hillary_dataset

data, _ = hillary_dataset(directory='../data')

No config for `hillary.hillary_dataset` (`src.datasets.hillary.hillary_dataset`)
100%|██████████| 10528/10528 [00:00<00:00, 317987.36it/s]


In [3]:
import random
import os

from IPython.display import Audio
from IPython.display import Markdown
from IPython.display import FileLink

def find_examples(regex, display_n=5, load_audio=False, replace=True, group=1, context=25):
    """ Print ``display_n`` examples of ``regex`` in ``lj_speech_dataset``.
    
    This is the bread and butter module for our data analysis. Enabling us to use regex to query the
    dataset and retrieve samples.
    
    Args:
        regex (str): Pattern or compiled regex object.
        display_n (int or None, optional): Number of examples to display.
        load_audio (bool, optional): If to load audio.
        replace (bool, optional): If to replace the matched characters with XXX...
        group (int, optional): Group to select in regex.
        context (int, optional): Number of characters to include on the left and right of the matched
            text as context.
        
    Returns:
        None
    """
    examples = []
    for row in data:
        matches = re.finditer(regex, row['text'])
        for match in matches:
            if match.start(group) - match.end(group) == 0:
                continue
            
            text = row['text']    
            start = match.start(group)
            end = match.end(group)
            start_context = max(start - context, 0)
            end_context = min(end + context, len(text))
            
            if replace:
                row['text'] = '{}{}{}'.format(text[:start],
                                              'X' * (end - start), 
                                              text[end:])
                
            if start != 0 or end != len(row['text']):
                text = '{}**{}**{}'.format(text[start_context:start],
                                           match.group(group),
                                           text[end:end_context])
            if start != 0:
                text = '…' + text
            if end != len(row['text']):
                text = text + '…'
                
            examples.append({
                'text': text,
                'audio': row['wav_filename']
            })
            
    # Print Examples
    display(Markdown('### Examples Captured by Regex'))
    display(Markdown('**Regex:** ' + str(regex)))
    display(Markdown('**Number of Examples:** ' + str(len(examples))))
    
    random.shuffle(examples)
    if display_n is not None:
        examples = examples[:display_n]
    
    display(Markdown('**Number of Examples Shown:** ' + str(len(examples))))
    display(Markdown('\n\n ___'))
    
    for example in examples:
        display(Markdown('**Text:** "' + example['text'] + '"'))
        display(FileLink(example['audio']))
        if load_audio:
            display(Audio(str(example['audio'])))
        display(Markdown('\n\n ___'))
        display()

## Sample of the Dataset

In [4]:
find_examples(r'(?s).*', display_n=100, replace=False, group=0, load_audio=False)

### Examples Captured by Regex

**Regex:** (?s).*

**Number of Examples:** 8422

**Number of Examples Shown:** 100



 ___

**Text:** "This includes government departments and agencies,"



 ___

**Text:** "Originally known as Pattaquonk Quarter, Chester was settled in"



 ___

**Text:** "And right there I saw and knew it all."



 ___

**Text:** "Their unusual looks and their untidy eating habits have given them a bad reputation."



 ___

**Text:** "at depths that our drills can reach, and from which we then derive commercial petroleum."



 ___

**Text:** "and 1970s saw the passage of the Fair Credit Reporting Act."



 ___

**Text:** "Critical thinking is the objective analysis of facts to form a judgment."



 ___

**Text:** "Agrippina decided the time was right to make a bid to put Nero"



 ___

**Text:** "creating a "unique and valuable  position""



 ___

**Text:** "mandates that websites collecting Personally Identifiable Information"



 ___

**Text:** "Active strategic management required active information gathering and active problem solving."



 ___

**Text:** "Policies can be understood as political, managerial, financial, and administrative mechanisms"



 ___

**Text:** "The mystery of their depths beckons the able and the foolhardy alike."



 ___

**Text:** "Solving problems sometimes involves dealing with pragmatics"



 ___

**Text:** "elements, that the whole edifice of business strategy was subsequently erected.""



 ___

**Text:** "Lastly, there are micro-domains. These are the specific tasks that reside within each domain"



 ___

**Text:** "she also discovers the empty
Tomb"



 ___

**Text:** "too little work has been done on what influences the quality of strategic decision making"



 ___

**Text:** "into certain trades, occupations or professions, that require special education or to raise revenue for local governments."



 ___

**Text:** "for the AS: "Credibility of Evidence" and "Assessing and Developing Argument". The full Advanced GCE is now available: in addition to the"



 ___

**Text:** "the outward and visible sign of the covenant America has made with the world ..."



 ___

**Text:** "suggested measuring the various parameters that encourage creativity and innovation:"



 ___

**Text:** "HBO viewers got an extended reservation to the Ranch"



 ___

**Text:** "but also how they feel about their relationship with the other individual."



 ___

**Text:** "11 sizzling editions.  Proving you can't get enough of a good thing, Cathouse Season Two returned in"



 ___

**Text:** "often as a way to promote better self-esteem, communication and social interaction."



 ___

**Text:** "Outwardly, he maintained a calm and smiling aspect."



 ___

**Text:** "Barriers to effective communication can retard or distort the message or intention of the message being conveyed."



 ___

**Text:** "Analogy: using a solution that solves an analogous problem"



 ___

**Text:** "Improper or inadequate caulking can cause several problems."



 ___

**Text:** "which impairs their social skills, and makes socialization unattractive."



 ___

**Text:** "the main one that must be made is deciding if you are ready to make a lifelong commitment.    When I say Lifelong…"



 ___

**Text:** "as set out in the book Three Laws of Performance."



 ___

**Text:** "If you're time-crunched and need to squeeze in your workouts, this program is for you."



 ___

**Text:** "Being aware of others' reactions and able to respond in an understanding manner."



 ___

**Text:** "4. Abusive Behavior."



 ___

**Text:** "Visual examination for flaws such as cracks, peeling, loose connections."



 ___

**Text:** "Incidentally, not all scientists agree on the allegedly peaceful nature of the bonobo"



 ___

**Text:** "First, let’s look at how to get the most from your virtual tour. See the 5 Main"



 ___

**Text:** "The discipline encompasses a range of topics,"



 ___

**Text:** "make it difficult to conduct international business." Moreover, it can be a risk for a company to operate in a country"



 ___

**Text:** "each of which resulted in a score for originality and fluency;"



 ___

**Text:** "in him should not perish, but have everlasting life.
17"



 ___

**Text:** "The Incans told story with dances and fire, aborigines told it with star and spear."



 ___

**Text:** "Point your microphone away from any noisy equipment,"



 ___

**Text:** "Contrariwise, more democratically inclined theorists have pointed to examples of"



 ___

**Text:** "The second half of the discussion will be about Pet Ownership."



 ___

**Text:** "they emphasize their open, fair, responsible, and pleasant communal qualities."



 ___

**Text:** "This era began the belief and support of self-regulation and free trade,"



 ___

**Text:** "According to Raymond Nickerson, one can see the consequences of confirmation bias in real-life situations,"



 ___

**Text:** "instantly into a sublimely romantic setting, complete with laughter, romance, the distant sounds of"



 ___

**Text:** "These dimensions are known as physical, informational, and cognitive."



 ___

**Text:** "Contrary to popular belief, Mother's Day was not conceived and fine-tuned in the boardroom of Hallmark."



 ___

**Text:** "Modular Selection and Identification for Control. New, hierarchically"



 ___

**Text:** "According to Barney, "formulation"



 ___

**Text:** "salmon swimming upstream against the current of power."



 ___

**Text:** "It would illuminate a"



 ___

**Text:** "could be nurtured by identifying young people"



 ___

**Text:** "While divergent thinking was associated with bilateral activation of the prefrontal cortex, schizotypal"



 ___

**Text:** "than the atoms in your right hand ... but you are literally a star child."



 ___

**Text:** "A purpose statement,"



 ___

**Text:** "may be considered to embody a rare premodern example of abstract theory of administration."



 ___

**Text:** "Readers unfortunately are left to conclude whether they comprise a redundancy."



 ___

**Text:** "Artists of the Romantic period thrived on emotion, imagination, and intuition."



 ___

**Text:** "is used in numerous disciplines, sometimes with different perspectives, visuals, and often with different terminologies."



 ___

**Text:** "After showing that the numbers of eminent relatives dropped off when his focus moved from first-degree to second-degree relatives,"



 ___

**Text:** "however, report misconduct to outside persons or entities."



 ___

**Text:** "It informs the client what specific information is collected,"



 ___

**Text:** "Over time, specialized peer-reviewed journals appeared,"



 ___

**Text:** "Actual or potential threat of adverse effects on living organisms and environment by effluents,"



 ___

**Text:** "do not change from situation to situation;"



 ___

**Text:** "Research and development refer to activities in connection with corporate or government innovation."



 ___

**Text:** "Kiechel wrote in 2010: "The experience curve was, simply, the most important concept in launching the strategy revolution..."



 ___

**Text:** "But this time it was Saxon who rebelled."



 ___

**Text:** "Specimens to be stuffed and mounted for the American Museum of Natural History."



 ___

**Text:** "that in the long term, the majority of workers must support management."



 ___

**Text:** "Mr. Lambert's dream was to build a home reminiscent of the castles in Great Britain"



 ___

**Text:** "Philip saw MacDougall soon after his short talk with Thorpe."



 ___

**Text:** "EI."



 ___

**Text:** "Yet studies show that the energy efficiency of buildings could double by 2020,"



 ___

**Text:** "Pros love Lipmix because it gives us the freedom to customize colour and texture."



 ___

**Text:** "And here is the script:"



 ___

**Text:** "Self-regulation – involves controlling or redirecting one's disruptive emotions and impulses"



 ___

**Text:** "Clubs and balls and cities grew to be only memories."



 ___

**Text:** "and knowledge economy-related sectors – especially information technology software and advertising."



 ___

**Text:** "Branches of management theory also exist relating to nonprofits and to government:"



 ___

**Text:** "focus effort, define or clarify the organization, and provide consistency or guidance in response to the environment,"



 ___

**Text:** "are concerns that the empathizer's own emotional background"



 ___

**Text:** "In the ‘second wave’ of critical thinking,"



 ___

**Text:** "An eye-popping, innovative new book series that brings you up close and personal
 with everything under the sun… and beyond the stars"



 ___

**Text:** "J. Duncan wrote the first college management-textbook in 1911."



 ___

**Text:** "It is also negatively correlated with poor health choices and behavior."



 ___

**Text:** "the understanding of the animal world in general, is a rapidly growing field, and even in the 21st"



 ___

**Text:** "Other online seal programs include the Trust Guard Privacy Verified program, eTrust,"



 ___

**Text:** "Firms engaging in international business"



 ___

**Text:** "for a good-faith report of a whistleblowing action"



 ___

**Text:** "preferably far away from noisy computers and other electrical equipment.  You will be surprised how much microphones can pick up."



 ___

**Text:** "from the motions of the stars to the behavior of nuclear particles;"



 ___

**Text:** "It was my reports from the north which chiefly induced people to buy."



 ___

**Text:** "and the Dodd-Frank Wall Street Reform and Consumer Protection Act."



 ___