# Verbalizing Hillary

The goals of this notebook are:

- Construct an algorithm to verbalize the Hillary dataset.
- Understand the distribution of not verbalized symbols such as numbers and special characters.

In [2]:
import re
import sys

# Setup the "PYTHONPATH"
sys.path.insert(0, '../../')

from src.datasets.lj_speech import _iterate_and_replace

In [4]:
from src.datasets import hillary_dataset

data, _ = hillary_dataset(directory='../../data')

No config for `hillary.hillary_dataset` (`src.datasets.hillary.hillary_dataset`)
100%|██████████| 9670/9670 [03:58<00:00, 33.86it/s]


In [5]:
import random
import os

from IPython.display import Audio
from IPython.display import Markdown
from IPython.display import FileLink

def find_examples(regex, display_n=5, load_audio=False, replace=True, group=1, context=25):
    """ Print ``display_n`` examples of ``regex`` in ``lj_speech_dataset``.
    
    This is the bread and butter module for our data analysis. Enabling us to use regex to query the
    dataset and retrieve samples.
    
    Args:
        regex (str): Pattern or compiled regex object.
        display_n (int or None, optional): Number of examples to display.
        load_audio (bool, optional): If to load audio.
        replace (bool, optional): If to replace the matched characters with XXX...
        group (int, optional): Group to select in regex.
        context (int, optional): Number of characters to include on the left and right of the matched
            text as context.
        
    Returns:
        None
    """
    examples = []
    for row in data:
        matches = re.finditer(regex, row['text'])
        for match in matches:
            if match.start(group) - match.end(group) == 0:
                continue
            
            text = row['text']    
            start = match.start(group)
            end = match.end(group)
            start_context = max(start - context, 0)
            end_context = min(end + context, len(text))
            
            if replace:
                row['text'] = '{}{}{}'.format(text[:start],
                                              'X' * (end - start), 
                                              text[end:])
                
            if start != 0 or end != len(row['text']):
                text = '{}**{}**{}'.format(text[start_context:start],
                                           match.group(group),
                                           text[end:end_context])
            if start != 0:
                text = '…' + text
            if end != len(row['text']):
                text = text + '…'
                
            examples.append({
                'text': text,
                'audio': row['wav_filename']
            })
            
    # Print Examples
    display(Markdown('### Examples Captured by Regex'))
    display(Markdown('**Regex:** ' + str(regex)))
    display(Markdown('**Number of Examples:** ' + str(len(examples))))
    
    random.shuffle(examples)
    if display_n is not None:
        examples = examples[:display_n]
    
    display(Markdown('**Number of Examples Shown:** ' + str(len(examples))))
    display(Markdown('\n\n ___'))
    
    for example in examples:
        display(Markdown('**Text:** "' + example['text'] + '"'))
        display(FileLink(example['audio']))
        if load_audio:
            display(Audio(filename=str(example['audio'])))
        display(Markdown('\n\n ___'))
        display()

## Sample of the Dataset

In [6]:
find_examples(r'(?s).*', display_n=100, replace=False, group=0, load_audio=True)

### Examples Captured by Regex

**Regex:** (?s).*

**Number of Examples:** 7736

**Number of Examples Shown:** 100



 ___

**Text:** "that begins with miles of quiet."



 ___

**Text:** "and relying on evidence and sound rationale."



 ___

**Text:** "This confuses the uniqueness that should be reserved for each by itself."



 ___

**Text:** "It was a miracle, and I owe you my life."



 ___

**Text:** "These dimensions are known as physical, informational, and cognitive."



 ___

**Text:** "it is important to stress what we do, and for who we do it for."



 ___

**Text:** "I may manage to freight a cargo back as well."



 ___

**Text:** "against TT. Wai et al. in a study using data from the longitudinal Study of Mathematically Precocious Youth –"



 ___

**Text:** "it is not another form or type of intelligence, but intelligence—"



 ___

**Text:** "This interactive DVD makes everything from assembly, maintenance, and usage a snap!"



 ___

**Text:** "All this day Gregson remained in the cabin."



 ___

**Text:** "Management consulting refers generally to the provision of business services,"



 ___

**Text:** "quantity, and appropriateness of participants responses to a variety of open-ended questions."



 ___

**Text:** "Whether you’re a senior, newcomer, pregnant, or prefer low impact,"



 ___

**Text:** "It was originally proposed as a measurement of a person's capacity to deal with people and social relationships."



 ___

**Text:** "However, Janssen and Vanhamme reported that less than 4% of average household expenditure in the UK in 2010 was"



 ___

**Text:** "transparency about environmental risks, transparency about product ingredients such as genetically modified organisms"



 ___

**Text:** "written by author"



 ___

**Text:** "Broken homes, childhood trauma, lack of parenting and many others factors can influence the connections in the brain"



 ___

**Text:** "Handsome, cocky, white,"



 ___

**Text:** "This action is usually framed by formal membership and form."



 ___

**Text:** "This test can be used when diagnosing autism spectrum disorders, including autism and Asperger syndrome."



 ___

**Text:** "the term became widely known with the publication of Goleman's book:"



 ___

**Text:** "offered in Critical Thinking in the UK, open to any A-level student regardless of whether they have the Critical Thinking A-level."



 ___

**Text:** "But, hey, go ahead and let everyone know that you have 2300 "friends." That's sure to convince people that you're really popular."



 ___

**Text:** "might only sustain high returns on their investment"



 ___

**Text:** "Creating shared value or CSV is based on the idea that corporate success"



 ___

**Text:** "gov which provides a single point of access to government services and information that help businesses comply with government regulations."



 ___

**Text:** "The dominant factors are usually identified as "the four Ps" — process, product, person, and place."



 ___

**Text:** "It is certainly difficult to understand an animal's intention behind an empathic response."



 ___

**Text:** "Understanding the manner of speaking within business in the local area to improve overall productivity"



 ___

**Text:** "The US Department of Labor, Occupational Health and Safety Administration was created by Congress"



 ___

**Text:** "Ethical marketing issues include marketing redundant or dangerous products/services"



 ___

**Text:** "When you choose"



 ___

**Text:** "The gene 5-HTTLPR seems to determine sensitivity to negative emotional information"



 ___

**Text:** "will show you how to decode those little black dots and, in a short time, you’ll be surprised how fluent you’ve become!"



 ___

**Text:** "Good leaders use their own inner mentors to energize their team and organizations"



 ___

**Text:** "Using essential oils for healing purposes is often called aromatherapy,"



 ___

**Text:** "and in Galatians 6:2 encourages: "Bear one another's burdens, and in this way you will fulfill the law of Christ.""



 ___

**Text:** "and Leadership Secrets of Attila the Hun by Wess Roberts."



 ___

**Text:** "which uses spoken and written words for expressing and transferring views and ideas."



 ___

**Text:** "They fought to belong, to survive, and to get ahead, as would all newcomers to America."



 ___

**Text:** "It is about "capturing what the manager learns from all sources"



 ___

**Text:** "but this may be understated.  While multinational consultancy firms provide advice on major projects"



 ___

**Text:** "Some research suggests that people are more able and willing to empathize with those most similar to themselves."



 ___

**Text:** "The Lighthouse Board"



 ___

**Text:** "advertising truthfulness and fairness in pricing & distribution."



 ___

**Text:** "As a result of their exposure to, and relationships"



 ___

**Text:** "Barbara Fredrickson in her broaden-and-build model suggests that positive emotions such as joy and love"



 ___

**Text:** "offers a plan and an online support system"



 ___

**Text:** "I have to be careful of them, as they tear very easily."



 ___

**Text:** "In 1982 the first single-authored books in the field appeared."



 ___

**Text:** "and found that 27 had no word which directly translated to 'creativity'. The principle of linguistic relativity,"



 ___

**Text:** "most certainly the most dramatic in NCAA Tournament history."



 ___

**Text:** "That's what this Fastbreak Basketball video is for."



 ___

**Text:** "or appear in marketing."



 ___

**Text:** "for decentralized decision-making"."



 ___

**Text:** "depending on the information's severity and nature, whistleblowers may report the misconduct to lawyers, the media,"



 ___

**Text:** "The objectives that an organization might wish to pursue"



 ___

**Text:** "This interface will let you negotiate a realistic visual world - without bumping into walls!"



 ___

**Text:** "For any organization, place, or function, large or small, safety is a normative concept."



 ___

**Text:** "In a simple model, often referred to as the transmission model or standard view of communication,"



 ___

**Text:** "advanced forms of empathy in humans"



 ___

**Text:** "knowing, feminist theory, subjectivity,"



 ___

**Text:** "to seeing the sunset as you cruise along the golden Kihei coastline,"



 ___

**Text:** "Proponents argue that politically liberal CEOs will envision the practice of CSR as beneficial"



 ___

**Text:** "While individuals with borderline personality disorder may show their emotions too much,"



 ___

**Text:** "The Bill was approved by the cabinet"



 ___

**Text:** "is sent in some form from an emisor/ sender/ encoder to a destination/ receiver/ decoder."



 ___

**Text:** "including sensitive personal data or information."



 ___

**Text:** "He wondered, too, where Roscoe was."



 ___

**Text:** "then click the subtopic you want."



 ___

**Text:** "A company or organization's policy on a particular topic.  For example, the equal opportunity policy of a company"



 ___

**Text:** "and coordinating the efforts of its employees to accomplish its objectives"



 ___

**Text:** "Cats can rotate their wrists, wrap their paws around prey, and unleash jackknife-like claws to hold on."



 ___

**Text:** "Understanding the time structure of an area."



 ___

**Text:** "and lead a team to achieve success."



 ___

**Text:** "owner you can be"



 ___

**Text:** "If not, let's say our prayers and go to bed."



 ___

**Text:** "Surrounding this nucleus are negatively charged particles, called electrons,"



 ___

**Text:** "then attach the concrete piers on top"



 ___

**Text:** "But leave enough room for unexpected rises in volume without 'clipping'. Clipping"



 ___

**Text:** "Each participant received a mild electric shock, then watched another go through the same pain."



 ___

**Text:** "can seem contrary to maintaining user privacy."



 ___

**Text:** "increase your metabolism, and to help you look and feel your best,"



 ___

**Text:** "are a combination of people skills, social skills, communication skills, character traits, attitudes, career attribute,"



 ___

**Text:** "This is a place full of attractions; they draw millions of tourists each year."



 ___

**Text:** "On the other side, the horse meat scandal of 2013 in the United Kingdom"



 ___

**Text:** "Companies that pursued the highest market share position to achieve cost advantages fit under Porter's cost leadership generic strategy,"



 ___

**Text:** "Responsive evaluation provides a naturalistic and humanistic approach to program evaluation."



 ___

**Text:** "On the other side, this opens additional danger for abuse from disreputable practitioners."



 ___

**Text:** "You were destroying my life."



 ___

**Text:** "while studying ergots, a type of fungus."



 ___

**Text:** "the nonprofit Nature Conservancy"



 ___

**Text:** "Bankers will love the long term planning idea and it builds trust, which is hard to purchase in any situation."



 ___

**Text:** "Philip thrust himself against it and entered."



 ___

**Text:** "the evolution of cells recognizable as eukaryotes of the modern type."



 ___

**Text:** "that Mary Magdalene was a very different woman to the one we think we know.
So who was the real Mary Magdalene?"



 ___

**Text:** "It was, of course, nature that dealt the most severe blows."



 ___

**Text:** "Most large corporations"



 ___