# Verbalizing Hillary

The goals of this notebook are:

- Construct an algorithm to verbalize the Hillary dataset.
- Understand the distribution of not verbalized symbols such as numbers and special characters.

In [5]:
import re
import sys

# Setup the "PYTHONPATH"
sys.path.insert(0, '../')

from src.datasets.lj_speech import _iterate_and_replace

In [6]:
from src.datasets import hillary_dataset

data, _ = hillary_dataset(directory='../data')

No config for `hillary.hillary_dataset` (`src.datasets.hillary.hillary_dataset`)
100%|██████████| 10528/10528 [00:00<00:00, 342004.99it/s]


In [7]:
import random
import os

from IPython.display import Audio
from IPython.display import Markdown
from IPython.display import FileLink

def find_examples(regex, display_n=5, load_audio=False, replace=True, group=1, context=25):
    """ Print ``display_n`` examples of ``regex`` in ``lj_speech_dataset``.
    
    This is the bread and butter module for our data analysis. Enabling us to use regex to query the
    dataset and retrieve samples.
    
    Args:
        regex (str): Pattern or compiled regex object.
        display_n (int or None, optional): Number of examples to display.
        load_audio (bool, optional): If to load audio.
        replace (bool, optional): If to replace the matched characters with XXX...
        group (int, optional): Group to select in regex.
        context (int, optional): Number of characters to include on the left and right of the matched
            text as context.
        
    Returns:
        None
    """
    examples = []
    for row in data:
        matches = re.finditer(regex, row['text'])
        for match in matches:
            if match.start(group) - match.end(group) == 0:
                continue
            
            text = row['text']    
            start = match.start(group)
            end = match.end(group)
            start_context = max(start - context, 0)
            end_context = min(end + context, len(text))
            
            if replace:
                row['text'] = '{}{}{}'.format(text[:start],
                                              'X' * (end - start), 
                                              text[end:])
                
            if start != 0 or end != len(row['text']):
                text = '{}**{}**{}'.format(text[start_context:start],
                                           match.group(group),
                                           text[end:end_context])
            if start != 0:
                text = '…' + text
            if end != len(row['text']):
                text = text + '…'
                
            examples.append({
                'text': text,
                'audio': row['wav_filename']
            })
            
    # Print Examples
    display(Markdown('### Examples Captured by Regex'))
    display(Markdown('**Regex:** ' + str(regex)))
    display(Markdown('**Number of Examples:** ' + str(len(examples))))
    
    random.shuffle(examples)
    if display_n is not None:
        examples = examples[:display_n]
    
    display(Markdown('**Number of Examples Shown:** ' + str(len(examples))))
    display(Markdown('\n\n ___'))
    
    for example in examples:
        display(Markdown('**Text:** "' + example['text'] + '"'))
        display(FileLink(example['audio']))
        if load_audio:
            display(Audio(str(example['audio'])))
        display(Markdown('\n\n ___'))
        display()

## Sample of the Dataset

In [9]:
find_examples(r'(?s).*', display_n=100, replace=False, group=0, load_audio=True)

### Examples Captured by Regex

**Regex:** (?s).*

**Number of Examples:** 8422

**Number of Examples Shown:** 100



 ___

**Text:** "and the allocation of resources necessary for carrying out these goals.""



 ___

**Text:** "but the receiver takes it in a different meaning. For example- ASAP, Rest room"



 ___

**Text:** "Some ethical issues of particular concern in today's evolving business market"



 ___

**Text:** "There is a small suction cup underneath the counting device, making sure that the count is accurate"



 ___

**Text:** "Kardashian served it to you personally."



 ___

**Text:** "and often get fooled into believing they understand their business if they have quantitative research to rely upon."



 ___

**Text:** "ONE OF THE MOST important investment decisions you will ever make is not whether to retire but when you chose to do it."



 ___

**Text:** "Render accurate judgments about specific things and qualities in everyday lifeIn"



 ___

**Text:** "did"



 ___

**Text:** "g. work-"



 ___

**Text:** "and producing products that go too far"



 ___

**Text:** "or those of other people involved in the experiment, as indirect ways of signaling their level of"



 ___

**Text:** "Every information exchange between living organisms —"



 ___

**Text:** "See our resources section in the manual that accompanied this program for a listing of suppliers."



 ___

**Text:** "is that it can overly constrain managerial discretion in a dynamic environment. ""



 ___

**Text:** "I am writing these lines in Honolulu, Hawaii."



 ___

**Text:** "There's something magical about fire dancing on water."



 ___

**Text:** "with Journal of Business Ethics and Business Ethics Quarterly considered the leaders."



 ___

**Text:** "'Supporting Local Communities'."



 ___

**Text:** "But, according to Hugo Conti, a self-taught Argentinian historian who leads a mysterious group called ""



 ___

**Text:** "in a near-unanimous vote."



 ___

**Text:** "Which skills and capabilities should be developed within the firm?"



 ___

**Text:** "are concerns that the empathizer's own emotional background"



 ___

**Text:** "The first looking at those problems that only have one solution which are grounded in"



 ___

**Text:** "This allowed society to assimilate the change before the next change arrived."



 ___

**Text:** "What if Jeanne failed him."



 ___

**Text:** "The man's knowledge of the can being served as purely an air freshener"



 ___

**Text:** "Meditation

Sit or lie comfortably. You may even want to invest in a meditation chair."



 ___

**Text:** "family relationships, quarrels, collaboration, reciprocity, and altruism,"



 ___

**Text:** "7 feet in length. Their venom attacks the nerves which can kill a human in as little as 30 min."



 ___

**Text:** "The other two seasonal points on the sun’s path are the solstices."



 ___

**Text:** "This course is designed to help you clean up your grammar and improve the way you communicate."



 ___

**Text:** "One primary study confirmed that patients with borderline personality disorder"



 ___

**Text:** "approach at one end, and a facilitative approach at the other."



 ___

**Text:** "His company’s name was DeBeers,"



 ___

**Text:** "Colleges and universities around the world"



 ___

**Text:** "such as wood, petroleum, natural gas, ores, plants or minerals."



 ___

**Text:** "On average, the prokaryotic cells of bacteria are much smaller, and much simpler in structure, than eukaryotic cells."



 ___

**Text:** "Language follows phonological rules ("



 ___

**Text:** "It is almost impossible to imagine life today without metals."



 ___

**Text:** "Cambridge International Examinations have an A-level in Thinking Skills."



 ___

**Text:** "such as formal and informal logic. This emphasized to students"



 ___

**Text:** "Anonymous reporting mechanisms, as mentioned previously,"



 ___

**Text:** "due to technology and supply chain process innovation."



 ___

**Text:** "Two American scientists, Edward Drinker Cope and"



 ___

**Text:** "Industrial manufacturers produce products, either from raw materials or from component parts, then export the finished products at a profit."



 ___

**Text:** "it identifies the intellectual capacity and the means "of judging", ""



 ___

**Text:** "In 1980,"



 ___

**Text:** "is the work by Allen Newell and Herbert A. Simon."



 ___

**Text:** "as emotional intelligence. ...emotional intelligence is the sine qua non of leadership"."



 ___

**Text:** "overbilling for days not worked, speed at the cost of quality,"



 ___

**Text:** "Fast, but endure."



 ___

**Text:** "Considerable progress has been made in automated scoring of divergent thinking tests using semantic approach."



 ___

**Text:** "began secretly buying up adjacent parcels of land in the Flatbush section of Brooklyn"



 ___

**Text:** "Extreme Cave Diving, right now on this Nova/National Geographic special."



 ___

**Text:** "and education."



 ___

**Text:** "and no longer than 60 days has lapsed since the employee has reported the incident to his employer, and
the employer has not addressed the irregularity"



 ___

**Text:** "Sometimes even a particular kinesic indicating something good in a country"



 ___

**Text:** "STUFF and see just how much of a change they have made."



 ___

**Text:** "At Lake Linderman I had one canoe, very good Peterborough canoe."



 ___

**Text:** "the sociologist Silvia Leal Martín, using the Innova"



 ___

**Text:** "and iterative. It is intentionally normative"



 ___

**Text:** "The popularity of matte lipstick skyrocketed"



 ___

**Text:** "hands of an even mightier adversary: Change."



 ___

**Text:** "out of every 3"



 ___

**Text:** "protons, which are positive particles, and neutrons, which have no charge."



 ___

**Text:** "from a blog post by Grantland's Sean McIndoe*"



 ___

**Text:** "Consultancies Associations study."



 ___

**Text:** "The terms pertinent to it are "feeling", "judgement", "sense", "proportion", "balance", "appropriateness"."



 ___

**Text:** "Recognize the existence of logical relationships between propositions"



 ___

**Text:** "consulting firms are typically aware of industry "best practices." However, the specific nature of situations under consideration"



 ___

**Text:** "Instead, one tends to think in terms of the various processes, tasks, and objects subject to management."



 ___

**Text:** "WLEIS did a bit better, and the Bar-On measure better still."



 ___

**Text:** "Due to its rich and abundant oyster beds"



 ___

**Text:** "Maier observed that participants were often unable to view the object in a way that strayed from its typical use,"



 ___

**Text:** "This area is affected by the currency exchange rate,"



 ___

**Text:** "His voice was passionately rebellious."



 ___

**Text:** "the ability to detect and decipher emotions in faces, pictures, voices, and cultural artifacts—"



 ___

**Text:** "Most of the Galapagos Islands have no permanent human settlements. Still, people have stopped to visit throughout history."



 ___

**Text:** "tabs across the top of your screen--Introduction, Trees, Shrubs, Plants, and Frequently Asked Questions?"



 ___

**Text:** "However, a close examination shows that most advice given today contains gaps and"



 ___

**Text:** "For instance, imagine the following situation:"



 ___

**Text:** "the separation being made between talent and genius."



 ___

**Text:** "moniker for many years, with some in the United States and other countries"



 ___

**Text:** "but, unlike controls,"



 ___

**Text:** "so omnipresent,"



 ___

**Text:** "Lack of people skills among upper echelons can result in bullying and/or harassment,"



 ___

**Text:** "funded team of researchers led by James C. Kaufman and Mark A. Runco"



 ___

**Text:** "have improved self-awareness, social-emotional adjustment and classroom behavior;"



 ___

**Text:** "It also, the authors argued, made a useful framework for analyzing creative processes in individuals."



 ___

**Text:** "whistleblowing in the public sector organization"



 ___

**Text:** "Taking these simple steps should go a long way to making sure you end up with a clean, professional sounding recording."



 ___

**Text:** "Plant roots communicate with rhizome bacteria, fungi, and insects within the soil."



 ___

**Text:** "at the intense pressures of the earth’s depths, must have been a common constituent of the"



 ___

**Text:** "From my earliest recollection my sleep was a period of terror."



 ___

**Text:** "the number increased by marriages to fifteen families and four single men,"



 ___

**Text:** "Fungi communicate with their own and related species"



 ___

**Text:** "but whose fields were infected through pollen drift."



 ___

**Text:** "Researcher Erin Hagen doesn’t have the answers. For now, she’s doing what she can to help just one:"



 ___

**Text:** "years with this new book and amazing 10-disc anthology"



 ___