# Beth Dataset Generator

The purpose of this notebook is to process the document files Beth sent accross into CSV files.

Before using this script, we did:

1. We validated all of Beth's file were in a `{title}\n{paragraph}`. Any files not following that format were adapted to follow the format. 
2. All `.doc` files were converted to a standard `.docx` file via: ``textutil -convert docx *.doc``
3. All `.mp3` files were converted to `.wav` files via: ``find . -name "*.mp3" -exec bash -c 'ffmpeg -i "{}" "${0/.mp3}.wav"' {} \;``
4. Remove files with spanish in them by searching the documents for common spanish words like "el,la": https://www.happyhourspanish.com/learning-efficiently-start-with-the-250-most-common-spanish-words/
5. Remove unverbalized copy that does not represent a section title. 

In [1]:
from pathlib import Path

destination = Path('./../../_data/02 Beth/') # Files to write processed dataset
destination.mkdir(exist_ok=True)

source = Path('./../../_data/01 Beth/') # Source destination
source.is_dir()

True

## Check Invariants 

Check that all the files in `source` are accounted for with the below patterns.

In [2]:
all_files = set([path for path in source.glob('*') if path.is_file()])
accounted_for_files = set(list(source.glob('*.docx')) + list(source.glob('*.doc')) + 
                          list(source.glob('*.wav')) + list(source.glob('*.mp3')))
all_files.difference(accounted_for_files)

{PosixPath('../../_data/01 Beth/.DS_Store'),
 PosixPath('../../_data/01 Beth/~$ng_FloodlightCamMotionDetectionSettings_082817.rtf'),
 PosixPath('../../_data/01 Beth/~$ng_HowToInslallRingVideoDoorbellPro_111417.rtf'),
 PosixPath('../../_data/01 Beth/~$ng_HowToInstallSpotlightCamMount_110917.rtf'),
 PosixPath('../../_data/01 Beth/~$ng_HowToJoinRingNeighborhoodWithoutADevice_110917.rtf'),
 PosixPath('../../_data/01 Beth/~$ng_HowToSolvePowerProblemsWithRingVideoDoorbellPro_111417.rtf')}

Check that all `.docx` files have an associated `.wav` file. (f.y.i. we fixed any naming issues causing this invariant to fail)

In [4]:
text_file_stems = set([path.stem for path in source.glob('*.docx')])
audio_file_stems = set([path.stem for path in source.glob('*.wav')])
print(text_file_stems.difference(audio_file_stems))
print(audio_file_stems.difference(text_file_stems))

set()
set()


## Generate Script

Finally, we preprocess all the text files into CSV files. 

In [5]:
from IPython.display import FileLink
from IPython.display import Markdown
import json
import docx
import pandas
import re
import pandas as pd
import shutil

def remove_parentheses(text):
    """ Remove paranethesis and / or brackets in text.
    
    Example: 
    
        >>> text = "Statement 7:  (repeating Statement 3)"
        >>> remove_parentheses(text)
        Statement 7:
        
    Args:
        text (str)

    Returns:
        str
    """
    return re.sub("[\(\[].*?[\)\]]", "", text).strip()

def is_title(text):
    """ Return `True` if text is a "title".
    
    Example titles:
      Statement 2:
      For Spanish:
      Statement 4A:
      Day Greeting:
      Close message:
      Statement:  10:
      :15 Spot\u2014Ready?
      Phone Greeting:
      On Hold Program:
      After pressing 2:
      Holiday Greeting:
      Prompt2a\u2013Press3Menu
      Main Greeting Menu:
      Statement 4:  (Blue)
      Option 1 \u2013 No Answer:
      Testimonial 1: (BLUE)
      Callback Confirmation:
      Secondary IVR Greeting:
      Prompt12-FunshoVoicemail
      New Statement 2: (for post holiday program)_
      Prompt9-Press2MenuOpenNationalHolidayM-F9AM-8PM
    
    Args:
        text (str)

    Returns:
        bool
    """
    text = remove_parentheses(text)
    if text[-1] == ':' and len(text) < 50: 
        return True 
    # Match:
    # 6.
    # 4.
    # 5.
    if len(text) < 5:
        return True
    # Match
    # Ready?
    # Statement3
    # Statement 5
    # :15 Spot\u2014Ready?
    # Voicemail Greeting.
    # Prompt1\u2013ThankYouForCalling
    if len(text) < 25 and any([s in text for s in ['Ready', 'Statement', 'Greeting', 'Prompt']]):
        return True
    # Matches: 
    # Prompt10-Press2MenuOpenNationalHolidayM-F7AM-8PM
    # Prompt 5b- AssistingAnotherPatient-Hold Message
    # Prompt2b\u2013Press3Menu- Monarch Physiotherapy
    if re.match(r"prompt[\s]{0,1}[0-9]+.*", text, re.IGNORECASE):
        return True
    return False

def is_noise(text):
    """ Return `True` if text is "noise".
    
    Example noise:
      ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      telephoneonhold.com\t1-888-321-8477
      PAGE 
      .
      
    Args:
        text (str)

    Returns:
        bool
    """
    if len(text) == 0:
        return True
    if not re.search(r"[a-zA-Z0-9]", text): # Has no letters in the text
        return True
    if 'PAGE' in text: # Typically seen as the last text in a script
        return True
    if 'telephoneonhold.com\t1-888-321-8477' in text: # One off case
        return True
    if '(Dr. Jack (grin year))' in text: # One off case
        return True
    return False

def has_white_text(paragraph):
    """ Return `True` if paragraph contains white text. 
    
    Args:
        paragraph (docx.text.paragraph.Paragraph)

    Returns:
        bool
    """
    if 'FFFFFF' in set([str(r.font.color.rgb) for r in paragraph.runs]):
        display(Markdown('Whited Out Text: "%s"' % paragraph.text))
        return True
    return False
        
    
def strip_quotes(text):
    """ Strip quotes if ``text`` has quotes on both sides.
    
    Args:
        text (str)
        
    Returns:
        str
    """
    if (text[0] == '“' or text[0] == '"') and (text[-1] == '"' or text[-1] == '”'):
        return text[1:-1]
    return text
    
# Flatten a 2-d list into a 1-d list
# Inspired by: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
flatten = lambda l: [item for sublist in l for item in sublist]
    
text_files = [path for path in source.glob('*.docx')]
audio_files = [path for path in source.glob('*.wav')]

# For inspecting results accross the entire dataset
all_content = []
all_title = []
all_noise = []

for text_file, audio_file in zip(sorted(text_files), sorted(audio_files)):
    display(FileLink(text_file))
    
    document = docx.Document(text_file)
    # Split by newline, filter white text, filter empty text
    paragraphs = flatten([p.text.split('\n') for p in document.paragraphs 
                          if len(p.text.strip()) > 0 and not has_white_text(p)])
    # Split by 7 spaces or more and strip (This number was set by trial and error)
    # NOTE: In the Beth dataset, instead of a newline there is a bunch of spaces sometimes.
    paragraphs = [re.split(r'\s{7,}', p.strip()) for p in paragraphs]
    for p in paragraphs:
        if len(p) > 1:
            display(Markdown('Split by Spaces: %s' % p))
    paragraphs = flatten(paragraphs)
    
    # Create table with title and content columns.
    rows = []
    for paragraph in paragraphs:
        paragraph = paragraph.strip()
        if is_noise(paragraph):
            display(Markdown('Skipped Paragraph: "%s"' % paragraph))
            all_noise.append(paragraph)
        elif is_title(paragraph):
            if len(paragraph) > 75: # Determined by trial and error
                display(Markdown('Long Title: "%s"' % paragraph))
                
            rows.append({'Title': paragraph, 'Content': []})
            all_title.append(paragraph)
        else:
            if len(paragraph) < 75: # Determined by trial and error
                display(Markdown('Short Content: "%s"' % paragraph))
                
            paragraph = strip_quotes(paragraph) 
            all_content.append(paragraph)
            if len(rows) == 0:
                rows = [{'Title': '', 'Content': [paragraph]}]
            else:   
                rows[-1]['Content'].append(paragraph)
    
    for row in rows:
        row['Content'] = '\n'.join(row['Content'])
    
    rows = [r for r in rows if len(r['Content']) > 0]
    
    df = pd.DataFrame(rows)
    df.to_csv(str(destination / (text_file.stem + '.csv')), index=False)
    shutil.copy(audio_file, str(destination / audio_file.name))
    display(Markdown('-' * 50))

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: ":"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: ""

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Short Content: "Health Tip- Surviving Flu Season."

Short Content: "And now, Health Tips! Brought to you by AllWays Health Partners."

Short Content: "-January."

Short Content: "Health Tip- Healthy Eating in 2019."

Short Content: "And now, Health Tips! Brought to you by AllWays Health Partners."

Short Content: "Trying to eat healthier in 2019?"

Short Content: "Or just trying to work off how not- healthy you were during the holidays?"

--------------------------------------------------

Short Content: "Did you know the average glazed donut has 192 calories?"

Short Content: "That’s what you burn in 30 minutes of moderate exercise."

Short Content: "But did you know hugs are healthy, too?"

--------------------------------------------------

Skipped Paragraph: "(Dr. Jack (grin year))"

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

Short Content: "Celebrate the holiday season at the Museum of Fine Arts!"

--------------------------------------------------

Short Content: "The Museum of Fine Arts, a place for you and your sense of wonder"

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Short Content: "Thank you for calling BlackGold Physical Therapy Clinic."

Short Content: "To book an appointment or to make changes to your appointment, Press 1."

Short Content: "For our address, directions and hours of operation, Press 2"

Short Content: "To reach one of our clinical Staff, Press 3"

Short Content: "OR simply stay on the line and one of our representatives will assist you"

Short Content: "At any time, you may press star to be returned to the main menu"

Short Content: "Thank you for calling BlackGold Physical Therapy Clinic."

Short Content: "To book an appointment or to make changes to your appointments, Press 1."

Short Content: "For our address, directions and hours of operation, Press 2"

Short Content: "For all insurance and billing related questions, Press 3"

Short Content: "Thank you for calling Monarch Physiotherapy Clinic."

Short Content: "To book an appointment or to make changes to your appointments, Press 1."

Short Content: "For our address, directions and hours of operation, Press 2"

Short Content: "For all insurance and billing related questions, Press 3"

Short Content: "OR simply stay on the line and one of our representatives will assist you"

Short Content: "will assist you."

Short Content: "At any time you may press star to be returned to the main menu"

Short Content: "Kindly Press 0 now if this is a billing or appointment related questions"

Short Content: "For Funsho, Press 1"

Short Content: "For Physiotherapist, Press 2"

Short Content: "For Massage Therapist, Press 3"

Short Content: "For the Chiropractor, Press 4"

Short Content: "For Bayo, Press 1"

Short Content: "For Physiotherapist, Press 2"

Short Content: "For Massage Therapist, Press 3"

Short Content: "For Chiropractor, Press 4"

Short Content: "For our address, directions and regular hours of operation, Press 2,"

Short Content: "To reach one of our clinical Staff Press 3"

Short Content: "OR simply stay on the line and one of our representatives will assist you."

Short Content: "To reach Funsho Press 1"

Short Content: "To reach Funsho at an alternate number Press 2"

Short Content: "To reach Bayo Press 1"

Short Content: "To reach Bayo at an alternate number Press 2"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Whited Out Text: "treatment that actually stimulates the body to heal from with"

--------------------------------------------------

Whited Out Text: "to heal from with"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "."

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "Many of our LASIK patients use their refund for “All-Laser” LASIK,"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

Skipped Paragraph: "PAGE  1"

Skipped Paragraph: "telephoneonhold.com	1-888-321-8477"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "For Spanish, please press the # button."

Short Content: "For sales and customer service, please press one."

Short Content: "For technical and installation support, please press two."

Short Content: "For energy savings and DLC rebate support, please press four."

Short Content: "For any other questions, please press five. (1-2 second pause)"

Short Content: "To repeat these menu options, please press nine."

Skipped Paragraph: "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "You’re next!   One of our team members will be right with you."

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "."

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "++++++++++++++++++++++++++++++++++++++"

Short Content: "To leave a message for Debbie, press 1."

Short Content: "To leave a message for Terah, press 2."

Short Content: "To leave a message for John, press 3."

Short Content: "To leave a message for Kirk, press 4."

Short Content: "To leave a message for Ken, press 5."

Short Content: "For our general voicemail box, press 6."

Short Content: "To repeat these options, please press star."

Skipped Paragraph: "PAGE"

--------------------------------------------------

Split by Spaces: ['Statement 2:', "Your pet's well being is our top priority. Whether you use our grooming services or our self-wash room, we want to maintain your inner and outer care when you leave Healthy Pet, which is why we provide all the food, treats, and supplies you need at home. Featuring free food programs, price guarantee, free local delivery, and the best reviewed food and supplies, you can find what you want at Healthy Pet because at Healthy Pet, we love your dog as much as he loves you."]

Split by Spaces: ['Statement 3:', 'Research proves what all of us dog owners know -  dogs can tell when you are sad, and want to make it better. And cats can actually  lower your risk of cancer. Gravity may make the world go round, but the love of our animals makes us feel like their world. Healthy Pet wants to help foster that bond by providing the best advice and options from what you feed, what you treat, and what toys or litter you use. We post helpful pet ownership videos to Facebook, YouTube, and our Healthy Pet Aurora app to make your life a little easier. Just search Healthy Pet Aurora.']

Split by Spaces: ['Statement 4:', "Who wants another app? Your dog and cat do! Healthy Pet Aurora App lets you view in-store specials, in-store special events, track your free food, earn rewards, book your grooming appointments, order online for free local delivery, and more. If you've got questions we've got answers. We strive to collaborate with our pet community so submit questions through the app, and we will answer you directly or make a video to share with everyone."]

Split by Spaces: ['Statement 5:', 'Fromm, Pure Vita, Zignature, Primal, Weruva, these are just a few of the many high  quality brands of cat and dog foods that we carry for your favorite cat and dog.  Looking for a different brand or formula or just a question about why?? Come in to speak to a pet food expert on the best option for your pet and your family.\xa0Proudly answering your pet nutrition and behavior questions since 1998.']

Split by Spaces: ['Statement 6:', "What's the deal with eye goopies? How do you recycle pet food bags? What are the newest products for your pets? What can you do about your pet's ear infections, dry skin, or allergies? Check out our YouTube channel, Facebook, or our app to learn the answers and more! Links to the app are on our website, Facebook, and YouTube channel, just search Healthy Pet Aurora."]

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "If you have an issue with a Promo Code not working, press 1."

Short Content: "To reach Supply Chain, press 2."

Short Content: "To reach Boutique Care, Press 3"

Short Content: "to reach Retail Ops, press 4"

Short Content: "to reach Recruiting, press 5"

Short Content: "To reach Payroll, press 6"

Short Content: "To reach Finance, press 7"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "."

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "Thank you for choosing The Pest Rangers. Check out our updated website at"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

Skipped Paragraph: "PAGE  1"

Skipped Paragraph: "telephoneonhold.com	1-888-321-8477"

--------------------------------------------------

Skipped Paragraph: "."

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "A diode is a small electrical component that looks like this:"

Short Content: "If so, then you probably have a mechanical bell."

Short Content: "If this is how your bell produces its sound, then it’s mechanical."

Short Content: "In this case, DO NOT use the diode. Doing so may damage your bell!"

Short Content: "If your bell plays a melody when it rings…"

Short Content: "…then that melody is most likely coming from a speaker inside your bell."

Short Content: "Next, take a look at your diode and notice this small marking on one end."

Short Content: "With Ring you’re always home."

--------------------------------------------------

Short Content: "Now you’ll install the provided bracket."

Short Content: "If your ground wire is long enough, proceed to install the bracket."

Skipped Paragraph: ""

Short Content: "You are now ready to wire your Floodlight Cam."

--------------------------------------------------

Short Content: "You’re almost done!"

--------------------------------------------------

Skipped Paragraph: ""

Short Content: "You are now ready to secure your Floodlight Cam."

--------------------------------------------------

Short Content: "Meet Your New Ring Floodlight Cam"

Skipped Paragraph: ""

Short Content: "This copper wire is a ground."

Short Content: "These small cap nuts will secure your Floodlight Cam to the bracket."

--------------------------------------------------

Short Content: "Now you’ll install the provided bracket."

Short Content: "If your ground wire is long enough, proceed to install the bracket."

Skipped Paragraph: ""

Short Content: "You are now ready to wire your Floodlight Cam."

--------------------------------------------------

Short Content: "Each type of event has its own color-code and icon, for easier navigation."

Short Content: "For more information, see our next video, “Introducing the Event Timeline”"

--------------------------------------------------

--------------------------------------------------

Short Content: "“It can also tell the difference between people and other moving things.”"

Short Content: "“To set up a Motion Zone, tap here.”"

Short Content: "“To add additional Motion Zones, tap here and repeat these steps.”"

--------------------------------------------------

Short Content: "This video shows you how to install your Ring Stick Up Cam Battery."

Short Content: "You’ll know it’s fully charged when only one of the LED lights is lit."

Short Content: "Then, insert the battery into its slot until you hear a click."

Short Content: "Next, bring your Stick Up Cam to the room that holds your Wi-Fi router."

Short Content: "Now, let’s look at how to mount your Stick Up Cam on a wall or ceiling."

Short Content: "Then, flip the base toward the rear of the camera."

Short Content: "Then, swivel the stand so the base is above the camera."

Short Content: "Finally, hold the base in place on your wall or ceiling, rubber side down."

Short Content: "Next, drive the three mounting screws."

--------------------------------------------------

Short Content: "This video shows you how to install your Ring Stick Up Cam Elite."

Short Content: "Plug the included USB power supply into an outlet..."

Short Content: "...then connect it to your Stick Up Cam Elite with the included USB cable."

Short Content: "Now, let’s look at how to mount your Stick Up Cam on a wall or ceiling."

Short Content: "Then, flip the base toward the rear of the camera."

Short Content: "Then, swivel the stand so the base is above the camera."

Short Content: "Finally, hold the base in place on your wall or ceiling, rubber side down."

Short Content: "Next, drive the three mounting screws."

Short Content: "Finally, snap the cover onto the base."

Short Content: "Now, let’s reconnect your Stick Up Cam."

--------------------------------------------------

Short Content: "“Then, unscrew the bracket from the wall.”"

--------------------------------------------------

Short Content: "“First, you’ll want to temporarily uninstall your Video Doorbell.”"

--------------------------------------------------

Short Content: "“This video shows you how to install Ring Spotlight Solar Panel.”"

Short Content: "<little pause>"

Short Content: "“...set your desired angle…”"

Short Content: "<little pause>"

Short Content: "“...then re-tighten the screw.”"

Short Content: "“Note that the mounting plate can be inserted in either direction.”"

Short Content: "“The charging port is just below that.”"

--------------------------------------------------

Short Content: "“First, let’s get your Spotlight Cam ready to install.”"

Short Content: "“Now, it’s time to close things up. You’re almost done!”"

--------------------------------------------------

Short Content: "“This video shows you how to install your Ring Video Doorbell Pro.”"

Short Content: "“Then, locate your internal doorbell, and remove the cover.”"

Short Content: "“Now, let’s head outside, to install your Ring Video Doorbell Pro.”"

Short Content: "“The first thing to do is take off the removable faceplate.”"

Short Content: "“If you’re installing on wood or siding, you can skip this step.”"

Short Content: "necessary.”"

Short Content: "“Now you’re ready to connect the wires.”"

Short Content: "“Loosen the screws on the back of your Ring Doorbell.”"

Short Content: "“Next, turn power back on for your doorbell at the breaker.”"

Short Content: "“Next, snap the faceplate of your choice onto your Ring Doorbell.”"

--------------------------------------------------

Short Content: "“This video shows you how to install your new Ring Spotlight Cam Mount.”"

Short Content: "“You’ll do the same when mounting on a ceiling or eaves.”"

Short Content: "“...and it’s ready to install.”"

--------------------------------------------------

--------------------------------------------------

Short Content: "“They’re available to help 24/7 at ring.com.”"

Short Content: "“Then, remove the cover from your internal doorbell.”"

Short Content: "“Tap Doorbell Kit Settings, then set Doorbell Type to None.”"

--------------------------------------------------

Short Content: "“First, let’s look at the things you can do in Live View.”"

Short Content: "“Tap the Plus button to access additional functions.”"

Short Content: "“In the timeline, Doorbell Ring events are marked with a bell.”"

Short Content: "“Motion events are marked with a “moving person.”"

Short Content: "“You can then navigate recorded events in the chosen day as normal.”"

Skipped Paragraph: ""

Short Content: "“You can also send a link to the event in an email or text message.”"

Short Content: "“Tap Delete to erase a recorded event.”"

--------------------------------------------------

Short Content: "First, let’s learn how to use Motion Snooze."

Short Content: "Then, tap Motion Snooze."

Short Content: "Now, select the length of time to snooze, then tap Save."

Short Content: "Next, let’s learn how to use Chime Snooze."

Short Content: "Now, select the length of time to snooze, then tap Save."

--------------------------------------------------

--------------------------------------------------

Short Content: "Welcome to Ring Protect. We’re glad you’re one of our Neighbors."

Short Content: "Welcome to our neighborhood. With Ring, you’re always home."

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "Ask about our New Year Specials!"

Short Content: "Ask us about our New Year monthly specials"

Short Content: "Pricing valid through February 28th."

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

Skipped Paragraph: "PAGE  1"

Skipped Paragraph: "telephoneonhold.com	1-888-321-8477"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "Our business hours are 8:00am to 5:00pm, Monday through Friday"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "For appointments, referrals, and prescription refills, please press 3."

Short Content: "For forms, please press 4."

Skipped Paragraph: ""

Skipped Paragraph: ""

Skipped Paragraph: ""

Short Content: "to leave a message for the staff."

Skipped Paragraph: ""

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "Henning’s Hatch Pepper Cheddar is a creamy smooth cheddar that uses the"

Short Content: "Another delectable addition from our Dairy Division is Blackstone - a new"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "Thank you for calling Truckers Permitting Services."

Short Content: "For registrations and permits press one."

Short Content: "For insurance press two."

Short Content: "We are located at 376 Duncan Avenue, Jersey City, New Jersey."

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Short Content: "For Aaron Tripp press 1."

Short Content: "For Brad Hall press 2."

Short Content: "For Casey Chase press 3."

Short Content: "For Mike Hopkins press 4."

Short Content: "For Scott Roberts press 5."

Short Content: "For Scott Stuart press 6."

Short Content: "For Tim Lambert press 7."

Short Content: "For Kelly in parts press 1."

Short Content: "For Rick in accessories press 2."

Short Content: "For Bill in parts press 3."

Short Content: "For Jay in parts press 4."

Short Content: "For Brittney in titles press 1."

Short Content: "For Amy in accounts payable press 2."

Short Content: "For Glenys the office manager and accounts receivable press 3."

Short Content: "For Weston in IT/Web press 4."

Short Content: "For Pete our new sales manager press 5."

Short Content: "For Troy our used sales manager press 6."

Short Content: "For Kevin our general manager press 7."

Short Content: "For Shawn our service manager press 8."

Short Content: "For Jay our parts manager press 9."

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

Split by Spaces: ['Statement 2:', 'Now is the time to experience the optimum in style and safety! The 2018 Jeep Renegade 4x4 delivers power and design with a 1.4 liter turbo engine featuring a six-speed manual transmission along with available my sky sunroof and standard backup camera, 16-inch steel wheels, remote keyless entry with panic alarm and a height adjustable rear cargo floor. All these features and more make the Renegade the pinnacle of performance and convenience. Or maybe you should check out the all new 2018 Jeep Compass, the all new SUV with style and performance to meet the snowy conditions the winter will bring.  This dream SUV has more than 70 available safety and security features.  The all new Jeep Compass is the most capable compact SUV in its segment.  To see these spectacular vehicles and all other exciting options, visit us online at White Plains Chrysler dot com.']

Split by Spaces: ['Statement 6:', 'Now that fall is here, it’s time to hit the road in style and comfort. If you’re planning a family tailgate event and need the ultimate in space, may we suggest the all  new 2018 Chrysler Pacifica, available with 8 passenger seating and a class exclusive hybrid model? Or maybe you’re looking for a go-anywhere, do-anything vehicle? The legendary Jeep Wrangler — whether 2-door or 4-door — is always in fashion. And did you know the Wrangler is the only convertible SUV made in America today? Come and visit our showroom for a test drive!']

Split by Spaces: ['Statement 8:', 'The experts at Consumer Guide Automotive have revealed their “Best Buys” for 2018 and we are pleased to announce that five of our models have made the cut! The all-new 2018 Chrysler Pacifica has arrived with unparalleled safety features and elegant styling that drives to impress.  The most popular SUV in America, the 2018 Jeep Grand Cherokee, is a winner of the National Highway Traffic Safety Administration’s “five star award.”  The Dodge Durango scored tops among large SUVs, while the Dodge Journey – available in both front or all-wheel drive — was singled out as the midsize crossover that you need to own!  Come and take a look at any of these vehicles and you’ll see just what all the buzz is about.']

Skipped Paragraph: "PAGE"

--------------------------------------------------

Split by Spaces: ['Statement 2:', 'Now is the time to experience the optimum in style and safety! The 2019 Jeep Cherokee limited 4 by 4 delivers power and design with a 3 point 2 liter V6 engine featuring a nine-speed automatic transmission along with standard backup camera and blind spot monitoring, apple card play/android auto, leather seating and 18 inch aluminum wheels.  All these features and more make the Cherokee limited the pinnacle of performance and convenience.  You will find comfort in knowing that the Cherokee limited has among the best safety features to better protect you and your family.  Or maybe you should check out the all new 2019 Jeep Compass, the all new SUV with style and performance to meet the snowy conditions the winter will bring.  This dream SUV has more than 70 available safety and security features.  The all new Jeep Compass is the most capable compact SUV in its segment.  To see these spectacular vehicles and all other exciting options, visit us online at White Plains Chrysler dot com.']

Split by Spaces: ['Statement 6:', 'Now that fall is around the corner, it’s time to hit the road in style and comfort. If you’re planning a family tailgate event and need the ultimate in space, may we suggest the all  new 2019 Chrysler Pacifica, available with 8 passenger seating and a class exclusive hybrid model? Or maybe you’re looking for a go-anywhere, do-anything vehicle? The legendary Jeep Wrangler — whether 2-door or 4-door — is always in fashion. And did you know the Wrangler is the only convertible SUV made in America today? The Wrangler is the ultimate versatility vehicle in American today.  Off road, sand dunes or taking the kids to school, it’s sure to make you proud and always certain to get the extra look.  Come and visit our showroom for a test drive!']

Split by Spaces: ['Statement 8:', 'The experts at Consumer Guide Automotive have revealed their “Best Buys” for 2018 and we are pleased to announce that three of our models have made the cut! The all-new 2019 Chrysler Pacifica has arrived with unparalleled safety features and elegant styling that drives to impress, or check out the all new Ram fifteen hundred, which has the pickup segment very excited with this uniquely superior vehicle.  It’s rugged and stylish, and will impress in any driveway.   Come and take a look at any of these vehicles and you’ll see just what all the buzz is about.']

Skipped Paragraph: "PAGE"

--------------------------------------------------

Skipped Paragraph: "PAGE"

Skipped Paragraph: "PAGE  1"

Skipped Paragraph: "telephoneonhold.com	1-888-321-8477"

--------------------------------------------------

Skipped Paragraph: "PAGE"

--------------------------------------------------

## QA Script

In [6]:
print(set(all_noise))

{'', 'PAGE  1', ':', '++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++', '++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++', '(Dr. Jack (grin year))', '+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++', '++++++++++++++++++++++++++++++++++++++', 'telephoneonhold.com\t1-888-321-8477', 'PAGE', '.'}


In [7]:
print(json.dumps(sorted(list(set(all_title)), key=lambda h: len(h)), indent=2))

[
  "5.",
  "6.",
  "4.",
  "Ready?",
  "Hours:",
  "Prompt:",
  "Address:",
  "Prompt 3:",
  "Prompt 2:",
  "Prompt 1:",
  "Location:",
  "Statement3",
  "Statement 6",
  "Statement 5",
  "Statement 7",
  "Statement 6:",
  "For Spanish:",
  "Statement 5:",
  "Statement 4:",
  "Statement 2:",
  "Statement 7:",
  "Statement 3:",
  "Statement 8:",
  "Statement 9:",
  "Statement 1:",
  "Statement 3a:",
  "Day Greeting:",
  "Statement 4a:",
  "Statement 10:",
  "Statement 2a:",
  "Statement 2b:",
  "Statement 12:",
  "Statement 4A:",
  "Statement 3A:",
  "Statement 13:",
  "Statement 14:",
  "Statement 1B:",
  "Statement 5a:",
  "Statement 6A:",
  "Statement 5A:",
  "Statement 2A:",
  "Statement 11:",
  "Statement 1a:",
  "Close message:",
  "Phone Greeting:",
  "Statement:  10:",
  ":15 Spot\u2014Ready?",
  "Night Greeting:",
  "Closed Greeting:",
  "On Hold Program:",
  "Welcome Greeting:",
  "After pressing 3:",
  "Daytime Greeting:",
  "Greeting Message:",
  "Holiday Greeting:",
  "Aft

In [8]:
print(json.dumps(sorted(list(set(all_content)), key=lambda h: len(h)), indent=2))

[
  "-January.",
  "necessary.\u201d",
  "<little pause>",
  "will assist you.",
  "For Bayo, Press 1",
  "For Funsho, Press 1",
  "You\u2019re almost done!",
  "To reach Bayo Press 1",
  "For Brad Hall press 2.",
  "To reach Funsho Press 1",
  "For Casey Chase press 3.",
  "For Aaron Tripp press 1.",
  "Then, tap Motion Snooze.",
  "For insurance press two.",
  "For Tim Lambert press 7.",
  "To reach Payroll, press 6",
  "For Chiropractor, Press 4",
  "To reach Finance, press 7",
  "For Mike Hopkins press 4.",
  "For Scott Stuart press 6.",
  "For Jay in parts press 4.",
  "For Scott Roberts press 5.",
  "For Bill in parts press 3.",
  "For forms, please press 4.",
  "...set your desired angle\u2026",
  "For Kelly in parts press 1.",
  "For Physiotherapist, Press 2",
  "to reach Recruiting, press 5",
  "to reach Retail Ops, press 4",
  "...then re-tighten the screw.",
  "For the Chiropractor, Press 4",
  "With Ring you\u2019re always home.",
  "For Weston in IT/Web press 4.",
  "...an