# Verbalizing LJ Speech

The goal of this notebook is experiment with different methods for verbalizing LJ speech symbols.

In [1]:
import re
import sys

sys.path.insert(0, '../')

from src.datasets.lj_speech import _iterate_and_replace

In [2]:
import random
import os

from IPython.display import Audio
from IPython.display import Markdown
from IPython.display import FileLink

from src.datasets import lj_speech_dataset

data = lj_speech_dataset(directory='../data', verbalize=False)

def get_unique(examples, get_key):
    """ Get a unique list of ``examples`` based on ``key``.
    
    Args:
        examples (list): Examples to dedup.
        get_key (callable): Get a key to dedup examples.
    """
    seen = set() 
    filtered = []
    for example in examples:
        key = get_key(example)
        if key not in seen:
            seen.add(key)
            filtered.append(example)
    return filtered

def find_examples(regex, display_n=5, match_to_key=None, load_audio=False, replace=True, group=1):
    """ Print ``display_n`` examples of ``regex`` in ``lj_speech_dataset``.
    
    Args:
        regex (str): Pattern or compiled regex object.
        display_n (int or None, optional): Number of examples to display.
        match_to_key (callable or None, optional): Key assigned to match to filter duplicates.
        load_audio (bool, optional): If to load audio.
        replace (bool, optional): If to replace the matched characters with XXX...
        group (int, optional): Group to select in regex.
        
    Returns:
        None
    """
    examples = []
    for row in data:
        matches = re.finditer(regex, row['text'])
        for match in matches:
            start = max(match.start(group) - 25, 0)
            end = min(match.end(group) + 25, len(row['text']))
            if replace:
                row['text'] = (row['text'][:match.start(group)] + 
                               'X' * (match.end(group) - match.start(group)) + 
                               row['text'][match.end(group):])
            if match_to_key is not None:
                key = match_to_key(match.group(group))
            else:
                key = None
            text = (row['text'][start:match.start(group)] + '**' + match.group(group) +
                    '**' + row['text'][match.end(group):end])
            examples.append({
                'text': '…' + text + '…',
                'audio': os.path.join('../data/LJSpeech-1.1/', row['wav']),
                'key': key
            })
            
    # Print Examples
    display(Markdown('### Examples Captured by Regex'))
    display(Markdown('**Regex:** ' + str(regex)))
    display(Markdown('**Number of Examples:** ' + str(len(examples))))
    
    if match_to_key is not None:
        examples = get_unique(examples, lambda example: example['key'])
    random.shuffle(examples)
    if display_n is not None:
        examples = examples[:display_n]
    
    display(Markdown('**Number of Examples Shown:** ' + str(len(examples))))
    display(Markdown('\n\n ___'))
    
    for example in examples:
        display(Markdown('**Text:** "' + example['text'] + '"'))
        if load_audio:
            display(Audio(example['audio']))
        else:
            display(FileLink(example['audio']))
        display(Markdown('\n\n ___'))
        display()

## Sample of Word with a Number

In [3]:
find_examples(r'\S*(\d+)\S*', display_n=100, replace=False, group=0, load_audio=False)

### Examples Captured by Regex

**Regex:** \S*(\d+)\S*

**Number of Examples:** 2118

**Number of Examples Shown:** 100



 ___

**Text:** "…arrived at approximately **1** p.m. and left a few minu…"



 ___

**Text:** "…er 4. The Assassin: Part **6.**…"



 ___

**Text:** "…r husband in November of **1962,**…"



 ___

**Text:** "… March 2, 1963, to April **24,** 1963,…"



 ___

**Text:** "…s to the State since the **1960** Presidential campaign an…"



 ___

**Text:** "…From September **24,** 1963, when Marina Oswald…"



 ___

**Text:** "…But about **1850** the two sides were disti…"



 ___

**Text:** "…ained a passport on June **25,** 1963.…"



 ___

**Text:** "…of 1963 show a weight of **136** pounds.…"



 ___

**Text:** "…a operates at a speed of **18.3** frames per second,…"



 ___

**Text:** "…and actually had only **1,000** printed.…"



 ___

**Text:** "…the Trade Mart, measured **10** miles and could be drive…"



 ___

**Text:** "…with two hits, within **4.8** and 5.6 seconds.…"



 ___

**Text:** "…periments had shown that **24** hours was a likely maxim…"



 ___

**Text:** "…te male about 30, 5 foot **8,** black hair, slender, end…"



 ___

**Text:** "…aced to the right of the **240-foot** target…"



 ___

**Text:** "…Chapter **7.** Lee Harvey Oswald: Backg…"



 ___

**Text:** "… 1962 until November 23, **1963.**…"



 ___

**Text:** "…may not endorse him in **'64.**…"



 ___

**Text:** "…ctually shipped on March **20** by Railway Express.…"



 ___

**Text:** "…At approximately **1** p.m., after last rites w…"



 ___

**Text:** "…second test run required **1** minute and 15 seconds.…"



 ___

**Text:** "…n 2:30 p.m., on November **22,** and 11 a.m.,…"



 ___

**Text:** "…5 foot 10 inches, weight **165** pounds, end quote.…"



 ___

**Text:** "…man they saw on November **22,** 1963.…"



 ___

**Text:** "… on this spot was on the **3rd** December, 1783,…"



 ___

**Text:** "…e Commission in February **1964,**…"



 ___

**Text:** "… at 500 North Beckley at **12:45** p.m.…"



 ___

**Text:** "…but in the **27th** George II the right of p…"



 ___

**Text:** "…ich took place in May of **1945.**…"



 ___

**Text:** "…**(2)** photographs found among …"



 ___

**Text:** "…ime of the first shot as **12** to 15 miles per hour.…"



 ___

**Text:** "…idate Johnson during the **1960** campaign,…"



 ___

**Text:** "…ifiable fingerprints and **8** palmprints were develope…"



 ___

**Text:** "…nown to amass as much as **£40.**…"



 ___

**Text:** "…e interview lasted about **20** to 25 minutes. In respon…"



 ___

**Text:** "…Approximately **15** men worked in the wareho…"



 ___

**Text:** "…autopsy at approximately **7:35** p.m.…"



 ___

**Text:** "…l the bank paper; I have **£30,000** now, and the Bank of Eng…"



 ___

**Text:** "…adily maintained, and in **1803** the total rose to 710.…"



 ___

**Text:** "…S received approximately **9,000** items of information;…"



 ___

**Text:** "…he afternoon of November **23,** Officers H. M. Moore,…"



 ___

**Text:** "… vote for Mr. Kennedy in **1960,**…"



 ___

**Text:** "…In **1853** three men escaped in com…"



 ___

**Text:** "…nterrogation on November **22,** Fritz asked Oswald to ac…"



 ___

**Text:** "…**(1)** those awaiting trial;…"



 ___

**Text:** "…ollar bill, the trip was **95** cents.…"



 ___

**Text:** "…idence on November 1 and **5,** 1963,…"



 ___

**Text:** "…riving at San Antonio at **1:30** p.m., Eastern Standard T…"



 ___

**Text:** "…On September **20,** 1963, Mrs. Paine and her…"



 ___

**Text:** "…n before; as in the year **1849,** a year memorable for the…"



 ___

**Text:** "…d Commission Exhibit No. **162** as the light-colored jac…"



 ___

**Text:** "…ght, because that is the **500** block of North Beckley, …"



 ___

**Text:** "…In the **100** years since 1865…"



 ___

**Text:** "…ailed to return by 10 or **10:30** p.m., Marina Oswald went…"



 ___

**Text:** "… that address on October **14,** 1963.…"



 ___

**Text:** "…nce on November 1 and 5, **1963,**…"



 ___

**Text:** "…wald, prior to April 10, **1963,**…"



 ___

**Text:** "…as on November 20 to 21, **1963.**…"



 ___

**Text:** "… is presented in chapter **5** of this report.…"



 ___

**Text:** "…and 375 respectively, or **650** in all.…"



 ___

**Text:** "…Biology. Chapter **10.** Morphology and Embryolog…"



 ___

**Text:** "…erested about the age of **15.** From an ideological view…"



 ___

**Text:** "…y stated, I have between **25** and 40 cases assigned to…"



 ___

**Text:** "… an area of 2 inches and **5** inches respectively.…"



 ___

**Text:** "…**(7)** attempted, in April 1963…"



 ___

**Text:** "…losed that on January 3, **1963,**…"



 ___

**Text:** "… manifest which showed a **12** o'clock trip from Travis…"



 ___

**Text:** "…and John Lancaster for **1** shilling, 8 pence, with …"



 ___

**Text:** "… was first discovered in **1820,**…"



 ___

**Text:** "…On September 30, **1952,** Lee enrolled in P.S. 117…"



 ___

**Text:** "…he was told that **45** minutes had been allotte…"



 ___

**Text:** "…metime after January 27, **1963,**…"



 ___

**Text:** "…from 1832 to **1844** not a single person had …"



 ___

**Text:** "…eekend of November 16 to **17,** 1963, the weekend before…"



 ___

**Text:** "…In the **6** to 8 minute period befor…"



 ___

**Text:** "…served in a report dated **1820,**…"



 ___

**Text:** "…icles of Newgate, Volume **2.** By Arthur Griffiths. Sec…"



 ___

**Text:** "…atters went on after the **1865** Act much the same as the…"



 ___

**Text:** "…On November **4,** Hosty telephoned the Tex…"



 ___

**Text:** "…er 4. The Assassin: Part **8.**…"



 ___

**Text:** "…during the summer of **1963.**…"



 ___

**Text:** "…ss prison for debtors in **1815.**…"



 ___

**Text:** "…ues combined amounted to **£27,000.**…"



 ___

**Text:** "…rifle and the scope, and **$1.50** for postage and handling…"



 ___

**Text:** "…o FBI agents on February **19,** 1964,…"



 ___

**Text:** "…On November 23, **1963,**…"



 ___

**Text:** "…distance of 100 yards in **5.9,** 6.2,…"



 ___

**Text:** "…Mr. Gee had invested **£1200** of this, and was seeking…"



 ___

**Text:** "…ons, until the spring of **1948.**…"



 ___

**Text:** "…As reported in chapter **2,** when the special file wa…"



 ___

**Text:** "…robbery there were still **£6000** worth in the warehouse.…"



 ___

**Text:** "…**3.** I paid the house rent on…"



 ___

**Text:** "…**6.** Projects will be allocat…"



 ___

**Text:** "…**7.** Certain of my documents …"



 ___

**Text:** "…ffered on the 29th June, **1809,**…"



 ___

**Text:** "…there were approximately **400** persons throughout the c…"



 ___

**Text:** "…Texas on November 21 and **22,**…"



 ___

**Text:** "… Frazier at the range of **25** yards landed within an a…"



 ___

**Text:** "…. postal money order for **$21.45,**…"



 ___

## Special Cases

In [4]:
import os

from IPython.display import Markdown
from IPython.display import FileLink

lookup = {
    'LJ044-0055': ('544 Camp Street New', 'five four four Camp Street New'),
    'LJ028-0180': ('In the year 562', 'In the year five sixty-two'),
    'LJ047-0063': ('602 Elsbeth Street', 'six oh two Elsbeth Street'),
    'LJ047-0160': ('411 Elm Street', 'four one one Elm Street'),
    'LJ047-0069': ('214 Neely Street', 'two one four Neely Street'),
    'LJ040-0121': ('P.S. 117', 'P.S. one seventeen'),
    'LJ032-0036': ('No. 2,202,130,462', 'No. two two zero two one three zero four six two'),
    'LJ029-0193': ('100 extra off-duty', 'one hundred extra off-duty'),
}

def special_cases():
    for row in data:
        basename = os.path.basename(row['wav']).split('.')[0]
        if basename in lookup:
            original = row['text']
            row['text'] = row['text'].replace(*lookup[basename])
            display(Markdown(original + ' → ' + row['text']))

special_cases()

In the year 562, after a long reign of forty-three years, Nebuchadnezzar died. → In the year five sixty-two, after a long reign of forty-three years, Nebuchadnezzar died.

to call in 100 extra off-duty officers to help protect President Kennedy. → to call in one hundred extra off-duty officers to help protect President Kennedy.

purchased as No. 2,202,130,462 in Dallas, Texas, on March 12, 1963. → purchased as No. two two zero two one three zero four six two in Dallas, Texas, on March 12, 1963.

On September 30, 1952, Lee enrolled in P.S. 117 → On September 30, 1952, Lee enrolled in P.S. one seventeen

While the legend, quote, FPCC, 544 Camp Street New Orleans, Louisiana, end quote, → While the legend, quote, FPCC, five four four Camp Street New Orleans, Louisiana, end quote,

Agent Hosty was told by Mrs. M. F. Tobias, a former landlady of the Oswalds at 602 Elsbeth Street in Dallas, → Agent Hosty was told by Mrs. M. F. Tobias, a former landlady of the Oswalds at six oh two Elsbeth Street in Dallas,

that the Oswalds were living at 214 Neely Street in Dallas. → that the Oswalds were living at two one four Neely Street in Dallas.

found it to be 411 Elm Street. End quote. → found it to be four one one Elm Street. End quote.

## Time of the Day

In [5]:
regex = r'([0-9]{1,2}:[0-9]{1,2})'
find_examples(regex)

### Examples Captured by Regex

**Regex:** ([0-9]{1,2}:[0-9]{1,2})

**Number of Examples:** 84

**Number of Examples Shown:** 5



 ___

**Text:** "…XXXXX p.m., and again at **12:55** p.m.…"



 ___

**Text:** "…have arrived there about **12:59** to 1 p.m.…"



 ___

**Text:** "…XX or XXXX a.m. to about **2:45** or 3 a.m.;…"



 ___

**Text:** "…ination at approximately **12:54** p.m.…"



 ___

**Text:** "…At approximately **12:45** p.m., Dr. Robert Shaw,…"



 ___

In [6]:
from IPython.display import Markdown
from functools import partial
from num2words import num2words

cases = [
    ('alone in the shop about 9:30', 'nine thirty'),
    ('San Antonio at 1:30 p.m.,', 'one thirty'),
    ('At 1:51 p.m., police car 2 report', 'one fifty-one'),
]

def replace(text, true):
    split = text.split(':')
    assert len(split) == 2
    words = [num2words(int(num)) for num in split]
    ret = ' '.join(words)
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

9:30 → nine thirty (nine thirty)

1:30 → one thirty (one thirty)

1:51 → one fifty-one (one fifty-one)

## Ordinals

In [7]:
regex = r'([0-9]+(st|nd|rd|th))'
find_examples(regex)

### Examples Captured by Regex

**Regex:** ([0-9]+(st|nd|rd|th))

**Number of Examples:** 71

**Number of Examples Shown:** 5



 ___

**Text:** "…d with his escort on the **17th** September the same year.…"



 ___

**Text:** "…ssed the intersection of **10th** and Patton, about eight …"



 ___

**Text:** "…He looked west on **10th** and saw a man running to…"



 ___

**Text:** "…he would have reached **10th** and Patton shortly after…"



 ___

**Text:** "…ses found on the lawn at **10th** Street and Patton Avenue…"



 ___

In [8]:
from IPython.display import Markdown
from functools import partial

from num2words import num2words

cases = [('shortly before Lee\'s 13th birthday', 'thirteenth'), 
         ('On October 23rd, I had attended a ultra', 'twenty-third'),
         ('between May 1st, 1827,', 'first'),
         ('and 30th April, 1831,', 'thirtieth')]

def replace(text, true):
    digit = ''.join([c for c in text if c.isdigit()])
    ret = num2words(int(digit), ordinal=True)
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret


for text in cases:
    _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

13th → thirteenth (thirteenth)

23rd → twenty-third (twenty-third)

1st → first (first)

30th → thirtieth (thirtieth)

## Money (dollars or pounds)

In [9]:
regex = r'(\S*([$£]{1}[0-9\,\.]+\b))'
find_examples(regex)

### Examples Captured by Regex

**Regex:** (\S*([$£]{1}[0-9\,\.]+\b))

**Number of Examples:** 128

**Number of Examples Shown:** 5



 ___

**Text:** "… bought by the Crown for **£10,500**.…"



 ___

**Text:** "…XXXX on alterations, but **£60,000** would suffice to reconst…"



 ___

**Text:** "…nt of debts sued for was **£81,791**.…"



 ___

**Text:** "…warden, whose income was **£2372**.…"



 ___

**Text:** "…963, included an item of **$21.45**. Klein's shipping order …"



 ___

In [10]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('rough diamonds valued at £4000.', 'four thousand pounds'), 
         ('inch BBL, unquote, cost $29.95.', 'twenty-nine dollars, ninety-five cents'),
         ('was indebted upwards of £50,000 subsequently stopped pay', 'fifty thousand pounds'),
         ('warden, whose income was £2372.', 'two thousand, three hundred seventy-two pounds'),
         ('plus $1.27', 'one dollar, twenty-seven cents'),
         ('$19.95,', 'nighteen dollars, nighty-five cents'),
         ('were out to the value of £367,800.', 'three hundred sixty-seven thousand and eight hundred pounds'),
         ('the offer of a reward of £1500 for the detection of the', 'fifteen hundred pounds'),
         ('of England notes for £1000 each,', 'one thousand pounds each'),
         ('of approximately $3,000,000 during that period', 'three million'),
         ('only afford to give £1750 for stones', 'one thousand seven-fifty pounds'),
         ('e surrender of the other £1200', 'one thousand, two hundred pounds')]

def replace(text, true):
    digit = text[1:].replace(',', '')
    ret = num2words(digit, to='currency', currency='USD')
    ret = ret.replace(', zero cents', '')
    ret = ret.replace('hundred and', 'hundred')
    if '£' in text:
        # num2words has bugs with their GBP current
        ret = ret.replace('dollar', 'pound')
        ret = ret.replace('cents', 'pence')
        ret = ret.replace('cent', 'penny')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

£4000 → four thousand pounds (four thousand pounds)

$29.95 → twenty-nine dollars, ninety-five cents (twenty-nine dollars, ninety-five cents)

£50,000 → fifty thousand pounds (fifty thousand pounds)

£2372 → two thousand, three hundred seventy-two pounds (two thousand, three hundred seventy-two pounds)

$1.27 → one dollar, twenty-seven cents (one dollar, twenty-seven cents)

$19.95 → nineteen dollars, ninety-five cents (nighteen dollars, nighty-five cents)

£367,800 → three hundred sixty-seven thousand, eight hundred pounds (three hundred sixty-seven thousand and eight hundred pounds)

£1500 → one thousand, five hundred pounds (fifteen hundred pounds)

£1000 → one thousand pounds (one thousand pounds each)

$3,000,000 → three million dollars (three million)

£1750 → one thousand, seven hundred fifty pounds (one thousand seven-fifty pounds)

£1200 → one thousand, two hundred pounds (one thousand, two hundred pounds)

In [11]:
# No more currency examples
find_examples(r'([$£])', replace=False)

### Examples Captured by Regex

**Regex:** ([$£])

**Number of Examples:** 0

**Number of Examples Shown:** 0



 ___

## PO Box Numbers & Serial Numbers

In [12]:
find_examples(r'([Bb]ox [0-9]+\b)')

### Examples Captured by Regex

**Regex:** ([Bb]ox [0-9]+\b)

**Number of Examples:** 14

**Number of Examples Shown:** 5



 ___

**Text:** "…inting, "A. Hidell, P.O. **Box 2915**, Dallas, Texas."…"



 ___

**Text:** "…e words "A. Hidell, P.O. **Box 2915** Dallas, Texas."…"



 ___

**Text:** "…. J. Hidell, Post Office **Box 2915**, Dallas, Texas.…"



 ___

**Text:** "…Post Office **Box 2915**, Dallas, Texas, on March…"



 ___

**Text:** "…d had rented post office **box 30061** in New Orleans on June 3…"



 ___

In [13]:
find_examples(r'(\b[A-Za-z]+[0-9]+\b)')

### Examples Captured by Regex

**Regex:** (\b[A-Za-z]+[0-9]+\b)

**Number of Examples:** 16

**Number of Examples Shown:** 5



 ___

**Text:** "…control number **VC836**, serial number C2766, wa…"



 ___

**Text:** "…son revolver, serial No. **V510210**,…"



 ___

**Text:** "… internal control number **VC836** on this rifle.…"



 ___

**Text:** "…ano rifle, serial number **C2766**,…"



 ___

**Text:** "…facturer's serial number **C2766**.…"



 ___

In [14]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('Post Office Box 2915, Dallas, Texas, on March', 'two nine one five'), 
         ('Post Office Box 30016, New Orleans', 'three zero zero one six'),
         ('serial No. C2766, which was also found', 'C two seven six six'),
         ('control number VC836, serial number', 'V C eight three six'),
         ('Commando, serial number V510210, end quote', 'V five one zero two one zero')]

def replace(text, true):
    split = text.split(' ')
    ret = [num2words(int(t)) if t.isdigit() else t for t in list(split[-1])]
    ret = ' '.join(ret)
    if len(split) == 2:
        ret = split[0] + ' ' + ret
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for regex in [r'([Bb]ox [0-9]+\b)', r'(\b[A-Za-z]+[0-9]+\b)']:
    for text in cases:
        _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

Box 2915 → Box two nine one five (two nine one five)

Box 30016 → Box three zero zero one six (three zero zero one six)

C2766 → C two seven six six (C two seven six six)

VC836 → V C eight three six (V C eight three six)

V510210 → V five one zero two one zero (V five one zero two one zero)

## Year

In [15]:
regexes = [r'(\b[0-9]{4}\b)', r'\b(?:in|In) ([0-9]{3})\b', r'\b([0-9]{3}) B\.C\b']
for regex in regexes:
    find_examples(regex)

### Examples Captured by Regex

**Regex:** (\b[0-9]{4}\b)

**Number of Examples:** 582

**Number of Examples Shown:** 5



 ___

**Text:** "…On August 21, **1963**, Bureau headquarters ins…"



 ___

**Text:** "…On August 17, **1963**, he appeared briefly on …"



 ___

**Text:** "…t address on October 14, **1963**.…"



 ___

**Text:** "…from **1832** to 1844 not a single per…"



 ___

**Text:** "… actions on November 22, **1963**.…"



 ___

### Examples Captured by Regex

**Regex:** \b(?:in|In) ([0-9]{3})\b

**Number of Examples:** 13

**Number of Examples Shown:** 5



 ___

**Text:** "…In **605**,…"



 ___

**Text:** "…Eleven years later, in **586**, he destroyed the sacred…"



 ___

**Text:** "…In **597**, when he sent his army t…"



 ___

**Text:** "…n was unavailing, and in **275** B.C., the inhabitants of…"



 ___

**Text:** "…In **529** Cyrus died.…"



 ___

### Examples Captured by Regex

**Regex:** \b([0-9]{3}) B\.C\b

**Number of Examples:** 2

**Number of Examples Shown:** 2



 ___

**Text:** "…t Babylon, writing about **250** B.C.,…"



 ___

**Text:** "…and there on June 13, **323** B.C., he met his death.…"



 ___

In [16]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('dated April XXXX, 1787, describing an', 'seventeen eighty-seven'), 
         ('Newgate down to 1818,', 'eighteen eighteen'),
         ('It was about 2250 B.C., when the great', 'twenty-two fifty'),
         ('In 597, when he sent his army', 'five ninety-seven'),
         ('writing about 250 B.C.', 'two fifty'),
         ('In 606, Nineveh', 'six oh-six'),
         ('June 13, 323 B.C.,', 'three twenty-three')]

def replace(text, true):
    ret = num2words(int(text), lang='en', to='year')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret


for regex in regexes:
    for text in cases:
        _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))


1787 → seventeen eighty-seven (seventeen eighty-seven)

1818 → eighteen eighteen (eighteen eighteen)

2250 → twenty-two fifty (twenty-two fifty)

597 → five ninety-seven (five ninety-seven)

606 → six oh-six (six oh-six)

250 → two fifty (two fifty)

323 → three twenty-three (three twenty-three)

## Numero (no.)

In [17]:
regex = r'(?:No|no)\. ([0-9]+)'
find_examples(regex)

### Examples Captured by Regex

**Regex:** (?:No|no)\. ([0-9]+)

**Number of Examples:** 29

**Number of Examples Shown:** 5



 ___

**Text:** "…ld was the man under No. **2**.…"



 ___

**Text:** "…. Horn, aged 18, was No. **1**;…"



 ___

**Text:** "…Lee Oswald was No. **3**;…"



 ___

**Text:** "…d Commission Exhibit No. **150** (the shirt taken from Os…"



 ___

**Text:** "… Lujan, aged 26, was No. **4**.…"



 ___

In [18]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('Commission Exhibit No. 133-B,', 'one thirty-three'), 
         ('Commission Exhibit No. 162 as', 'one sixty-two')]

def replace(text, true):
    ret = num2words(int(text), lang='en', to='year')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

133 → one thirty-three (one thirty-three)

162 → one sixty-two (one sixty-two)

## Other Numbers

In [19]:
find_examples(r'(\b[0-9]{1}[0-9\.\,]{0,}\b)', display_n=50)

### Examples Captured by Regex

**Regex:** (\b[0-9]{1}[0-9\.\,]{0,}\b)

**Number of Examples:** 1171

**Number of Examples Shown:** 50



 ___

**Text:** "…orieties continued, part **2**.…"



 ___

**Text:** "…n XXXX p.m., on November **22**, and 11 a.m.,…"



 ___

**Text:** "…ca sack made on December **1**, XXXX,…"



 ___

**Text:** "… November XX to November **18**, when he was joined by A…"



 ___

**Text:** "…Chapter **8**. The Protection of the P…"



 ___

**Text:** "…and XX to **15** of these cases as highly…"



 ___

**Text:** "…On April **21**, XXXX, the FBI field off…"



 ___

**Text:** "…uched of XXX debtors and **182** felons, or 379 in all.…"



 ___

**Text:** "…at XX yards in X, X, and **9** seconds,…"



 ___

**Text:** "…ed "Friday November XX, '**63**" and was punched in two …"



 ___

**Text:** "…months prior to November **22**.…"



 ___

**Text:** "…On June **24**, XXXX, he applied for a …"



 ___

**Text:** "…IA to the FBI on October **10**,…"



 ___

**Text:** "…fle was shipped on March **20**, and the shooting occurr…"



 ___

**Text:** "…At least **12** persons saw the man with…"



 ___

**Text:** "…(**2**) took paper and tape fro…"



 ___

**Text:** "…lly discussed in chapter **6**, page 249.…"



 ___

**Text:** "…ly, the entry of January **4** to 31 of XXXX, quote,…"



 ___

**Text:** "…t his remarks of January **7** were intended by him mer…"



 ___

**Text:** "… quote, about XX, X foot **8** inches, black hair, slen…"



 ___

**Text:** "…ixth Floor Approximately **35** Minutes Before the Assas…"



 ___

**Text:** "…On December **2**, XXXX, Mrs. Ruth Paine t…"



 ___

**Text:** "…imately XX, well, almost **11** years old. End quote.…"



 ___

**Text:** "…On August **22**, it learned that Oswald …"



 ___

**Text:** "…man they saw on November **22**, XXXX.…"



 ___

**Text:** "… is presented in chapter **5** of this report.…"



 ___

**Text:** "… President to spend only **1** day in the State, making…"



 ___

**Text:** "…ret Service in the first **4** months of XXXX.…"



 ___

**Text:** "… circulation on November **21** of a handbill sharply cr…"



 ___

**Text:** "…XX George III. c. XX, s. **4** (XXXX)…"



 ___

**Text:** "…(**3**) firearm identification …"



 ___

**Text:** "…belonged to Oswald, and (**4**)…"



 ___

**Text:** "…er X. The Assassin: Part **3**.…"



 ___

**Text:** "…e Commission considered (**1**)…"



 ___

**Text:** "…e White House on October **4** to discuss the details o…"



 ___

**Text:** "…proximately XXX of these **400** cases as serious risks…"



 ___

**Text:** "…t regarded approximately **100** of these 400 cases as se…"



 ___

**Text:** "…ne residence on November **1** and 5, XXXX,…"



 ___

**Text:** "…ard, each dated November **22**, were for Scott-Foresman…"



 ___

**Text:** "…heeler. Biology. Chapter **8**.…"



 ___

**Text:** "…included in the group of **400**…"



 ___

**Text:** "…ory Building on November **22**, carrying a long and bul…"



 ___

**Text:** "…d measured approximately **5** inches (13 centimeters) …"



 ___

**Text:** "…news stories on November **19**, 20, and 22.…"



 ___

**Text:** "…n proceeded at a rate of **12** to 15 miles per hour…"



 ___

**Text:** "…th FBI agents on January **7**, XXXX,…"



 ___

**Text:** "…as been shown in chapter **3**, if the three shots were…"



 ___

**Text:** "…On November **4**, Gerald A. Behn, agent i…"



 ___

**Text:** "…weeks before, on October **7**, but she had asked him t…"



 ___

**Text:** "…, end quote, for October **21**, XXXX, reports, quote,…"



 ___

In [20]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('Chapter 4. The Assassin:', 'four'), 
         ('the morning of November 22 prior to the motorcade', 'twenty-two'),
         ('was shipped on March 20, and the shooting', 'twenty'),
         ('Kennedy in the neck at 176.9', 'one hundred seventy-six point nine'), 
         ('distance of 265.3 feet was, quote', 'two hundred sixty-five point three'),
         ('ries they required XXXX, 6.45,', 'six point four five'),
         ('information on some 50,000 cases', 'fifty thousand'), 
         ('actually had only 1,000 printed.', 'one thousand'),
         ('PRS received items in 8,709 cases', 'eight thousand, seven hundred nine'),
         ('debtors and 182 felons,', 'one hundred eighty-two')]

def replace(text, true):
    text = text.replace(',', '')
    ret = num2words(float(text))
    ret = ret.replace('hundred and', 'hundred')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(r'(\b[0-9\.\,]+\b)', text[0], partial(replace, true=text[1]))

4 → four (four)

22 → twenty-two (twenty-two)

20 → twenty (twenty)

176.9 → one hundred seventy-six point nine (one hundred seventy-six point nine)

265.3 → two hundred sixty-five point three (two hundred sixty-five point three)

6.45 → six point four five (six point four five)

50000 → fifty thousand (fifty thousand)

1000 → one thousand (one thousand)

8709 → eight thousand, seven hundred nine (eight thousand, seven hundred nine)

182 → one hundred eighty-two (one hundred eighty-two)

## Roman Numbers

In [32]:
find_examples(r'\b(?:George|Charles|Napoleon|Henry|Nebuchadnezzar|William) ([IV]+\.{0,})', display_n=50, replace=False)

### Examples Captured by Regex

**Regex:** \b(?:George|Charles|Napoleon|Henry|Nebuchadnezzar|William) ([IV]+\.{0,})

**Number of Examples:** 22

**Number of Examples Shown:** 22



 ___

**Text:** "…Again, the XX Charles **II.** c XX ordered the jailer …"



 ___

**Text:** "…tried to stab George **III.** as he was alighting from…"



 ___

**Text:** "…later act, the XX George **III.** c. XX (XXXX),…"



 ___

**Text:** "…irements of the X George **IV.**…"



 ___

**Text:** "…ew Jail Acts of X George **IV**…"



 ___

**Text:** "…force was the XXX George **IV.** cap. XX, which directed …"



 ___

**Text:** "…By the XX George **III.** c. XX, s. X,…"



 ___

**Text:** "…mber in the XXXX Charles **I.**…"



 ___

**Text:** "…as instituted by Charles **I.** in the sixth year of his…"



 ___

**Text:** "…but in the XXXX George **II** the right of presentatio…"



 ___

**Text:** "…William **IV.** was also the victim of a…"



 ___

**Text:** "…ck as the reign of Henry **VIII.** a new and most cruel pen…"



 ___

**Text:** "…id down by the XX George **III.** cap. XX,…"



 ___

**Text:** "…fired a pistol at George **III.** from the pit of Drury La…"



 ___

**Text:** "…assing of the XXX George **IV.** c. XX, any two justices …"



 ___

**Text:** "… as the reign of Charles **II.**, a law was passed declar…"



 ___

**Text:** "…others, that Napoleon **III.**, but recently proclaimed…"



 ___

**Text:** "…e long illness of George **III.**, as many as one hundred …"



 ___

**Text:** "…nd in XXX Nebuchadnezzar **III.**, a native Babylonian, wa…"



 ___

**Text:** "…wever, was the XX George **III.** c. XX, s. X (XXXX)…"



 ___

**Text:** "…cap. XX, and X George **IV.** cap. XX…"



 ___

**Text:** "…hich became the X George **IV.** cap. XX, said that he ha…"



 ___

In [54]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('reign of Charles II., a law was passed', 'the second'), 
         ('William IV. was also the victim', 'the forth'),
         ('the reign of Henry VIII. a new and most', 'the eighth')]

def replace(text, true):
    if text[-1] == '.':
        text = text[:-1]
        
    num = 0
    if 'V' not in text:
        num = len(text)
    elif 'IV' == text:
        num = 4
    else:
        num = 5 + len(text) - 1
        
    ret = 'the ' + num2words(int(num), to='ordinal')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(r'\b(?:George|Charles|Napoleon|Henry|Nebuchadnezzar|William) ([IV]+\.{0,})',
                         text[0], partial(replace, true=text[1]))

II.
II


II → the second (the second)

IV.
IV


IV → the fourth (the forth)

VIII.
VIII


VIII → the eighth (the eighth)