# Verbalizing LJ Speech

The goals of this notebook are:

- Construct an algorithm to verbalize the LJ Speech dataset.
- Understand the distribution of not verbalized symbols such as numbers.

In [None]:
import re
import sys

# Setup the "PYTHONPATH"
sys.path.insert(0, '../../')

from src.datasets.lj_speech import _iterate_and_replace

In [2]:
import random
import os

from IPython.display import Audio
from IPython.display import Markdown
from IPython.display import FileLink

from src.datasets import lj_speech_dataset

data = lj_speech_dataset(directory='../../data', verbalize=False)

def get_unique(examples, get_key):
    """ Get a unique list of ``examples`` based on ``key``.
    
    Args:
        examples (list): Examples to dedup.
        get_key (callable): Get a key to dedup examples.
    """
    seen = set() 
    filtered = []
    for example in examples:
        key = get_key(example)
        if key not in seen:
            seen.add(key)
            filtered.append(example)
    return filtered

def find_examples(regex, display_n=5, match_to_key=None, load_audio=False, replace=True, group=1):
    """ Print ``display_n`` examples of ``regex`` in ``lj_speech_dataset``.
    
    This is the bread and butter module for our data analysis. Enabling us to use regex to query the
    dataset and retrieve samples.
    
    Args:
        regex (str): Pattern or compiled regex object.
        display_n (int or None, optional): Number of examples to display.
        match_to_key (callable or None, optional): Key assigned to match to filter duplicates.
        load_audio (bool, optional): If to load audio.
        replace (bool, optional): If to replace the matched characters with XXX...
        group (int, optional): Group to select in regex.
        
    Returns:
        None
    """
    examples = []
    for row in data:
        matches = re.finditer(regex, row['text'])
        for match in matches:
            start = max(match.start(group) - 25, 0)
            end = min(match.end(group) + 25, len(row['text']))
            if replace:
                row['text'] = (row['text'][:match.start(group)] + 
                               'X' * (match.end(group) - match.start(group)) + 
                               row['text'][match.end(group):])
            if match_to_key is not None:
                key = match_to_key(match.group(group))
            else:
                key = None
            text = (row['text'][start:match.start(group)] + '**' + match.group(group) +
                    '**' + row['text'][match.end(group):end])
            examples.append({
                'text': '…' + text + '…',
                'audio': os.path.join('../../', row['audio_path']),
                'key': key
            })
            
    # Print Examples
    display(Markdown('### Examples Captured by Regex'))
    display(Markdown('**Regex:** ' + str(regex)))
    display(Markdown('**Number of Examples:** ' + str(len(examples))))
    
    if match_to_key is not None:
        examples = get_unique(examples, lambda example: example['key'])
    random.shuffle(examples)
    if display_n is not None:
        examples = examples[:display_n]
    
    display(Markdown('**Number of Examples Shown:** ' + str(len(examples))))
    display(Markdown('\n\n ___'))
    
    for example in examples:
        display(Markdown('**Text:** "' + example['text'] + '"'))
        if load_audio:
            display(Audio(example['audio']))
        else:
            display(FileLink(example['audio']))
        display(Markdown('\n\n ___'))
        display()

## Sample of Phrases with a Number

Here we look at the overall distribution of numbers in the LJ Speech dataset.

In [3]:
find_examples(r'\S*(\d+)\S*', display_n=10, replace=False, group=0, load_audio=False)

### Examples Captured by Regex

**Regex:** \S*(\d+)\S*

**Number of Examples:** 2118

**Number of Examples Shown:** 10



 ___

**Text:** "… reported in the October **1,** 1963, issue of the Worke…"



 ___

**Text:** "… V. T. Lee, dated August **17,** 1963,…"



 ___

**Text:** "…ed his initial orders at **12:30** p.m.…"



 ___

**Text:** "…ursday morning, November **21,**…"



 ___

**Text:** "…Department shows that at **12:30** p.m. on November 22…"



 ___

**Text:** "…have obtained as much as **£40,000** by false and fraudulent …"



 ___

**Text:** "…imes-Herald of September **17**…"



 ___

**Text:** "…This was no earlier than **12:37** p.m. and may have been l…"



 ___

**Text:** "…rthur Griffiths. Section **16:** Newgate notorieties cont…"



 ___

**Text:** "…He took with him **$13.87** and the long brown packa…"



 ___

## Special Cases

Here we verbalize special phrases not captured below.

In [4]:
import os

from IPython.display import Markdown
from IPython.display import FileLink

lookup = {
    'LJ044-0055': ('544 Camp Street New', 'five four four Camp Street New'),
    'LJ028-0180': ('In the year 562', 'In the year five sixty-two'),
    'LJ047-0063': ('602 Elsbeth Street', 'six oh two Elsbeth Street'),
    'LJ047-0160': ('411 Elm Street', 'four one one Elm Street'),
    'LJ047-0069': ('214 Neely Street', 'two one four Neely Street'),
    'LJ040-0121': ('P.S. 117', 'P.S. one seventeen'),
    'LJ032-0036': ('No. 2,202,130,462', 'No. two two zero two one three zero four six two'),
    'LJ029-0193': ('100 extra off-duty', 'one hundred extra off-duty'),
}

def special_cases():
    for row in data:
        basename = os.path.basename(row['audio_path']).split('.')[0]
        if basename in lookup:
            original = row['text']
            row['text'] = row['text'].replace(*lookup[basename])
            display(Markdown(original + ' → ' + row['text']))

special_cases()

In the year 562, after a long reign of forty-three years, Nebuchadnezzar died. → In the year five sixty-two, after a long reign of forty-three years, Nebuchadnezzar died.

to call in 100 extra off-duty officers to help protect President Kennedy. → to call in one hundred extra off-duty officers to help protect President Kennedy.

purchased as No. 2,202,130,462 in Dallas, Texas, on March 12, 1963. → purchased as No. two two zero two one three zero four six two in Dallas, Texas, on March 12, 1963.

On September 30, 1952, Lee enrolled in P.S. 117 → On September 30, 1952, Lee enrolled in P.S. one seventeen

While the legend, quote, FPCC, 544 Camp Street New Orleans, Louisiana, end quote, → While the legend, quote, FPCC, five four four Camp Street New Orleans, Louisiana, end quote,

Agent Hosty was told by Mrs. M. F. Tobias, a former landlady of the Oswalds at 602 Elsbeth Street in Dallas, → Agent Hosty was told by Mrs. M. F. Tobias, a former landlady of the Oswalds at six oh two Elsbeth Street in Dallas,

that the Oswalds were living at 214 Neely Street in Dallas. → that the Oswalds were living at two one four Neely Street in Dallas.

found it to be 411 Elm Street. End quote. → found it to be four one one Elm Street. End quote.

## Time of the Day

Here we experiment verbalizing the time of day.

In [5]:
regex = r'([0-9]{1,2}:[0-9]{1,2})'
find_examples(regex)

### Examples Captured by Regex

**Regex:** ([0-9]{1,2}:[0-9]{1,2})

**Number of Examples:** 84

**Number of Examples Shown:** 5



 ___

**Text:** "…showed the numerals **12:30** as the Vice-Presidential…"



 ___

**Text:** "…again at **12:48** p.m., and again at 12:55…"



 ___

**Text:** "…At **1:51** p.m., police car 2 repor…"



 ___

**Text:** "…chool Book Depository at **2:15** after a brief stop at th…"



 ___

**Text:** "…the bus at approximately **12:40** p.m. and left it at appr…"



 ___

In [6]:
from IPython.display import Markdown
from functools import partial
from num2words import num2words

cases = [
    ('alone in the shop about 9:30', 'nine thirty'),
    ('San Antonio at 1:30 p.m.,', 'one thirty'),
    ('At 1:51 p.m., police car 2 report', 'one fifty-one'),
]

def replace(text, true):
    split = text.split(':')
    assert len(split) == 2
    words = [num2words(int(num)) for num in split]
    ret = ' '.join(words)
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

9:30 → nine thirty (nine thirty)

1:30 → one thirty (one thirty)

1:51 → one fifty-one (one fifty-one)

## Ordinals

Here we experiment verbalizing ordinals.

In [7]:
regex = r'([0-9]+(st|nd|rd|th))'
find_examples(regex)

### Examples Captured by Regex

**Regex:** ([0-9]+(st|nd|rd|th))

**Number of Examples:** 71

**Number of Examples Shown:** 5



 ___

**Text:** "…tween May XXX, 1827, and **30th** April, 1831,…"



 ___

**Text:** "… on this spot was on the **3rd** December, 1783,…"



 ___

**Text:** "… then lieutenant, in the **10th** Hussars.…"



 ___

**Text:** "…Up to the **21st** December, 1842,…"



 ___

**Text:** "…Thomas Dobson, on **22nd** August, 1799, for 1 shil…"



 ___

In [8]:
from IPython.display import Markdown
from functools import partial

from num2words import num2words

cases = [('shortly before Lee\'s 13th birthday', 'thirteenth'), 
         ('On October 23rd, I had attended a ultra', 'twenty-third'),
         ('between May 1st, 1827,', 'first'),
         ('and 30th April, 1831,', 'thirtieth')]

def replace(text, true):
    digit = ''.join([c for c in text if c.isdigit()])
    ret = num2words(int(digit), ordinal=True)
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret


for text in cases:
    _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

13th → thirteenth (thirteenth)

23rd → twenty-third (twenty-third)

1st → first (first)

30th → thirtieth (thirtieth)

## Money (dollars or pounds)

Here we experiment verbalizing currency.

In [9]:
regex = r'(\S*([$£]{1}[0-9\,\.]+\b))'
find_examples(regex)

### Examples Captured by Regex

**Regex:** (\S*([$£]{1}[0-9\,\.]+\b))

**Number of Examples:** 128

**Number of Examples Shown:** 5



 ___

**Text:** "…**$60** on the second of the mon…"



 ___

**Text:** "…s' Fund to the extent of **£90,000**.…"



 ___

**Text:** "…ment requests the sum of **$100,000** to conduct a detailed fe…"



 ___

**Text:** "…ed her for a full sum of **£2000**, after which the Wallace…"



 ___

**Text:** "…neral average being from **£120** to £130 per cell.…"



 ___

In [10]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('rough diamonds valued at £4000.', 'four thousand pounds'), 
         ('inch BBL, unquote, cost $29.95.', 'twenty-nine dollars, ninety-five cents'),
         ('was indebted upwards of £50,000 subsequently stopped pay', 'fifty thousand pounds'),
         ('warden, whose income was £2372.', 'two thousand, three hundred seventy-two pounds'),
         ('plus $1.27', 'one dollar, twenty-seven cents'),
         ('$19.95,', 'nighteen dollars, nighty-five cents'),
         ('were out to the value of £367,800.', 'three hundred sixty-seven thousand and eight hundred pounds'),
         ('the offer of a reward of £1500 for the detection of the', 'fifteen hundred pounds'),
         ('of England notes for £1000 each,', 'one thousand pounds each'),
         ('of approximately $3,000,000 during that period', 'three million'),
         ('only afford to give £1750 for stones', 'one thousand seven-fifty pounds'),
         ('e surrender of the other £1200', 'one thousand, two hundred pounds')]

def replace(text, true):
    digit = text[1:].replace(',', '')
    ret = num2words(digit, to='currency', currency='USD')
    ret = ret.replace(', zero cents', '')
    ret = ret.replace('hundred and', 'hundred')
    if '£' in text:
        # num2words has bugs with their GBP current
        ret = ret.replace('dollar', 'pound')
        ret = ret.replace('cents', 'pence')
        ret = ret.replace('cent', 'penny')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

£4000 → four thousand pounds (four thousand pounds)

$29.95 → twenty-nine dollars, ninety-five cents (twenty-nine dollars, ninety-five cents)

£50,000 → fifty thousand pounds (fifty thousand pounds)

£2372 → two thousand, three hundred seventy-two pounds (two thousand, three hundred seventy-two pounds)

$1.27 → one dollar, twenty-seven cents (one dollar, twenty-seven cents)

$19.95 → nineteen dollars, ninety-five cents (nighteen dollars, nighty-five cents)

£367,800 → three hundred sixty-seven thousand, eight hundred pounds (three hundred sixty-seven thousand and eight hundred pounds)

£1500 → one thousand, five hundred pounds (fifteen hundred pounds)

£1000 → one thousand pounds (one thousand pounds each)

$3,000,000 → three million dollars (three million)

£1750 → one thousand, seven hundred fifty pounds (one thousand seven-fifty pounds)

£1200 → one thousand, two hundred pounds (one thousand, two hundred pounds)

In [11]:
# No more currency examples
find_examples(r'([$£])', replace=False)

### Examples Captured by Regex

**Regex:** ([$£])

**Number of Examples:** 0

**Number of Examples Shown:** 0



 ___

## PO Box Numbers & Serial Numbers


Here we experiment verbalizing serial numbers and PO box number.

In [12]:
find_examples(r'([Bb]ox [0-9]+\b)')

### Examples Captured by Regex

**Regex:** ([Bb]ox [0-9]+\b)

**Number of Examples:** 14

**Number of Examples Shown:** 5



 ___

**Text:** "…Post Office **Box 2915**, Dallas, Texas, on March…"



 ___

**Text:** "…as listed on post office **box 30061**, New Orleans,…"



 ___

**Text:** "… address was Post Office **Box 2915**, Dallas, Texas.…"



 ___

**Text:** "…e had rented post office **box 2915**, Dallas,…"



 ___

**Text:** "…lication for post office **box 2915** listed "A. Hidell" as a …"



 ___

In [13]:
find_examples(r'(\b[A-Za-z]+[0-9]+\b)')

### Examples Captured by Regex

**Regex:** (\b[A-Za-z]+[0-9]+\b)

**Number of Examples:** 16

**Number of Examples Shown:** 5



 ___

**Text:** "… barrel near end of grip **C2766**, end quote,…"



 ___

**Text:** "…facturer's serial number **C2766**.…"



 ___

**Text:** "… Commando, serial number **V510210**, end quote,…"



 ___

**Text:** "…bearing serial number **C2766**.…"



 ___

**Text:** "…ano rifle, serial number **C2766**,…"



 ___

In [14]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('Post Office Box 2915, Dallas, Texas, on March', 'two nine one five'), 
         ('Post Office Box 30016, New Orleans', 'three zero zero one six'),
         ('serial No. C2766, which was also found', 'C two seven six six'),
         ('control number VC836, serial number', 'V C eight three six'),
         ('Commando, serial number V510210, end quote', 'V five one zero two one zero')]

def replace(text, true):
    split = text.split(' ')
    ret = [num2words(int(t)) if t.isdigit() else t for t in list(split[-1])]
    ret = ' '.join(ret)
    if len(split) == 2:
        ret = split[0] + ' ' + ret
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for regex in [r'([Bb]ox [0-9]+\b)', r'(\b[A-Za-z]+[0-9]+\b)']:
    for text in cases:
        _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

Box 2915 → Box two nine one five (two nine one five)

Box 30016 → Box three zero zero one six (three zero zero one six)

C2766 → C two seven six six (C two seven six six)

VC836 → V C eight three six (V C eight three six)

V510210 → V five one zero two one zero (V five one zero two one zero)

## Year

Here we experiment verbalizing a year.

In [15]:
regexes = [r'(\b[0-9]{4}\b)', r'\b(?:in|In) ([0-9]{3})\b', r'\b([0-9]{3}) B\.C\b']
for regex in regexes:
    find_examples(regex)

### Examples Captured by Regex

**Regex:** (\b[0-9]{4}\b)

**Number of Examples:** 582

**Number of Examples Shown:** 5



 ___

**Text:** "…ighties," or at least by **1490**, printing in Venice had …"



 ___

**Text:** "…ain, as late as XXXX and **1812**, Execution Dock, on the …"



 ___

**Text:** "…as submitted in November **1963**…"



 ___

**Text:** "…This was in May **1842**.…"



 ___

**Text:** "…int where the dangers of **1929** are again becoming possi…"



 ___

### Examples Captured by Regex

**Regex:** \b(?:in|In) ([0-9]{3})\b

**Number of Examples:** 13

**Number of Examples Shown:** 5



 ___

**Text:** "…Late in his life, in **567**, he invaded Egypt.…"



 ___

**Text:** "…and in **521** Nebuchadnezzar III., a n…"



 ___

**Text:** "…The next year, in **605**, Nabopolassar died, and …"



 ___

**Text:** "…n was unavailing, and in **275** B.C., the inhabitants of…"



 ___

**Text:** "…Eleven years later, in **586**, he destroyed the sacred…"



 ___

### Examples Captured by Regex

**Regex:** \b([0-9]{3}) B\.C\b

**Number of Examples:** 2

**Number of Examples Shown:** 2



 ___

**Text:** "…and there on June 13, **323** B.C., he met his death.…"



 ___

**Text:** "…t Babylon, writing about **250** B.C.,…"



 ___

In [16]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('dated April XXXX, 1787, describing an', 'seventeen eighty-seven'), 
         ('Newgate down to 1818,', 'eighteen eighteen'),
         ('It was about 2250 B.C., when the great', 'twenty-two fifty'),
         ('In 597, when he sent his army', 'five ninety-seven'),
         ('writing about 250 B.C.', 'two fifty'),
         ('In 606, Nineveh', 'six oh-six'),
         ('June 13, 323 B.C.,', 'three twenty-three')]

def replace(text, true):
    ret = num2words(int(text), lang='en', to='year')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret


for regex in regexes:
    for text in cases:
        _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))


1787 → seventeen eighty-seven (seventeen eighty-seven)

1818 → eighteen eighteen (eighteen eighteen)

2250 → twenty-two fifty (twenty-two fifty)

597 → five ninety-seven (five ninety-seven)

606 → six oh-six (six oh-six)

250 → two fifty (two fifty)

323 → three twenty-three (three twenty-three)

## Numero (no.)

Here we experiment verbalizing numeral.

In [17]:
regex = r'(?:No|no)\. ([0-9]+)'
find_examples(regex)

### Examples Captured by Regex

**Regex:** (?:No|no)\. ([0-9]+)

**Number of Examples:** 29

**Number of Examples Shown:** 5



 ___

**Text:** "…se pictures, Exhibit No. **133**-A, shows most of the rif…"



 ___

**Text:** "… lighting in Exhibit No. **133**-A.…"



 ___

**Text:** "… negative of Exhibit No. **133**-B was exposed in Oswald'…"



 ___

**Text:** "…tinctive colors with No. **10** prominently displayed on…"



 ___

**Text:** "…uld not test Exhibit No. **133**-A in the same way becaus…"



 ___

In [18]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('Commission Exhibit No. 133-B,', 'one thirty-three'), 
         ('Commission Exhibit No. 162 as', 'one sixty-two')]

def replace(text, true):
    ret = num2words(int(text), lang='en', to='year')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(regex, text[0], partial(replace, true=text[1]))

133 → one thirty-three (one thirty-three)

162 → one sixty-two (one sixty-two)

## Other Numbers

Here we experiment verbalizing all other numbers.

In [19]:
find_examples(r'(\b[0-9]{1}[0-9\.\,]{0,}\b)', display_n=50)

### Examples Captured by Regex

**Regex:** (\b[0-9]{1}[0-9\.\,]{0,}\b)

**Number of Examples:** 1171

**Number of Examples Shown:** 50



 ___

**Text:** "…t, at a "fast walk" took **1** minute, 14 seconds.…"



 ___

**Text:** "…oor lunchroom was within **3** seconds of the time need…"



 ___

**Text:** "…tober the same year, for **2** shillings, with costs of…"



 ___

**Text:** "…On August **17**, XXXX, Oswald told Mr. W…"



 ___

**Text:** "…nd drove at speeds up to **25** to 30 miles an hour…"



 ___

**Text:** "…s therefore bearing only **3** and one half percent int…"



 ___

**Text:** "…On November **22** a Secret Service agent s…"



 ___

**Text:** "…which became the **4** George IV. cap. 64, said…"



 ___

**Text:** "…wgate down to XXXX, part **2**.…"



 ___

**Text:** "…Chapter **7**. Lee Harvey Oswald: Back…"



 ___

**Text:** "…erested about the age of **15**. From an ideological vie…"



 ___

**Text:** "… visit Texas on November **21** and 22,…"



 ___

**Text:** "…**300** debtors and 900 criminal…"



 ___

**Text:** "…t twice within a span of **4.6** and 5.15 seconds.…"



 ___

**Text:** "…th costs of X shillings, **6** pence.…"



 ___

**Text:** "…About **6** months later she also wi…"



 ___

**Text:** "…The master felons' side. **6**. The female felons' side…"



 ___

**Text:** "…and arrived on September **27**, XXXX.…"



 ___

**Text:** "…In the **6** to 8 minute period befor…"



 ___

**Text:** "…(**1**) positive identification…"



 ___

**Text:** "…yer as, quote, about XX, **5** foot 8 inches, black hai…"



 ___

**Text:** "…On November **18**,…"



 ___

**Text:** "…t his remarks of January **7** were intended by him mer…"



 ___

**Text:** "…y stated, I have between **25** and 40 cases assigned to…"



 ___

**Text:** "…turning home on November **21**,…"



 ___

**Text:** "…a operates at a speed of **18.3** frames per second,…"



 ___

**Text:** "…Quote, this **13** year old well built boy …"



 ___

**Text:** "…tol, a copy of the March **24**, XXXX, issue of the Work…"



 ___

**Text:** "…tification experts, and (**4**) the testimony of Marina…"



 ___

**Text:** "…Between the hours of **8** and 9 p.m. they were occ…"



 ___

**Text:** "…ation the names of about **100** persons were in this ind…"



 ___

**Text:** "…f X pence, with costs of **7** shillings, 6 pence.…"



 ___

**Text:** "…o the police by Oswald, (**7**)…"



 ___

**Text:** "…session a Smith & Wesson **38** caliber revolver…"



 ___

**Text:** "…tween the hours of X and **9** p.m. they were occupied …"



 ___

**Text:** "…Chapter **4**. The Assassin: Part 8.…"



 ___

**Text:** "…munism when he was about **15**.…"



 ___

**Text:** "…X, where he stayed until **3** days before he was sched…"



 ___

**Text:** "… the evening of November **22**, Benavides told them tha…"



 ___

**Text:** "…firing **50** rounds each day for five…"



 ___

**Text:** "…one agent was there from **2** until 5 a.m.…"



 ___

**Text:** "…e building approximately **3** minutes after the assass…"



 ___

**Text:** "…t Mexico City on October **2**, XXXX.…"



 ___

**Text:** "…ussed in chapter X, page **249**.…"



 ___

**Text:** "…rthur Griffiths. Section **11**: Executions, part one.…"



 ___

**Text:** "…On June **24**, Oswald applied in New O…"



 ___

**Text:** "…te there had been XXX or **800** frequently, and once, in…"



 ___

**Text:** "…he weapon at XX yards in **6**, 7, and 9 seconds,…"



 ___

**Text:** "…weeks before, on October **7**, but she had asked him t…"



 ___

**Text:** "…d on the sixth floor was **88** inches long.…"



 ___

In [20]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('Chapter 4. The Assassin:', 'four'), 
         ('the morning of November 22 prior to the motorcade', 'twenty-two'),
         ('was shipped on March 20, and the shooting', 'twenty'),
         ('Kennedy in the neck at 176.9', 'one hundred seventy-six point nine'), 
         ('distance of 265.3 feet was, quote', 'two hundred sixty-five point three'),
         ('ries they required XXXX, 6.45,', 'six point four five'),
         ('information on some 50,000 cases', 'fifty thousand'), 
         ('actually had only 1,000 printed.', 'one thousand'),
         ('PRS received items in 8,709 cases', 'eight thousand, seven hundred nine'),
         ('debtors and 182 felons,', 'one hundred eighty-two')]

def replace(text, true):
    text = text.replace(',', '')
    ret = num2words(float(text))
    ret = ret.replace('hundred and', 'hundred')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(r'(\b[0-9\.\,]+\b)', text[0], partial(replace, true=text[1]))

4 → four (four)

22 → twenty-two (twenty-two)

20 → twenty (twenty)

176.9 → one hundred seventy-six point nine (one hundred seventy-six point nine)

265.3 → two hundred sixty-five point three (two hundred sixty-five point three)

6.45 → six point four five (six point four five)

50000 → fifty thousand (fifty thousand)

1000 → one thousand (one thousand)

8709 → eight thousand, seven hundred nine (eight thousand, seven hundred nine)

182 → one hundred eighty-two (one hundred eighty-two)

## Roman Numbers

Here we experiment verbalizing roman numbers.

In [21]:
find_examples(r'\b(?:George|Charles|Napoleon|Henry|Nebuchadnezzar|William) ([IV]+\.{0,})')

### Examples Captured by Regex

**Regex:** \b(?:George|Charles|Napoleon|Henry|Nebuchadnezzar|William) ([IV]+\.{0,})

**Number of Examples:** 22

**Number of Examples Shown:** 5



 ___

**Text:** "…but in the XXXX George **II** the right of presentatio…"



 ___

**Text:** "…ck as the reign of Henry **VIII.** a new and most cruel pen…"



 ___

**Text:** "…fired a pistol at George **III.** from the pit of Drury La…"



 ___

**Text:** "…tried to stab George **III.** as he was alighting from…"



 ___

**Text:** "…ew Jail Acts of X George **IV**…"



 ___

In [22]:
from IPython.display import Markdown

from functools import partial
from num2words import num2words

cases = [('reign of Charles II., a law was passed', 'the second'), 
         ('William IV. was also the victim', 'the forth'),
         ('the reign of Henry VIII. a new and most', 'the eighth')]

def replace(text, true):
    if text[-1] == '.':
        text = text[:-1]
        
    num = 0
    if 'V' not in text:
        num = len(text)
    elif 'IV' == text:
        num = 4
    else:
        num = 5 + len(text) - 1
        
    ret = 'the ' + num2words(int(num), to='ordinal')
    display(Markdown(text + ' → ' + ret + ' (' + true + ')'))
    return ret

for text in cases:
    _iterate_and_replace(r'\b(?:George|Charles|Napoleon|Henry|Nebuchadnezzar|William) ([IV]+\.{0,})',
                         text[0], partial(replace, true=text[1]))

II → the second (the second)

IV → the fourth (the forth)

VIII → the eighth (the eighth)