## String Manipulation

### Example

You're working on a project to analyze court filings.

* One member of your team is working on OCR (Optical Character Recognition) to convert scanned documents into text files.
* Another member of your team will be visualizing the data, and they need the counts of ten key terms in each document.
* Your job is to write a function that takes a word and a text file, and returns the number of times that word appears in the text file.

This seems like a simple task, you don't have real data yet, so we'll take a free text file of comparable size from Project Gutenberg, and use that.


In [None]:
all_text = open('shakespeare.txt').read()

def count_word(word, text):
    counter = 0
    for w in text.split():
        if w == word:
            counter += 1
    return counter

How long does this take?


In [None]:
import timeit
number = 10*10000 # (10 words to search for) * (10000 documents to search)
timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

In [None]:
import timeit
number = 1000
timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

<https://docs.python.org/3/library/timeit.html>

If it takes 40 seconds to run 1000, that means our 100,000 documents will take 4000 seconds, that is over an hour.

Seems like we could do better, but an hour is acceptable for now so we move on.

### Example 2

During code review it is pointed out you will need to ignore case:

In [None]:
def count_word(word, text):
    counter = 0
    for w in text.split():
        if w.lower() == word.lower():
            counter += 1
    return counter

How long does this take?

In [None]:
import timeit
number = 1000
timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

This made it take about twice as long to run. About two hours to run 100,000 documents.

Where did that time go?

**Why did it take twice as long?**

**There's one easy optimization that can shave about 40ms/iteration off, what is it?**

### Example 3

Just as you wonder what will happen as the corpus of text grows, you hear that there are new requirements:

* Ignore punctuation
* Ignore plurals (for our purposes we can just ignore trailing s characters)
* Sometimes page numbers are showing up in the middle of scans, and we want to ignore those too, so strip all number characters.

In [None]:
def count_word(word, text):
    counter = 0
    for word in text.split():
        # remove all numeric characters that might appear inside words
        w = "".join([c for c in word if c not in '0123456789'])
        w = w.lower()
        # remove leading/trailing punctuation (but not punctuation in the middle)
        w = w.strip('.,!?;:"\'') 
        if w == word or w + "s" == word:
            counter += 1
    return counter

In [None]:
import timeit

number = 100
timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

This made it take 7x as long to run, about 14 hours to run the full corpus.

As more requirements come in, the code gets more and more complicated, and it takes longer and longer to run.

We know each time we add a new requirement, we have to iterate over each word in the text, and do some work on it. Is there any way around this?

But what if we could do all of that work in a single pass?

### Example 4

In [None]:
import re


def count_word(word, text):
    # remove all non-alphabetical characters that might appear inside words
    text = re.sub(r'[\d.,!?;:"\']', '', text)
    # return number of matches of word separated by "word boundaries" with optional trailing s
    return len(re.findall(r"\W" + word + "s?\W", text, re.IGNORECASE))

In [None]:
import timeit

number = 100
timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

66% faster than prior version, but with all the same features. Saves us ~9 hours on the full corpus.