# Example of searching a folder for variations on a set of terms

Suppose we want to find variations of the word "purpose" in a set of texts.

We have a folder where the texts are stored:

In [3]:
source = '/Users/tunder/Box Sync/NEHproject/dirtynarratives/'

And a list of words we want to get:

In [6]:
# making this a set because it's fast to check membership in a set

targets = {'purpose', 'purposive', 'purposed'}

We could start by getting a list of files to search. To limit the task, I'm going to arbitrarily say only those beginning with "9" or "8".

In [19]:
import glob
files2get = glob.glob(source + '9*.txt')
files2get.extend(glob.glob(source + '8*.txt'))
len(files2get)

17

Some modules that will come in handy.

In [20]:
from difflib import SequenceMatcher
from nltk import tokenize

In [21]:
snippets = []

for filename in files2get:
    with open(filename, encoding = 'utf-8') as f:
        filestring = f.read()
    wordsinfile = tokenize.word_tokenize(filestring)
    
    for idx, w in enumerate(wordsinfile):
        found = False
        if w in targets:
            found = True
            w = w.lower()
        elif w.startswith('p') and len(w) > 4:
            # suspicious!
            # you might not have an easy way to filter candidates like this,
            # but in this case we do. Otherwise just proceed to the next step.
            
            for t in targets:
                matcher = SequenceMatcher(None, w, t)
                if matcher.real_quick_ratio() > 0.5 and matcher.ratio() > 0.85:  # the quick check happens first
                    found = True                            # and saves us some time by blocking obvious fails
                    break
        
        if found:
            snippetstart = idx - 20
            if snippetstart < 0:
                snippetstart = 0
            
            snippetend = idx + 20
            if snippetend > len(wordsinfile):
                snippetend = len(wordsinfile)
            
            snippets.append((filename, idx, ' '.join(wordsinfile[snippetstart : snippetend])))
            if len(snippets) % 10 == 1:
                print(len(snippets))

1
11
21
31
41
51
61
71


The counting from 1 to 71 above isn't important here, but in a long task it gives you a way to know the program is running and not broken.

In the examples below, you'll see our criteria are a bit over-generous; we got some examples of "propose," which was close enough to "purpose" to slide under the .85 threshold. But if the overall number of hits is likely to be low it might be wiser to be generous — and catch "purpoſe" — than to be too stingy.

In [22]:
snippets

[('/Users/tunder/Box Sync/NEHproject/dirtynarratives/9611_4.txt',
  15174,
  "Phyſiognomiſt will . • rarely be deceived . I preſume you have never read the Story of Socrates to this purpoſe , and '' therefore I will tell it you . A certain Phyſiog- 'romiſt aſſerted of Socrates , that"),
 ('/Users/tunder/Box Sync/NEHproject/dirtynarratives/9611_2.txt',
  8239,
  "far off ? The Mug is out , ſhall I draw another ? '' Whilft he was gone for that purpoſe , a Stage- Coach drove up to the Door . The Coachman coming into the Houſe , was aſked"),
 ('/Users/tunder/Box Sync/NEHproject/dirtynarratives/9611_3.txt',
  5448,
  'he was no ſooneř arrived , than Bellar- mine brought him back to the Point ; but all to no purpoſe ; he made his Eſcape from that Subject in : in a Minute ; till at laſt the Lover'),
 ('/Users/tunder/Box Sync/NEHproject/dirtynarratives/9611_0.txt',
  7176,
  'Shillings por Annum would have accrued to the Rector : but he had not yet been able to accompliſh his purpoſe ; and had