OCR'd texts present special challenges to tokenization.  Consider this selection from an OCR'd version of Darwin's Origin of Species from the [Internet Archive](https://archive.org/download/originofspecies00darwuoft/originofspecies00darwuoft_djvu.txt):

```
the inhabitants of the surrounding districts will, also, be thus
prevented. Moritz Wagner has lately published an interest-
ing essay on this subject, and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed.
But from reasons already assigned I can by no means agree
with this naturalist, that migration and isolation are neces-
sary elements for the formation of new species. The im-
portance of isolation is likewise great in preventing, after
any physical change in the conditions such as of climate ele-
vation of the land, &c., the immigration of better adapted or-
ganisms; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants. Lastly, isolation will give time for a
new variety to be improved at a slow rate ; and this may
```

Here the printing convention of line-break hyphenization would, under a standard tokenizer, generate incorrect tokens like `interest-ing` (or perhaps `interest-` and `ing`).  Design a better tokenizer (even just using pre- and post-processing) for these texts.  Note here the correct tokenization of `interest-ing` is `interesting` but the correct tokenization for `newly-formed` is still `newly-formed`.

For a more thorough library for handling OCR'd book data, see https://github.com/tedunderwood/DataMunging


In [1]:
import sys, nltk, re

In [6]:
def read_text(filename):
    lines=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            lines.append(line.rstrip())
    return lines        

In [7]:
filename="../data/darwin_origin_ia.txt"

In [8]:
lines=read_text(filename)

In [9]:
testText="""the inhabitants of the surrounding districts will, also, be thus
prevented. Moritz Wagner has lately published an interest-
ing essay on this subject, and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed.
But from reasons already assigned I can by no means agree
with this naturalist, that migration and isolation are neces-
sary elements for the formation of new species. The im-
portance of isolation is likewise great in preventing, after
any physical change in the conditions such as of climate ele-
vation of the land, &c., the immigration of better adapted or-
ganisms; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants. Lastly, isolation will give time for a
new variety to be improved at a slow rate ; and this may"""

In [10]:
# To distinguish between words that are hyphenated *just* because they appear at the end of the line
# from those that *should* be hyphenated, let's lookup the dehyphenated word in a dictionary to see 
# if it exists.  We'll create that dictionary from an existing one (e.g., /usr/share/dict/words) and
# with all of the other (non-hyphenated) terms observed in the book

vocab={}

with open("/usr/share/dict/words") as file:
    for line in file:
        vocab[line.rstrip().lower()]=1
        
for line in lines:
    words=nltk.word_tokenize(line, language="english")
    for word in words:
        if not word.endswith("-"):
            vocab[word.lower()]=1

In [11]:
# Tokenize text passage
lines=testText.split("\n")
tokenized_lines=[]
for line in lines:
    tok_words=nltk.word_tokenize(line, language="english")
    tokenized_lines.append(tok_words)
    
tokens=[]
previousLineHyphenMatch=False

for idx,words in enumerate(tokenized_lines):
    flag=False
    
    # if line ends in hyphen
    if len(words) > 0 and words[-1].endswith("-") and idx < len(tokenized_lines)-1:
        nextwords=tokenized_lines[idx+1]
        if len(nextwords) > 0:
            first=nextwords[0]
            candidate="%s%s" % (re.sub("-$", "", words[-1]), first)
            
            # check if candidate word exists in dictionary
            if candidate.lower() in vocab:
                # if so, replace the fragment with the full word
                words[-1]=candidate
                
                # and keep a flag to we know to drop the first word of the next line
                flag=True
           
    if previousLineHyphenMatch:
        tokens.append(words[1:])
    else:
        tokens.append(words)
    
    previousLineHyphenMatch = True if flag else False

    
print("Tokenized:\n")
for line in tokens:
    print(' '.join(line))
print("\nOriginal:\n")
print(testText)

Tokenized:

the inhabitants of the surrounding districts will , also , be thus
prevented . Moritz Wagner has lately published an interesting
essay on this subject , and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed .
But from reasons already assigned I can by no means agree
with this naturalist , that migration and isolation are necessary
elements for the formation of new species . The importance
of isolation is likewise great in preventing , after
any physical change in the conditions such as of climate elevation
of the land , & c. , the immigration of better adapted organisms
; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants . Lastly , isolation will give time for a
new variety to be improved at a slow rate ; and this may

Original:

the inhabitants of the surrounding districts will, also, be thus
pre