We're going to pull the html from https://www.gutenberg.org/ebooks/3300.  
  
Using the html because it's better structured for newlines. You could also use the epub data.

In [1]:
import requests
from bs4 import BeautifulSoup as Soup
book_html=requests.get("https://www.gutenberg.org/files/3300/3300-h/3300-h.htm").text
soup=Soup(book_html,"html.parser")
soup.text[:1000]

'\n\n\n\n\nThe Project Gutenberg eBook of An Inquiry into the Nature and Causes of the Wealth of Nations, by Adam Smith\n\n\n\nThe Project Gutenberg eBook of An Inquiry into the Nature and Causes of the Wealth of Nations, by Adam Smith\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online\r\nat www.gutenberg.org. If you\r\nare not located in the United States, you will have to check the laws of the\r\ncountry where you are located before using this eBook.\r\n\nTitle: An Inquiry into the Nature and Causes of the Wealth of Nations\nAuthor: Adam Smith\nRelease Date: March 17, 2001 [eBook #3300]\r\n[Most recently updated: December 29, 2021]\nLanguage: English\nCharacter set encoding: UTF-8\nProduced by: Colin Muir and David Widger\n*** START OF THE PRO

In [2]:
#From using inspect element I see chapter is a major class. We're in Book I chapter 1 which corresponds to chapter idx 2
chapters=soup.find_all("div", {"class": "chapter"})
chapters[2]

<div class="chapter">
<h2><a name="chap03"></a>CHAPTER I.<br/>
OF THE DIVISION OF LABOUR.</h2>
<p>
The greatest improvements in the productive powers of labour, and the greater
part of the skill, dexterity, and judgment, with which it is anywhere directed,
or applied, seem to have been the effects of the division of labour. The
effects of the division of labour, in the general business of society, will be
more easily understood, by considering in what manner it operates in some
particular manufactures. It is commonly supposed to be carried furthest in some
very trifling ones; not perhaps that it really is carried further in them than
in others of more importance: but in those trifling manufactures which are
destined to supply the small wants of but a small number of people, the whole
number of workmen must necessarily be small; and those employed in every
different branch of the work can often be collected into the same workhouse,
and placed at once under the view of the spectator.
</p

In [3]:
import glob
ocred_pages=[]
for file in glob.glob("pages/*.txt"):
    with open(file, "r+") as f:
        ocred_pages.append(f.read())
ocred_text="\n\n".join(ocred_pages)
print(ocred_text)

BOOK I
OF THE CAUSES OF IMPROVEMENT IN THE PRODUCTIVE
POWERS OF LABOUR, AND OF THE ORDER ACCORDING TO
WHICH ITS PRODUCE IS NATURALLY DISTRIBUTED AMONG
THE DIFFERENT RANKS OF THE PEOPLE
CHAPTER I
OF THE DIVISION OF LABOUR
THE greatest improvement in the productive powers of labour,
and the greater part of the skill, dexterity, and judgment with
which it is anywhere directed, or applied, seem to have been the
effects of the division of labour.
The effects of the division of labour, in the general business of
society, will be more easily understood by considering in what
manner it operates in some particular manufactures. It is com-
monly supposed to be carried furthest in some very trifling ones;
not perhaps that it really is carried further in them than in others
of more importance: but in those trifling manufactures which are
destined to supply the small wants of but a small number of
people, the whole number of workmen must necessarily be small;
and those employed in every different b

In [4]:
gutenberg_text="\n".join([p.text for p in chapters[2].find_all("p")])
gutenberg_text

'\r\nThe greatest improvements in the productive powers of labour, and the greater\r\npart of the skill, dexterity, and judgment, with which it is anywhere directed,\r\nor applied, seem to have been the effects of the division of labour. The\r\neffects of the division of labour, in the general business of society, will be\r\nmore easily understood, by considering in what manner it operates in some\r\nparticular manufactures. It is commonly supposed to be carried furthest in some\r\nvery trifling ones; not perhaps that it really is carried further in them than\r\nin others of more importance: but in those trifling manufactures which are\r\ndestined to supply the small wants of but a small number of people, the whole\r\nnumber of workmen must necessarily be small; and those employed in every\r\ndifferent branch of the work can often be collected into the same workhouse,\r\nand placed at once under the view of the spectator.\r\n\n\r\nIn those great manufactures, on the contrary, which are

# Time to Compare Text
First we're going to align our texts to the same start points. Then we'll clean up some of the special character and compute an accuracy score

In [5]:
ocred_text[220:400]

'\nTHE greatest improvement in the productive powers of labour,\nand the greater part of the skill, dexterity, and judgment with\nwhich it is anywhere directed, or applied, seem to hav'

In [6]:
gutenberg_text[:100]

'\r\nThe greatest improvements in the productive powers of labour, and the greater\r\npart of the skill, '

Fronts are aligned

### Clean Text


In [19]:
import re#regex

def clean_text(text):
    text=text.lower()
    text=re.sub("-\n","",text)
    text=re.sub("\s+"," ",text)
    text=re.sub("^ ","",text)
    return text
test_text=clean_text(ocred_text[220:])
true_text=clean_text(gutenberg_text)

In [20]:
#Align the end of the text
pattern_index=true_text.index(test_text[-20:])
pattern_index,len(test_text)

(14627, 14681)

In [21]:
#pattern is pretty close in length so is probably right. 

In [22]:
test_text[-100:]

'ovements have been made by the ingenuity of the makers of the machines, when to make them became the'

In [23]:
true_text[14527:pattern_index+20]

' machines. many improvements have been made by the ingenuity of the makers of the machines, when to make them became the'

In [24]:
#True text looks aligned at 14647
true_text=true_text[:pattern_index+20]

# Error Metrics
We'll use this JIWER library I found based off this paper https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2004/i04_2765.pdf. Contains a variety of metrics that should give us a good idea of performance

In [25]:
import jiwer

jiwer.compute_measures(true_text, test_text)

{'wer': 0.04217579818683485,
 'mer': 0.04169914263445051,
 'wil': 0.06789280093798333,
 'wip': 0.9321071990620167,
 'hits': 2459,
 'substitutions': 69,
 'deletions': 9,
 'insertions': 29}

- WER: Word Error Rate  
- MER: Match Error Rate
- WIL: Word Information Lost
- WIP: 1- WIL (Word Information Preserved)
- hits: Exact matches
- subsitutions: count of subs like b to c in (b)at -> (c)at
- deletions: count of deletes like b (b)at -> at
- insertions: count of insertions like h in at -> (h)at
- CER: character error rate

In [26]:
character_accuracy=1 - jiwer.cer(true_text,test_text)
character_accuracy

0.982180651327917

# Find Error Examples
98.2% character accuracy is pretty good but we should still look for examples of errors

In [27]:
import difflib
d = difflib.Differ()

true_words=true_text.split()
test_words=test_text.split()
diff = difflib.unified_diff(true_words, test_words, lineterm='')
for d in diff:
    print(d)

--- 
+++ 
@@ -1,6 +1,6 @@
 the
 greatest
-improvements
+improvement
 in
 the
 productive
@@ -16,7 +16,7 @@
 skill,
 dexterity,
 and
-judgment,
+judgment
 with
 which
 it
@@ -53,7 +53,7 @@
 be
 more
 easily
-understood,
+understood
 by
 considering
 in
@@ -185,7 +185,7 @@
 a
 number
 of
-workmen,
+workmen
 that
 it
 is
@@ -230,7 +230,7 @@
 greater
 number
 of
-parts,
+parts
 than
 in
 those
@@ -253,6 +253,11 @@
 much
 less
 observed.
+the
+division
+of
+labour
+5
 to
 take
 an
@@ -262,7 +267,7 @@
 a
 very
 trifling
-manufacture,
+manufacture;
 but
 one
 in
@@ -281,8 +286,8 @@
 the
 trade
 of
-a
-pin-maker:
+the
+pin-maker;
 a
 workman
 not
@@ -388,18 +393,18 @@
 draws
 out
 the
-wire;
+wire,
 another
 straights
-it;
+it,
 a
 third
 cuts
-it;
+it,
 a
 fourth
 points
-it;
+it,
 a
 fifth
 grinds
@@ -428,7 +433,7 @@
 is
 a
 peculiar
-business;
+business,
 to
 whiten
 the
@@ -498,7 +503,7 @@
 manufactory
 of
 this
-kind,
+kind
 where
 ten
 men
@@ -573,7 +578,7 @@
 them
 upwards
 of
-forty-ei

Looks like a lot of punctuation errors. A comma vs a semicolon isn't a huge deal. Let's remove that!

In [30]:
def clean_text(text):
    text=text.lower()
    text=re.sub("-\n","",text)
    text=re.sub("\s+"," ",text)
    text=re.sub("^ ","",text)
    text=re.sub("\snan\s","",text)
    #NEW removal line below. It says to ignore all characters except what's in the []
    text = re.sub(r"[^a-zA-Z\. 0-9]+", "", text)
    
    return text
test_text=clean_text(ocred_text[220:])
true_text=clean_text(gutenberg_text)
pattern_index=true_text.index(test_text[-20:])
pattern_index,len(test_text)

(14336, 14432)

In [31]:
true_text=true_text[:pattern_index+20]
true_text

'the greatest improvements in the productive powers of labour and the greater part of the skill dexterity and judgment with which it is anywhere directed or applied seem to have been the effects of the division of labour. the effects of the division of labour in the general business of society will be more easily understood by considering in what manner it operates in some particular manufactures. it is commonly supposed to be carried furthest in some very trifling ones not perhaps that it really is carried further in them than in others of more importance but in those trifling manufactures which are destined to supply the small wants of but a small number of people the whole number of workmen must necessarily be small and those employed in every different branch of the work can often be collected into the same workhouse and placed at once under the view of the spectator. in those great manufactures on the contrary which are destined to supply the great wants of the great body of the p

In [32]:

measures=jiwer.compute_measures(true_text, test_text)
measures['cer']=jiwer.cer(true_text,test_text)
measures['car']=1-measures['cer']
measures

{'wer': 0.01931415057154119,
 'mer': 0.019110764430577222,
 'wil': 0.023427036228873432,
 'wip': 0.9765729637711266,
 'hits': 2515,
 'substitutions': 11,
 'deletions': 11,
 'insertions': 27,
 'cer': 0.01337419894120925,
 'car': 0.9866258010587907}

Now we've got 98.6% character accuracy rate. Great!

In [33]:
import difflib
d = difflib.Differ()

true_words=true_text.split()
test_words=test_text.split()
diff = difflib.unified_diff(true_words, test_words, lineterm='')
for d in diff:
    print(d)

--- 
+++ 
@@ -1,6 +1,6 @@
 the
 greatest
-improvements
+improvement
 in
 the
 productive
@@ -253,6 +253,11 @@
 much
 less
 observed.
+the
+division
+of
+labour
+5
 to
 take
 an
@@ -281,7 +286,7 @@
 the
 trade
 of
-a
+the
 pinmaker
 a
 workman
@@ -593,8 +598,7 @@
 might
 be
 considered
-as
-making
+asmaking
 four
 thousand
 eight
@@ -715,6 +719,11 @@
 be
 so
 much
+the
+wealth
+of
+nations
+6
 subdivided
 nor
 reduced
@@ -1166,6 +1175,11 @@
 never
 so
 much
+the
+division
+of
+labour
+7
 more
 productive
 as
@@ -1233,7 +1247,8 @@
 is
 in
 the
-cornprovinces
+corn
+provinces
 fully
 as
 good
@@ -1442,7 +1457,7 @@
 this
 great
 increase
-in
+of
 the
 quantity
 of
@@ -1491,8 +1506,7 @@
 is
 commonly
 lost
-in
-passing
+inpassing
 from
 one
 species
@@ -1534,7 +1548,7 @@
 dexterity
 of
 the
-workmen
+workman
 necessarily
 increases
 the
@@ -1611,6 +1625,11 @@
 will
 scarce
 i
+the
+wealth
+of
+nations
+8
 am
 assured
 be
@@ -2096,6 +2115,11 @@
 of
 dexterity
 this
+the
+division
+of
+labour

The biggest errors are missed s, minor ocr issues like (whenever vs wherever). Overall not bad. I think it should suffice for an audiobook version. In the next notebook we'll create 2 audio files.  
- One audio file  based on gutenberg text
- Another audio file based on ocred text

In [55]:
gutenberg_text

'\r\nThe greatest improvements in the productive powers of labour, and the greater\r\npart of the skill, dexterity, and judgment, with which it is anywhere directed,\r\nor applied, seem to have been the effects of the division of labour. The\r\neffects of the division of labour, in the general business of society, will be\r\nmore easily understood, by considering in what manner it operates in some\r\nparticular manufactures. It is commonly supposed to be carried furthest in some\r\nvery trifling ones; not perhaps that it really is carried further in them than\r\nin others of more importance: but in those trifling manufactures which are\r\ndestined to supply the small wants of but a small number of people, the whole\r\nnumber of workmen must necessarily be small; and those employed in every\r\ndifferent branch of the work can often be collected into the same workhouse,\r\nand placed at once under the view of the spectator.\r\n\n\r\nIn those great manufactures, on the contrary, which are

In [57]:
print(test_text)

THE greatest improvement in the productive powers of labour, and the greater part of the skill, dexterity, and judgment with which it is anywhere directed, or applied, seem to have been the effects of the division of labour. The effects of the division of labour, in the general business of society, will be more easily understood by considering in what manner it operates in some particular manufactures. It is commonly supposed to be carried furthest in some very trifling ones; not perhaps that it really is carried further in them than in others of more importance: but in those trifling manufactures which are destined to supply the small wants of but a small number of people, the whole number of workmen must necessarily be small; and those employed in every different branch of the work can often be collected into the same workhouse, and placed at once under the view of the spectator. In those great manufactures, on the contrary, which are destined to supply the great wants of the great b

In [58]:
#Save the text for use by the next notebook, audio file generation

#We'll use the text with punctuation and make sure paragraphs have extra spacing
import re#regex

def clean_text_for_audio(text):
    text=re.sub("\n\n","<NEW PARAGRAPH>",text)
    text=re.sub("-\n","",text)
    text=re.sub("\s+"," ",text)
    text=re.sub("^ ","",text)
    text=re.sub("\s?\<NEW PARAGRAPH\>\s?","\n\n",text)
    return text
test_text=clean_text_for_audio(ocred_text[220:])
true_text=clean_text_for_audio(gutenberg_text)
#Align the end of the text
pattern_index=true_text.index(test_text[-20:])
pattern_index,len(test_text)
true_text=true_text[:pattern_index+20]
print(true_text)

The greatest improvements in the productive powers of labour, and the greater part of the skill, dexterity, and judgment, with which it is anywhere directed, or applied, seem to have been the effects of the division of labour. The effects of the division of labour, in the general business of society, will be more easily understood, by considering in what manner it operates in some particular manufactures. It is commonly supposed to be carried furthest in some very trifling ones; not perhaps that it really is carried further in them than in others of more importance: but in those trifling manufactures which are destined to supply the small wants of but a small number of people, the whole number of workmen must necessarily be small; and those employed in every different branch of the work can often be collected into the same workhouse, and placed at once under the view of the spectator.

In those great manufactures, on the contrary, which are destined to supply the great wants of the gre

In [59]:
print(test_text)

THE greatest improvement in the productive powers of labour, and the greater part of the skill, dexterity, and judgment with which it is anywhere directed, or applied, seem to have been the effects of the division of labour. The effects of the division of labour, in the general business of society, will be more easily understood by considering in what manner it operates in some particular manufactures. It is commonly supposed to be carried furthest in some very trifling ones; not perhaps that it really is carried further in them than in others of more importance: but in those trifling manufactures which are destined to supply the small wants of but a small number of people, the whole number of workmen must necessarily be small; and those employed in every different branch of the work can often be collected into the same workhouse, and placed at once under the view of the spectator. In those great manufactures, on the contrary, which are destined to supply the great wants of the great b

## Uh Oh - No Test Text Paragraphs
Hmm paragraph information is lost. Let's roll without it for now. That would need to be handled more in OCR and Microsoft doesn't seem to give it without us doing some basic geometry

In [61]:
with open("pages/gutenberg.txt","w+") as f:
    f.write(true_text)

with open("pages/ocr.txt","w+") as f:
    f.write(test_text)

# Further Work
The purpose of using Wealth of Nations was that it is not under US Copyright and there is already a text transcription done by project gutenberg.  
  
In data science work it's best if you have a labeled training set you can use to build your models and learn your accuracy.  
  
Should you want to work on a book OCR scanner more seriously I'd recommend trying books in Gutenberg as a way to build your model and refine your methods. Also, because we have Gutenberg we can evaluate our page splitting algorithms without actually having a labeled page split database!  
  
Ideally we would have that database but with gutenberg we can send our data through a pretty good OCR pipeline that will let you know if you successfully identified a page or not!
  
### Improve OCR Output
You should also know that many people will take OCR data and apply a text model on it to clean up the grammar and spellcheck. Few starting points below  
- https://www.pyimagesearch.com/2021/11/29/using-spellchecking-to-improve-tesseract-ocr-accuracy/
- https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/  
  
Hardcoded rules can also be useful depending on what you are OCR'ing. For examples a73bat is likely garbage unless you are parsing invoices with ID format like that or something