In [15]:
from textblob import TextBlob
import string

# Input for benchnarking test

These text files are the result of PDF extracting process using PyMuPDF and pypdf libraries.

The extracted source is from ["A History of Rome to 565 A. D"](https://www.gutenberg.org/files/32624/32624-pdf.pdf), page 23 to 26.


In [16]:
fitz = 'from_guttenberg_pymupdf.txt' # from PyMuPDF
pypdf = 'from_guttenberg_pypdf.txt' # from pypdf

# Reading the inputs
First process in benchmarking both libraries based on each library's ability to extract the text as close as the source is by reading all the text into python object.

In [17]:
with open(fitz, 'r', encoding = 'utf-8') as f:
    fitz_text = f.read().replace('\n', ' ')

with open(pypdf, 'r', encoding = 'utf-8') as p:
    pypdf_text = p.read().replace('\n', ' ')

# Cleaning the inputs
Second process is cleaning the inputs from any punctuations and normalize all the letters to lowercase.

In [18]:
# PyMuPDF-result clean text
clean_fitz = fitz_text.translate(str.maketrans('', '', string.punctuation)).lower()
clean_fitz

'a history of rome to 565 ad  chapter i  the geography of italy  tly ribbed by the apennines girdled by the alps and th sea juts out ikea �long piethead� fom europe towards the northern coast of africa it includes two regions of widely differing physical characteristics dhe norther continental the southern peninsulat the peninsula is slightly larger than the continental portion together their reais about 91200 square miles  �continental italy the continental portion of kaly consist of the southem watershed ofthe alps and the northem watershed of the apennines with the intervening lowland plain drained for the most part by the river po and sts numerous tributaries on the north the alps extend in an irregular crescent of over 1200 mils rom the mediterranean to the adriatic they rise abruptly �on the italian side but their north slope is gradual with easy passes leading over the divide to the southern plain �thus they invite rather than deter immigration from central europe east and west 

In [19]:
# pypdf-result clean text
clean_pypdf = pypdf_text.translate(str.maketrans('', '', string.punctuation)).lower()
clean_pypdf

'4 a history of rome to 565 a d more than 125 miles in striking contrast to the plains of the po southern italy is traversed throughout by the parallel ridges of the apennines which give it an endless diversity of hill and valley the average height of these mountains which form a sort of vertebrate system for the peninsula  apennino dorso italia dividitur  livy xxxvi 15 is about 4000 feet and even their highest peaks 9500 feet are below the line of perpetual 4 snow the apennine chain is highest on its eastern side where it approaches closely to the adriatic leaving only a narrow strip of coast land intersected by numerous short mountain torrents on the west the mountains are lower and recede further from the sea leaving the wide lowland areas of etruria latium and campania on this side too are rivers of considerable length navigable for small craft the volturnus and liris the tiber and the arno whose valleys link the coast with the highlands of the interior the coastline in comparison 

# Creating benchmarking function
I make two benchmarking function.
First function will counting how much misspelled words from the result. The second one will calculate the accuracy of the spelling in the result.

In [20]:
def mispelled_word_count(sentence):  
  wordlist = [word for word in sentence.split(' ') if word != '']
  
  count = 0
  for word in wordlist:
    blob = TextBlob(word).correct()    
    if word != blob:
      count += 1
  return count

In [21]:
def spelling_accuracy(sentence):
    wordlist = [word for word in sentence.split(' ') if word != '']
    mispell = mispelled_word_count(sentence)
    error_percentage = (mispell/(len(wordlist))) * 100
  
    return 100 - error_percentage

# Testing the extracted product
This is where the benchmarking begins

In [22]:
miss_fitz = mispelled_word_count(clean_fitz)
miss_pypdf = mispelled_word_count(clean_pypdf)

acc_fitz = spelling_accuracy(clean_fitz)
acc_pypdf = spelling_accuracy(clean_pypdf)

# The result is ...

In [25]:
print(f'The result of PyMuPDF: {miss_fitz} words misspelled, {acc_fitz:.2f}% accuracy.')
print(f'The result of pypdf: {miss_pypdf} word misspelled, {acc_pypdf:.2f}% accuracy.')

The result of PyMuPDF: 121 words misspelled, 90.16% accuracy.
The result of pypdf: 40 word misspelled, 96.10% accuracy.


From the result above, the **accuracy of pypdf is higher than PyMuPDF**. This is due to how both libraries work. For pypdf, it will read any text font from the pdf file as-is. But it also means if the PDF is from scanned document, pypdf cannot extract any text from it. In another hand, PyMuPDF don't have better accuracy because it works by changing the PDF into images before then processing the images using tesseract OCR into text. But by using OCR, PyMuPDF can extract any kinds of PDF files, whether it is scanned or save-as documents.