# Corpus Preprocessing — Performing OCR with PyTesseract 

In this notebook, we will run pyTesseract to produce machine readable text from:
* a JPEG image
* a multi-paged PDF
* a corpus of multi-page PDF-s

## Importing tools

In [None]:
import pytesseract
from PIL import Image

## Processing one image

This is how we can **perform OCR on the image** of a newspaper article ('*Die Grippe wütet weiter*') that you have already seen above:

In [None]:
ocr_output = pytesseract.image_to_string(Image.open('grippe.png'), lang='frk') 
print(ocr_output)

## Processing a (multi-pages) PDF

With a bit more Python code, we can also use pytesseract to OCR entire **PDF files with many pages**:

In [None]:
from pathlib import Path
from pdf2image import convert_from_path
from tqdm import tqdm

In [None]:
sample_pdf_path = Path('../data/pdf/SNP27112366-19181224-0-0-0-0.pdf')
recognized_pages = []
converted_pdf = tqdm(convert_from_path(sample_pdf_path, use_cropbox=True))
for image in converted_pdf:
    recognized = pytesseract.image_to_string(image, 
                                             lang='frk') 
    #print(recognized)
    recognized_pages.append(recognized)

Let's look at the first page:

In [None]:
print(recognized_pages[0])

Last page:

In [None]:
print(recognized_pages[-1])

None of these results look very good (mostly due to scan quality and general challenges of working with old newspapers). In the next parts we will learn how to 
* a) measure the OCR quality
* b) improve the quality at the OCR postcorrection stage

## (Advanced) Processing the whole corpus of PDF-s with the same OCR engine 

The code below will process all the files in folder `'../data/pdf'` which have '.pdf' as extension, and then put the results into the `'../data/txt'` (the filenames will be the same but with '.txt' extension instead of '.pdf'). **WARNING**: For a large (>5) number of PDF-s this will take a long time. 

In [None]:
pathpdf = Path('../data/pdf')
pathtxt = Path('../data/txt')

In [None]:
for filename in tqdm(pathpdf.iterdir()):
    if filename.suffix == '.pdf':
        converted_pdf = convert_from_path(filename, use_cropbox=True)
        output_path = pathtxt / filename.stem 
        output_path = output_path.with_suffix('.txt')
        with output_path.open('w') as output_txt:
            for image in converted_pdf:
                recognized = pytesseract.image_to_string(image, 
                                                         lang='frk') 
                output_txt.write(recognized)