# Data Input

## 1. Types of input data for text corpora

Textual data might come in different forms. 

1. It could be **plain text**:

```
Die Grippe wütet weiter
Zunahme der schweren Fälle in Berlin. Die Zahl der Grippefälle ist in den letzten beiden Tagen auch in Groß-Berlin noch deutlich gestiegen. Die Warenhäuser und sonstigen Geschäfte, die Kriegs- und die privaten Betriebe klagen, dass übermäßig viele Angestellte krank melden müssen, und auch bei der Post und bei der Straßenbahn ist die Zahl der Grippekranken bedeutend gestiegen.
```

2. It could be **images** (pdf, jpg, etc):

<img src="grippe.png" width=700>

(source: Berliner Morgenpost, October 15, 1918)

3. It could be some **structured markup** (XML/HTML):

```
<text>
    <head>
        Die Grippe wütet weiter
    </head>
    <p>
        <s>Zunahme der schweren Fälle in Berlin.</s> 
        <s>Die Zahl der Grippefälle ist in den letzten beiden Tagen auch in Groß-Berlin noch deutlich gestiegen.</s>
        <s>Die Warenhäuser und sonstigen Geschäfte, die Kriegs- und die privaten Betriebe klagen, dass übermäßig viele Angestellte krank melden müssen, und auch bei der Post und bei der Straßenbahn ist die Zahl der Grippekranken bedeutend gestiegen.</s>
    </p>
</text>
```

#### We have to be able to use all these formats and homogenise different sources into a unified corpus. 

In most cases, working with plain text is the simplest option (though sometimes you might actually want to *keep* structured XML/HTML markup and rely on that structure in your analysis). So we will convert everything to plain text

<img src="homogenisationchart.png">

## 2.  Images into digial text. OCR

To process images into digital text we need an **Optical Character Recognition (OCR)** tool. 

### 2.1. How OCR works

![](grippeocr.gif)

A modern Optical Character Recognition (OCR) algorithm typically involves several stages. Here’s a breakdown of the key stages:

1. **Preprocessing**: This initial step involves preparing the image for analysis and recognition. Common preprocessing tasks include:
   - **Noise Reduction**: Removing noise from the image to enhance the text's clarity. This could involve filtering techniques like Gaussian blur or median filter.
   - **Binarization**: Converting the image from grayscale or color to black-and-white, where text is typically represented as black pixels on a white background. This helps in distinguishing the text from the background.
   - **Normalization**: Standardizing the brightness and contrast of the image to reduce variability between different images.
   - **Dewarping**: Correcting any image distortions that result from curved surfaces or skewed scanning angles.

2. **Segmentation**: This step divides the image into parts that are easier to analyze. Segmentation levels can vary based on the complexity of the layout and the requirements of the application:
   - **Page Segmentation**: Identifying different blocks of text, images, or other elements on a page.
   - **Line Segmentation**: Breaking down text blocks into individual lines.
   - **Word Segmentation**: Further dividing lines into words.
   - **Character Segmentation**: The final step where words are broken down into individual characters.

3. **Feature Extraction**: In this stage, the algorithm extracts features from the segmented characters that are useful for recognition. This might include the basic shape, line endpoints, intersections, and other geometrical and topological characteristics.

4. **Character Recognition**: At this stage, each character image is analyzed and compared against a pre-trained model to identify the most likely corresponding textual character. Techniques used in this stage can vary:
   - **Pattern Recognition**: Using methods such as support vector machines or neural networks to recognize characters based on the features extracted.
   - **Template Matching**: Comparing character images to a set of predefined character templates to find the best match.

5. **Post-processing**: After characters are recognized, the algorithm performs corrections based on context and additional information:
   - **Spell Checking and Correction**: Identifying and correcting misspelled words using dictionaries and context-based algorithms.
   - **Language and Grammar Analysis**: Applying language-specific rules to improve the accuracy of the output text.

6. **Output Formatting**: The final text is formatted according to the desired output specifications, which may include maintaining the layout, fonts, and style of the original text.

Each of these stages can be enhanced by deep learning techniques, especially convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which can learn to handle many of the tasks automatically and often provide better accuracy than traditional methods, especially in complex or noisy environments.

### 2.2. How one performs OCR 

OCR technology is increasingly being integrated into basic software applications, such as PDF viewers. Tools like MacOS 'Preview' or Adobe Acrobat feature built-in OCR. But this is not suitable for bulk processing of corpora. Therefore one still needs specialized OCR software or programming packages to process large quantities of images into machine-readable corpora.  

### 2.2.1. What OCR tools there are

The field of making OCR tools is developing rapidly (together with all other fields of text processing), so there are always new tools challenging the old ones. But as of 2024, the well-known products were: 

* FineReader (commercial, has a desktop interface)
* Tesseract (open source, command-line interface, also some third-party desktop interfaces)
* OCR4all (open source, has a (dockerized locally deployable) desktop interface)
* Kraken & e-Scriptorium (open source, e-Scriptorium has a desktop interface)
* EasyOCR (open source, has a desktop interface)

### 2.2.2. Performing OCR with Tesseract

We'll use Tesseract in this tutorial, which is an open & free tool. Specifically, we'll use the Python package PyTesseract. 

In [None]:
#!pip install pytesseract
#!pip install pillow

In [None]:
import pytesseract
from PIL import Image

This is how we can perform OCR on the image of a newspaper article ('*Die Grippe wütet weiter*') that you have already seen above:

In [None]:
ocr_output = pytesseract.image_to_string(Image.open('grippe.png'), lang='frk') 
print(ocr_output)

With a bit more Python code, we can also use pytesseract to process entire PDF files with many pages:

In [None]:
from pathlib import Path
from pdf2image import convert_from_path
from tqdm import tqdm

In [None]:
sample_pdf_path = Path('../data/pdf/SNP27112366-19181224-0-0-0-0.pdf')
recognized_pages = []
converted_pdf = tqdm(convert_from_path(sample_pdf_path, use_cropbox=True))
for image in converted_pdf:
    recognized = pytesseract.image_to_string(image, 
                                             lang='frk') 
    #print(recognized)
    recognized_pages.append(recognized)

first page:

In [None]:
print(recognized_pages[0])

Last page:

In [None]:
print(recognized_pages[-1])

None of these results look very good (mostly due to scan quality and general challenges of working with old newspapers). In the next parts we will learn how to 
* a) measure the OCR quality
* b) improve the quality at the OCR postcorrection stage

#### P.S.Processing the whole corpus of PDF-s with the same OCR engine

The code below will process all the files in folder `'../data/pdf'` which have '.pdf' as extension, and then put the results into the `'../data/txt'` (the filenames will be the same but with '.txt' extension instead of '.pdf'). For a large (>5) number of PDF-s this will take a long time. 

In [None]:
pathpdf = Path('../data/pdf')
pathtxt = Path('../data/txt')

In [None]:
for filename in tqdm(pathpdf.iterdir()):
    if filename.suffix == '.pdf':
        converted_pdf = convert_from_path(filename, use_cropbox=True)
        output_path = pathtxt / filename.stem 
        output_path = output_path.with_suffix('.txt')
        with output_path.open('w') as output_txt:
            for image in converted_pdf:
                recognized = pytesseract.image_to_string(image, 
                                                         lang='frk') 
                output_txt.write(recognized)