# Data Input and Homogenisation

## 1. Types of input data for text corpora

Textual data might come in different forms. 

1. It could be **plain text**:

```
Die Grippe wütet weiter
Zunahme der schweren Fälle in Berlin. Die Zahl der Grippefälle ist in den letzten beiden Tagen auch in Groß-Berlin noch deutlich gestiegen. Die Warenhäuser und sonstigen Geschäfte, die Kriegs- und die privaten Betriebe klagen, dass übermäßig viele Angestellte krank melden müssen, und auch bei der Post und bei der Straßenbahn ist die Zahl der Grippekranken bedeutend gestiegen.
```

2. It could be **images** (pdf, jpg, etc):

<img src="grippe1.png" width=700>

(source: Berliner Morgenpost, October 15, 1918)

3. It could be some **structured markup** (XML/HTML):

```
<text>
    <head>
        Die Grippe wütet weiter
    </head>
    <p>
        <s>Zunahme der schweren Fälle in Berlin.</s> 
        <s>Die Zahl der Grippefälle ist in den letzten beiden Tagen auch in Groß-Berlin noch deutlich gestiegen.</s>
        <s>Die Warenhäuser und sonstigen Geschäfte, die Kriegs- und die privaten Betriebe klagen, dass übermäßig viele Angestellte krank melden müssen, und auch bei der Post und bei der Straßenbahn ist die Zahl der Grippekranken bedeutend gestiegen.</s>
    </p>
</text>
```

#### We have to be able to use all these formats and homogenise different sources into a unified corpus. 

In most cases, working with plain text is the simplest option (though sometimes you might actually want to *keep* structured XML/HTML markup and rely on that structure in your analysis). So we will convert everything to plain text

<img src="homogenisationchart.png">

## 2.  images into digial text. OCR

To process images into digital text we need an **Optical Character Recognition (OCR)** tool. 

![](grippeocr.gif)

### How OCR works

A modern Optical Character Recognition (OCR) algorithm typically involves several stages. Here’s a breakdown of the key stages:

1. **Preprocessing**: This initial step involves preparing the image for analysis and recognition. Common preprocessing tasks include:
   - **Noise Reduction**: Removing noise from the image to enhance the text's clarity. This could involve filtering techniques like Gaussian blur or median filter.
   - **Binarization**: Converting the image from grayscale or color to black-and-white, where text is typically represented as black pixels on a white background. This helps in distinguishing the text from the background.
   - **Normalization**: Standardizing the brightness and contrast of the image to reduce variability between different images.
   - **Dewarping**: Correcting any image distortions that result from curved surfaces or skewed scanning angles.

2. **Segmentation**: This step divides the image into parts that are easier to analyze. Segmentation levels can vary based on the complexity of the layout and the requirements of the application:
   - **Page Segmentation**: Identifying different blocks of text, images, or other elements on a page.
   - **Line Segmentation**: Breaking down text blocks into individual lines.
   - **Word Segmentation**: Further dividing lines into words.
   - **Character Segmentation**: The final step where words are broken down into individual characters.

3. **Feature Extraction**: In this stage, the algorithm extracts features from the segmented characters that are useful for recognition. This might include the basic shape, line endpoints, intersections, and other geometrical and topological characteristics.

4. **Character Recognition**: At this stage, each character image is analyzed and compared against a pre-trained model to identify the most likely corresponding textual character. Techniques used in this stage can vary:
   - **Pattern Recognition**: Using methods such as support vector machines or neural networks to recognize characters based on the features extracted.
   - **Template Matching**: Comparing character images to a set of predefined character templates to find the best match.

5. **Post-processing**: After characters are recognized, the algorithm performs corrections based on context and additional information:
   - **Spell Checking and Correction**: Identifying and correcting misspelled words using dictionaries and context-based algorithms.
   - **Language and Grammar Analysis**: Applying language-specific rules to improve the accuracy of the output text.

6. **Output Formatting**: The final text is formatted according to the desired output specifications, which may include maintaining the layout, fonts, and style of the original text.

Each of these stages can be enhanced by deep learning techniques, especially convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which can learn to handle many of the tasks automatically and often provide better accuracy than traditional methods, especially in complex or noisy environments.


### What OCR tools are there

The field of making OCR tools is developing rapidly (together with all other fields of text processing), so there are always new tools challenging the old ones. But as of 2024, the well-known products were: 

* FineReader (commercial, has a desktop interface)
* Tesseract (open source, command-line interface)
* OCR4all (open source, has a (dockerized locally deployable) desktop interface)
* Kraken & e-Scriptorium (open source, e-Scriptorium has a desktop interface)
* EasyOCR (open source, has a desktop interface)

We'll use Tesseract in this tutorial, which is an open & free tool. Specifically, we'll use the Python package PyTesseract. 

In [1]:
#!pip install pytesseract
#!pip install pillow

In [2]:
import pytesseract

In [3]:
from PIL import Image

### 2.1. Evaluate OCR engine quality

Before processing all files, we evaluate OCR quality on a sample image part for evaluation(source: [Deutsche Zeitung, Ausgaben am Montag, 23.12.1918](https://zefys.staatsbibliothek-berlin.de/kalender/auswahl/date/1918-12-23/30744015/)):
![sample.jpg](sample.jpg)

Let us OCR it:

In [7]:
ocr_output = pytesseract.image_to_string(Image.open('sample.jpg'), lang='frk')  # using German fraktur OCR model

In [8]:
print(ocr_output)

Die Lage anfdemKohlenmarkte gibt zu “en ſhlimm-
ſten Befürc<tungen Anlaß. Für Sachſen fehlten im Nov»mber
30 000 Wagen zu je 10 Tonnen und für Tezembex wird mit no<
größeren Ausfällen gere<net werden. E3 iſt mit einem völligen
Stillſtand der Induſtrie innerhalb vierzehn Tagen zu red<hnen,
wenn nicht eine erhebliche Steigerung der Belenſ<aften der Kot:en-
bergwerke oder ihrer Zah! geiingt. Weiter ſteht eine weſentliche
Erhöhung der Kohlenpreije bevor.



#### 2.1.1 Manually create  the 'ground truth' to evaluate against

In [9]:
ground_truth = input('Please insert corrected string: ')

Please insert corrected string: Die Lage an dem Kohlenmarkte gibt zu den ſhlimmſten Befürchtungen Anlaß. Für Sachſen fehlten im November 30 000 Wagen zu je 10 Tonnen und für Dezember wird mit noch größeren Ausfällen gerechnet werden. Eſ iſt mit einem völligen Stillſtand der Induſtrie innerhalb vierzehn Tagen zu rechhnen, wenn nicht eine erhebliche Steigerung der Belenſchaften der Kohlenbergwerke oder ihrer Zahl gelingt. Weiter ſteht eine weſentliche Erhöhung der Kohlenpreiſe bevor.


In [10]:
print(ground_truth)

Die Lage an dem Kohlenmarkte gibt zu den ſhlimmſten Befürchtungen Anlaß. Für Sachſen fehlten im November 30 000 Wagen zu je 10 Tonnen und für Dezember wird mit noch größeren Ausfällen gerechnet werden. Eſ iſt mit einem völligen Stillſtand der Induſtrie innerhalb vierzehn Tagen zu rechhnen, wenn nicht eine erhebliche Steigerung der Belenſchaften der Kohlenbergwerke oder ihrer Zahl gelingt. Weiter ſteht eine weſentliche Erhöhung der Kohlenpreiſe bevor.


#### 2.1.2 Measure OCR precision, recall and F-measure

In the context of Optical Character Recognition (OCR), precision, recall, and F-measure are metrics used to evaluate the accuracy and efficiency of OCR systems in converting images of typed, handwritten, or printed text into machine-encoded text. These metrics help to understand how well an OCR system performs, especially in terms of correctly identifying characters, words, or specific information within documents. Here's how these metrics apply to OCR quality evaluation:

###### Precision in OCR
In OCR, precision measures the accuracy of the recognized text against the actual text in the document images. It calculates the proportion of correctly identified characters or words out of all the characters or words that the OCR system identified. High precision means that most of the text the OCR system identified as present in the document was actually correct, indicating fewer false positives (i.e., incorrectly identified as present).

![](precision.png)

##### Recall in OCR
Recall in the context of OCR measures the OCR system's ability to capture all the relevant characters or words from the document images. It is the ratio of the correctly identified characters or words to all the characters or words that are actually present in the documents. High recall indicates that the OCR system is able to identify most of the actual text present, minimizing false negatives (i.e., failing to recognize text that is there).

![](recall.png)

##### F-measure (F1 Score) in OCR
The F-measure or F1 score in OCR provides a single metric that combines both precision and recall to give a balanced view of the OCR system's overall performance. Since precision and recall have a trade-off (improving one can often lead to a reduction in the other), the F1 score helps to evaluate the OCR system's effectiveness at recognizing text accurately while minimizing both false positives and false negatives.

![](fmeasure.png)

These metrics are critical for assessing OCR systems, particularly in applications where the accuracy of text recognition directly impacts the outcome, such as document automation, data extraction from scanned documents, and automated processing of handwritten forms. A balance between high precision and high recall is often desired to ensure that the OCR system is both accurate and comprehensive in its text recognition capabilities.

In [11]:
import Levenshtein as lev

In [12]:
def measure_quality(ocr_output, ground_truth):
    """
    Calculates precision, recall, and F1-score
    using the Levenshtein distance to align text from OCR with the ground truth data.

    :param ocr_output: A string containing the raw OCR results.
    :param ground_truth: A string containing the verified ground truth text.
    """

    matching_parts = lev.matching_blocks(lev.editops(ocr_output, ground_truth), ocr_output, ground_truth)
    true_pos = len(''.join([ocr_output[x[0]:x[0]+x[2]] for x in matching_parts]))

    precision = true_pos / len(ground_truth)
    recall = true_pos / len(ocr_output)
    f_score = 2 * ((precision * recall) / (precision + recall))

    return precision, recall, f_score

In [13]:
precision, recall, f_score = measure_quality(ocr_output, ground_truth)

In [14]:
print(f'Precision: {round(precision, 4)}\nRecall: {round(recall, 4)}\nF1-score: {round(f_score, 4)}')

Precision: 0.9427
Recall: 0.9407
F1-score: 0.9417


### 2.2 Process the whole corpus of PDF-s with the same OCR engine

In [15]:
import os
from tqdm import tqdm
from pdf2image import convert_from_path

In [16]:
pathpdf = '../data/pdf'

In [None]:
for filename in tqdm(os.listdir(pathpdf)):
    if '.pdf' in filename:
        thispath = os.path.join(pathpdf, filename)
        converted_pdf = convert_from_path(thispath, use_cropbox=True)
        with open(thispath.replace('.pdf', '.txt'), 'w') as output_txt:
            for image in converted_pdf:
                recognized = pytesseract.image_to_string(image, 
                                                         lang='frk') 
                output_txt.write(recognized)

#### After running this we have all our PDF-s in plain txt form

### 2.3. OCR postprocessing

As we mentioned before, the last stage of the OCR process is post-processing the result. Some of it is done internally by the OCR engine. Other improvement can be applied separately afterwards. 

#### 2.3.1. Rule-based OCR postprocessing

After doing OCR, one can often notice regular errors , e.g. letter `с` turning to `<` or letter `l` becoming a `!`. In many cases we can fix it with some regular search-and replace patterns (e.g. take each `<` not surrounded by spaces and convert into `c`)

The standard way to express & implement such patterns on a computer would be regular expressions. For example, here is a regular expression that does the aforementioned context-aware transformation of `<` to `c`.  

In [21]:
import re

In [22]:
ocr_output

'Die Lage anfdemKohlenmarkte gibt zu “en ſhlimm-\nſten Befürc<tungen Anlaß. Für Sachſen fehlten im Nov»mber\n30 000 Wagen zu je 10 Tonnen und für Tezembex wird mit no<\ngrößeren Ausfällen gere<net werden. E3 iſt mit einem völligen\nStillſtand der Induſtrie innerhalb vierzehn Tagen zu red<hnen,\nwenn nicht eine erhebliche Steigerung der Belenſ<aften der Kot:en-\nbergwerke oder ihrer Zah! geiingt. Weiter ſteht eine weſentliche\nErhöhung der Kohlenpreije bevor.\n'

In [23]:
ocr_output_corr = re.sub('(\w)<(\w)', '\\1c\\2', ocr_output)

Let us see how the whole thing changed: 

In [24]:
ocr_output_corr

'Die Lage anfdemKohlenmarkte gibt zu “en ſhlimm-\nſten Befürcctungen Anlaß. Für Sachſen fehlten im Nov»mber\n30 000 Wagen zu je 10 Tonnen und für Tezembex wird mit no<\ngrößeren Ausfällen gerecnet werden. E3 iſt mit einem völligen\nStillſtand der Induſtrie innerhalb vierzehn Tagen zu redchnen,\nwenn nicht eine erhebliche Steigerung der Belenſcaften der Kot:en-\nbergwerke oder ihrer Zah! geiingt. Weiter ſteht eine weſentliche\nErhöhung der Kohlenpreije bevor.\n'

So, the '<' is gone now in most cases (but not in all, since we have an additional condition in place). You can learn more about regular expressions [here](https://www.w3schools.com/python/python_regex.asp).

Let us see hot that affected the OCR quality

In [25]:
precision, recall, f_score = measure_quality(ocr_output_corr, ground_truth)

In [26]:
print(f'Precision: {round(precision, 4)}\nRecall: {round(recall, 4)}\nF1-score: {round(f_score, 4)}')

Precision: 0.9493
Recall: 0.9473
F1-score: 0.9483


So, our F-measure increased a bit, good!

#### 2.3.2. OCR postprocessing with large language models (LLMs)

* Here we intend to use Llama3, which is quite good for OCR postcorrection of the german text

## 3.  Getting digial text from the structured markup (XML)

Unlike text on the image, XML/HTML are already machine readable, so they are a lower-hanging fruit. Still, we'll need to use a parser for such markup to get rid of XML/HTML tags and some metadata

In [17]:
from bs4 import BeautifulSoup

In [18]:
pathtoxmlfiles = '../data/xml'

In [None]:
for filename in os.listdir(pathtoxmlfiles):
    if '.xml' in filename:
        path2file = os.path.join(pathtoxmlfiles, filename)
        with open(path2file) as openxml:
            soup = BeautifulSoup(openxml)
        print(soup.find('text').text.strip())
        #with open(path2file.replace('.txt', '.xml'), 'w') as output_xml:
        #    output.write(soup.find('text').text.strip())
            

#### After running this we have all our XML-s in plain txt form

## Now let's use all the data for processing and analysis (next notebook)