# Introduction to the USHMM Package

This notebook demonstrates the usage of the USHMM (United States Holocaust Memorial Museum) package, a suite of tools designed for working with data from the USHMM. The package provides functionalities to process and analyze testimonies and related documents.

Key features of the USHMM package include:
1. Converting PDF files to images
2. Performing OCR (Optical Character Recognition) on images
3. Cleaning and processing extracted text
4. Removing footers from images
5. Generating structured HTML output from processed texts

In the following cells, we'll walk through each step of the process using the USHMM package, from converting a PDF testimony to a final, cleaned HTML output.


# Importing USHMM Functions

In this section, we'll import the necessary functions from the USHMM package and briefly explain their purposes:

1. `pdf_to_images`: Converts PDF files to images
2. `images_to_text`: Performs OCR on images to extract text
3. `clean_texts`: Processes and cleans the extracted text
4. `remove_footers`: Removes footers from images
5. `process_testimony_texts`: Generates structured HTML output from processed texts

These functions form the core of our testimony processing pipeline, allowing us to transform PDF testimonies into clean, structured HTML documents.


In [1]:
from ushmm import pdf_to_images, images_to_text, clean_texts, remove_footers, process_testimony_texts

## Workflow

In this section, we'll walk through the step-by-step process of using the USHMM package to convert a PDF testimony into a structured HTML document. Each step corresponds to a function we imported earlier:

1. Convert PDF to Images: We'll use `pdf_to_images` to convert our PDF file into a series of image files.
2. Remove Footers: The `remove_footers` function will be used to crop out any footer information from our images.
3. Perform OCR: We'll use `images_to_text` to extract text from our cropped images.
4. Clean Extracted Text: The `clean_texts` function will process and clean the extracted text.
5. Generate HTML: Finally, we'll use `process_testimony_texts` to create a structured HTML document from our cleaned text.

Let's go through each of these steps in detail:


### Step 1: Convert PDF to Images

In this step, we use the `pdf_to_images` function to convert our PDF testimony into a series of image files. This is the first crucial step in our processing pipeline.

In [2]:
images = pdf_to_images("data/RG-50.030.0047_trs_en.pdf", "data/RG-50.030.0047_trs_en/images", save=True)

Saved: data/RG-50.030.0047_trs_en/images/0001.jpg
Saved: data/RG-50.030.0047_trs_en/images/0002.jpg
Saved: data/RG-50.030.0047_trs_en/images/0003.jpg
Saved: data/RG-50.030.0047_trs_en/images/0004.jpg
Saved: data/RG-50.030.0047_trs_en/images/0005.jpg
Saved: data/RG-50.030.0047_trs_en/images/0006.jpg
Saved: data/RG-50.030.0047_trs_en/images/0007.jpg
Saved: data/RG-50.030.0047_trs_en/images/0008.jpg
Saved: data/RG-50.030.0047_trs_en/images/0009.jpg
Saved: data/RG-50.030.0047_trs_en/images/0010.jpg
Saved: data/RG-50.030.0047_trs_en/images/0011.jpg
Saved: data/RG-50.030.0047_trs_en/images/0012.jpg
Saved: data/RG-50.030.0047_trs_en/images/0013.jpg
Saved: data/RG-50.030.0047_trs_en/images/0014.jpg
Saved: data/RG-50.030.0047_trs_en/images/0015.jpg
Saved: data/RG-50.030.0047_trs_en/images/0016.jpg
Saved: data/RG-50.030.0047_trs_en/images/0017.jpg
Saved: data/RG-50.030.0047_trs_en/images/0018.jpg
Saved: data/RG-50.030.0047_trs_en/images/0019.jpg


### Step 2: Remove Footers

In this step, we use the `remove_footers` function to crop out any footer information from our images. This is an important preprocessing step that helps improve the accuracy of the subsequent OCR process by removing potentially confusing or irrelevant text from the bottom of each page.

The `remove_footers` function takes the directory containing our images as input, processes each image to remove the footer, and saves the cropped images to a new directory. This step ensures that we're focusing on the main content of each page for our text extraction.

In [3]:
cropped_images = remove_footers('data/RG-50.030.0047_trs_en/images', output_directory='data/RG-50.030.0047_trs_en/images_cropped', save=True)

reading: 0012.jpg
image read
Footer Found at 1946. Cropping Image
Processed image saved at: data/RG-50.030.0047_trs_en/images_cropped/0012.jpg
reading: 0006.jpg
image read
Processed image saved at: data/RG-50.030.0047_trs_en/images_cropped/0006.jpg
reading: 0007.jpg
image read
Footer Found at 1946. Cropping Image
Processed image saved at: data/RG-50.030.0047_trs_en/images_cropped/0007.jpg
reading: 0013.jpg
image read
Processed image saved at: data/RG-50.030.0047_trs_en/images_cropped/0013.jpg
reading: 0005.jpg
image read
Footer Found at 1907. Cropping Image
Processed image saved at: data/RG-50.030.0047_trs_en/images_cropped/0005.jpg
reading: 0011.jpg
image read
Footer Found at 1907. Cropping Image
Processed image saved at: data/RG-50.030.0047_trs_en/images_cropped/0011.jpg
reading: 0010.jpg
image read
Processed image saved at: data/RG-50.030.0047_trs_en/images_cropped/0010.jpg
reading: 0004.jpg
image read
Footer Found at 1907. Cropping Image
Processed image saved at: data/RG-50.030.004

### Step 3: Perform OCR

In this step, we use the `images_to_text` function to extract text from our cropped images. This process, known as Optical Character Recognition (OCR), converts the image data into machine-readable text. The function processes each image in the specified directory, performs OCR, and saves the extracted text. This step is crucial as it transforms our visual data into textual data that we can further process and analyze.


In [4]:
# Example usage:
texts = images_to_text("data/RG-50.030.0047_trs_en/images_cropped", save=True, output_folder="data/RG-50.030.0047_trs_en/text")

0012.jpg
Saved: data/RG-50.030.0047_trs_en/text/0012.txt
0006.jpg
Saved: data/RG-50.030.0047_trs_en/text/0006.txt
0007.jpg
Saved: data/RG-50.030.0047_trs_en/text/0007.txt
0013.jpg
Saved: data/RG-50.030.0047_trs_en/text/0013.txt
0005.jpg
Saved: data/RG-50.030.0047_trs_en/text/0005.txt
0011.jpg
Saved: data/RG-50.030.0047_trs_en/text/0011.txt
0010.jpg
Saved: data/RG-50.030.0047_trs_en/text/0010.txt
0004.jpg
Saved: data/RG-50.030.0047_trs_en/text/0004.txt
0014.jpg
Saved: data/RG-50.030.0047_trs_en/text/0014.txt
0015.jpg
Saved: data/RG-50.030.0047_trs_en/text/0015.txt
0001.jpg
Saved: data/RG-50.030.0047_trs_en/text/0001.txt
0017.jpg
Saved: data/RG-50.030.0047_trs_en/text/0017.txt
0003.jpg
Saved: data/RG-50.030.0047_trs_en/text/0003.txt
0002.jpg
Saved: data/RG-50.030.0047_trs_en/text/0002.txt
0016.jpg
Saved: data/RG-50.030.0047_trs_en/text/0016.txt
0018.jpg
Saved: data/RG-50.030.0047_trs_en/text/0018.txt
0019.jpg
Saved: data/RG-50.030.0047_trs_en/text/0019.txt
0009.jpg
Saved: data/RG-50.030.

### Step 4: Clean Extracted Text

In this step, we use the `clean_texts` function to process and clean the extracted text. This function takes the raw OCR output and applies various cleaning operations to improve the quality and consistency of the text. This may include removing extraneous whitespace, correcting common OCR errors, and standardizing formatting. The cleaned text is then saved to a new directory, preparing it for the final step of our process.


In [5]:
texts = clean_texts("data/RG-50.030.0047_trs_en/text", save=True, output_directory="data/RG-50.030.0047_trs_en/clean_text")

0007.txt
Header Found: USHMM Archives RG-50.030*0047 5
Timestamp Found: 01:16:04
Normalizing characters...
Changed: ’ -> '
Normalizing quotes...
Changed: ' -> "
Normalized 1 instances of excessive line breaks.
0013.txt
Header Found: USHMM Archives RG-50.030*0047 1
Timestamp Found: 01:41:26
Timestamp Found: 01:39:00
Normalizing characters...
Changed: é -> e
Changed: ” -> "
Normalizing quotes...
Normalized 2 instances of excessive line breaks.
0012.txt
Header Found: USHMM Archives RG-50.030*0047 1
Timestamp Found: 01:33:29
Timestamp Found: 01:36:44
Normalizing characters...
Changed: ’ -> '
Changed: “ -> "
Changed: ” -> "
Normalizing quotes...
Changed: ' -> "
Normalized 2 instances of excessive line breaks.
0006.txt
Header Found: USHMM Archives RG-50.030*0047 4
Timestamp Found: 01:11:52
Timestamp Found: 01:13:36
Normalizing characters...
Normalizing quotes...
Normalized 2 instances of excessive line breaks.
0010.txt
Header Found: USHMM Archives RG-50.030*0047 8
Timestamp Found: 01:28:27
N

### Step 5: Generate HTML

Finally, we use the `process_testimony_texts` function to create a structured HTML document from our cleaned text. This function takes the cleaned text files, processes them to identify different elements of the testimony (such as speakers, questions, and responses), and formats them into an HTML structure. This step transforms our raw text data into a more readable and navigable format, making it easier to analyze and present the testimony.

In [6]:
html_result = process_testimony_texts('data/RG-50.030.0047_trs_en/clean_text', output_file="data/RG-50.030.0047_trs_en/RG-50.030.0047_trs_en.html", save=True)