# University of Waterloo History Workshop

Notebook v.0.3 (2025-11)

Jupyter notebook by: R. Antonio Mu침oz G칩mez. (Cataloguing and Metadata Librarian. University of Waterloo)

Workshop delivered by: Mike Chee (Research and Learning Librarian. University of Waterloo) and R. Antonio Mu침oz G칩mez. (Cataloguing and Metadata Librarian. University of Waterloo)

## Introduction

This notebook was created for a workshop aimed at PhD students in History.
It is meant to assist students in extracting the text layer from .pdf files, and then translate the extracted text.

There are multiple ways to perform these tasks.
The notebook and code below are meant to illustrate a computational approach to doing these tasks.

Students are encouraged to build on this knowledge, and make a copy of the notebook that they can modify as they test new code and methods that better suit their individual needs.
For example, the University of Waterloo's Centre for Education in Mathematics and Computing has an excellent [Python from scratch](https://open.cs.uwaterloo.ca/python-from-scratch/) course.

## Notebook structure

The notebook has two main sections:

- Part 1. OCR text extraction : This code opens a .pdf file and extracts the text layer (OCR or born-digital) onto a .txt file to be translated.
- Part 2. Translation : This code takes the .txt file from Part 1 as input, and translates it from French into English.
- Part 3. Full code : This section contains all the Python code as a single block.
Please note the various places where you changed variable data in the Jupyter notebook, as you'd need to make those changes in the Python code as well.
Also note that the pip install command needs to be done directly in your computer's terminal, and not as part of the Python code itself.

## Before we get started

You will need to add the test file (Louis-Riel-martyr.pdf) to the notebook.

### If you are working in Google Collab:

1. From the left-hand menu options, choose Files (folder icon)
2. Right click on the panel and choose the option to Upload
3. Select the .pdf file that we will work with
4. You may also upload an additional .pdf file if you want to do more testing

## Part 1: Text extraction.

### Step 1: Installing and importing libraries

In programming, 'libraries' refer to code that someone else has written and made available for re-use. Rather than having to write all that code from scratch, you can use it by 'importing' the libraries into your own code.

To successfully import a library, first you must make sure that it has been installed.

The code below will try to import the following library (follow the link if you want to learn more about this library and what it is used for):

* [PyPDF2](https://pypdf2.readthedocs.io/en/3.0.0/#)

If the library cannot be found in the system, the code will first install and then import it.

In [None]:
try:
  from PyPDF2 import PdfReader
except:
  !pip install PyPDF2
  from PyPDF2 import PdfReader

### Step 2: Indicate the name of the file that file you want to extract text from.

In [None]:
sourceFile = open('Louis-Riel-martyr.pdf','rb')

---

**NOTE:** You can choose a different .pdf file for testing. All you have to do is change the file name in the code above.

In order to change the file name, look for this code:

`sourceFile = open('Louis-Riel-martyr.pdf', 'rb')`

Change the text inside the first set of quotation marks to use your own .pdf file instead.

---

#### 2.1 Testing the file

The following code will count the number of pages in our document. For the document we are using in this workshop, the result should be 87 pages.

In [None]:
sourceFileReader = PdfReader(sourceFile)
x = len(sourceFileReader.pages)
print('The document contains', x, 'pages.')

The following code will iterate through each page in the document and do the following:
- For each page, it will extract the text and add it to a variable called 'output'.
- The code will add the text from each page into the 'output' variable, and then it will print out all the text on the screen.

In [None]:
fullText= ""
for i in range(x):
    page=sourceFileReader.pages[i]
    fullText+= page.extract_text()
print(fullText)

### Step 3: Reading a single page

#### 3.1 Anticipating code output

The following code works in a similar way to the previous one. However, we wil extract a single page, rather than the entire document.

**Before running the code:**

- Read it and try to guess which page we will extract from the document.
- How can you tell?

Now, run the code.

In [None]:
singlePage=PdfReader(sourceFile)
pageObj=singlePage.pages[5]
pageText=pageObj.extract_text()
print(pageText)

Look at the output and try to answer the following:

- Is the page extracted the same as the one you anticipated?
  - HINT: Look at the extracted page number at the top of the output
- Why do you think this happened?
- How would you change the code to extract the page that is numbered '8' in the article?

### Step 4: Saving the text.

In order to translate the text in part 2, we should save our output as a .txt file that we can use as the starting point.

Also, saving the file as .txt would allow you to do any additional cleanup tasks to enhance the quality of the translated output.

Text cleanup is beyond the scope of today's workshop, but you are encouraged to make any minor changes to the text file for testing purposes.

#### 4.1 Saving the full text obtained in step 2.1

In [None]:
f=open('Louis-Riel-full.txt', "a", encoding='utf-8')
f.writelines(fullText)
f.close()

#### 4.2 Saving the single-page obtained in step 3.1

The following code is identical to the above, with some blank spaces for you to modify as follows:

- Change the name of the .txt file to indicate that it contains a single page, instead of the full text.
- Notice the second line of the code was changed as well. This time, it uses the variable 'pageText' that we defined above, instead of the 'fullText' variable.

In [None]:
f=open('Louis-Riel-page.txt', "a", encoding='utf-8')
f.writelines(pageText)
f.close()

## Part 2: Translation

### Step 1: Installing and importing libraries

The code below will try to import the following library (follow the links if you want to learn more about this library and what it is used for):

* [deep_translator](https://deep-translator.readthedocs.io/en/latest/README.html)

If the library cannot be found in the system, the code will first install and then import it.

In [None]:
try:
  from deep_translator import GoogleTranslator
except:
  !pip install deep_translator
  from deep_translator import GoogleTranslator

### Step 2: Setting up source and target languages

From within the deep_translator library, we will be using the GoogleTranslator.

The following code will show you which languages are available for translating to/from.

**Running this code is optional.**

In [None]:
langs_dict = GoogleTranslator().get_supported_languages(as_dict=True)
print(langs_dict)

The following code sets the source and target languages, using the language codes as shown above.

In [None]:
translator=GoogleTranslator(source='fr', target='en')

---

**Question:** How would you change the above text if you had a text in Italian to be translated into German?

---

### Step 3: Opening the files

In the following code, we indicate the files in source language that we wish to translate.

You can change the file names to match your own file(s).

In [None]:
sourceFull = 'Louis-Riel-full.txt'
sourcePage = 'Louis-Riel-page.txt'

#### Step 3.1 Opening the full-text file

In [None]:
with open(sourceFull, 'r', encoding='utf-8') as full:
    fullText=full.read()


#### Step 3.2 Opening the single-page file

In [None]:
with open(sourcePage, 'r', encoding='utf-8') as page:
    pageText=page.read()

### Step 4: Splitting up the text

The libraries that we are using for this exercise can only handle translations of up to 4500 characters.

For this reason, we break up the text into chunks of that size, and then translate each chunk.

Finally, we combine all the translated chunks together to create our final translated file.

#### Step 4.1: Splitting up the full text file

In [None]:
chunksF=[]
while len(fullText)>0:
    if len(fullText) <= 4500:
        chunksF.append(fullText)
        fullText= ''
    else:
        last_newline_index=fullText.rfind('\n', 0, 4500)
        if last_newline_index != -1:
            chunksF.append(fullText[:last_newline_index])
            fullText=fullText[last_newline_index+1:]
        else:
            chunksF.append(fullText[:4500])
            fullText=fullText[4500:]

#### Step 4.2: Splitting up the single-page file

In [None]:
chunksP=[]
while len(pageText)>0:
    if len(pageText) <= 4500:
        chunksP.append(pageText)
        pageText= ''
    else:
        last_newline_index=text.rfind('\n', 0, 4500)
        if last_newline_index != -1:
            chunksP.append(pageText[:last_newline_index])
            pageText=pageText[last_newline_index+1:]
        else:
            chunksP.append(text[:4500])
            pageText=pageText[4500:]

### Step 5: Saving the file

#### Step 5.1: Saving the full-text translation file

In [None]:
with open('louis-riel-translation-full.txt', "a", encoding="utf-8") as full2:
    for chunkF in chunksF:
        result=translator.translate(chunkF)
        # print(result)
        full2.write(result + '\n')

print('The full-text translation file has been compiled and saved.')

#### Step 5.2: Saving the single-page translation file

In [None]:
with open('louis-riel-translation-page.txt', "a", encoding="utf-8") as page2:
    for chunkP in chunksP:
        result=translator.translate(chunkP)
        print(result)
        page2.write(result + '\n')

print('The single-page translation file has been compiled and saved.')

## Part 3 : All the code in one block

**NOTE:** If you are running this code on your own computer, for example using [Python's IDLE](https://en.wikipedia.org/wiki/IDLE), you will need to install the libraries using the command line (for example, Command Prompt in Windows, Terminal in Mac, Command Line in Linux) before you can run the code.

In [None]:
# Part 1 : OCR layer extraction

try:
  from PyPDF2 import PdfReader
except:
  !pip install PyPDF2
  from PyPDF2 import PdfReader

sourceFile = open('Louis-Riel-martyr.pdf','rb') # You can change the .pdf file name.

sourceFileReader = PdfReader(sourceFile)
x = len(sourceFileReader.pages)

print('The document contains', x, 'pages.')

## Obtaining full text

fullText= ""

for i in range(x):
    page=sourceFileReader.pages[i]
    fullText+= page.extract_text()
print(fullText)

## Obtaining single page

singlePage=PdfReader(sourceFile)
pageObj=singlePage.pages[5] # You can change which page you want to extract.
pageText=pageObj.extract_text()
print(pageText)

## Saving extracted text

f=open('louis-riel-martyr-full.txt', "a", encoding='utf-8') # You can change the name of the output file.
f.writelines(fullText)
f.close()

f=open('louis-riel-martyr-page.txt', "a", encoding='utf-8') # You can change the name of the output file.
f.writelines(pageText)
f.close()

# Part 2: Translation

try:
  from deep_translator import GoogleTranslator
except:
  !pip install deep_translator
  from deep_translator import GoogleTranslator

langs_dict = GoogleTranslator().get_supported_languages(as_dict=True)
print(langs_dict)

translator=GoogleTranslator(source='fr', target='en') # You can change the source and target languages.

# Opening files

sourceFull = 'louis-riel-martyr-full.txt' # You can change the name of this file.
sourcePage = 'louis-riel-martyr-page.txt' # You can change the name of this file.

# Translating the full text

with open(sourceFull, 'r', encoding='utf-8') as full:
    fullText=full.read()

chunksF=[]
while len(fullText)>0:
    if len(fullText) <= 4500:
        chunksF.append(fullText)
        fullText= ''
    else:
        last_newline_index=fullText.rfind('\n', 0, 4500)
        if last_newline_index != -1:
            chunksF.append(fullText[:last_newline_index])
            fullText=fullText[last_newline_index+1:]
        else:
            chunksF.append(fullText[:4500])
            fullText=fullText[4500:]

with open('louis-riel-translation-full.txt', "a", encoding="utf-8") as full2:
    for chunkF in chunksF:
        result=translator.translate(chunkF)
        #print(result)
        full2.write(result + '\n')

print('The full-text translation file has been compiled and saved.')

# Translating the single page text

with open(sourcePage, 'r', encoding='utf-8') as page:
    pageText=page.read()

chunksP=[]
while len(pageText)>0:
    if len(pageText) <= 4500:
        chunksP.append(pageText)
        pageText= ''
    else:
        last_newline_index=text.rfind('\n', 0, 4500)
        if last_newline_index != -1:
            chunksP.append(pageText[:last_newline_index])
            pageText=pageText[last_newline_index+1:]
        else:
            chunksP.append(text[:4500])
            pageText=pageText[4500:]

with open('louis-riel-translation-page.txt', "a", encoding="utf-8") as page2:
    for chunkP in chunksP:
        result=translator.translate(chunkP)
        print(result)
        page2.write(result + '\n')

print('The single-page translation file has been compiled and saved.')