# University of Waterloo History Workshop

Notebook prepared by:
- Mike Chee. Librarian, Information Services and Resources (Porter). University of Waterloo.
- R. Antonio Muñoz Gómez. Digital Scholarship Librarian. University of Waterloo.

## Introduction

This notebook was created for a workshop aimed at PhD students in History.
It is meant to assist students do basic Optical Character Recognition on .pdf files, and then translate the extracted text.

There are multiple ways to perform these tasks.
The notebook and code below are meant to illustrate a computational approach to doing these tasks.

Students are encouraged to build on this knowledge, and make a copy of the notebook that they can modify as they test new code and methods that better suit their individual needs.
For example, the University of Waterloo's Centre for Education in Mathematics and Computing has an excellent [Python from scratch](https://open.cs.uwaterloo.ca/python-from-scratch/) course.

## Notebook structure

The notebook has two main sections:

- Part 1. Text extraction : This code opens a .pdf file and extracts the OCR layer onto a .txt file to be translated.
- Part 2. Translation : This code takes the .txt file from Part 1 as input, and translates it from French into English.



## Before we get started

You will need to add the test file (Louis-Riel-martyr.pdf) to the notebook.

### In Google Collab:

1. From the left-hand menu options, choose Files (folder icon)
2. Right click on the panel and choose the option to Upload
3. Select the .pdf file that we will work with
4. You may also upload an additional .pdf file if you want to do more testing

## FULL CODE AND EXPLANATION

## Part 1: Text extraction.

### Step 1: Installing libraries

In programming, 'libraries' refer to code that someone else has written and made available for re-use.
Rather than having to write all that code from scratch, you can use it by 'importing' the libraries into your own code.

To successfully import a library, first you must make sure that it has been installed.

The code below will install a library called '[PyPDF2](https://pypdf2.readthedocs.io/en/3.0.0/#)', which we will use for this workshop.

In [None]:
!pip install PyPDF2

### Step 2: Importing libraries

Now that the Library has been installed, you can import it by running the code below.

In [None]:
from PyPDF2 import PdfReader

### Step 3: Indicate which file you are working with.

In [None]:
sourceFile = open('Louis-Riel-martyr.pdf','rb')

---

**NOTE:** You can choose a different .pdf file for testing. All you have to do is change the file name in the code above.

In order to change the file name, look for this code:

`sourceFile = open('Louis-Riel-martyr.pdf', 'rb')`

Change the text inside the first set of quotation marks to use your own .pdf file instead.

---

#### 3.1 Testing the file

The following code will count the number of pages in our document. For the document we are using in this workshop, the result should be 87 pages.

In [None]:
sourceFileReader = PdfReader(sourceFile)
x = len(sourceFileReader.pages)
print('The document contains', x, 'pages.')

The document contains 87 pages.


The following code will iterate through each page in the document and do the following:
- For each page, it will extract the text and add it to a variable called 'output'.
- The code will add the text from each page into the 'output' variable, and then it will print out all the text on the screen.

In [None]:
fullText= ""
for i in range(x):
    page=sourceFileReader.pages[i]
    fullText+= page.extract_text()
print(fullText)

^•s^^
..^^<
IMAGE EVALUATION
TESTTARGET (MT-3)O.^JSS'%V^i^.4^
1.0
l.llisâ1^
Uiâ12.2
usus12.0I
/â^.CIHM/ICMH
Microfiche
Séries.CIHM/ICMH
Collection de
microfiches.
Canadien Institute forHistorical Microreproductions /institut canadien demicroreproductions historiques
Tachnical andBibliographie Notas/Notas tachniquaa atbibliographiquaa
ThaInatituta hasattamptad toobtain thabaat
originai copy availabia forfilming. Faaturat ofthia
copywhich maybabibliographically uniqua,
whichmay aitaranyofthaimagaa intha
raproduction, orwhich may aignificantly changa
thauauaimathod offilming, arachacicad baiow.
D
D
DCoiourad covara/
Couvartura dacouiaur
I ICovara damagad/
Couvartura andommagéa
Covara raatorad and/or iaminatad/
Couvartura raatauréa at/ou pailiculéa
Covar titiamiaaing/
Latitradacouvartura manqua
Coiourad mapa/
Cartaa géographiquas ancouiaur
Coiourad init (i.a.othar than blua orblack)/
Encra dacouiaur (i.a.autraquablaua ounoira)
I ICoiourad piataa and/or iiluatrations/
Pianchaa at/ou iiluatr

### Step 4: Reading a single page

#### 4.1 Anticipating code output

The following code works in a similar way to the previous one. However, we wil extract a single page, rather than the entire document.

**Before running the code:**

- Read it and try to guess which page we will extract from the document.
- How can you tell?

Now, run the code.

In [None]:
singlePage=PdfReader(sourceFile)
pageObj=singlePage.pages[5]
pageText=pageObj.extract_text()
print(pageText)

—4—
Alphonse XII,enremontant sursontrône, n'apaspour-
suivi lesrépublicains d'Espagne.
Enpendant Riel, legouvernement deSirJohn A.Macdo-
nald s'estmishors laloidespeuples civilisés.
11aimprimé unopprobre àsonnom etànotre histoire.
Cemeurtre, qu'on aàpeine pris lesoinderecouvrir d'un
fauxsemblant d'exécution juridique asoulevé dans lescœurs
honnêtes uneindignation d'autant plus irrésistible, que le
meurtre était enlaidi, s'ilest possilble, par lescalculs in-
avouables quisesont établis autour decegibet.
Chacun saitqu'on aimposé àRielunelongue agonie,
parce que legouvernement, entre lesmains duquel notre
constitution aremis cedroit redoutable quis'appelle ledroit
devieetdemort, n'apascesséunseulinstant deconsidérer
lavieoulamort deRiel,comme dépendant exclusivement
dupoint desavoir cequi,delavieoudelamort decemal-
heureux, serait leplus favorable àlafortune politique des
ministres.
Deshommes quisedisent chrétiens ontcahulé froide-
ment, pendant delongs mois, combien decomtés lapotence
de

Look at the output and try to answer the following:

- Is the page extracted the same as the one you anticipated?
  - HINT: Look at the extracted page number at the top of the output
- Why do you think this happened?
- How would you change the code to extract the page that is numbered '8' in the article?

### Step 5: Saving the text.

In order to translate the text in part 2, we should save our output as a .txt file that we can use as the starting point.

Also, saving the file as .txt would allow you to do any additional cleanup tasks to enhance the quality of the translated output.

Text cleanup is beyond the scope of today's workshop, but you are encouraged to make any minor changes to the text file for testing purposes.

#### 5.1 Saving the full text obtained in 3.1

In [None]:
f=open('louis-riel-martyr-full.txt', "a", encoding='utf-8')
f.writelines(fullText)
f.close()

#### 5.2 Saving the single-page obtained in 4.1

The following code is identical to the above, with some blank spaces for you to modify as follows:

- Change the name of the .txt file to indicate that it contains a single page, instead of the full text.
- Notice the second line of the code was changed as well. This time, it uses the variable 'pageText' that we defined above, instead of the 'fullText' variable.

In [None]:
f=open('louis-riel-page.txt', "a", encoding='utf-8')
f.writelines(pageText)
f.close()

## Part 2: Translation

### Step 1: Installing libraries

In [None]:
!pip install deep_translator

Step 2: Importing libraries

In [None]:
from deep_translator import GoogleTranslator

**For advanced users:** Full documentation about the deep_translator Library can be found on the [deep_translator](https://deep-translator.readthedocs.io/en/latest/index.html) website.

### Step 3: Setting up source and target languages

In [None]:
translator=GoogleTranslator(source='fr', target='en')

**Question:** How would you change the above text if you had a text in Italian to be translated into German?

### Step 4: Opening the file

In [None]:
with open('louis-riel-page.txt', 'r', encoding='utf-8') as f:
    text=f.read()


### Step 5: Splitting up the text

The libraries that we are using for this exercise can only handle translations of up to 4500 characters.

For this reason, we break up the text into chunks of that size, and then translate each chunk.

Finally, we combine all the translated chunks together to create our final translated file.

In [None]:
chunks=[]
while len(text)>0:
    if len(text) <= 4500:
        chunks.append(text)
        text= ''
    else:
        last_newline_index=text.rfind('\n', 0, 4500)
        if last_newline_index != -1:
            chunks.append(text[:last_newline_index])
            text=text[last_newline_index+1:]
        else:
            chunks.append(text[:4500])
            text=text[4500:]

### Step 6: Saving the file

In [None]:
with open('louis-riel-translation.txt', "a", encoding="utf-8") as f2:
    for chunk in chunks:
        result=translator.translate(chunk)
        print(result)
        f2.write(result + '\n')
print('The file has been compiled and saved.')

—4—
Alphonse XII, returning to his throne, did not
followed the Republicans of Spain.
During Riel, the government of Sir John A. Macdo-
nald set himself apart from the law of civilized peoples.
He brought opprobrium to his name and to our history.
This murder, which we barely took care to cover with a
false pretense of legal execution raised in hearts
honest an indignation all the more irresistible, as the
murder was made ugly, if possible, by the calculations in-
avowables who are established around this gallows.
Everyone knows that Riel was subjected to a long agony,
because the government, in whose hands our
constitution aremis this formidable right which is called the right
life and death, did not stop for a single moment to consider
the life or death of Riel, as exclusively dependent
from the point of knowing what, about life or death of this mal-
happy, would be the most favorable to fortune a policy of
ministers.
Men who call themselves Christians have hid in cold
lies, for many