# Extract "List of papers" Pages


Supervisor: Prof. Henry Lovejoy for the Digital Slavery Research Lab

Author: Sushma Akoju (Sushma Anand Akoju)

### Reading PDF files which are Bitmap Encoded using Universal Document Converter
Reference: https://www.universal-document-converter.com/

## Getting aquainted with extracting pages from PDF documents

### Steps to Install Universal Document converter on Windows
- Download and install the exe and follow the installation instructions from <a href="https://www.universal-document-converter.com/">Download from here</a>
- After downloading open an example pdf in Adobe Acrobat version such as Adobe Acrobat Reader DC .
- click on print and in your printers list, you can verify that Universal Document Converter exists.
- select pages you like to save as a seperate pdf, click on print and save.

### For Mac or Linux
#### For Mac users
- Open pdf file in preview
- click on print and select pages you like to print and save

#### For Linux users
- Open pdf file in PDF Reader
- click on print and select pages you like to print and save

## Libraries to import for generating image files required for Transkribus
- We use pdf2jpg for for generating images for each page for "List of papers" document for a given collection.
    - Install pdf2jpg from <a href="https://github.com/pankajr141/pdf2jpg"> pdf2jpg </a>
- We use pytesseract for Optical character recognition. However, there are no guarantees this works.
    - Install tessseract from <a href="https://tesseract-ocr.github.io/tessdoc/Home.html#binaries"> tesseract </a>
    - Install pytesseract from <a href="https://pypi.org/project/pytesseract/"> pytesseract </a>

In [None]:
import re
import os

import pytesseract
from pdf2jpg import pdf2jpg

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

### Steps to generate "List of papers" pdf from each of the collection pdf
- Download each collection from volume i.e. for example Slave Trade Volume 10 from http://ddsnext.crl.edu/titles/33509/items?terms=&page=0
    - Create a folder with name "Slave Trade Volume 10"
    - Click on "Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio de Janeiro, and Surinam, relating to the slave trade. 1824-1825" of item under Slave Trade Volume 10
    - Click on Download button. Note that this opens a new tab or page where you should select all pages starting from 01 to last page number, so select all pages and click download.
    - Once document is downloaded, save it to your root volume folder you created in step 1.
    - repeat this for all collections under "Slave Trade Volume 10"
- Now navigate to the volume we just downloaded.
- Create a folder named "extracted" since we are going to extract "List of Papers" pages.
- Click on first collection pdf document. Let us assume we are using Windows OS. (for Mac/linux, the steps are simple and listed in previous section "Getting aquainted with extracting pages from PDF documents"). So the document should have opened on Adobe Reader.
- Now click on "Print" on option on Reader.
- Select Universal Document Converter.
- Note down starting page number and ending page number of "List of papers".
- In Print window, select option "Pages" and enter 4-8 , 4 being the starting page number and 8 being ending page number for List of papers.
- Note that collection "Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio de Janeiro, and Surinam, relating to the slave trade. 1824-1825" PDF has List of papers from pages 4 to 8. (These are true PDF page numbers).
- Now that you selected pages and selected Universal Document Converter, click on properties right next to Universal Document Converter.
- In File format section, make sure PDF document is selected as first option and PDF Standard is Regular PDF, select Data Structure as Bitmapped PDF and leave the rest as default and click OK.
- Now click Print on Print window. This will prompt for once for saving .prn file and another for .pdf file for extracted pages. Note that these files are saved under "Slave Trade Volume 10" -> "extracted" folder.
- Now we will be running following code to first list all files and create a volume-wise dictionary with its corresponding list of collections first.

In [None]:
root = "E:\cu\summer2022\independent-study\lod-images"
assert os.path.exists(root), "Path does not exists %s" %root
volumes = {folder: os.path.join(root,folder) for folder in os.listdir(root)}
volumes_dict = {}
for folder, volume in volumes.items():
    docs = []
    files = os.listdir(os.path.join(root, volume))
    assert len(files) >0, "Empty volume folder %s" %volume
    for doc in files:
        assert os.path.exists(os.path.join(root, volume, doc)), "File does not exist %s" %doc
        if not os.path.isdir(os.path.join(volume,doc)) and os.path.splitext(doc)[1] == '.pdf':
            docs.append(os.path.join(volume,doc))
    volumes_dict[folder] = []
    volumes_dict[folder] = docs
volumes_dict

{'Slave Trade Volume 10': ['E:\\cu\\summer2022\\independent-study\\lod-images\\Slave Trade Volume 10\\Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio de Janeiro, and Surinam, relating to the slave trade. 1824-1825.pdf',
  'E:\\cu\\summer2022\\independent-study\\lod-images\\Slave Trade Volume 10\\Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio de Janeiro, and Surinam, relating to the slave trade. 1825-1826.pdf',
  'E:\\cu\\summer2022\\independent-study\\lod-images\\Slave Trade Volume 10\\Class B. Correspondence with foreign powers, relating to the slave trade. 1824-1825.pdf',
  'E:\\cu\\summer2022\\independent-study\\lod-images\\Slave Trade Volume 10\\Class B. Correspondence with foreign powers, relating to the slave trade. 1825-1826.pdf'],
 'Slave Trade Volume 11': ['E:\\cu\\summer2022\\independent-study\\lod-images\\Slave Trade Volume 11\\Class A. Correspondence with the British commissioners at Sierra

## Convert extracted "List of papers" for each collection into Jpeg Images
- First we use volumes dictionary to get volume and corresponding collections list (which are from "extracted" folder)
- <b> Create a folder named as "images". </b>
- Then we use pdf2jpg to convert each pdf of "List of papers" for collection.
- The previous step will create a folder name same as pdf name under "extracted\images" i.e. "images" folder we created in step 2 here. Under each of document named folders, we will see each of "list of papers" pages saved as jpg files.
- Note that each of jpg file is saved as pagenumber_collectionname.jpg.

In [None]:
lines = []
total_check = 0

for volume, colls in volumes_dict.items():
#colls = volumes_dict['Slave Trade Volume 10']
    for doc in colls:
        #print(doc)
        outputpath = os.path.join(os.path.split(doc)[0], 'images')
        print(os.path.split(doc)[0] )
        result = pdf2jpg.convert_pdf2jpg(doc, outputpath, dpi=300, pages="ALL")
        #print(result)

E:\cu\summer2022\independent-study\lod-images\Slave Trade Volume 10
E:\cu\summer2022\independent-study\lod-images\Slave Trade Volume 10
E:\cu\summer2022\independent-study\lod-images\Slave Trade Volume 10
E:\cu\summer2022\independent-study\lod-images\Slave Trade Volume 10
E:\cu\summer2022\independent-study\lod-images\Slave Trade Volume 11
E:\cu\summer2022\independent-study\lod-images\Slave Trade Volume 11
E:\cu\summer2022\independent-study\lod-images\Slave Trade Volume 11
E:\cu\summer2022\independent-study\lod-images\Slave Trade Volume 11


## Create Images Volume dictionary
- For each volume in root folder, we get all collection folders listed under "volume name\extracted\images".
- For each collection folder, we get all image files, and we also rename each image file to page number.jpg, since each of jpg file is saved as pagenumber_collectionname.jpg, we do not need collection for individual pages since pages are listed under folder named as collectionname.

In [None]:
root = "E:\cu\summer2022\independent-study\lod-images"
print(os.path.exists(root))
volumes_imgs = {folder: os.path.join(root,folder, 'images') for folder in os.listdir(root)}
volumes_img_dict = {}
print(volumes_imgs)
for folder, volume in volumes_imgs.items():
    colls = [ os.path.join(volume,col) for col in os.listdir(volume)]
    #print(colls)
    for coll in colls:
        imgs = []
        files = os.listdir(coll)
        print(files)
        #rename first and then
        for pg,file in enumerate(files):
            if not os.path.isdir(os.path.join(coll,file)) and os.path.splitext(file)[1] == '.jpg':
                dest = os.path.join(os.path.split(file)[0], str(pg+1)+".jpg")
                if os.path.basename(dest) not in files:
                    os.rename(os.path.join(coll,file), os.path.join(coll,dest)) #uncomment this line only if you want to change the those long filenames.
                    print("renaming file",file, dest )
        files = os.listdir(coll)
        for file in files:
            imgs.append(os.path.join(coll,file))
        volumes_img_dict[coll] = []
        volumes_img_dict[coll] = imgs

True
{'Slave Trade Volume 10': 'E:\\cu\\summer2022\\independent-study\\lod-images\\Slave Trade Volume 10\\images', 'Slave Trade Volume 11': 'E:\\cu\\summer2022\\independent-study\\lod-images\\Slave Trade Volume 11\\images'}
['0_Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio de Janeiro, and Surinam, relating to the slave trade. 1824-1825.pdf.jpg', '1_Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio de Janeiro, and Surinam, relating to the slave trade. 1824-1825.pdf.jpg', '2_Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio de Janeiro, and Surinam, relating to the slave trade. 1824-1825.pdf.jpg', '3_Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio de Janeiro, and Surinam, relating to the slave trade. 1824-1825.pdf.jpg']
renaming file 0_Class A. Correspondence with the British commissioners at Sierra Leone, the Havannah, Rio d

## Convert Image to string using Tesseract
- We can see text extracted from image is scattered, original structure is NOT preserved as well as there are some random special characters.

In [None]:
key = list(volumes_img_dict.keys())[0]
file = volumes_img_dict[key][0]
print(pytesseract.image_to_string(file, timeout=2000))

Class A.

LIST

OF PAPERS

SIERRA LEONE.

No.

1, Mr. Sec’ Canning to H. M’s. Comm”

One Enclosure

Mr. Sec! Canning to H, M’s. Comm™
3. H. M’s. Comm" to Mr. SecYCanning = -
4. E. Gregory, Esq. to Mr. Sec? Canning = -
5. E. Gregory, Esq. to Mr. Sect Canning
6. E. Gregory, Esq. to Mr. See? Canning
7. E. Gregory, Esq. to Mr. Sect Canning
8. E. Gregory, Esq. to Mr. Sec! Canning
9. Mr. Sec! Canning to H. M’s, Comm? ~

2.

Four Enclosures

10. H. M’s. Comm® to Mr. Sec? Canning
11. D. M. Hamilton, Esq. to Mr. See Canning
12. Mr, Sec? Canning to D. M. Hamilton, Esq.
18. Mr, Sec! Canning to D. M. Hamilton, Esq.

> S eae

|

Bale & Receipt. *

DB

v BD Be PH PS FS BP =

(General.)

SUBJECT,
June sta Slaves ou board at time of detention

Juei8-— Papers laid before Parliament oe
Ky u—~ Annual Report . 2. . 2...
oat ¢—— Proceeds of Sales . . . .

oc — Mess. Worrall, Magnus, & Bidwell
se 3—~ Instruction to Mr. Con.-Gen. Clarke
NET— Receipt of Despatches

Ree i—— Receipt of No.2 2. 2...

Nov.20 © Co