<a href="https://colab.research.google.com/github/xiyueking/HPCPL/blob/main/Digitising_Maritime_History_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Digitising Maritime History: Preparation

## Summary of Lloyd's Register Datasets

### **Lloyd's Register of Ships**
The **Lloyd’s Register of Ships** is a historical record of maritime vessels spanning **1764 to 2000**, detailing ship characteristics and ownership. Formerly known as *Lloyd’s Register of Shipping*, it is accessible via **Internet Archive, Wikimedia Commons, and Google Books**.

- Link: https://hec.lrfoundation.org.uk/archive-library/lloyds-register-of-ships-online


### Table Categories (1  per Report)**
 - **Ship_Registry**


#### **Colums Defintions (1 per Report)**
- **Vessel Name & Type** – Name and classification.
- **Master's Name** – Captain's name.
- **Tonnage (Burthen)** – Weight-carrying capacity.
- **Place & Year of Build** – Construction location and date.
- **Name of Owner** – Ship’s registered owner.
- **Draft in Feet (Loaded)** – Depth when fully loaded.
- **Port of Survey** – Last inspection location.
- **Classification** – Structural and safety rating.

Digitized **1930-1945** records are part of the **Plimsoll Ship Data Project**, and the **Heritage & Education Centre (HEC)** continues scanning the collection.

---

### **Lloyd's Register Casualty Returns**
The **Casualty Returns** (*Wreck Returns*) record **total losses** of ocean-going merchant ships **over 100 gross tonnes**. Published **quarterly and annually**, they are essential for maritime history research.

- Link 1. : https://hec.lrfoundation.org.uk/archive-library/casualty-returns
- Link 2: https://archive.org/details/HECCR1890/1890/

#### **Table Categories (21 Total per Report)**
1. **Statistics Tables (3 total)**
   - **Steam Vessels**
   - **Sailing Vessels**
   - **Combined Steam & Sailing Vessels**


2. **Casualty Type Tables (18 total, 9 per vessel type)**
   - **Abandoned at Sea**
   - **Broken up, Condemned**
   - **Burnt**
   - **Collision**
   - **Foundered**
   - **Lost**
   - **Missing**
   - **Wrecked**

#### **Columns Definition**

We have two types of tables (either Statistics or Casualty). And each type of table has different columns.

- **Statistics Tables (Steam, Sailing, Combined)**
  - **Nationality (Flag)**
  - **Number of Vessels Lost**
  - **Net & Gross Tonnage**
  - **Number of Vessels Owned (Lloyd’s Register)**
  - **Percentage Lost** (relative to fleet size)

- **Casualty Type Tables**
  - **Registry Number** (Lloyd’s Register)
  - **Vessel Name**
  - **Tonnage (Net & Gross)**
  - **Nationality (Flag)**
  - **Ship Type** (e.g., Steamship, Schooner)
  - **Voyage Route**
  - **Cargo**
  - **Circumstances & Place of Loss**
  - **Date of Incident**

Each **casualty table** categorizes ships based on cause of loss (e.g., Fire, Collision, Foundered).

---

## **JSON Structure for Lloyd's Register Data**
This structure ensures:
- **Consistent metadata fields** across books in a collection.
- **Sequential page storage**, capturing all tables per page.
- **Standardized categorization** of vessels and incidents.

#### **Metadata Fields**
All books within the same collection share the following fields:
- **`collection`** → `"Lloyd's Register of Shipping"` or `"Lloyd's Register Casualty Returns"`
- **`publication_year`** → Year published (e.g., `"1799"`, `"1890"`)
- **`publisher`** → `"Lloyd's Register Foundation, Heritage & Education Centre"`
- **`metadata`**:
  - **topics** → List (e.g., `"maritime history", "classification"`)
  - **language** → `"English"`
  - **item_size** → File size (e.g., `"209.9M"`, `"4.8G"`)
  - **collection_type** → `"folkscanomy_history"`
- **`description`** → Document summary.
- **`guidance_link`** → URL for interpretation guidelines.

**Metadata fields are extracted once per collection** from **Internet Archive metadata pages**:

- **Lloyd's Register of Shipping (1799)**:  
  [https://archive.org/details/HECROS1799/page/n9/mode/2up](https://archive.org/details/HECROS1799/page/n9/mode/2up)

- **Lloyd's Register Casualty Returns (1890)**:  
  [https://archive.org/details/HECCR1890/1890/](https://archive.org/details/HECCR1890/1890/)

#### **Automated Extraction**
To extract metadata programmatically, we use:
- **`requests`** (to fetch HTML content).
- **`BeautifulSoup`** (to parse page source).

---


---


# 📖 **Explanation of Pages and Tables Fields in the JSON Structure**

## **1️⃣ Pages Field: Capturing Structured Information**
The **`pages`** field stores structured content **per page** in a Lloyd’s Register book. Each book contains multiple pages, but only **pages that include tabular data** are stored in the JSON. Pages without relevant tables are **ignored**.

Each page is represented as:

```json
{
  "page_number": 12,
  "tables": [ ... ]
}
```

📌 Key Attributes
- page_number: The page number in the book where the tables appear.
- tables: A list of tables found on that specific page.


## **2️⃣ Tables Field: Storing Maritime Data in Structured Tables**

Each page may contain one or more tables, depending on the type of dataset. The tables field organizes data into structured categories, ensuring consistency.

Each table follows this structure

```json
{
  "table_type": "Statistics",
  "vessel_type": "Steam",
  "caption": "Fleet Summary for Steam Vessels - 1890",
  "rows": [ ... ]
}
````

📌 Key Attributes
- table_type: The category of the table (e.g., "Statistics", "Casualty-Wrecked").
- vessel_type: The type of vessels listed in the table ("Steam", "Sailing", "Combined").
- caption: A descriptive label summarizing the table's content.
- rows: A list of dictionary entries, where each entry represents one row in the table.


### **Lloyd's Register of Shipping**
- **One table per page**: `"table_type": "Ship_Registry"`
- **Vessel type**: `"vessel_type": "NA"` (Not applicable)

### **Lloyd's Register Casualty Returns**
- **Multiple tables per page possible**.
- **Different pages contain different tables** (e.g., `"Casualty-Wrecked"`, `"Casualty-Burnt"`, `"Statistics-Steam Vessels", ....`).
- **Each table type has a distinct set of columns**.


----
## **JSON Schema**




In [1]:
{
  "collection": "",  // "Lloyd's Register of Shipping" or "Lloyd's Register Casualty Returns"
  "publication_year": "",  // e.g., "1799" or "1890"
  "publisher": "Lloyd's Register Foundation, Heritage & Education Centre",
  "metadata": {
    "topics": [],  // List of topics (e.g., "maritime history", "classification")
    "language": "English",
    "item_size": "",  // e.g., "209.9M", "4.8G"
    "collection_type": ""  // e.g., "folkscanomy_history"
  },
  "description": "",  // Additional description if available
  "guidance_link": "",  // URL for guidance, if applicable
  "pages": [
    {
      "page_number": "",  // Page number in the document
      "tables": [
        {
          "table_type": "",  // e.g., "Statistics", "Casualty-Abandoned at Sea", "Casualty-Broken up", "Ship_Registry"
          "vessel_type": "", // "Steam", "Sailing", "Combined" (or "NA" if not applicable)
           "caption": "", // Caption for the table, if available
          "rows": [
            {}  // Row data (dictionaries with key-value pairs)
          ]
        }
      ]
    }
  ]
}


SyntaxError: invalid syntax (<ipython-input-1-a3dd9a496ed6>, line 2)

**Important**: We will have a nested json file, for each book of each collection.


## **Lloyd's Register Casualty Returns Table Types**
Each page may contain **one or more tables**, categorized as:


## **table_type field values**
| **table_type** | **Description** |
|---------------|----------------|
| `"Statistics"` | Fleet summaries by nationality and vessel type (Steam, Sailing, Combined). |
| `"Casualty-Abandoned at Sea"` | Ships abandoned during voyages. |
| `"Casualty-Broken up, Condemned"` | Ships decommissioned, scrapped, or condemned. |
| `"Casualty-Burnt"` | Ships destroyed by fire. |
| `"Casualty-Collision"` | Ships lost due to collisions. |
| `"Casualty-Foundered"` | Ships sunk due to structural failure or water ingress. |
| `"Casualty-Lost"` | General classification for ships lost at sea. |
| `"Casualty-Missing"` | Ships that disappeared with unknown fate. |
| `"Casualty-Wrecked"` | Ships destroyed due to grounding or hitting obstacles. |


## **vessel_type Values**
| **Vessel_Type** | **Description** |
|---------------|----------------|
| `"Steam"` | Steam-powered vessels |
| `"Sailing"` | Sailing vessels |
| `"Combined"` | Both steam & sailing vessels |
| `"NA"` | Not applicable (for ship registry) |

---


## **Lloyd's Register of Ships Table Types**
Each page contains **exactly one table**, categorized as **"Ship_Registry"**.



## **table_type values**
| **Table_Type** | **Description** |
|---------------|----------------|
| `"Ship_Registry"` | Standardized table listing ship characteristics and ownership details. |


## **vessel_type Values**
| **vessel_type** | **Description** |
|---------------|----------------|
| `"NA"` | Not Applicable |





### **Example of JSON File for a Book in the Lloyd's Register of Shipping**

This **imaginary book** represents a **digitized volume** from the **Lloyd’s Register of Shipping**, which records ship details, ownership, and classifications. The JSON structure captures metadata, page content, and ship registry tables.

Each page contains **one table**, categorized as `"Ship_Registry"`, which includes **vessel characteristics** such as name, tonnage, build details, and classification. This **imaginary book** has only 2 pages (page 10 and 11) with table information.


In [2]:
{
  "collection": "Lloyd's Register of Shipping",
  "publication_year": "1799",
  "publisher": "Lloyd's Register Foundation, Heritage & Education Centre",
  "metadata": {
    "topics": ["maritime history", "merchant ships", "ship design", "shipbuilding"],
    "language": "English",
    "item_size": "209.9M",
    "collection_type": "folkscanomy_history"
  },
  "description": "A historical record of maritime vessels, detailing ship characteristics and ownership from 1764 to 2000.",
  "guidance_link": "https://archive.org/details/HECROS1799/page/n9/mode/2up",
  "pages": [
    {
      "page_number": 10,
      "tables": [
        {
          "table_type": "Ship_Registry",
          "vessel_type": "NA",
          "caption": "Ship registry table listing vessel details such as name, tonnage, year of build, and ownership.",
          "rows": [
            {
              "Vessel Name & Type": "HMS Victory (Ship of the Line)",
              "Master's Name": "John Smith",
              "Tonnage (Burthen)": "2000",
              "Place of Build": "Portsmouth",
              "Year of Build": "1765",
              "Name of Owner": "Royal Navy",
              "Draft in Feet (Loaded)": "23",
              "Port of Survey": "London",
              "Classification": "First Rate"
            },
            {
              "Vessel Name & Type": "SS Britannia (Passenger Liner)",
              "Master's Name": "William Brown",
              "Tonnage (Burthen)": "3000",
              "Place of Build": "Liverpool",
              "Year of Build": "1840",
              "Name of Owner": "Cunard Line",
              "Draft in Feet (Loaded)": "26",
              "Port of Survey": "Liverpool",
              "Classification": "Ocean Liner"
            }
          ]
        }
      ]
    },
    {
      "page_number": 11,
      "tables": [
        {
          "table_type": "Ship_Registry",
          "vessel_type": "NA",
          "caption": "Ship registry table listing vessel details such as name, tonnage, year of build, and ownership.",
          "rows": [
            {
              "Vessel Name & Type": "SS Great Eastern (Iron Steamship)",
              "Master's Name": "James Anderson",
              "Tonnage (Burthen)": "18600",
              "Place of Build": "Millwall",
              "Year of Build": "1858",
              "Name of Owner": "Eastern Steam Navigation Company",
              "Draft in Feet (Loaded)": "30",
              "Port of Survey": "London",
              "Classification": "Steamship"
            }
          ]
        }
      ]
    }
  ]
}


{'collection': "Lloyd's Register of Shipping",
 'publication_year': '1799',
 'publisher': "Lloyd's Register Foundation, Heritage & Education Centre",
 'metadata': {'topics': ['maritime history',
   'merchant ships',
   'ship design',
   'shipbuilding'],
  'language': 'English',
  'item_size': '209.9M',
  'collection_type': 'folkscanomy_history'},
 'description': 'A historical record of maritime vessels, detailing ship characteristics and ownership from 1764 to 2000.',
 'guidance_link': 'https://archive.org/details/HECROS1799/page/n9/mode/2up',
 'pages': [{'page_number': 10,
   'tables': [{'table_type': 'Ship_Registry',
     'vessel_type': 'NA',
     'caption': 'Ship registry table listing vessel details such as name, tonnage, year of build, and ownership.',
     'rows': [{'Vessel Name & Type': 'HMS Victory (Ship of the Line)',
       "Master's Name": 'John Smith',
       'Tonnage (Burthen)': '2000',
       'Place of Build': 'Portsmouth',
       'Year of Build': '1765',
       'Name of 

### **Example of JSON File for a Book in the Lloyd's Register Casualty Returns**

This **imaginary book** represents a **digitized volume** from the **Lloyd’s Register Casualty Returns**, which records **total losses** of ocean-going merchant ships over **100 gross tonnes**. The JSON structure captures metadata, sequential page content, and **multiple tables per page**, depending on the **casualty type and statistics**.

Each page may contain **one or more tables**, including:
1. **"Statistics"** → Summarizes fleet losses by nationality and vessel type.
2. **Casualty Type Tables** → Records **ship losses** due to specific causes (e.g., Foundered, Collision, Burnt, etc.).



In [3]:
{
  "collection": "Lloyd's Register Casualty Returns",
  "publication_year": "1904",
  "publisher": "Lloyd's Register Foundation, Heritage & Education Centre",
  "metadata": {
    "topics": ["maritime history", "ship losses", "wreck records", "classification"],
    "language": "English",
    "item_size": "5.2G",
    "collection_type": "historical_reports"
  },
  "description": "A record of total losses of ocean-going merchant ships over 100 gross tonnes, categorized by cause of loss.",
  "guidance_link": "https://archive.org/details/HECCR1904/1904/",
  "pages": [
    {
      "page_number": 5,
      "tables": [
        {
          "table_type": "Statistics",
          "vessel_type": "Steam",
          "caption": "Summary of steam vessel losses by nationality, showing the number of ships lost, total tonnage affected, and percentage lost relative to total ownership.",
          "rows": [
            {
              "Nationality": "British",
              "Number of Vessels Lost": 68,
              "Net Tonnage": 150000,
              "Gross Tonnage": 160000,
              "Number of Vessels Owned": 7530,
              "Percentage Lost": "0.9%",
              "caption": "British steamship losses: 68 vessels lost, totaling 160,000 gross tonnes, representing 0.9% of the total British-owned fleet."
            },
            {
              "Nationality": "German",
              "Number of Vessels Lost": 30,
              "Net Tonnage": 70000,
              "Gross Tonnage": 74000,
              "Number of Vessels Owned": 1142,
              "Percentage Lost": "2.6%",
              "caption": "German steamship losses: 30 vessels lost, totaling 74,000 gross tonnes, accounting for 2.6% of the total German-owned fleet."
            }
          ]
        }
      ]
    },
    {
      "page_number": 7,
      "tables": [
        {
          "table_type": "Casualty-Wrecked",
          "vessel_type": "Sailing",
          "caption": "Record of sailing vessels wrecked, including details of ship names, registry numbers, routes, cargo, and circumstances of loss.",
          "rows": [
            {
              "Registry Number": "1876",
              "Vessel Name": "SS Afghanistan",
              "Tons (Net)": 1751,
              "Tons (Gross)": 2573,
              "Nationality": "British",
              "Ship Type": "Steamship",
              "Voyage": "Clyde - Karachi",
              "Cargo": "General Cargo",
              "Circumstances & Place of Loss": "Wrecked 40 miles N. of Suakin",
              "Date of Incident": "22 January 1904",
              "caption": "The SS Afghanistan, a British steamship, was wrecked 40 miles north of Suakin while transporting general cargo from Clyde to Karachi."
            }
          ]
        },
        {
          "table_type": "Casualty-Collision",
          "vessel_type": "Steam",
          "caption": "Record of steam vessel collisions, providing details on ship names, registry numbers, routes, cargo, and location of incidents.",
          "rows": [
            {
              "Registry Number": "726",
              "Vessel Name": "SS Amerique",
              "Tons (Net)": 2136,
              "Tons (Gross)": 3128,
              "Nationality": "French",
              "Ship Type": "Screw Steamer",
              "Voyage": "Marseilles - Pirams",
              "Cargo": "General Cargo",
              "Circumstances & Place of Loss": "Collision near Faro Point, Straits of Messina",
              "Date of Incident": "24 March 1904",
              "caption": "The SS Amerique, a French screw steamer, was involved in a fatal collision near Faro Point in the Straits of Messina while carrying general cargo from Marseilles to Pirams."
            }
          ]
        }
      ]
    }
  ]
}


{'collection': "Lloyd's Register Casualty Returns",
 'publication_year': '1904',
 'publisher': "Lloyd's Register Foundation, Heritage & Education Centre",
 'metadata': {'topics': ['maritime history',
   'ship losses',
   'wreck records',
   'classification'],
  'language': 'English',
  'item_size': '5.2G',
  'collection_type': 'historical_reports'},
 'description': 'A record of total losses of ocean-going merchant ships over 100 gross tonnes, categorized by cause of loss.',
 'guidance_link': 'https://archive.org/details/HECCR1904/1904/',
 'pages': [{'page_number': 5,
   'tables': [{'table_type': 'Statistics',
     'vessel_type': 'Steam',
     'caption': 'Summary of steam vessel losses by nationality, showing the number of ships lost, total tonnage affected, and percentage lost relative to total ownership.',
     'rows': [{'Nationality': 'British',
       'Number of Vessels Lost': 68,
       'Net Tonnage': 150000,
       'Gross Tonnage': 160000,
       'Number of Vessels Owned': 7530,
 

# Steps to Extract Tables from the PDF - Casualty Returns

1. Download the PDF in Colab


In [4]:
!wget -O casualty_returns_1905.pdf "https://lloyds-production.s3.amazonaws.com/_file/general/1905-casualty-returns.pdf"


--2025-01-31 10:20:21--  https://lloyds-production.s3.amazonaws.com/_file/general/1905-casualty-returns.pdf
Resolving lloyds-production.s3.amazonaws.com (lloyds-production.s3.amazonaws.com)... 52.95.191.39, 52.95.149.199, 52.95.142.71, ...
Connecting to lloyds-production.s3.amazonaws.com (lloyds-production.s3.amazonaws.com)|52.95.191.39|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3329169 (3.2M) [application/pdf]
Saving to: ‘casualty_returns_1905.pdf’


2025-01-31 10:20:23 (2.19 MB/s) - ‘casualty_returns_1905.pdf’ saved [3329169/3329169]



2.  Install Required Python Packages
You need a library that can extract tables from PDFs. The best options are:

- pdfplumber → Extracts structured tables accurately.
- PyMuPDF (fitz) → Parses text and images.
- camelot-py / pdfminer → Works well for structured tabular PDFs.

In [5]:
!pip install pdfplumber
!pip install pymupdf


Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [6]:
import pdfplumber

pdf_path = "casualty_returns_1905.pdf"

with pdfplumber.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, start=1):
        tables = page.extract_tables()

        if tables:
            print(f"\n--- Page {page_num} ---")
            for table in tables:
                for row in table:
                    print(row)  # Print each row of the table



--- Page 2 ---
['TOTAL.(e)', 'PER\nCENTAGE\nLo ST\n(STEAM\nVESSELS).(e)']

--- Page 4 ---
[None, None, 'BROKEN UP1\nABANDONED\nCONDEMNED (a), BURNT.\nAT SEA.\nETC.', None, None, None, None, 'COLLISION.', 'FOUNDERED.', 'LOST, ETC. (a)', 'MISSING. (a)', None]
['1904-1905.\nNo. Tons.\n9,236 15,391,350\n2,014 1,189,495\n*2,970 2,590,349\n290 585,156\n803 597,984\n496 687,529\n1,376 1,693,366\n1,935 3,369,807\n1,238 1,187,566\n591 668,360\n2,218 1,717,654\n1,370 840,515\n579 754,855\n1,517 751,533\nTotals ......', None, 'No. Tons.\n2 281\n2 1,168\n1 1,521\n1 253\n1,974\n4 3,590\n420\n! 12 9,207', None, 'No. Tons.', None, 'I No.', None, None, None, 'No.', 'Tons. I']
[None, '15,391,350\n1,189,495\n2,590,349\n585,156\n597,984\n687,529\n1,693,366\n3,369,807\n1,187,566\n668,360\n1,717,654\n840,515\n754,855\n751,533\nTotals ......', None, '281\n1,168\n1,521\n253\n1,974\n3,590\n420', '3\n3', '3,469\n889\n1,259', None, None, None, None, None, None]
[None, None, None, '9,207', '7', '5,617', None, N

## Lets save una of those tables into a dataframe

In [7]:
pdf.pages[0:5]

[<Page:1>, <Page:2>, <Page:3>, <Page:4>, <Page:5>]

In [8]:
## Extract page 4 (index 3)
pdf.pages[3]

<Page:4>

In [9]:
import pandas as pd
page = pdf.pages[3]
tables = page.extract_tables()  # Extract tables

# Check if tables were detected
if tables:
    # A Page can have more than 1 table
    print(f"Number of tables found on page 4: {len(tables)}")
    # Convert the first detected table into a Pandas DataFrame
    df_table_1 = pd.DataFrame(tables[0])



Number of tables found on page 4: 1


In [10]:
df_table_1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,,,"BROKEN UP1\nABANDONED\nCONDEMNED (a), BURNT.\n...",,,,,COLLISION.,FOUNDERED.,"LOST, ETC. (a)",MISSING. (a),
1,"1904-1905.\nNo. Tons.\n9,236 15,391,350\n2,014...",,"No. Tons.\n2 281\n2 1,168\n1 1,521\n1 253\n1,9...",,No. Tons.,,I No.,,,,No.,Tons. I
2,,"15,391,350\n1,189,495\n2,590,349\n585,156\n597...",,"281\n1,168\n1,521\n253\n1,974\n3,590\n420",3\n3,"3,469\n889\n1,259",,,,,,
3,,,,9207,7,5617,,,,,,


### Extracting the text of a Page

In [11]:
page = pdf.pages[3]  # Extract page 4 (index 3)
text = page.extract_text()  # Extract full text

# Print extracted text
if text:
    print(f"\n--- Full Text of Page 4 ---\n")
    print(text)
else:
    print("\n--- No Text Detected on Page 4 ---\n")


--- Full Text of Page 4 ---

·~~
TABLE No. 3:-Showing the number, tonnage and nationality of STEAM AND SAILING VESSELS totally lost, condemned, &c., during the quarter
ended 31st March, 1905, as reported up to the 4th September, 1905, and showing also the number and tonnage of Vessels owned in each
country. The tonnage given is gross for Steam Vessels and net for Sailing Vessels. (Vessels under 100 tons are not included in this return.)
PER CENTAGE
HOVV LOST.
STEAM AND SAILING LOST
I
VESSELS OWNED TOTAL (<r) <STEAM AND
ACCORDING TO LLOYD'S ABANDONED COB NR DO EK ME NN E DU P (a1 ), BURNT. COLLISION. FOUNDERED. LOST, ETC. (a) MISSING. (a) WRECKED. (a) SAILING VESSELS
FLAG. REGISTER BOOK AT SEA. IT OGETHER) (a)
ETC. I
1904-1905. I I To~: Ve~~-.el_s_\_T_o_n°n-fa_r_e
Tons. I No. Tons. No ..
No. Tons. No. Tons. No. Tons. I No. I Tons. No. Tons. No. Tons. No. I Tons. No. --- ---. ----- __Q:!_!!~h_ ~
1,587 7 g 709 3 l,90!i 10 9,812 29 44 ,5lfi 50 61,529 0·54 0•40
UNITED KINGDOM 9,236 15,391,

### Extracting the text of all pages

In [12]:
import pdfplumber

pdf_path = "casualty_returns_1905.pdf"


with pdfplumber.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, start=1):
        text = page.extract_text()  # Extract full text
        tables = page.extract_tables()  # Extract any tables

        print(f"\n--- Page {page_num} ---\n")

        # Print extracted text
        if text:
            print("---------- Extracted Text --------:\n")
            print(text)
        else:
            print("No text detected.\n")

        # Print extracted tables
        if tables:
            print("\n <<<<<<<< Extracted Tables >>>>>>>>>>>:")
            for table in tables:
                for row in table:
                    print(row)  # Print each row of the table
        else:
            print("\nNo tables detected.\n")



--- Page 1 ---

---------- Extracted Text --------:

J'
RETURNS
OF
VESSELS TOTALLY LOST, CONDEMNED, &c.
lst January to 31st March, 1905.
71, FENCHURCH STREET, LONDON, E.C.
September, 1905.

No tables detected.


--- Page 2 ---

---------- Extracted Text --------:

TABLE No. 1 :-Showing the number, net and gross tonnage and nationality of STEAM VESSELS totally lost, condemned, &c., during the
quarter ended 31st March, 1905, as reported up to the 4th September, 1905; and showing also the number and tonnage of steam
vessels owned in each country. (Vessels under 100 tons gross are not included in this return.)
HOW LOST. PER
STEAM VESSELS OWNED, CENTAGE
ACCORDING TO TOTAL.(e) Lo ST
BROKEN UP,
LLOYD'S REGISTER BooK, ABANDONED CONDEMNED("), BURNT. I COLLISION. I FOUNDERED. I LOST, ETC. (b) MISSING. (o) WRECKED. (fl) (STEAM
FLAG. 1904-1905. AT SEA. I ETC. I I I I I I I I I I I I IV ESSELS).(e)
·-----------
I I I
No.
I Tons. I No. I To In s. No. To In s. - No. Tons. No. To In s. No To In s. - 

**Important**: You have all the PDFs here: https://archive.org/details/HECCR1890/1890/

# Steps to Extract Tables from the PDF - Registry of Shipping

1. Download the PDF in Colab

In [None]:
!wget -O registry_of_shipping_1799.pdf "https://ia903101.us.archive.org/10/items/HECROS1799/ROS1799.pdf"


--2025-01-31 10:21:46--  https://ia903101.us.archive.org/10/items/HECROS1799/ROS1799.pdf
Resolving ia903101.us.archive.org (ia903101.us.archive.org)... 207.241.232.141
Connecting to ia903101.us.archive.org (ia903101.us.archive.org)|207.241.232.141|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63389108 (60M) [application/pdf]
Saving to: ‘registry_of_shipping_1799.pdf’

ing_1799.pdf         20%[===>                ]  12.12M  97.2KB/s    eta 6m 41s 

In [None]:
import pdfplumber

pdf_path = "registry_of_shipping_1799.pdf"

with pdfplumber.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, start=1):
        tables = page.extract_tables()

        if tables:
            print(f"\n--- Page {page_num} ---")
            for table in tables:
                for row in table:
                    print(row)  # Print each row of the table


**Bad news** : it doesnt work well!

Lets try another thing. This is not ideal but it gives something.

In [None]:
import pdfplumber


pdf_path = "registry_of_shipping_1799.pdf"

with pdfplumber.open(pdf_path) as pdf:
    #for page_num, page in enumerate(pdf.pages, start=1):
    ## Lets try with the first 20 pages
    for page_num, page in enumerate(pdf.pages[:20], start=1):
        text = page.extract_text()  # Extract text from the page

        if text:
            print(f"\n--- Page {page_num} ---")
            print(text[:2000])  # Print first 2000 characters (avoid printing too much)
        else:
            print(f"\n--- Page {page_num} --- (No text detected)")


Note: You sould try to compare it with you get with the OCRed text from https://archive.org/details/HECCR1890/1890/ which is stored here ##https://archive.org/stream/HECROS1799/ROS1799_djvu.txt -- and see what is better to work with.

In [None]:
!wget -O ROS1799_djvu.txt "https://archive.org/download/HECROS1799/ROS1799_djvu.txt"


In [None]:
# Read the downloaded text file
text_file_path = "ROS1799_djvu.txt"

with open(text_file_path, "r", encoding="utf-8") as file:
    text_content = file.readlines()  # Read all lines

# Print first 1000 lines to inspect structure
for line_num, line in enumerate(text_content[:1000], start=1):
    print(f"{line_num}: {line.strip()}")
