### Import dependencies

In [1]:
%load_ext autotime

import cv2
from IPython.display import display_html, display
from PIL import Image as PILImage

from img2table.document import Image
from img2table.ocr import TesseractOCR, PaddleOCR

time: 2.77 s (started: 2023-03-28 22:43:59 +02:00)


### Borderless table extraction

An algorithm for identification and extraction of borderless tables (i.e not fully bordered). However, the detection might not be as reliable as the one for bordered tables, especially for tables with multi-line cells.

#### Image used
<img src="data/borderless_ocr.jpg" width="425" height="550">

In [2]:
img = Image("data/borderless_ocr.jpg")
tesseract = TesseractOCR()

# Extract tables with Tesseract and PaddleOCR
tables = img.extract_tables(ocr=tesseract, borderless_tables=True)

tables[0].df

Unnamed: 0,0,1,2,3
0,General Disclosures\nStandard,Disclosure,Description,Location of information
1,Organization Profile,102-1,Name of the organization,Cognizant
2,,102-2,"Activities, brands, products, and\nservices",2020 10-K
3,,102-3,Location of headquarters,"Teaneck, New Jersey (U.S)"
4,,102-4,Location of operations,2020 Annual Report
5,,102-5,Ownership and legal form,Cognizant Technology Solutions Corp is listed ...
6,,102-6,Markets served,2020 10-K
7,,102-7,Scale of the organization,2020 10-K
8,,102-8,Information on employees and\nother workers,2020 ESG Report: Our global business (p.7)
9,,102-9,Supply chain,2020 Annual Report\n2020 ESG Report: Supply ch...


time: 3.09 s (started: 2023-03-28 22:44:02 +02:00)


### Extract multiple tables

Using an example of a document provided in an <a href="https://aws.amazon.com/fr/blogs/machine-learning/merge-cells-and-column-headers-in-amazon-textract-tables/">AWS blogpost</a>, multiple borderless tables can be extracted from the same image :

##### Original image

<img src="data/borderless_aws.jpg" alt="Document with borderless tables from AWS blogpost" width="50%" height="50%">

In [3]:
img = Image("data/borderless_aws.jpg")
ocr = PaddleOCR()

extracted_tables = img.extract_tables(ocr=ocr,
                                      borderless_tables=True)

for idx, table in enumerate(extracted_tables):
    display_html(table.html_repr(title=f"Extracted table n°{idx + 1}"), raw=True)

Unnamed: 0,0,1
0,Beginning Balance:,$8000.00
1,Deposits,$3005.50
2,Other Subtractions,-1539.55
3,Checks,
4,0.00,
5,Service Fees,
6,0.00,


Unnamed: 0,0,1,2,3,4,5
0,Date,Description,Details,Credits,Debits,Balance
1,2/4/2022,Life Insurance Payments,Credit,,445,9500.45
2,,Property Management,Credit,,300,9945.45
3,,Retail Store4,Credit,,65.75,10245.45
4,2/3/2022,Electricity Bill,Credit,,245.45,10311.2
5,,Water Bill,Credit,,312.85,10556.65
6,,Rental Deposit,Credit,3000,,10869.5
7,2/2/2022,Retail Store 3,Credit,,125,7869.5
8,,Retail Store 2 Refund,Debit,5.5,,7994.5
9,,Retail Store1,Credit,,45.5,8000


time: 11.8 s (started: 2023-03-28 22:44:05 +02:00)
