> Here is the flowchart for what the code below does:

    1- Text extraction from a pdf file
    2- Sentence Segmentation using spacy
    3- Fuzzy name matching of all sentences to each of our KPIs (find sentences belonging to a given KPI)
    4- NER or Dependency Parsing to extract info (not implemented yet)
    

In [9]:
from fuzzywuzzy import fuzz

import warnings
warnings.filterwarnings("ignore")

# Extracting pdf Text using *pytesseract*

In [2]:
# Get PDF text
# ref: https://www.youtube.com/watch?v=bk5u3rZk8Vk&t=5s

from pdf2image import convert_from_path
from pytesseract import image_to_string


def convert_pdf_to_img(pdf_file):
    return convert_from_path(pdf_file)

def convert_image_to_text(file):
    text = image_to_string(file)
    return text

def get_text_from_any_pdf(pdf_file):
    images = convert_pdf_to_img(pdf_file)
    final_text = ""
    for pg, img in enumerate(images):
        
        final_text += convert_image_to_text(img)
        #print("Page n°{}".format(pg))
        #print(convert_image_to_text(img))
        
    return final_text


path = 'Boskalis_Sustainability_Report_2020.pdf'
text = get_text_from_any_pdf(path)

In [3]:
print(text)

ti?

& Boskalis

SUSTAINABILITY
REPORT 2020

| SUSTAINABILITY
REPORT 2020

 

 
| KEY FIGURES

REVENUE BY SEGMENT (in EUR million)

Dredging & Inland Infra 7

HI Offshore Energy

HE Towage & Salvage 244
Eliminations (-30)

 

SUSTAINABILITY REPORT 2020 - BOSKALIS

 

KEY FIGURES

 

 

{in EUR million, unless stated otherwise] 2020 2019
Revenue 2,525 2,645
Order book 5,306 4,722
EBITDA 404 376
Net result from joint ventures and associates 19* 26
Depreciation and amortization 264 265
Operating result 140 28
Exceptional items (charges/income) -195 82
EBIT -56 111
Net operating profit 90 “1
Net profit (loss) -97 75
Net group profit (loss) -97 75
Cash flow 355* 340
Shareholders’ equity 2,283 2,491
RATIOS (IN PERCENTAGES)

EBIT as % of revenue 5.5* 4.2
Return on capital employed 3.9* 29
Return on equity 3.8* 3.0
Solvency 50.5 54.3
FIGURES PER SHARE (IN EUR)

Profit 0.69* 0.56
Dividend (proposal) 0.50 -
Cash flow 2.48* 2.55
NON-FINANCIAL INDICATORS

Employees including associated companies 

# Extracting Sentences using *spacy*

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

sent_list = []
for sent in doc.sents:
    sent_list.append(sent)

In [5]:
print(sent_list)

[ti?

& Boskalis

SUSTAINABILITY
REPORT 2020

| SUSTAINABILITY
REPORT 2020

 

 
| KEY FIGURES

REVENUE BY SEGMENT (in EUR million)

Dredging & Inland Infra 7

HI Offshore Energy

HE Towage & Salvage 244
Eliminations (-30)

 

SUSTAINABILITY REPORT 2020 - BOSKALIS

 

KEY FIGURES

 

 

{in EUR million, unless stated otherwise] 2020 2019
Revenue 2,525 2,645
Order book 5,306 4,722
EBITDA 404 376
Net result from joint ventures and associates 19* 26
Depreciation and amortization 264 265
Operating result 140 28
Exceptional items (charges/income) -195 82
EBIT -56, 111
Net operating profit 90 “1
Net profit (loss) -97, 75
Net group profit (loss) -97, 75
Cash flow 355* 340
Shareholders’ equity 2,283 2,491
RATIOS (IN PERCENTAGES), 

EBIT as % of revenue 5.5* 4.2
Return on capital employed 3.9* 29
Return on equity 3.8* 3.0
Solvency 50.5 54.3
FIGURES PER SHARE (IN EUR)

Profit 0.69* 0.56
Dividend (proposal) 0.50 -
Cash flow 2.48* 2.55, 
NON-FINANCIAL INDICATORS

Employees including associated co

# Fuzzy Name Matching (KPIs to Sentences)

In [6]:
# import KPIs
import pandas as pd
df_kpi = pd.read_excel('GRI KPI list.xlsx')
df_kpi

Unnamed: 0,Description of KPI,KPI,ESG,category,gri_disclosure_sub_code
0,"Scale of the organization, including: Total nu...",total number of employees,Governance,102.07,ai
1,"Scale of the organization, including: total nu...",total number of operations,Governance,102.07,aii
2,"Scale of the organization, including: Scale of...",net sales (for private sector),Governance,102.07,aiii
3,,net revenues (for public sector),Governance,102.07,aiii
4,"Scale of the organization, including: quantity...",quantity of products,Governance,102.07,av
...,...,...,...,...,...
157,,percentage of investment agreements that inclu...,Social,412.03,a
158,Percentage of new suppliers that were screened...,Percentage of new suppliers that were screened...,Social,414.01,a
159,Number of suppliers assessed for social impacts,Number of suppliers assessed for social impacts,Social,414.02,a
160,Number of suppliers identified as having signi...,Number of suppliers having significant actual ...,Social,414.02,b


In [7]:
df_kpi = df_kpi[['KPI']]
df_kpi

Unnamed: 0,KPI
0,total number of employees
1,total number of operations
2,net sales (for private sector)
3,net revenues (for public sector)
4,quantity of products
...,...
157,percentage of investment agreements that inclu...
158,Percentage of new suppliers that were screened...
159,Number of suppliers assessed for social impacts
160,Number of suppliers having significant actual ...


In [10]:
# check for the 1st kpi
kpi = "total number of employees"

for sent in sent_list:
    if fuzz.token_set_ratio(kpi, sent) > 90:
        print(str(sent))
        print('--------------------------------')



NUMBER OF EMPLOYEES

 

 

 

 

 

2020 2019 NATIONALITIES 2020 2019
Boskalis 6,137 5,812 Number of different nationalities 84 79
Anglo Eastern 1,347 1,321
Subtotal 7/A84 7.133 \WOMEN/MEN RATIOS 2020 2019
Joint Ventures 2,429 2,471
TOTAL 9,913 9,604 Women 14% 14%
Men 86% 86%
TOTAL 100% 100%
COMPOSITION OF WORKFORCE
NUMBER OF EMPLOYEES 2020
--------------------------------


Scope of work
The scope of our work was limited to assurance over the following
information included within the Report for the period 1st January to
31st December 2020 (the ‘Selected Information’):
+ Direct greenhouse gas (GHG) emissions (Scope 1);
+ Fuel consumption of marine gas oil (MGO) and heavy fuel oil (HFO)
from the fleet;
» Number of employees broken down by:
- employment contract (permanent or temporary contract) and
by gender;
- employment type (parttime, full-time) and by gender;
- country and number of nationalities;
+ Inflow and outflow of employees broken down by age (<30, 30-50, >50)
and gender, a

In [11]:
from fuzzywuzzy import fuzz

token_set_ratio_threshold = 95   # 100 means that we want exact match

# Custom function to find sentences belonging to a given KPI
def sent_extraction_per_kpi(kpi, sentence_list):
    result = []
    for sent in sentence_list:
        if fuzz.token_set_ratio(kpi, sent) > token_set_ratio_threshold:
            result.append(sent)
    return str(result)        # this returns a list in each cell of the df

In [12]:
df_kpi['extracted_sent'] = df_kpi['KPI'].apply(lambda x: sent_extraction_per_kpi(x, sent_list))
df_kpi

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.max_colwidth', -1,   # displays the full text of a cell
                       'display.precision', 3,
                       'display.colheader_justify', 'left'):display(df_kpi)

Unnamed: 0,KPI,extracted_sent
0,total number of employees,"[\n\nNUMBER OF EMPLOYEES\n\n \n\n \n\n \n\n \n\n \n\n2020 2019 NATIONALITIES 2020 2019\nBoskalis 6,137 5,812 Number of different nationalities 84 79\nAnglo Eastern 1,347 1,321\nSubtotal 7/A84 7.133 \WOMEN/MEN RATIOS 2020 2019\nJoint Ventures 2,429 2,471\nTOTAL 9,913 9,604 Women 14% 14%\nMen 86% 86%\nTOTAL 100% 100%\nCOMPOSITION OF WORKFORCE\nNUMBER OF EMPLOYEES 2020, \n\nScope of work\nThe scope of our work was limited to assurance over the following\ninformation included within the Report for the period 1st January to\n31st December 2020 (the ‘Selected Information’):\n+ Direct greenhouse gas (GHG) emissions (Scope 1);\n+ Fuel consumption of marine gas oil (MGO) and heavy fuel oil (HFO)\nfrom the fleet;\n» Number of employees broken down by:\n- employment contract (permanent or temporary contract) and\nby gender;\n- employment type (parttime, full-time) and by gender;\n- country and number of nationalities;\n+ Inflow and outflow of employees broken down by age (<30, 30-50, >50)\nand gender, and outflow by reason;\n+ Percentage of employees covered by collective bargaining agreements\nbroken down by gender;\n+ Composition of workforce broken down by gender and by age (<30,\n30-50, <50);\n+ Number of training hours broken down by gender and by job category\n(management, office staff, project staff, crew/yard staff);\n+ Talent management and engagement;\n+ Lost Time Injury Frequency (LTIF) and Total Recordable Injury Rate (TRIR);\n+ Total number of Lost Time Injuries (LTIs) and fatalities;\n+ Prevention of occupational and other diseases;\n+ Spend represented by strategic suppliers; and\n+ Percentage of strategic suppliers who have signed the Boskalis Supplier\nCode of Conduct.]"
1,total number of operations,[]
2,net sales (for private sector),[]
3,net revenues (for public sector),[]
4,quantity of products,[]
5,quantity of services,[]
6,total number of permanent female employees,[]
7,total number of permanent male employees,[]
8,total number of temporary female employees,[]
9,total number of temporary male employees,[]
