# Extract Financial Statement Values

Financial statement analysis is critical to every organization to enable companies to make better economic decisions that yields more income in the future. For a given financial statement, a rulebook is followed to extract the values associated with accrual, audit status, balance sheet, measurement date and pension plans as shown (refer rulebook & sample data here)

## Task:

Build an NLP model which analyzes each document, looks for relevant financial terms (described below)

The NLP Model should learn on Documents available under Training data folder using Rules listed here

Apply the same model on Test Data documents to extract relevant financial information

The results need to be updated & uploaded in the sheet provided: ‘Results.csv’ (download dataset)

Note: No training labels are explicitly available for this problem statement. You will be able to test your model's accuracy by submitting values extracted for Test data.

## Data Description:

Columns

Description of Values

Credit Name, State, Security ID, Org ID, FYE

Identifiers from Documents provided to you already in “Results.csv”

Accounting Basis

Identify ‘Basis of Accounting’ as [‘Accrual’, ‘Cash’, ‘Modified Accrual’, ‘Modified Cash’, ‘Regulatory’]

Pension Plan 1 Name*

Pension Plan Identifier for Pension Plan with highest Total Pension Liability (0 if no date is specified) 

Pension Plan 1 Measurement Date

Reporting Date for Pension Plan 1 (0 if no date is specified) [DD/MM/YYYY]

Pension Plan 1 Total Pension Liability

Total Pension Plan liability for Pension Plan 1 (0 if no liability is specified) [int64]

Balance Sheet Cash

Total value of Balance sheet-Governmental funds under ‘Cash & Cash Equivalents’ row. This will include all other row items as well which have been highlighted in the rules for cash and cash equivalent.  (0 if no balance sheet amount is specified) [int64]

Pension Plan 2 Name*

Pension Plan Identifier for Pension Plan with second highest Total Pension Liability (0 if no date is specified)

Pension Plan 2 Measurement Date

Reporting Date for Pension Plan 1 (0 if no date is specified) [DD/MM/YYYY]

Pension Plan 2 Total Pension Liability

Total Pension Plan liability for Pension Plan 1 (0 if no date is specified) [int64]

Note: *- Map Pension Plan Names to Industry Standards as specified here

### Data Volume:

Train Data: 479 Documents, 1.5GB

Test Data: 98 Documents, 373.7MB

Submission Format: You need to update Results.csv provided here, with your predictions and upload it online.

Data Files: Download Dataset (~1.7GB)

Evaluation Metric:

[ML_Model] accuracy_score


[Offline] Source Code, Models/Logic used

In [1]:
import pandas as pd
import numpy as np

In [3]:
res=pd.read_csv('../Results5cf9666.csv')
res.head()

Unnamed: 0,For Lookup,Year,Credit Name,State,Security ID,Org ID,FYE,Accounting Basis,Pension Plan 1 Name,Pension Plan 1 Measurement Date,Pension Plan 1 Total Pension Liability,Balance Sheet Cash,Pension Plan 2 Name,Pension Plan 2 Measurement Date,Pension Plan 2 Total Pension Liability
0,1013737_2016,2016,Jeannette,PA,1013737,8612,31/12/2016,,,,,,,,
1,1028564_2016,2016,Dickinson,ND,1028564,12020,31/12/2016,,,,,,,,
2,711506_2016,2016,Rawlins County,KS,711506,494151,31/12/2016,,,,,,,,
3,702336_2016,2016,Stone County,MO,702336,326056,31/12/2016,,,,,,,,
4,1029684_2017,2017,Anchorage,AK,1029684,27613,31/12/2017,,,,,,,,


## Sample submission with hard coded values

In [5]:
p=['GA_Muni Asso',
'CO_Nonemer Emp Ret Sys',
'MI_Gen Ret Sys',
'MO_Polman & Firemen Pen Plan',
'TX_Firemen Reli & Ret Fund']

import random
random.seed(1994)
random.choice(p)

'MO_Polman & Firemen Pen Plan'

In [6]:
pen1=[]
pen2=[]
for j in range(98):
    pen1.append(random.choice(p))
    pen2.append(random.choice(p))


In [7]:
res.shape

(98, 15)

In [55]:
res['Accounting Basis']='Accrual'
res['Pension Plan 1 Total Pension Liability']=0
res['Pension Plan 2 Total Pension Liability']=0
res['Balance Sheet Cash']=0

res['Pension Plan 1 Name']=pen1
res['Pension Plan 2 Name']=pen2
res['Pension Plan 1 Measurement Date']='June 30, 2017'
res['Pension Plan 2 Measurement Date']='June 30, 2017'



In [56]:
res.head()

Unnamed: 0,For Lookup,Year,Credit Name,State,Security ID,Org ID,FYE,Accounting Basis,Pension Plan 1 Name,Pension Plan 1 Measurement Date,Pension Plan 1 Total Pension Liability,Balance Sheet Cash,Pension Plan 2 Name,Pension Plan 2 Measurement Date,Pension Plan 2 Total Pension Liability
0,1013737_2016,2016,Jeannette,PA,1013737,8612,31/12/2016,Accrual,MO_Polman & Firemen Pen Plan,"June 30, 2017",0,0,MI_Gen Ret Sys,"June 30, 2017",0
1,1028564_2016,2016,Dickinson,ND,1028564,12020,31/12/2016,Accrual,MO_Polman & Firemen Pen Plan,"June 30, 2017",0,0,MI_Gen Ret Sys,"June 30, 2017",0
2,711506_2016,2016,Rawlins County,KS,711506,494151,31/12/2016,Accrual,TX_Firemen Reli & Ret Fund,"June 30, 2017",0,0,CO_Nonemer Emp Ret Sys,"June 30, 2017",0
3,702336_2016,2016,Stone County,MO,702336,326056,31/12/2016,Accrual,GA_Muni Asso,"June 30, 2017",0,0,MI_Gen Ret Sys,"June 30, 2017",0
4,1029684_2017,2017,Anchorage,AK,1029684,27613,31/12/2017,Accrual,MI_Gen Ret Sys,"June 30, 2017",0,0,MI_Gen Ret Sys,"June 30, 2017",0


In [57]:
res.columns

Index(['For Lookup', 'Year', 'Credit Name', 'State', 'Security ID', 'Org ID',
       'FYE', 'Accounting Basis', 'Pension Plan 1 Name',
       'Pension Plan 1 Measurement Date',
       'Pension Plan 1 Total Pension Liability', 'Balance Sheet Cash',
       'Pension Plan 2 Name', 'Pension Plan 2 Measurement Date',
       'Pension Plan 2 Total Pension Liability'],
      dtype='object')

In [58]:
res.to_csv('newSub2.csv',index=False)

### gave around 12-14.XX score

In [8]:
res.set_index(['For Lookup', 'Year', 'Credit\nName', 'State', 'Security\nID',
       'Org\nID', 'FYE']).to_csv('withIndex2.csv')

In [15]:
res[['Credit\nName', 'State', 'Security\nID',
       'Org\nID', 'FYE', 'Accounting Basis', 'Pension Plan 1 Name',
       'Pension Plan 1 Measurement Date',
       'Pension Plan 1 Total Pension Liability', 'Balance Sheet Cash',
       'Pension Plan 2 Name', 'Pension Plan 2 Measurement Date',
       'Pension Plan 2 Total Pension Liability']].to_csv('Resultsq2.csv',index=False)

In [1]:
### Installing packages

In [16]:
!pip install PyPDF2

Collecting PyPDF2
[?25l  Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
[K    100% |████████████████████████████████| 81kB 642kB/s ta 0:00:01
[?25hBuilding wheels for collected packages: PyPDF2
  Running setup.py bdist_wheel for PyPDF2 ... [?25ldone
[?25h  Stored in directory: /Users/rajat.ranjan/Library/Caches/pip/wheels/53/84/19/35bc977c8bf5f0c23a8a011aa958acd4da4bbd7a229315c1b7
Successfully built PyPDF2
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0


## All test files

In [14]:
import os
print(os.listdir('Test Data/Jeannette_PA_8612_G O Municipality & County_City _2016.pdf'))

['Minneapolis_KS_523347_G O Municipality & County_City _2017.pdf', 'Pima Cnty_AZ_1422_G O Municipality & County_County_2018.pdf', 'Audubon Cnty_IA_303832_G O Municipality & County_County_2017.pdf', 'Ipswich Twn_MA_340179_G O Municipality & County_Town_2018.pdf', 'Oro Vy_AZ_20742_G O Municipality & County_Town_2017.pdf', 'Jeannette_PA_8612_G O Municipality & County_City _2016.pdf', 'Yonkers_NY_3562_G O Municipality & County_City _2018.pdf', 'River Trails Pk Dist_IL_347152_G O Municipality & County_Special District_2017.pdf', 'South Florida Wtr Mgt Dist_FL_13376_G O Municipality & County_Special District_2017.pdf', 'Garfield Township_MI_521870_G O Municipality & County_Town_2017.pdf', 'Dodge City_KS_15059_G O Municipality & County_City _2017.pdf', 'Annapolis_MD_1784_G O Municipality & County_City_2018.pdf', 'Sussex Cnty_DE_13218_G O Municipality & County_County_2018.pdf', 'Talladega_AL_8999_G O Municipality & County_City_2017.pdf', 'Kitsap Cnty_WA_11962_G O Municipality & County_County_2

In [12]:
res[res.Year==2016]

Unnamed: 0,For Lookup,Year,Credit Name,State,Security ID,Org ID,FYE,Accounting Basis,Pension Plan 1 Name,Pension Plan 1 Measurement Date,Pension Plan 1 Total Pension Liability,Balance Sheet Cash,Pension Plan 2 Name,Pension Plan 2 Measurement Date,Pension Plan 2 Total Pension Liability
0,1013737_2016,2016,Jeannette,PA,1013737,8612,31/12/2016,Accrual,,,0,0,,,0
1,1028564_2016,2016,Dickinson,ND,1028564,12020,31/12/2016,Accrual,,,0,0,,,0
2,711506_2016,2016,Rawlins County,KS,711506,494151,31/12/2016,Accrual,,,0,0,,,0
3,702336_2016,2016,Stone County,MO,702336,326056,31/12/2016,Accrual,,,0,0,,,0
86,1022225_2016,2016,Buford,GA,1022225,14231,6/30/2016,Accrual,,,0,0,,,0


In [23]:
!pip install pdfminer.six

Collecting pdfminer.six
[?25l  Downloading https://files.pythonhosted.org/packages/8a/fd/6e8746e6965d1a7ea8e97253e3d79e625da5547e8f376f88de5d024bacb9/pdfminer.six-20181108-py2.py3-none-any.whl (5.6MB)
[K    100% |████████████████████████████████| 5.6MB 302kB/s ta 0:00:01
Collecting pycryptodome (from pdfminer.six)
[?25l  Downloading https://files.pythonhosted.org/packages/b7/94/74a003a7dfbffed6064679e9f3c87a7b7866c652dc3c647ca2a01822d7ca/pycryptodome-3.8.2-cp37-cp37m-macosx_10_6_intel.whl (10.1MB)
[K    100% |████████████████████████████████| 10.1MB 945kB/s ta 0:00:01
[?25hInstalling collected packages: pycryptodome, pdfminer.six
Successfully installed pdfminer.six-20181108 pycryptodome-3.8.2


### Installing packages

In [31]:
!pip install PIL
!pip install pytesseract
!pip install pdf2image

Collecting PIL
[31m  Could not find a version that satisfies the requirement PIL (from versions: )[0m
[31mNo matching distribution found for PIL[0m
Collecting pytesseract
[?25l  Downloading https://files.pythonhosted.org/packages/1d/40/3f72d13d0f347bf688ff189b6d6bb369125c0bed9ed4b15e7f20c65123a8/pytesseract-0.2.7.tar.gz (169kB)
[K    100% |████████████████████████████████| 174kB 1.6MB/s ta 0:00:01
Building wheels for collected packages: pytesseract
  Running setup.py bdist_wheel for pytesseract ... [?25ldone
[?25h  Stored in directory: /Users/rajat.ranjan/Library/Caches/pip/wheels/cd/4a/30/998e01b892300ba0ccce7b806b6e889794605a384dac81a49a
Successfully built pytesseract
Installing collected packages: pytesseract
Successfully installed pytesseract-0.2.7
Collecting pdf2image
  Downloading https://files.pythonhosted.org/packages/04/21/64583455a8e41b2838a51789807889d947575b828f816a0ade1160c1e2c4/pdf2image-1.6.0.tar.gz
Building wheels for collected packages: pdf2image
  Running setu

In [8]:
import os
test_pdf=os.listdir('Test Data')
test_pdf

['Minneapolis_KS_523347_G O Municipality & County_City _2017.pdf',
 'Pima Cnty_AZ_1422_G O Municipality & County_County_2018.pdf',
 'Audubon Cnty_IA_303832_G O Municipality & County_County_2017.pdf',
 'Ipswich Twn_MA_340179_G O Municipality & County_Town_2018.pdf',
 'Oro Vy_AZ_20742_G O Municipality & County_Town_2017.pdf',
 'Jeannette_PA_8612_G O Municipality & County_City _2016.pdf',
 'Yonkers_NY_3562_G O Municipality & County_City _2018.pdf',
 'River Trails Pk Dist_IL_347152_G O Municipality & County_Special District_2017.pdf',
 'South Florida Wtr Mgt Dist_FL_13376_G O Municipality & County_Special District_2017.pdf',
 'Garfield Township_MI_521870_G O Municipality & County_Town_2017.pdf',
 'Dodge City_KS_15059_G O Municipality & County_City _2017.pdf',
 'Annapolis_MD_1784_G O Municipality & County_City_2018.pdf',
 '.DS_Store',
 'Sussex Cnty_DE_13218_G O Municipality & County_County_2018.pdf',
 'Talladega_AL_8999_G O Municipality & County_City_2017.pdf',
 'Kitsap Cnty_WA_11962_G O Mu

In [20]:
os.getcwd()
os.chdir('/Users/rajat.ranjan/Desktop/ML/CIRSIL/Crisil Challenge Data/')

## Importing the PDF
## Converting them to images with pdf2image
## using pytessaract converted Images to String and save into text file

In [21]:
from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os 
  
for pdf in test_pdf:
    print('for ----',pdf)
    PDF_file = "Test Data/"+pdf
    pages = convert_from_path(PDF_file, 500) 
    path=os.getcwd()+'/'+pdf.split('&')[0]
 
    try:
        # Create target Directory
        os.mkdir(pdf.split('&')[0])
        print(" Created ") 
    except FileExistsError:
        print('error')
    os.chdir(path)
    # Counter to store images of each page of PDF to image 
    image_counter = 1
    # Iterate through all the pages stored above 
    for page in pages: 
        # Declaring filename for each page of PDF as JPG 
        # For each page, filename will be: 
        # PDF page 1 -> page_1.jpg 
        # PDF page 2 -> page_2.jpg 
        # PDF page 3 -> page_3.jpg 
        # .... 
        # PDF page n -> page_n.jpg 
        filename = "page_"+str(image_counter)+".jpg"

        # Save the image of the page in system 
        page.save(filename, 'JPEG') 

        # Increment the counter to update filename 
        image_counter = image_counter + 1

    print('Part 2')
    #Part #2 - Recognizing text from the images using OCR 
    #'''
#         3
    # Variable to get count of total number of pages 
    filelimit = image_counter-1

    # Creating a text file to write the output 
    outfile = "out_text"+pdf.split('&')[0]+".txt"

    # Open the file in append mode so that  
    # All contents of all images are added to the same file 
    f = open(outfile, "a") 
    print('Part 3')
    # Iterate from 1 to total number of pages 
    for i in range(1, filelimit + 1): 

        # Set filename to recognize text from 
        # Again, these files will be: 
        # page_1.jpg 
        # page_2.jpg 
        # .... 
        # page_n.jpg 
        filename = "page_"+str(i)+".jpg"


        # Recognize the text as string in image using pytesserct 
        text = str(((pytesseract.image_to_string(Image.open(filename))))) 

        # The recognized text is stored in variable text 
        # Any string processing may be applied on text 
        # Here, basic formatting has been done: 
        # In many PDFs, at line ending, if a word can't 
        # be written fully, a 'hyphen' is added. 
        # The rest of the word is written in the next line 
        # Eg: This is a sample text this word here GeeksF- 
        # orGeeks is half on first line, remaining on next. 
        # To remove this, we replace every '-\n' to ''. 
        text = text.replace('-\n', '')     

        # Finally, write the processed text to the file. 
        f.write(text) 
        # Close the file after writing all the text. 
    os.chdir('/Users/rajat.ranjan/Desktop/ML/CIRSIL/Crisil Challenge Data/')
    f.close()

for ---- Geneva County_AL_571349_G O Municipality & County_County_2017.pdf
error
Part 2
Part 3
for ---- La Paz Cnty_AZ_372467_G O Municipality & County_County_2017.pdf
 Created 
Part 2
Part 3
for ---- Angleton Danbury Hosp Dist_TX_15837_G O Municipality & County_Special District_2017.pdf
 Created 
Part 2
Part 3
for ---- Crestview_FL_2916_G O Municipality & County_City_2017.pdf
 Created 
Part 2
Part 3
for ---- Shelby Cnty_AL_21033_G O Municipality & County_County_2017.pdf
 Created 
Part 2
Part 3
for ---- Smithfield_RI_7948_G O Municipality & County_Town_2018.pdf
 Created 
Part 2
Part 3
for ---- Fond Du Lac_WI_6846_G O Municipality & County_City_2017.pdf
 Created 
Part 2
Part 3
for ---- Clay Twp_MI_345734_G O Municipality & County_Town_2018.pdf
 Created 
Part 2
Part 3
for ---- Hot Springs_SD_553948_G O Municipality & County_City _2017.pdf
 Created 
Part 2
Part 3
for ---- Centerville City_UT_467721_G O Municipality & County_City _2018.pdf
 Created 
Part 2
Part 3
for ---- Lake Benton_MN_23

# This will create a directory as 

+PDF name

--images

-- pdf.txt


+PDF name

--images

--pdf.txt