# Qarik Project
## Author: Shuo Xu

### Step 1: Convert each pdf to a readable format so we can extract its content.

#### Get the pdf titles in the dataset

In [1]:
mypath = r"\Users\xushu\Downloads\Qarik Project\World_Bank_Loans"
import os
pdfs = []
for count, filename in enumerate(os.listdir(mypath)):
    pdfs.append(filename)

In [2]:
print("There are",len(pdfs), "pdfs.")

There are 3205 pdfs.


#### Lets look at the title of a random pdf:

In [3]:
import random
n = random.randint(0,22)
pdfs[n-1]

'1990_april_24_587321468019152780_conformed-copy--l3186--forestry-sector-project--loan-agreement.pdf'

#### It has a structure of "Year", "Month", "Day", "ID", "Description". Let's make a file directionary.

In [4]:
Year = []
Month = []
Day = []
ID = []
Description = []
for i in range(len(pdfs)):
    x = pdfs[i].split("_")
    Year.append(x[0])
    Month.append(x[1])
    Day.append(x[2])
    ID.append(x[3])
    Description.append(x[4])

In [5]:
FileDirectoryDictionary = {
    "Year": Year,
    "Month": Month,
    "Day": Day,
    "ID": ID,
    "Description": Description
}

In [6]:
import pandas as pd
FileDirectory = pd.DataFrame(FileDirectoryDictionary)

In [7]:
FileDirectory.head()

Unnamed: 0,Year,Month,Day,ID,Description
0,1990,april,24,587321468019152780,conformed-copy--l3186--forestry-sector-project...
1,1990,april,24,668811468165272290,conformed-copy--c2120--water-supply-project--l...
2,1990,april,25,904191468298750561,conformed-copy--l3190--environment-management-...
3,1990,april,30,410811468040573756,conformed-copy--l3180--rural-electrification-p...
4,1990,april,30,725911468042268845,conformed-copy--l3182--third-telecommunication...


In [8]:
FileDirectory.to_csv(r'C:\Users\xushu\Downloads\Qarik Project\FileDirectory.csv', index = False)

In [9]:
FileDirectory.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3205 entries, 0 to 3204
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Year         3205 non-null   object
 1   Month        3205 non-null   object
 2   Day          3205 non-null   object
 3   ID           3205 non-null   object
 4   Description  3205 non-null   object
dtypes: object(5)
memory usage: 125.3+ KB


In [10]:
print("It is",pd.Series(FileDirectory['ID']).is_unique, "that each loan document has a unique ID.")

It is True that each loan document has a unique ID.


In [11]:
print(FileDirectory['ID'][1])

668811468165272290


#### Every pdf's title starts with a number (It is the first digit of the Year. We don't have empty values in the Year column.). This is problematic when we OCR the pdfs in the Windows system (Many other operating systems don't have this issue). To solve this issue, I rename the pdfs. The File Directory can help us trace the original pdf. 

#### In case something goes wrong, I made a copy of the original pdfs in a new folder so that the orginal dataset is intact.

In [12]:
path_source = r"\Users\xushu\Downloads\Qarik Project\World_Bank_Loans"

In [13]:
path_destination = r"\Users\xushu\Downloads\Qarik Project\Renamed_PDFs"

In [14]:
# Uncomment this code if the files are not copied
#import shutil
#files = pdfs
#for f in files:
#    shutil.copy(f, path_destination)

In [38]:
# Get the current working directory
cwd = os.getcwd()

# Print the current working directory
print("Current working directory: {0}".format(cwd))

Current working directory: C:\Users\xushu\Downloads\Qarik Project\Renamed_PDFs


In [16]:
os.chdir(path_destination)

# Print the current working directory
print("Current working directory: {0}".format(os.getcwd()))

Current working directory: C:\Users\xushu\Downloads\Qarik Project\Renamed_PDFs


#### Double-check we didn't lose any pdfs.

In [78]:
pdfs_copied = []
for count, filename in enumerate(os.listdir(path_destination)):
    pdfs_copied.append(filename)
print("There are",len(pdfs_copied), "pdfs.")

There are 3205 pdfs.


In [19]:
print(pdfs_copied[0],pdfs[0])

pdf100261468251185109.pdf 1990_april_24_587321468019152780_conformed-copy--l3186--forestry-sector-project--loan-agreement.pdf


#### Rename the pdfs. The format is 'pdf' + 'ID'.

In [21]:
oldname = pdfs[0]
print(oldname)
print(type(oldname))

1990_april_24_587321468019152780_conformed-copy--l3186--forestry-sector-project--loan-agreement.pdf
<class 'str'>


In [22]:
newname = 'pdf'+ str(FileDirectory['ID'][0]) + '.pdf'
print(newname)
print(type(newname))

pdf587321468019152780.pdf
<class 'str'>


In [23]:
for i in range(len(pdfs_copied)):
    oldname = pdfs[i]
    newname = 'pdf'+ FileDirectory['ID'][i] + '.pdf'
    os.rename(oldname,newname)

FileNotFoundError: [WinError 2] The system cannot find the file specified: '1990_april_24_587321468019152780_conformed-copy--l3186--forestry-sector-project--loan-agreement.pdf' -> 'pdf587321468019152780.pdf'

In [20]:
renamed_pdfs = []
for count, filename in enumerate(os.listdir(path_destination)):
    renamed_pdfs.append(filename)

In [79]:
print(renamed_pdfs[0])

pdf100261468251185109.pdf


#### Then we can OCR the pdfs

#### First we need to know which documents require OCR. We create a flag for OCR: 0-> Not Required, 1-> Required, 2->Empty document.

In [84]:
from tika import parser
import ocrmypdf
OCR_Status = []
for i in range(len(renamed_pdfs)):
    file = renamed_pdfs[i]
    parsed_pdf = parser.from_file(file)
    text = parsed_pdf['content']
    if text is None:
        OCR_Status.append("2")
        txt = open("txt" + str(renamed_pdfs[i][3:-4]) + ".txt","w+", encoding='utf-8')
        txt.write('Empty')
        txt.close()
    else:
        if len(text)<5000:
            OCR_Status.append("1")
        else:
            OCR_Status.append("0")
            txt = open("txt" + str(renamed_pdfs[i][3:-4]) + ".txt","w+", encoding='utf-8')
            txt.write(text)
            txt.close()
print("Finished!")

Finished!


#### We can see that 2796 pdfs don't require OCR, 344 pdfs require OCR, and 65 are Empty pdfs.

In [85]:
Temp = pd.DataFrame({"ID":renamed_pdfs,"OCR_Status":OCR_Status})

In [87]:
Temp.OCR_Status.value_counts()

0    2796
1     344
2      65
Name: OCR_Status, dtype: int64

#### We list the pdfs that require OCR, and extract the contents.

In [90]:
OCR_List = [i for i, e in enumerate(Temp["OCR_Status"]) if e == "1"]

In [96]:
for i in OCR_List:
    file = renamed_pdfs[i]
    ocrfile = 'ocr'+ file
    if __name__ == '__main__':
        cmd = 'ocrmypdf --optimize 0 --output-type pdf --force-ocr --deskew --fast-web-view 0 ' + file + ' ' + ocrfile 
        os.system(cmd)
    parsed_pdf = parser.from_file(ocrfile)
    text = parsed_pdf['content']
    if text is None:
        OCR_Status[i] = '2'
        txt = open("txt" + str(renamed_pdfs[i][3:-4]) + ".txt","w+", encoding='utf-8')
        txt.write('Empty')
        txt.close()
    else:
        txt = open("txt" + str(renamed_pdfs[i][3:-4]) + ".txt","w+", encoding='utf-8')
        txt.write(text)
        txt.close()
print("Finished!")

Finished!


#### All pdfs are converted to text documents. Each document's title is in the format of 'txt' + 'ID'. We write down the OCR status into our File Directory for debugging and future reference.

In [98]:
OCR_Status[:5]

['0', '0', '1', '0', '0']

In [107]:
FileDirectory = FileDirectory.drop(['Data Type'], axis=1)

In [105]:
for i in range(len(Temp)):
    Temp["ID"][i] = Temp["ID"][i][3:-4] 
Temp.head()

Unnamed: 0,ID,OCR_Status
0,100261468251185109,0
1,100271468223150252,0
2,100681564776273055,1
3,101071468074933425,0
4,101561468017966063,0


In [108]:
FileDirectory.head()

Unnamed: 0,Year,Month,Day,ID,Description
0,1990,april,24,587321468019152780,conformed-copy--l3186--forestry-sector-project...
1,1990,april,24,668811468165272290,conformed-copy--c2120--water-supply-project--l...
2,1990,april,25,904191468298750561,conformed-copy--l3190--environment-management-...
3,1990,april,30,410811468040573756,conformed-copy--l3180--rural-electrification-p...
4,1990,april,30,725911468042268845,conformed-copy--l3182--third-telecommunication...


In [109]:
result = pd.merge(FileDirectory, Temp, how="inner", on=["ID"])
result.head()

Unnamed: 0,Year,Month,Day,ID,Description,OCR_Status
0,1990,april,24,587321468019152780,conformed-copy--l3186--forestry-sector-project...,0
1,1990,april,24,668811468165272290,conformed-copy--c2120--water-supply-project--l...,0
2,1990,april,25,904191468298750561,conformed-copy--l3190--environment-management-...,0
3,1990,april,30,410811468040573756,conformed-copy--l3180--rural-electrification-p...,0
4,1990,april,30,725911468042268845,conformed-copy--l3182--third-telecommunication...,0


In [110]:
FileDirectory.to_csv(r'C:\Users\xushu\Downloads\Qarik Project\FileDirectory.csv', index = False)

### Step 2: Extract Information from the Documents.

### A loan document contains the following information: Borrower, Lender, Loan Amount, Start Date, End Date, 

#### Let's first extract the Borrower.

In [225]:
Project = []
Party1 = []
Party2 = []

for i in range(len(FileDirectory)):
    file = open("txt" + str(FileDirectory["ID"][i]) + ".txt", encoding='utf-8')
    data = file.read().upper()
    file.close()
    substring0 = '('
    length0 = len(substring0)
    index0 = data.find(substring0)
    substring1 = ')'
    length1 = len(substring1)
    index1 = data.find(substring1,index0+length0)
    substring2 = 'Between'
    length2 = len(substring2)
    index2 = data.find(substring2,index1+length1)
    substring3 = 'AND'
    length3 = len(substring3)
    index3 = data.find(substring3,index2+length2)
    substring4 = 'DATED'
    length4 = len(substring4)
    index4 = data.find(substring4,index3+length3)
    Project_target = data[index0+length0:index1].replace("\n"," ").replace("("," ").replace(")"," ").strip()
    Party1_target = data[index2+length2:index3].replace("\n"," ").strip()
    Party2_target = data[index3+length3:index4].replace("\n"," ").strip()
    Project.append(Project_target)
    Party1.append(Party1_target)
    Party2.append(Party2_target)
print("Finished!")

Finished!


In [228]:
Party2

['INTERNATIONAL BANK FOR RECONSTRUCTION AND DEVELOPMENT',
 'INTERNATIONAL BANK FOR RECONSTRUCTION                           AND DEVELOPMENT',
 'AND                 INTERNATIONAL BANK FOR RECONSTRUCTION                           AND DEVELOPMENT',
 'INTERNATIONAL BANK FOR RECONSTRUCTION                     AND DEVELOPMENT',
 'INTERNATIONAL BANK FOR RECONSTRUCTION                     AND DEVELOPMENT',
 'POPULAR REPUBLIC OF ALGERIA  AND  INTERNATIONAL BANK FOR RECONSTRUCTION AND DEVELOPMENT',
 'TECHNOLOGY RESEARCH PROJECT)                                BETWEEN                           REPUBLIC OF KOREA                                  AND                 INTERNATIONAL BANK FOR RECONSTRUCTION                           AND DEVELOPMENT',
 'INTERNATIONAL BANK FOR RECONSTRUCTION                           AND DEVELOPMENT',
 'INTERNATIONAL BANK FOR RECONSTRUCTION                           AND DEVELOPMENT',
 'INTERNATIONAL BANK FOR RECONSTRUCTION                     AND DEVELOPMENT',
 'INTERNATI

In [217]:
file = open("txt" + str(FileDirectory["ID"][100]) + ".txt", encoding='utf-8')
data = file.read()
file.close()

In [218]:
print(data)













































World Bank Document


                                                     LOAN NUMBER 3125 UNI

                              Loan Agreement

                         (Essential Drugs Project)

                                  between

                        FEDERAL REPUBLIC OF NIGERIA

                                    and

                   INTERNATIONAL BANK FOR RECONSTRUCTION
                              AND DEVELOPMENT

                             Dated May 7, 1990

                                                     LOAN NUMBER 3125 UNI

                              LOAN AGREEMENT

      AGREEMENT, dated May 7, 1990, between the FEDERAL REPUBLIC OF NIGERIA
(the Borrower) and INTERNATIONAL BANK FOR RECONSTRUCTION AND DEVELOPMENT
(the Bank).

      WHEREAS (A) the Borrower having satisfied itself as to the feasibility
and priority of the Project described in Schedule 2 to this Agreement, has
requested the Bank to assist in the f

In [234]:
teststring = '12345'
index = teststring.find('2',0,1)
print(index)

-1


In [186]:
substring2 = 'between'
length2 = len(substring2)
index2 = data.find(substring2,index1)
print(index2)

336


In [187]:
substring3 = '('
index3 = data.find(substring3,index2)
print(index3)

370


In [188]:
target = data[index2+length2:index3]
print(target)

 REPUBLIC OF COTE D’IVOIRE 
