# JPE references and affiliations extraction

## Table of Contents
Check how to make a bloody table of contents for jupyter notebook

## Merged dataset field description

This is a description of fields in the Merged dataset that combines JPE masterlist, pivot list, Scopus data pre-2016. The Merged dataset is stored in JPE_M_sco_du.xlsx.

    'URL' : JSTOR url for article 
    'urldate': date 
    'author' : Author names recorded by JSTOR generally in the form "last name, first name initial." with multiple authors joined by "and" or ","
    'author_split': The previous field split ['author 1', 'author 2', 'author 3'...]
    'reviewed-author' : If it is a review this is the field that will record the reviewed author(s)' name
    'title' : Title of article recorded by JSTOR citation files
    'title10' : This is previously scraped title data from jstor article pages, I've noticed inconsistencies where the review title is missing in the citation file and this can be used as a supplement.
    'abstract' : abstract recorded by JSTOR nb: this is not consistent
    'content_type' : Article type determined during cleaning. Includes MISC for miscellaneous, Reviews, Note, Comment, Rejoinder and Article categorizations
    'issue_url' : url of issue article belongs to on JSTOR, if this is from the original sources then 
    'pages' : pages as recorded by JSTOR
    'year' : Year of publication recorded by JSTOR
    'volume' : Volume of article recorded by JSTOR
    'number' : issue of article recorded by JSTOR
    'ISSN' : CHECK THIS, from JSTOR
    'journal' : journal name JSTOR
    'type' : Type of issue determined during cleaning. S for special issue. N for normal issue
    'authorsSCO' : Author names recorded by Scopus
    'titleSCO' : Title recorded by Scopus
    'journalSCO' : Journal name recorded by Scopus
    'DOI' : DOI recorded by scopus
    'affiliations' : affiliations of authors as recorded by scopus
    'abstractSCO' : abstract of article recorded by scopus
    'citations' : citations of article recorded by scopus
    'document type' : Article type recorded by scopus, may differ from that in cleaning
    'index keywords' : from scopus
    'author keywords' : from scopus

## The Tesseract library

I use Tesseract, a popular parsing library and the python binding of it pyTesseract to parse JPE documents in this section. I am  following the code and technique from this article on how to read a multi-column pdf. The fitz python module is a lightweight pdf reader, it will require the installation of pyMupdf. OpenCV python module and the module/class cv2 from it is used to apply otsu's thresholding technique to lift/determine paragraph edges.

https://towardsdatascience.com/read-a-multi-column-pdf-with-pytesseract-in-python-1d99015f887a

There are some configurations of tesseract that need to be set to apply it effectively. These are

Page segmentation modes: where tesseract automatically performs some sort of otsu thresholding to determine blobs of text to ensure the correct output order. There are 12 modes, by default tesseract is mode 3 which does page sengmentation. The two I will switch between is 3 (great for automatically detecting column data) and 6 (for text in a single column). Although mode 3 can achieve the similar results as 6, mode 3 looks for vertical blank spacing between columns to determine separate bodies of text. If something is in a listed format, mode 3 will assume that the list numerics or points are a separate column and read the text from left column to right column.

OEM: (need to check the full def of this too) the engine used to .

In [2]:
import fitz

# for OCR using PyTesseract
import cv2                              # pre-processing images
import pytesseract                      # extracting text from images
import numpy as np
import matplotlib.pyplot as plt         # displaying output images
from PIL import Image
import regex
import pandas as pd
import time
import os

Set path to where pdfs of articles are.

In [3]:
base_path="/Users/sijiawu/Work/Refs Danae/Thesis/Data"
temp=base_path+'/PDFs/JPE/'

Read in the merged pdf containing jstor, scopus and datadump metadata.

In [4]:
Merged=pd.read_excel(base_path+'/Combined/JPE_M_sco_du.xlsx')
Merged.loc[Merged['journal']=="Journal of Political Economy",'journal']='JPE'

Set the zoom factor to zoom into the pdf. This is to get a higher resolution image. I have chosen 2x zoom for both vertically and horizontally. This doesn't matter if we use the pngs that were previously generated via the split_pdf script in part two.

In [5]:
zoom_x = 2.0 # horizontal zoom
zoom_y = 2.0 # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y)

## EDA
I am printing the first and last page of each paper within a range(s) of years. This is a quick way to look at the article content in the data set to observe and identify any changes in the the layout and positioning of affiliations or references. Specifically, this enables me to:
1. Apply the most appropriate page segmentation mode ie: 3 or 6 for when the formatting changes in which range of years
2. Identify the year at which to stop looking for references for when citations shift to the footnotes
3. Identify strange cases where the start of an article is on the same page as the end of the previous article. The risk here is if the following article does not have references, my script may erroneously attribute the previous articles reference section to the following article. These cases should be minimal, so excluding them will be simple.

In [6]:
investigate=Merged[(Merged.year<=1967)&(Merged.year>=1960)&(Merged.content_type!="MISC")&(Merged.content_type!="Review")]

In [None]:
# printing first and last page, the assumption is these pngs already exist from the previous stage
# this for loop may also be modified to print every page
for i in investigate.index:
    filepath=base_path+'/PDFs/JPE/'+Merged.iloc[i]['URL'].split('/')[-1]+'.pdf'
    if os.path.exists(filepath):
        doc=fitz.open(filepath)
        print(Merged.iloc[i]['year'])
        print(Merged.iloc[i]['number'])
        print(Merged.iloc[i]['volume'])
        print(Merged.iloc[i]['author'])
        print(Merged.iloc[i]['title'])
#         for page in doc: 
            #if (page.number == 1) or (page.number==(doc.page_count-1)):
        png = base_path+"/PDFs/JPE/png/" + Merged.iloc[i]['URL'].split('/')[-1].split('.')[0] + '_wo_cover_page-0.png'
        png2 = base_path+"/PDFs/JPE/png/" + Merged.iloc[i]['URL'].split('/')[-1].split('.')[0] + '_wo_cover_page-'+str(doc.page_count-2)+'.png'
        print(png)
        print(png2)
        if (os.path.exists(png)==True)&(os.path.exists(png2)==True):
            original_image = cv2.imread(png)

            # convert the image to grayscale
            gray_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2GRAY)

            plt.figure(figsize=(15, 8))
            plt.imshow(gray_image, cmap='gray')
            plt.show()
            
            original_image = cv2.imread(png2)

            # convert the image to grayscale
            gray_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2GRAY)

            plt.figure(figsize=(15, 8))
            plt.imshow(gray_image, cmap='gray')
            plt.show()

In [12]:
exclude=['1832163'] #this has the problem of point 3
separate=['1828842'] # this has a reference in it's footnotes but no reference, however the articles in the rest of the issue have a reference list

### EDA Results

I will proceed by processing the data by decades or less if there are changes mid-way. Generally, I expect conventions to be consistent within a single issue (ie: reference list or no reference list) but this can be inconsistent.

From 1971 to 2020 (inclusive), articles are in single column format with references at the end.

From 1968 to 1970 (inclusive), articles are in double column format and each article starts on a new page regardless of what type it is. No need to make special provisions to identify the start and end of an article.

From 1966 to 1967 (inclusive), articles are in double column format, and non-articles such as reviews, comments etc. begin on the same page as the previous article ends. The only article to pose an issue is 1832163, because the previous article's reference list is on the first page of it and the article itself does not have any references. The exception is 1828842 that has references in its footnotes contrary to the rest of the articles in the issue.

From 1940 to 1965 (inclusive), articles are in double column format and they do not have reference lists at the end. Citations are found in the footnotes.

### The converter() function
This function takes a string and replaces all non-ascii characters with a placeholder. In regex, a placeholder is represented by a '.'. Specific to JPE, a lower case 'L' and a upper case 'I' look the same to tesseract because of the font. Hence, upper case 'I's are replaced with a placeholder. I found that for middle name initials, tesseract may mistake the letter for something else  hence again, we replace it with a placeholder. 

An alternative solution is to use fuzzy matching.

In [35]:

import string
def converter(teststring):
    for i in range(len(teststring)):
        if teststring[i] == '.':
            teststring=teststring[0:i-1]+'.'+teststring[i:]
    
    teststring=teststring.replace('I','.')
            
    for i in teststring:
        if (i not in string.ascii_lowercase) & (i not in string.ascii_uppercase) & (i !=' '):
            teststring=teststring.replace(i,'.')
    return teststring


### The generate_pngs() function
This function looks for the block of text that contains author names, assuming that the block also contains affiliations. If affiliations are not found, then the parsed text is returned. Only the first page of the article and sometimes the JSTOR cover page is parsed. 

Given a pdf file path (SCANNED_FILE), the number of pages (pages), zoom matrix (mat), path to pdf file folder (path), a value for how tightly to draw mask (k_val). A higher k_val results in a mask that covers more of the page ie: segments the page less. You can uncomment the lines of code for plots inside the function observe the mask. Lastly, a string or regex pattern that would match the lead author's name.

In [None]:
def generate_pngs(SCANNED_FILE, pages, mat, path, k_val, author):
    doc = fitz.open(SCANNED_FILE)
    parsed={}
    count=doc.page_count-pages
    if count<0:
        count=1
    for page in doc:
        if (page.number == count):
            png = path+"\\pages_png\\" + SCANNED_FILE.split('\\')[-1].split('.')[0] + '_page-%i.png' % page.number
            if os.path.exists(png)==False:
                pix = page.get_pixmap(matrix=mat)
                print(png)
                pix.save(png)

            parsed[page.number]=[]

            original_image = cv2.imread(png)
            # convert the image to grayscale
            gray_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2GRAY)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(gray_image, cmap='gray')
            #plt.show()

            # Performing OTSU threshold
            ret, threshold_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(threshold_image, cmap='gray')
            #plt.show()

            rectangular_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (k_val, k_val))

            # Applying dilation on the threshold image
            dilated_image = cv2.dilate(threshold_image, rectangular_kernel, iterations = 1)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(dilated_image)
            #plt.show()

            # Finding contours
            contours, hierarchy = cv2.findContours(dilated_image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

            # Creating a copy of the image
            copied_image = original_image.copy()

            mask = np.zeros(original_image.shape, np.uint8)
            i=1
            # Looping through the identified contours
            # Then rectangular part is cropped and passed on to pytesseract
            # pytesseract extracts the text inside each contours
            # Extracted text is then written into a text file
            for cnt in reversed(contours):
                x, y, w, h = cv2.boundingRect(cnt)
                print(i)
                # Cropping the text block for giving input to OCR
                cropped = copied_image[y:y + h, x:x + w]
                # Apply OCR on the cropped image
                text = pytesseract.image_to_string(cropped, lang='lat', config='--oem 3 --psm 1')
                print(text)
                parsed[page.number].append(text)
                print(re.search(author.upper(),text.upper()))
                if re.search('AUTHOR\(S\)', text.upper()) is not None:
                    count+=1
                    break
                if re.search(author.upper(),text.upper()) is not None:
                    return {'found': text}
                #masked = cv2.drawContours(mask, [cnt], 0, (255, 255, 255), -1)
                print()
                i=i+1
            #plt.figure(figsize=(25, 15))
            #plt.imshow(masked, cmap='gray')
            #plt.show()
    return {'raw': parsed}


### Testing the generate_pngs() function
I use the getNumberofPages() and converter() functions as inputs.

In [None]:
# replace with you own file
SCANNED_FILE = path+'\\1830926.pdf'

t0=time.time()        
affiliations=generate_pngs(SCANNED_FILE, getNumberofPages('339-354'), mat, path, 50, converter('Michael D. Intriligator'))
t1=time.time()
total=t1-t0
print(total)
affiliations


### Extracting affiliations from JPE
First create an empty dictionary

In [54]:
dict={}

#lower case all letters in both upper and lower
counts=Merged[(Merged['year']>1940) & (Merged['content_type']!='MISC') & (Merged['content_type']!='Review')]
counts.shape

(4430, 25)

This for loop, provided the content_type is not miscellaneous or a review, stores metadata of a paper via the JSTOR ID in the dictionary dict.

JSTOR_id: { 

    'affiliations': {'found': affiliations_text_if_found}, 
    'content_type': content_type, 
    'authors': [author1, author2, author3 ...], 
    'stable_url': stable_url]
   }
   
Note: if affiliations are not found then the 'affiliations' field will contain a dictionary of form.

'raw': {

    '0': [parsed_text_on_page_0 separated by commas], 
    '1': [parsed_text_on_page_1 separated by commas] ...
   }

In [None]:
t0=time.time()

for i in Merged[(Merged['year']>=1940) & (Merged['content_type']!='MISC') & (Merged['content_type']!='Review')].index:
    if Merged.iloc[i]['Jstor_authors'] is not NaN: 
        if "Suggested by" not in Merged.iloc[i]['Jstor_authors']:
            authors=str(Merged.iloc[i]['Jstor_authors']).replace(' and ',', ').replace("  ",' ').split(',')
            filepath=path+'\\'+Merged.iloc[i]['stable_url'].split('/')[-1]+'.pdf'
            if os.path.exists(filepath)==True:
                print(Merged.iloc[i]['year'])
                first_author=converter(authors[0])
                print(first_author)
                n_pages=getNumberofPages(Merged.iloc[i]['pages'])
                if pd.isna(n_pages)==False:
                    affiliations=generate_pngs(filepath, n_pages, mat, path, 52, first_author.strip())
                    dict[Merged.iloc[i]['stable_url'].split('/')[-1]]={'affiliations':affiliations, 'content_type':Merged.iloc[i]['content_type'], 'authors':authors, 'stable_url': Merged.iloc[i]['stable_url']}
            else:
                dict[Merged.iloc[i]['stable_url'].split('/')[-1]]='PDF not available, download at '+ Merged.iloc[i]['stable_url']
t1=time.time()
total=t1-t0
print(total)
print(i)

Save the dictionary containing affiliations inside a json file.

In [65]:
import json
with open(path+'//JPE_affiliation_output_aff2.json','w') as fp:
    json.dump(dict, fp)

In [None]:
import json
# print pretty to view dictionary content
print(json.dumps(dict, sort_keys=False, indent=4))
print(len(dict.keys()))

### Extracting references
JPE has references at the end in a dedicated references section from 1966 onwards. Hence the generate_refs function looks for a keyword 'References' using fuzzy matching (less that 3 character difference) and returns everything following it. If it is not found then the function returns the parsed text of the full page of the document.

TODO: MODIFICATION REQ TO CHANGE THIS.

In [8]:
# This version ignores contouring and sectioning out paragraphs. It directly feeds the image to tesseract. 
# There seems to be no image resolution degradation this way as opposed to reduced using the openCV library.
# SCANNED_FILE: is for the full path to the original pdf. we require this to get the number of pages. 
#  The assumption is that the jstor (or other) cover page has been removed previously, so in our case it will always have wo_cover.pdf as suffix.
# path: to the folder containing pre-generated pngs, the pngs in this folder are assumed to have the same file name as the SCANNED_FILE + suffix page-{page no}.png for each sharded page
# keyword: this is whatever regex pattern that you wish to search for. This function uses the regex.search method from the regex library
#  It can take fuzzy match regex patterns
# config: this is the tesseract configuration default is '--oem 1 --psm 3', which is also the default for this function
#  3 implies automatic page segmentation, better for 2 column format pdfs, 6 assumes single column, top to bottom text and will preserve each line ending better.
def generate_refs2(SCANNED_FILE, path, keyword, custom_config = r'--oem 1 --psm 3'):
    try:
        doc = fitz.open(SCANNED_FILE)
    except:
        print("could not open: "+SCANNED_FILE)
        raise Exception("this file is corrupt")
    if "wo_cover" not in SCANNED_FILE:
        print("warning, the file: "+SCANNED_FILE.split('/')[-1]+" does not have it's coverpage removed.\nThis function will continue. Assumed image file name convention is: "+SCANNED_FILE.split('/')[-1].split('.')[0] + '_page-{number}.png')
    parsed={}
    references={}
    found=0
    for page in reversed(doc):
        png = path+"/" + SCANNED_FILE.split('/')[-1].split('.')[0] + '_page-%i.png' % page.number
#             print(png)
        parsed[page.number]=[]
        references[page.number]=[]
        if os.path.exists(png)==True:
            text = pytesseract.image_to_string(png, config=custom_config)
#                 print(text)
            parsed[page.number].append(text)
            keyword_search=regex.search(keyword,text.upper())
            if keyword_search is not None:
                print('found')
                return {'found': parsed, "pages":doc.page_count}
        else:
            print("error: this image does not exist, please generate png shards at 300 dpi in path: "+path)
    print("the keyword: "+keyword + "was not found. But this is the full tesseract output nonetheless.")
    return {'raw': parsed, "pages": doc.page_count}

### Testing the generate_pngs() function

In [27]:
# 1832164_wo_cover_page
filepath=temp+'wo_cover/'+'1829157_wo_cover.pdf'
op=generate_refs2(filepath, temp+"png", '(^|\n)R(EFERENCES){e<=3}(\n| )', r'--oem 1 --psm 3')

found


Create the function to run each article through the generate_refs2 function. Each file is saved individually.

In [None]:
t0=time.time()
custom_config = r'--oem 1 --psm 3'

for i in Merged[(Merged['year']<=1967) & (Merged['year']>1965)& (Merged['content_type']!='MISC') & (Merged['content_type']!='Review')].index:
    filepath=temp+'wo_cover/'+Merged.iloc[i]['URL'].split('/')[-1]+'_wo_cover.pdf'
    print(filepath)
     if os.path.exists(filepath)==True:
        references=generate_refs2(filepath, temp+"png", '(^|\n)R(EFERENCES){e<=3}(\n| )', custom_config)
        o_file=base_path+'/'+Merged.iloc[i]['URL'].split('/')[-1]+ "_tesseract.json"
        with open(o_file,'w') as f:
                    json.dump({Merged.iloc[i]['URL'].split('/')[-1]: {'references':references, 'URL': Merged.iloc[i]['URL']}}, f, indent=3)
    else:
        dict_ref[Merged.iloc[i]['URL'].split('/')[-1]]='PDF not available, download at '+ Merged.iloc[i]['URL']
        print("filepath not valid, file "+Merged.iloc[i]['URL'].split('/')[-1]+'_wo_cover.pdf'+ " did not get sharded")
t1=time.time()
total=t1-t0
print(total)
print(i)

This for loop, provided the content_type is not miscellaneous or a review, stores the references extracted via tesseract and metadata of a paper via the JSTOR ID in the dictionary dict.

JSTOR_id: {

    'references': {
        'found': {
            'page_no': [Text containing or following references keyword separated by commas],
            'page_no': [Text containing or following references keyword separated by commas] ...
            }
         }, 
    'content_type': content_type, 
    'authors': [author1, author2, author3 ...], 
    'stable_url': stable_url]
    }

Note: if references are not found then the 'references' field will contain a dictionary of form.

'raw': {

    'page_no': [parsed_text_on_page_no separated by commas], 
    'page_no': [parsed_text_on_page_no separated by commas] ...
    }