# Merged dataset field description

This is a description of fields in the Merged dataset that combines JPE masterlist, pivot list, Scopus data pre-2016. The Merged dataset is stored in JPE_M_sco_du.xlsx.

    'stable_url' : JSTOR url for article 
    'Jstor_authors' : Author names recorded by JSTOR
    'Jstor_title' : Title of article recorded by JSTOR
    'Jstor_abstract' : abstract recorded by JSTOR nb: this is blank at the moment
    'content_type' : Article type determined during cleaning. Includes MISC for miscellaneous, Reviews, Note, Comment, Rejoinder and Article categorizations
    'issue_url' : url of issue article belongs to on JSTOR
    'pages' : pages as recorded by JSTOR
    'year' : Year of publication recorded by JSTOR
    'volume' : Volume of article recorded by JSTOR
    'issue' : issue of article recorded by JSTOR
    'Jstor_journal' : journal name JSTOR
    'type' : Type of issue determined during cleaning. S for special issue. N for normal issue
    'scopus_authors' : Author names recorded by Scopus
    'scopus_title' : Title recorded by Scopus
    'scopus_journal' : Journal name recorded by Scopus
    'DOI' : DOI recorded by scopus
    'affiliations' : affiliations of authors as recorded by scopus
    'scopus_abstract' : abstract of article recorded by scopus
    'citations' : citations of article recorded by scopus
    'document type' : Article type recorded by scopus, may differ from that in cleaning
    'index keywords' : from scopus
    'author keywords' : from scopus
    'footnotes' : footnotes scraped from metadata panel.
    'raw' : raw text data scraped from JSTOR metadata panel.
    'references' : citations scraped from JSTOR metadata panel during data collection. 

In [None]:
Merged=pd.read_excel('C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_M_sco_du.xlsx')

## The Tesseract library

I use Tesseract, a popular parsing library and the python binding of it pyTesseract to parse JPE documents in this section. I am  following the code and technique from this article on how to read a multi-column pdf. The fitz python module is a lightweight pdf reader, it will require the installation of pyMupdf. OpenCV python module and the module/class cv2 from it is used to apply otsu's thresholding technique to lift/determine paragraph edges. 

https://towardsdatascience.com/read-a-multi-column-pdf-with-pytesseract-in-python-1d99015f887a

In [7]:
import fitz

# for OCR using PyTesseract
import cv2                              # pre-processing images
import pytesseract                      # extracting text from images
import numpy as np
import matplotlib.pyplot as plt         # displaying output images
from PIL import Image

Set path to where pdfs of articles are.

In [4]:
path='C:\\Users\\sjwu1\\Journal_Data\\JPE_data'
temp=path+'\\dummy.pdf'

Read in the merged pdf containing jstor, scopus and datadump metadata.

In [3]:
Merged=pd.read_excel('C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_M_sco_du.xlsx')

Set the zoom factor to zoom into the pdf. This is to get a higher resolution image. I have chosen 2x zoom for both vertically and horizontally.

In [8]:
zoom_x = 2.0 # horizontal zoom
zoom_y = 2.0 # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y)

### The getNumberofPages() function
Given a text string from jstor of the page numbers of an article, this gives the number of pages expected in the article. Because some articles' pdfs are downloaded from scopus sometimes they are missing a front page. On the otherhand sometimes jstor has a coverpage that is not accounted for. This function returns the number of pages so that the first page of an article can be correctly identified.

In [34]:
def getNumberofPages(text):
    if pd.isna(text)==False:
        if re.search('\d',text):
            temp=text.split(',')
            pages=0
            print(temp)
            for m in temp:
                if '-' in m:
                    t=str(m).split('-')
                    pages=pages+int(re.sub('\D','',t[1]))-int(re.sub('\D','',t[0]))+1
                else:
                    pages+=1
            return pages
    return 1


print(getNumberofPages('2014-2016, 30'))
print(getNumberofPages('20-40'))

['2014-2016', ' 30']
4
['20-40']
21


### The converter() function
This function takes a string and replaces all non-ascii characters with a placeholder. In regex, a placeholder is represented by a '.'. Specific to JPE, a lower case 'L' and a upper case 'I' look the same to tesseract because of the font. Hence, upper case 'I's are replaced with a placeholder. I found that for middle name initials, tesseract may mistake the letter for something else  hence again, we replace it with a placeholder. 

An alternative solution is to use fuzzy matching.

In [35]:

import string
def converter(teststring):
    for i in range(len(teststring)):
        if teststring[i] == '.':
            teststring=teststring[0:i-1]+'.'+teststring[i:]
    
    teststring=teststring.replace('I','.')
            
    for i in teststring:
        if (i not in string.ascii_lowercase) & (i not in string.ascii_uppercase) & (i !=' '):
            teststring=teststring.replace(i,'.')
    return teststring


### The generate_pngs() function
This function looks for the block of text that contains author names, assuming that the block also contains affiliations. If affiliations are not found, then the parsed text is returned. Only the first page of the article and sometimes the JSTOR cover page is parsed. 

Given a pdf file path (SCANNED_FILE), the number of pages (pages), zoom matrix (mat), path to pdf file folder (path), a value for how tightly to draw mask (k_val). A higher k_val results in a mask that covers more of the page ie: segments the page less. You can uncomment the lines of code for plots inside the function observe the mask. Lastly, a string or regex pattern that would match the lead author's name.

In [None]:
def generate_pngs(SCANNED_FILE, pages, mat, path, k_val, author):
    doc = fitz.open(SCANNED_FILE)
    parsed={}
    count=doc.page_count-pages
    if count<0:
        count=1
    for page in doc:
        if (page.number == count):
            png = path+"\\pages_png\\" + SCANNED_FILE.split('\\')[-1].split('.')[0] + '_page-%i.png' % page.number
            if os.path.exists(png)==False:
                pix = page.get_pixmap(matrix=mat)
                print(png)
                pix.save(png)

            parsed[page.number]=[]

            original_image = cv2.imread(png)
            # convert the image to grayscale
            gray_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2GRAY)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(gray_image, cmap='gray')
            #plt.show()

            # Performing OTSU threshold
            ret, threshold_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(threshold_image, cmap='gray')
            #plt.show()

            rectangular_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (k_val, k_val))

            # Applying dilation on the threshold image
            dilated_image = cv2.dilate(threshold_image, rectangular_kernel, iterations = 1)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(dilated_image)
            #plt.show()

            # Finding contours
            contours, hierarchy = cv2.findContours(dilated_image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

            # Creating a copy of the image
            copied_image = original_image.copy()

            mask = np.zeros(original_image.shape, np.uint8)
            i=1
            # Looping through the identified contours
            # Then rectangular part is cropped and passed on to pytesseract
            # pytesseract extracts the text inside each contours
            # Extracted text is then written into a text file
            for cnt in reversed(contours):
                x, y, w, h = cv2.boundingRect(cnt)
                print(i)
                # Cropping the text block for giving input to OCR
                cropped = copied_image[y:y + h, x:x + w]
                # Apply OCR on the cropped image
                text = pytesseract.image_to_string(cropped, lang='lat', config='--oem 3 --psm 1')
                print(text)
                parsed[page.number].append(text)
                print(re.search(author.upper(),text.upper()))
                if re.search('AUTHOR\(S\)', text.upper()) is not None:
                    count+=1
                    break
                if re.search(author.upper(),text.upper()) is not None:
                    return {'found': text}
                #masked = cv2.drawContours(mask, [cnt], 0, (255, 255, 255), -1)
                print()
                i=i+1
            #plt.figure(figsize=(25, 15))
            #plt.imshow(masked, cmap='gray')
            #plt.show()
    return {'raw': parsed}


### Testing the generate_pngs() function
I use the getNumberofPages() and converter() functions as inputs.

In [None]:
# replace with you own file
SCANNED_FILE = path+'\\1830926.pdf'

t0=time.time()        
affiliations=generate_pngs(SCANNED_FILE, getNumberofPages('339-354'), mat, path, 50, converter('Michael D. Intriligator'))
t1=time.time()
total=t1-t0
print(total)
affiliations


### Extracting affiliations from JPE
First create an empty dictionary

In [54]:
dict={}

#lower case all letters in both upper and lower
counts=Merged[(Merged['year']>1940) & (Merged['content_type']!='MISC') & (Merged['content_type']!='Review')]
counts.shape

(4430, 25)

This for loop, provided the content_type is not miscellaneous or a review, stores metadata of a paper via the JSTOR ID in the dictionary dict.

JSTOR_id: { 

    'affiliations': {'found': affiliations_text_if_found}, 
    'content_type': content_type, 
    'authors': [author1, author2, author3 ...], 
    'stable_url': stable_url]
   }
   
Note: if affiliations are not found then the 'affiliations' field will contain a dictionary of form.

'raw': {

    '0': [parsed_text_on_page_0 separated by commas], 
    '1': [parsed_text_on_page_1 separated by commas] ...
   }

In [None]:
t0=time.time()

for i in Merged[(Merged['year']>=1940) & (Merged['content_type']!='MISC') & (Merged['content_type']!='Review')].index:
    if Merged.iloc[i]['Jstor_authors'] is not NaN: 
        if "Suggested by" not in Merged.iloc[i]['Jstor_authors']:
            authors=str(Merged.iloc[i]['Jstor_authors']).replace(' and ',', ').replace("  ",' ').split(',')
            filepath=path+'\\'+Merged.iloc[i]['stable_url'].split('/')[-1]+'.pdf'
            if os.path.exists(filepath)==True:
                print(Merged.iloc[i]['year'])
                first_author=converter(authors[0])
                print(first_author)
                n_pages=getNumberofPages(Merged.iloc[i]['pages'])
                if pd.isna(n_pages)==False:
                    affiliations=generate_pngs(filepath, n_pages, mat, path, 52, first_author.strip())
                    dict[Merged.iloc[i]['stable_url'].split('/')[-1]]={'affiliations':affiliations, 'content_type':Merged.iloc[i]['content_type'], 'authors':authors, 'stable_url': Merged.iloc[i]['stable_url']}
            else:
                dict[Merged.iloc[i]['stable_url'].split('/')[-1]]='PDF not available, download at '+ Merged.iloc[i]['stable_url']
t1=time.time()
total=t1-t0
print(total)
print(i)

Save the dictionary containing affiliations inside a json file.

In [65]:
import json
with open(path+'//JPE_affiliation_output_aff2.json','w') as fp:
    json.dump(dict, fp)

In [None]:
import json
# print pretty to view dictionary content
print(json.dumps(dict, sort_keys=False, indent=4))
print(len(dict.keys()))

### Extracting references
JPE has references at the end in a dedicated references section from 1966 onwards. Hence the generate_refs function looks for a keyword 'References' using fuzzy matching (less that 3 character difference) and returns everything following it. If it is not found then the function returns the parsed text of the last 5 pages of the document.

In [81]:
import time
import regex
def generate_refs(SCANNED_FILE, mat, path, k_val, keyword):
    doc = fitz.open(SCANNED_FILE)
    parsed={}
    references={}
    found=0
    for page in reversed(doc):
        if (page.number >= doc.page_count-5):
            png = path+"\\pages_png\\" + SCANNED_FILE.split('\\')[-1].split('.')[0] + '_page-%i.png' % page.number
            if os.path.exists(png)==False:
                pix = page.get_pixmap(matrix=mat)
                print(png)
                pix.save(png)

            parsed[page.number]=[]
            references[page.number]=[]

            original_image = cv2.imread(png)
            # convert the image to grayscale
            gray_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2GRAY)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(gray_image, cmap='gray')
            #plt.show()

            # Performing OTSU threshold
            ret, threshold_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(threshold_image, cmap='gray')
            #plt.show()

            rectangular_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (k_val, k_val))

            # Applying dilation on the threshold image
            dilated_image = cv2.dilate(threshold_image, rectangular_kernel, iterations = 1)

            #plt.figure(figsize=(25, 15))
            #plt.imshow(dilated_image)
            #plt.show()

            # Finding contours
            contours, hierarchy = cv2.findContours(dilated_image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

            # Creating a copy of the image
            copied_image = original_image.copy()

            mask = np.zeros(original_image.shape, np.uint8)
            i=1
            # Looping through the identified contours
            # Then rectangular part is cropped and passed on to pytesseract
            # pytesseract extracts the text inside each contours
            # Extracted text is then written into a text file
            for cnt in contours:
                x, y, w, h = cv2.boundingRect(cnt)
                # Cropping the text block for giving input to OCR
                cropped = copied_image[y:y + h, x:x + w]
                # Apply OCR on the cropped image
                text = pytesseract.image_to_string(cropped, lang='lat', config='--oem 3 --psm 1')
                #print(i)
                #print(text)
                parsed[page.number].append(text)
                #print(re.search(keyword.upper(),text.upper()))
                if regex.search(keyword, text.upper()) is not None:
                    print('found')
                    return {'found': parsed}
                #masked = cv2.drawContours(mask, [cnt], 0, (255, 255, 255), -1)
                i=i+1
    return {'raw': parsed}

### Testing the generate_pngs() function

In [76]:
t0=time.time()        
refs=generate_refs(path+'\\26549911.pdf', mat, path, 40, '\n(REFERENCES){e<=3}\n')
t1=time.time()
total=t1-t0
print(total)
refs

found
4.572999000549316


{'found': {48: ['This content downloaded from.\n137.158.158.62 on Wed, 30 Mar 2022 16:36:52 UTC\nAll use subject to https;//about jstor.org/terms\n',
   '1562 JOURNAL OF POLITICAL ECONOMY\n\nMatthews, S. A., and A. Postlewaite. 1995. *On Modeling Cheap Talk in Bayesian\nGames." In The Economics of Informational Decentralization: Complexity, Efficiency,\nand Stability; Essays in Honor of Stanley Reiter, edited by J. O. Ledyard, 347-66.\nBoston: Kluwer.\n\nRosenberg, D., E. Solan, and N. Vieille. 2013. "Strategic Information Exchange."\nGames and Econ. Behavior 82:444—067.\n\nSpence, M. 1973. "Job Market Signaling." Q. E. 87:3:\n\n'],
  47: ['This content downloaded from.\n137.158.158.62 on Wed, 30 Mar 2022 16:36:52 UTC\nAll use subject to https;//about jstor.org/terms\n',
   'SELLING INFORMATION 1561\n\nNote, however, that, because /iis no steeper than u(f)/p.\n\nVp) 8 9 - [\' e tnap\n\nL\n\n(recall that u/ (5) *- w(p)/p), and hence, replacing (f) and rearranging,\n\n1 [^\nwp) s zl wu (

create the empty dictionary

In [None]:
dict_ref={}

This for loop, provided the content_type is not miscellaneous or a review, stores the references extracted via tesseract and metadata of a paper via the JSTOR ID in the dictionary dict.

JSTOR_id: {

    'references': {
        'found': {
            'page_no': [Text containing or following references keyword separated by commas],
            'page_no': [Text containing or following references keyword separated by commas] ...
            }
         }, 
    'content_type': content_type, 
    'authors': [author1, author2, author3 ...], 
    'stable_url': stable_url]
    }

Note: if references are not found then the 'references' field will contain a dictionary of form.

'raw': {

    'page_no': [parsed_text_on_page_no separated by commas], 
    'page_no': [parsed_text_on_page_no separated by commas] ...
    }

In [77]:
t0=time.time()

for i in Merged[(Merged['year']>1965) & (Merged['year']<=1970) & (Merged['content_type']!='MISC') & (Merged['content_type']!='Review')].index:
    if Merged.iloc[i]['Jstor_authors'] is not NaN: 
        if "Suggested by" not in Merged.iloc[i]['Jstor_authors']:
            authors=str(Merged.iloc[i]['Jstor_authors']).replace(' and ',', ').replace("  ",' ').split(',')
            filepath=path+'\\'+Merged.iloc[i]['stable_url'].split('/')[-1]+'.pdf'
            if os.path.exists(filepath)==True:
                print(Merged.iloc[i][['year','issue','volume', 'stable_url']])
                if pd.isna(n_pages)==False:
                    references=generate_refs(filepath, mat, path, 52, '(REFERENCES){e<=3}\n')
                    dict_ref[Merged.iloc[i]['stable_url'].split('/')[-1]]={'references':references, 'content_type':Merged.iloc[i]['content_type'], 'authors':authors, 'stable_url': Merged.iloc[i]['stable_url']}
            else:
                dict_ref[Merged.iloc[i]['stable_url'].split('/')[-1]]='PDF not available, download at '+ Merged.iloc[i]['stable_url']
t1=time.time()
total=t1-t0
print(total)
print(i)

year                                          1970
issue                                            6
volume                                          78
stable_url    https://www.jstor.org/stable/1830621
Name: 3824, dtype: object
['1213-1227']
found
year                                          1970
issue                                            6
volume                                          78
stable_url    https://www.jstor.org/stable/1830622
Name: 3825, dtype: object
['1228-1263']
found
year                                          1970
issue                                            6
volume                                          78
stable_url    https://www.jstor.org/stable/1830623
Name: 3826, dtype: object
['1264-1291']
found
year                                          1970
issue                                            6
volume                                          78
stable_url    https://www.jstor.org/stable/1830624
Name: 3827, dtype: object
['1292-1309']
found


found
year                                          1970
issue                                            4
volume                                          78
stable_url    https://www.jstor.org/stable/1829815
Name: 3867, dtype: object
['890-905']
found
year                                          1970
issue                                            4
volume                                          78
stable_url    https://www.jstor.org/stable/1829816
Name: 3868, dtype: object
['906-947']
found
year                                          1970
issue                                            4
volume                                          78
stable_url    https://www.jstor.org/stable/1829817
Name: 3869, dtype: object
['948-965']
found
year                                          1970
issue                                            4
volume                                          78
stable_url    https://www.jstor.org/stable/1829818
Name: 3870, dtype: object
['966-1006']
found
y

found
year                                          1970
issue                                            2
volume                                          78
stable_url    https://www.jstor.org/stable/1830687
Name: 3916, dtype: object
['274-278']
found
year                                          1970
issue                                            2
volume                                          78
stable_url    https://www.jstor.org/stable/1830688
Name: 3917, dtype: object
['279-290']
found
year                                          1970
issue                                            2
volume                                          78
stable_url    https://www.jstor.org/stable/1830689
Name: 3918, dtype: object
['291-305']
found
year                                          1970
issue                                            2
volume                                          78
stable_url    https://www.jstor.org/stable/1830690
Name: 3919, dtype: object
['306-310']
found
ye

found
year                                          1970
issue                                            1
volume                                          78
stable_url    https://www.jstor.org/stable/1829636
Name: 3956, dtype: object
['175-177']
found
year                                          1970
issue                                            1
volume                                          78
stable_url    https://www.jstor.org/stable/1829637
Name: 3957, dtype: object
['178-180']
found
year                                          1970
issue                                            1
volume                                          78
stable_url    https://www.jstor.org/stable/1829638
Name: 3958, dtype: object
['181-184']
year                                          1969
issue                                            6
volume                                          77
stable_url    https://www.jstor.org/stable/1837201
Name: 3965, dtype: object
['873-891']
found
year    

found
year                                          1969
issue                                            4
volume                                          77
stable_url    https://www.jstor.org/stable/1829322
Name: 4010, dtype: object
['586-627']
found
year                                          1969
issue                                            4
volume                                          77
stable_url    https://www.jstor.org/stable/1829323
Name: 4011, dtype: object
['628-652']
found
year                                          1969
issue                                            4
volume                                          77
stable_url    https://www.jstor.org/stable/1829324
Name: 4012, dtype: object
['653-664']
found
year                                          1969
issue                                            4
volume                                          77
stable_url    https://www.jstor.org/stable/1829325
Name: 4013, dtype: object
['665-683']
found
ye

found
year                                          1969
issue                                            2
volume                                          77
stable_url    https://www.jstor.org/stable/1829768
Name: 4061, dtype: object
['242-244']
found
year                                          1969
issue                                            2
volume                                          77
stable_url    https://www.jstor.org/stable/1829769
Name: 4062, dtype: object
['245-248']
found
year                                          1969
issue                                            2
volume                                          77
stable_url    https://www.jstor.org/stable/1829770
Name: 4063, dtype: object
['249-273']
found
year                                          1969
issue                                            2
volume                                          77
stable_url    https://www.jstor.org/stable/1829771
Name: 4064, dtype: object
['274-285']
found
ye

found
year                                          1968
issue                                            5
volume                                          76
stable_url    https://www.jstor.org/stable/1830038
Name: 4111, dtype: object
['1069-1077']
found
year                                          1968
issue                                            5
volume                                          76
stable_url    https://www.jstor.org/stable/1830039
Name: 4112, dtype: object
['1078-1084']
found
year                                          1968
issue                                            5
volume                                          76
stable_url    https://www.jstor.org/stable/1830040
Name: 4113, dtype: object
['1085-1087']
found
year                                          1968
issue                                            5
volume                                          76
stable_url    https://www.jstor.org/stable/1830041
Name: 4114, dtype: object
['1087']
found

found
year                                          1968
issue                                            4
volume                                          76
stable_url    https://www.jstor.org/stable/1830051
Name: 4152, dtype: object
['583-600']
found
year                                          1968
issue                                            4
volume                                          76
stable_url    https://www.jstor.org/stable/1830052
Name: 4153, dtype: object
['601-614']
found
year                                          1968
issue                                            4
volume                                          76
stable_url    https://www.jstor.org/stable/1830053
Name: 4154, dtype: object
['615-634']
found
year                                          1968
issue                                            4
volume                                          76
stable_url    https://www.jstor.org/stable/1830054
Name: 4155, dtype: object
['635-644']
found
ye

found
year                                          1968
issue                                            1
volume                                          76
stable_url    https://www.jstor.org/stable/1830724
Name: 4199, dtype: object
['38-43']
found
year                                          1968
issue                                            1
volume                                          76
stable_url    https://www.jstor.org/stable/1830725
Name: 4200, dtype: object
['44-52']
found
year                                          1968
issue                                            1
volume                                          76
stable_url    https://www.jstor.org/stable/1830726
Name: 4201, dtype: object
['53-67']
found
year                                          1968
issue                                            1
volume                                          76
stable_url    https://www.jstor.org/stable/1830727
Name: 4202, dtype: object
['68-77']
found
year      

found
year                                          1967
issue                                            5
volume                                          75
stable_url    https://www.jstor.org/stable/1829088
Name: 4248, dtype: object
['750-754']
found
year                                          1967
issue                                            5
volume                                          75
stable_url    https://www.jstor.org/stable/1829089
Name: 4249, dtype: object
['755-760']
found
year                                          1967
issue                                            5
volume                                          75
stable_url    https://www.jstor.org/stable/1829090
Name: 4250, dtype: object
['761-762']
found
year                                          1967
issue                                            5
volume                                          75
stable_url    https://www.jstor.org/stable/1829091
Name: 4251, dtype: object
['763-764']
found
ye

found
year                                          1967
issue                                            4
volume                                          75
stable_url    https://www.jstor.org/stable/1832173
Name: 4285, dtype: object
['651-654']
year                                          1967
issue                                            4
volume                                          75
stable_url    https://www.jstor.org/stable/1828594
Name: 4287, dtype: object
['321-334']
found
year                                          1967
issue                                            4
volume                                          75
stable_url    https://www.jstor.org/stable/1828595
Name: 4288, dtype: object
['335-351']
found
year                                          1967
issue                                            4
volume                                          75
stable_url    https://www.jstor.org/stable/1828596
Name: 4289, dtype: object
['352-365']
found
year    

found
year                                          1967
issue                                            1
volume                                          75
stable_url    https://www.jstor.org/stable/1829545
Name: 4351, dtype: object
['49-62']
found
year                                          1967
issue                                            1
volume                                          75
stable_url    https://www.jstor.org/stable/1829546
Name: 4352, dtype: object
['63-70']
found
year                                          1967
issue                                            1
volume                                          75
stable_url    https://www.jstor.org/stable/1829547
Name: 4353, dtype: object
['71-76']
found
year                                          1967
issue                                            1
volume                                          75
stable_url    https://www.jstor.org/stable/1829548
Name: 4354, dtype: object
['77-85']
found
year      

found
year                                          1966
issue                                            4
volume                                          74
stable_url    https://www.jstor.org/stable/1829155
Name: 4420, dtype: object
['396-400']
found
year                                          1966
issue                                            4
volume                                          74
stable_url    https://www.jstor.org/stable/1829156
Name: 4421, dtype: object
['401-402']
found
year                                          1966
issue                                            4
volume                                          74
stable_url    https://www.jstor.org/stable/1829157
Name: 4422, dtype: object
['403-405']
found
year                                          1966
issue                                            4
volume                                          74
stable_url    https://www.jstor.org/stable/1829158
Name: 4423, dtype: object
['406']
year        

Save references as a json file.

In [79]:
import json
with open(path+'//JPE_refs_output_1966_1970.json','w') as fp:
    json.dump(dict_ref, fp)

In [27]:
inv=Merged[(Merged['year']<=1966) & (Merged['year']>=1940)&(Merged['content_type']!='MISC')&(Merged['content_type']!='Review')&(Merged['content_type']!='Discussion')]
inv.shape

(1139, 25)

### Consider references for years pre-1965 (inclusive)
References are expected to be found in the footnotes from 1940 to 1965 (inclusive). Pdfminer.six is able to detect the size of the font through xml analysis. Use heuristics such as references are at the bottom of the page and of a different font type and size. 

In [19]:
from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
# suppress logs
import logging 
logging.propagate = False
logging.getLogger().setLevel(logging.ERROR)
from pdfminer.high_level import extract_pages
#https://stackoverflow.com/questions/29762706/warnings-on-pdfminer

In [35]:
output_string = StringIO()
filepath=path+'\\2138780.pdf'
filepath='D:\\docs\\Masters\\Data\\ECTA_data\\1906861.pdf'
with open(filepath, 'rb') as fin:
    extract_text_to_fp(fin, output_string, laparams=LAParams(), output_type='html', codec=None)

In [36]:
for page_layout in extract_pages(filepath):
    for element in page_layout:
        print(element)


<LTTextBoxHorizontal(0) 50.000,621.788,514.684,713.788 'The Possibilities and Limitations of Objective Sampling in Strengthening Agricultural\nStatistics\nAuthor(s): Charles F. Sarle\nSource: Econometrica, Vol. 8, No. 1 (Jan., 1940), pp. 45-61\nPublished by: The Econometric Society\nStable URL: http://www.jstor.org/stable/1906861 .\nAccessed: 15/01/2015 07:49\n'>
<LTTextBoxHorizontal(1) 50.000,578.830,476.530,601.613 'Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .\nhttp://www.jstor.org/page/info/about/policies/terms.jsp\n'>
<LTTextBoxHorizontal(2) 50.000,518.830,552.980,565.613 ' .\nJSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of\ncontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms\nof scholarship. For more information about JSTOR, please contact support@jstor.org.\n'>
<LTText

<LTTextBoxHorizontal(0) 149.880,595.141,239.033,602.261 'OBJECTIVE  SAMPLING \n'>
<LTTextBoxHorizontal(1) 342.960,594.250,355.460,604.250 '47 \n'>
<LTTextBoxHorizontal(2) 38.700,474.670,355.500,580.970 'tural  township  reporting  for  his  locality  in  addition  to  a  reporter  in \neach  county  reporting  for  the  entire  county.  The  original  plan  called \nfor several  subreporters  to  report to  the  county  reporter on  conditions \nin the  parts  of the  county  where they  lived-a  plan  that  never  seemed \nto  work  out  very  well  in  practice.  Furthermore,  it  was  impossible  to \npersuade  more than  50 to  possibly  75 per  cent  of the  county  reporters \nto  respond  in  any  one  month.  There  was  obviously  a  limit  to  the \namount  of  service  these  people  would  give  without  financial  com- \npensation. \n'>
<LTTextBoxHorizontal(3) 38.880,378.190,355.950,472.790 'The  part-time  state  agents  frequently  were men  on  the  agricultural \nstaff

<LTTextBoxHorizontal(0) 151.200,592.321,240.473,599.561 'OBJECTIVE  SAMPLING \n'>
<LTTextBoxHorizontal(1) 344.520,591.070,357.020,601.070 '51 \n'>
<LTTextBoxHorizontal(2) 39.780,508.390,356.300,578.870 'for 2 successive  years  he  obtained  data  that  would  furnish  an  indica- \ntion  of  change  in  acreage  of  the  various  crops  found  along  the  route. \nThese  were  objective  observations  and  although  the  routes  might  not \nbe representative,  he was able to  obtain  a sample of all farms along the \nroute-provided  of course he could identify  the  crops in  the  field  as he \nsped  along. \n'>
<LTTextBoxHorizontal(3) 39.300,387.550,356.230,506.870 'Later  the  automobile  was  substituted  for  the  train  and  telephone \npoles  along  the  highways  were  counted  or the  number  of  fields  in  the \nvarious  crops were  counted.  Later, about  1925,  one  of  the  field statis- \nticians,  D.  A.  McCandliss,  invented  the  "crop-meter"  for  measuring \nthe 

<LTTextBoxHorizontal(0) 149.280,592.921,238.493,599.981 'OBJECTIVE  SAMPLING \n'>
<LTTextBoxHorizontal(1) 342.360,592.630,354.860,602.630 '55 \n'>
<LTTextBoxHorizontal(2) 38.160,519.910,355.240,578.570 'normal"  obtained  on  the  first  of  each  month  from  the  regular  crop \ncorrespondents of the  Department.  Later in the growing season a report \non  the  probable  yield  per acre also is  obtained.  Both  of these  samples \nare judgment  inquiries  based  on the  mass  opinion  of the  reporters and \nthey  apply  to  the  locality  in  which  the  crop  reporter  lives. \n'>
<LTTextBoxHorizontal(3) 38.280,375.790,355.520,518.630 'The  condition  reports  are not  so likely  to  be  selective  as  are the  re- \nports  on  yield  per  acre  or  acreage.  The  condition  data  also  are  less \nvariable.  The  very  nature  of  the  condition  inquiry  leads  to  less  varia- \nbility  in  the  individual  observations.  Normal  yields  may  vary  con- \nsiderably  from one  a

<LTTextBoxHorizontal(0) 146.640,591.864,241.552,599.984 'OBJECTIVE  SAMPLING \n'>
<LTTextBoxHorizontal(1) 340.080,591.970,352.580,601.970 '59 \n'>
<LTTextBoxHorizontal(2) 35.760,555.310,351.770,578.150 'fruiting  habits,  like  cotton,  can  be  greatly  increased  even  late  in  the \nseason  by  unusually  favorable  weather. \n'>
<LTTextBoxHorizontal(3) 35.880,495.430,353.150,554.270 'The  earlier in  the  growing  season  the  forecast  is  made  the  greater  is \nthe  hazard of  subsequent  weather.  It  is  not  surprising,  therefore,  that \nearly  season  condition  usually  is  not  sufficiently  related  to  final yield \nto justify  its  use in forecasting.  For many  years,  the  condition  of cotton \nwas  obtained  as  of  May  25.  In  Figure  2 it  will  be  seen  that  the  rela- \n'>
<LTTextBoxHorizontal(4) 65.280,463.404,321.332,471.944 'RELATION  OF  MAY 25  UNITED  STATES  COTTON  CONDITION \n'>
<LTTextBoxHorizontal(5) 82.320,454.464,295.684,462.944 'TO  FINAL  

In [38]:
for page_layout in extract_pages(filepath):
    for element in page_layout:
        print(element)
        if isinstance(element, LTTextContainer):
            for text_line in element:
                print(text_line)
                for character in text_line:
                    if isinstance(character, LTChar):
                        print(character.fontname)
                        print(character.size)
                        print(character)


<LTTextBoxHorizontal(0) 50.000,621.788,514.684,713.788 'The Possibilities and Limitations of Objective Sampling in Strengthening Agricultural\nStatistics\nAuthor(s): Charles F. Sarle\nSource: Econometrica, Vol. 8, No. 1 (Jan., 1940), pp. 45-61\nPublished by: The Econometric Society\nStable URL: http://www.jstor.org/stable/1906861 .\nAccessed: 15/01/2015 07:49\n'>
<LTTextLineHorizontal 50.000,702.788,514.684,713.788 'The Possibilities and Limitations of Objective Sampling in Strengthening Agricultural\n'>
TPZMNE+Code2000
11.0
<LTChar 50.000,702.788,56.952,713.788 matrix=[1.00,0.00,0.00,1.00, (50.00,706.00)] font='TPZMNE+Code2000' adv=6.952 text='T'>
TPZMNE+Code2000
11.0
<LTChar 56.952,702.788,64.102,713.788 matrix=[1.00,0.00,0.00,1.00, (56.95,706.00)] font='TPZMNE+Code2000' adv=7.15 text='h'>
TPZMNE+Code2000
11.0
<LTChar 64.102,702.788,69.338,713.788 matrix=[1.00,0.00,0.00,1.00, (64.10,706.00)] font='TPZMNE+Code2000' adv=5.236000000000001 text='e'>
TPZMNE+Code2000
11.0
<LTChar 69.338,70

<LTTextBoxHorizontal(0) 149.880,595.141,239.033,602.261 'OBJECTIVE  SAMPLING \n'>
<LTTextLineHorizontal 149.880,595.141,239.033,602.261 'OBJECTIVE  SAMPLING \n'>
Times-Bold
7.0
<LTChar 149.880,595.261,155.326,602.261 matrix=[1.00,0.00,0.00,1.00, (149.88,596.78)] font='Times-Bold' adv=5.446 text='O'>
Times-Bold
7.0
<LTChar 155.326,595.261,159.995,602.261 matrix=[1.00,0.00,0.00,1.00, (155.33,596.78)] font='Times-Bold' adv=4.6690000000000005 text='B'>
Times-Bold
7.0
<LTChar 159.995,595.261,163.495,602.261 matrix=[1.00,0.00,0.00,1.00, (160.00,596.78)] font='Times-Bold' adv=3.5 text='J'>
Times-Bold
7.0
<LTChar 163.495,595.261,168.164,602.261 matrix=[1.00,0.00,0.00,1.00, (163.50,596.78)] font='Times-Bold' adv=4.6690000000000005 text='E'>
Times-Bold
7.0
<LTChar 168.164,595.261,173.218,602.261 matrix=[1.00,0.00,0.00,1.00, (168.16,596.78)] font='Times-Bold' adv=5.054 text='C'>
Times-Bold
7.0
<LTChar 173.218,595.261,177.887,602.261 matrix=[1.00,0.00,0.00,1.00, (173.22,596.78)] font='Times-Bold' 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

