# AER Cleaning

This notebook walks through how the AER articles were sorted into categories of articles and non-articles.

## Load Libraries

In [88]:
from tokenize import Ignore
from numpy import NaN
import pandas as pd
from difflib import SequenceMatcher
from multiprocessing import Pool
import multiprocessing as mp
import time
from os import path
import concurrent.futures
from multiprocessing import freeze_support
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter
import re
import slate3k as slate


## Load Files

Replace the file paths below to match local file paths

In [89]:
masters = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Master lists\\AER_master.xlsx")
pivots = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\pivots\\AER_pivots.xlsx")
scopus = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Scopus\\AER_SCOPUS.xlsx")
datadump = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_datadump.xlsx")



## Create file names

In [90]:
authors="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_authors.xlsx"
non_auth="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_Nauthors.xlsx"
saveas="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_processed.xlsx"
reviews="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_reviews.xlsx"
misc="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_misc.xlsx"
conf="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_conf.xlsx"

## Some random checks on the masters list

My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

Note: in both cases I've restricted to output to 20 to for sake of viewing on github - there is no scroll function for output.


In [171]:
pd.set_option('display.max_rows',masters.shape[0])
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1]).head(20)

Unnamed: 0,title
new books,2015
front matter,553
discussion,542
back matter,444
notes,304
periodicals,204
volume information,112
titles of new books,107
"documents, reports, and legislation",89
report of the finance committee,66


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [172]:
temp2=masters[masters['authors'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp2).head(20)

Unnamed: 0,title
new books,2009
front matter,553
back matter,443
notes,301
periodicals,204
volume information,112
titles of new books,107
"documents, reports, and legislation",72
report of the finance committee,63
report of the auditor,37


There is also many reports with unique titles due to the year of the report being included in the title. Interestingly, discussions are no longer part of the table excluding non-authored articles indicating these may be non-adminstrative documents.

The next block corrects for individual errors that were noted.

In [93]:
#Block for misspelling or renaming of data
masters.loc[8990,'title']="Back Matter"
masters.loc[10861,'title']="Back Matter"
masters.loc[16376,'title']="Foreword"
masters.loc[25807,'title']="Documents, Reports and Legislation"
masters.loc[25815,'authors']="Alexander Marx"
masters.loc[25720,'authors']="Review by: James Bonar"
masters.loc[6425,'content_type']="Discussion"
masters.loc[2284,'authors']="Victoria Ivashina and David Scharfstein"
masters.loc[503,'authors']="Jennifer L. Doleac and Benjamin Hansen"
masters.loc[22177,'authors']="Review by: W. L. Crum"
masters.loc[22176,'authors']="Review by: Gardiner C. Means"
masters.loc[24681,'authors']="Review by: Victor H. Pelz"
masters.loc[6073,'authors']='Haizhou Huang'
masters.loc[19384,'authors']='Review by: Anon'
masters.loc[6149,'content_type']="Discussion"
masters.loc[18729,'authors']='Anon'
masters.loc[14710,'authors']='Anon'
masters.loc[14710,'title']='Human Resources: The Wealth of a Nation by Eli Ginzberg: Erratum'
masters.loc[24876,'authors']='Review by: Henry Pratt Fairchild'
masters.loc[11919,'authors']='Review by: Anon'
masters.loc[23831,'authors']='Review by: Roy G. Blakey'
masters.loc[24620,'authors']='Review by: Ralph H. Blanchard'
masters.loc[27402,'authors']='Review by: Anon'
masters.loc[19927,'authors']='Anon'

## Classifying Miscellaneous content

In [94]:
scopus.rename(columns = {'abstract':'abstract2', 'title':'title2', 'authors':'authors2'}, inplace = True)
scopus['pages2']=scopus['pages']
masters['pages']=masters['pages'].str.strip()  
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = NaN  
pivots['type']=NaN

masters.loc[masters.title.str.lower() == "back matter", 'content_type'] = "MISC"  
masters.loc[masters.title.str.lower() == "front matter", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "volume matter", 'content_type'] = "MISC"
masters.loc[masters.title == "Announcements", 'content_type'] = "MISC"
masters.loc[masters.title == "Announcement", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "foreword", 'content_type'] = "MISC"
masters.loc[masters.title == "Periodicals", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "doctoral dissertations", 'content_type'] = "MISC"
masters.loc[masters.title == "Editorial Statement", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "list of members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "annual meetings", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "biographical listing of members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "honorary members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower().str.contains("preliminary announcement of the program"), 'content_type'] = "MISC"
masters.loc[masters["title"].str.contains("Distinguished Fellow"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("\[photograph\]"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("volume information"),'content_type']="MISC"
masters.loc[masters['title'].str.contains("The John Bates Clark Award"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("new books"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("new book"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("the american economic association"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("in memoriam"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("in memorium"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("memorial:"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("list of doctoral dissertations"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("notes") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("documents, reports and legislation") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("documents, reports, and legislation") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("editor") & masters["title"].str.lower().str.contains("introduction"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("classification of members"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("aer survey of members"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("annual business meeting"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("auditor") & masters["title"].str.lower().str.contains("report"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("proceedings of the") & masters["title"].str.lower().str.contains("annual meeting"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("report of the") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("minutes of the") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("minutes of business meetings") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.len()<3,'content_type']='MISC'
masters.loc[masters['title'].str.match(r'^Program.*')==True,'content_type']='MISC'
masters.loc[masters['title'].str.match(r'^Business Meeting.*')==True,'content_type']='MISC'
#masters[masters['title'].str.lower().str.contains("review")]['title']
masters.loc[masters['title'].str.lower().str.match(r'the committee on.*')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^report of.* representative')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^report of.*committee on')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*francis.*walker.*award')==True,'content_type']='MISC'
masters.loc[masters['authors'].isna() & masters['content_type'].isna(),'content_type']='MISC' 


In [95]:
#masters[masters["title"].str.lower().str.contains('affiliation')][['title','stable_url']]
#masters[masters['title'].str.match(r'Business Meeting*')==True]
#masters[masters["title"].str.lower().str.contains("aer survey of members")][['title','stable_url']]

... One last check. Note: I found that after removing most of the miscellaneous content the remainder that did not have author names were not articles.

In [96]:
print(masters[masters['authors'].isna() & masters['content_type'].isna()]['title'].shape[0])
pd.set_option('display.max_colwidth', None)
masters[masters['authors'].isna() & masters['content_type'].isna()][['title','stable_url']].sort_values('title')

0


Unnamed: 0,title,stable_url


In [101]:
pd.set_option('display.max_rows',masters.shape[0])
print(sum(masters['content_type']=='MISC'))
#pd.DataFrame(masters['title'][masters['content_type']=='MISC']).sort_values('title')


5732


In [102]:
masters[masters.title.str.lower().str.match(r'.*:.*') & masters.content_type.isna()].head()

Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages,year,volume,issue
4,https://www.jstor.org/stable/26848481,"Tatyana Deryugina, Garth Heutel, Nolan H. Miller, David Molitor and Julian Reif",The Mortality and Medical Costs of Air Pollution: Evidence from Changes in Wind Direction,,,https://www.jstor.org/stable/10.2307/e26848476,4178-4219,2019,109,12
6,https://www.jstor.org/stable/26848483,Sharon Traiberman,Occupations and Import Competition: Evidence from Denmark,,,https://www.jstor.org/stable/10.2307/e26848476,4260-4301,2019,109,12
8,https://www.jstor.org/stable/26848485,Luigi Bocola and Alessandro Dovis,Self-Fulfilling Debt Crises: A Quantitative Analysis,,,https://www.jstor.org/stable/10.2307/e26848476,4343-4377,2019,109,12
9,https://www.jstor.org/stable/26848486,"Mathieu Couttenier, Veronica Petrencu, Dominic Rohner and Mathias Thoenig","The Violent Legacy of Conflict: Evidence on Asylum Seekers, Crime, and Public Policy in Switzerland",,,https://www.jstor.org/stable/10.2307/e26848476,4378-4425,2019,109,12
12,https://www.jstor.org/stable/26807866,"Judd B. Kessler, Corinne Low and Colin D. Sullivan",Incentivized Resume Rating: Eliciting Employer Preferences without Deception,,,https://www.jstor.org/stable/10.2307/e26807864,3713-3744,2019,109,11


## Separating out other types

In [103]:
masters.loc[~(masters['authors'].isna()) & masters['authors'].str.lower().str.match(r'.*review by:.*'),'content_type']='Review'
masters[masters.content_type=='Review'].shape[0]

7144

In [104]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0]

809

In [105]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

505

In [106]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters[masters['content_type']=='Discussion'].shape[0]

592

In [107]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

52

In [108]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

12584

In [109]:
# block for testing regex strings
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion(|.*)$')==True] #false positive for discussion
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True] comments to specific people
#masters[masters.title.str.lower().str.match(r'^(|a )note.*')]

In [114]:
masters[masters.content_type=='Article'].shape[0] #articles in data set

12584

In [117]:
masters[(masters['content_type']=='Article') & ((masters.year>1939) ==True)].shape[0] #all articles after 1940

11203

In [118]:
masters[(masters['content_type']=='Article') & ((masters.year>1939) ==True) & ((masters.year<2011) ==True)].shape[0] #articles between 1940 and 2010

9340

## Consider the pivots file

At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles. The next code block separates special issues (S) from normal issues (N)

In [119]:
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'(.*)(supplement|proceedings|annual meeting|survey)(.*)'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots[pivots.type=='S'].head()

Unnamed: 0,year,month,volume,issue,issue_url,Jstor_issue_text,journal,pivot_url,no_docs,type
30,2017,MAY,107,5,https://www.jstor.org/stable/10.2307/i40178116,"No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Ninth Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2017 pp. i-xii, 1-681",AER,https://www.jstor.org/stable/44250350,132,S
42,2016,MAY,106,5,https://www.jstor.org/stable/10.2307/i40158602,"No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Eighth Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2016 pp. i-xiv, 1-683",AER,https://www.jstor.org/stable/43860977,128,S
54,2015,MAY,105,5,https://www.jstor.org/stable/10.2307/i40156735,"No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Seventh Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2015 pp. i-xiv, 1-682",AER,https://www.jstor.org/stable/43821842,124,S
66,2014,MAY,104,5,https://www.jstor.org/stable/10.2307/i40112127,"No. 5 PAPERS AND PROCEEDINGS OF One Hundred Twenty-Sixth Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2014 pp. i-xiv, 1-608, i-ii",AER,https://www.jstor.org/stable/42920902,104,S
75,2013,MAY,103,3,https://www.jstor.org/stable/10.2307/i23469657,"No. 3 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Fifth Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2013 pp. i-xiv, 1-683",AER,https://www.jstor.org/stable/23469697,116,S


Merging the pivots with masters

In [120]:
result = pd.merge(masters, pivots[['issue_url','journal','type']], how="left", on=["issue_url", "issue_url"])

In [121]:
result.head()

Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages,year,volume,issue,journal,type
0,https://www.jstor.org/stable/26848477,,Front Matter,,MISC,https://www.jstor.org/stable/10.2307/e26848476,,2019,109,12,AER,N
1,https://www.jstor.org/stable/26848478,"Marcella Alsan, Owen Garrick and Grant Graziani",Does Diversity Matter for Health? Experimental Evidence from Oakland,,Article,https://www.jstor.org/stable/10.2307/e26848476,4071-4111,2019,109,12,AER,N
2,https://www.jstor.org/stable/26848479,Drew Fudenberg and Annie Liang,Predicting and Understanding Initial Play,,Article,https://www.jstor.org/stable/10.2307/e26848476,4112-4141,2019,109,12,AER,N
3,https://www.jstor.org/stable/26848480,"Hanno Lustig, Andreas Stathopoulos and Adrien Verdelhan",The Term Structure of Currency Carry Trade Risk Premia,,Article,https://www.jstor.org/stable/10.2307/e26848476,4142-4177,2019,109,12,AER,N
4,https://www.jstor.org/stable/26848481,"Tatyana Deryugina, Garth Heutel, Nolan H. Miller, David Molitor and Julian Reif",The Mortality and Medical Costs of Air Pollution: Evidence from Changes in Wind Direction,,Article,https://www.jstor.org/stable/10.2307/e26848476,4178-4219,2019,109,12,AER,N


## Creating counts of the data

In [122]:
pd.DataFrame(masters['content_type'].value_counts()) # counts by content_type

Unnamed: 0,content_type
Article,12584
Review,7144
MISC,5732
Comment,809
Discussion,592
Reply,505
Rejoinder,52


In [123]:
pd.DataFrame(masters[masters.year>1939].content_type.value_counts()) # counts after 1940 (inclusive)

Unnamed: 0,content_type
Article,11203
Review,4571
MISC,2964
Comment,792
Discussion,545
Reply,489
Rejoinder,46


In [124]:
pd.DataFrame(masters[(masters.year>1939) & (masters.year<2011)].content_type.value_counts()) 
# counts between 1940 and 2010 (inclusive)

Unnamed: 0,content_type
Article,9340
Review,4571
MISC,2714
Comment,748
Discussion,545
Reply,461
Rejoinder,46


In [None]:
result.to_excel(saveas, index=False)

## Splitting up pdfs into different sections in preparation for extraction
### Save the first page as a pdf
We should first determine the number of pages in each article to see how many pages need to be scanned. Then separate out the first page which for AER articles should have affiliations at the bottom of the first page. Note: this is done for every article regardless of content_type just in case other content_types fit the same layout as normal articles so this process won't need to be run twice.


In [128]:
# Note: use the master list with content_type filled in ie: the processed file that was previously
path='C:\\Users\\sjwu1\\Journal_Data\\AER_data'
cleaned=pd.read_excel('C:\\Users\\sjwu1\\Journal_Data\\datadumps\\processed\\AER_processed.xlsx')


Get all the Ids

In [130]:
id_list=cleaned['stable_url'].str.split('https://www.jstor.org/stable/').str[-1]
id_list.head()

0    26848477
1    26848478
2    26848479
3    26848480
4    26848481
Name: stable_url, dtype: object

This for-loop iterates through all the article Ids and if the pdf is on file, the first page is extracted.

In [None]:
for a in id_list:
    pdf_path=path+'\\'+a+'.pdf'
    print(pdf_path)
    if os.path.exists(pdf_path)==True:
        with open(pdf_path, 'rb') as read_stream:
            pdf_reader = PdfFileReader(read_stream)
            numPages=pdf_reader.numPages
            pdf_writer=PdfFileWriter()
            pdf_writer.addPage(pdf_reader.getPage(1))
            out=Path(path+'\\page1\\'+a+'_1.pdf')
            with open(out, 'wb') as data:
                pdf_writer.write(data)

### Save the pages of the reference list as a pdf
We use slate3k library to scan for the occurence of the word 'references' in all caps and other variations. Slate3k is a wrapper for pdfMiner, the underlying OCR library. PdfMiner.six is a version currently maintained by the community and it is able to recognise carriages in text. PdfMiner.six seems more powerful but I have not managed to run it without errors. Because this takes relatively longer and I am only interested in articles with references, the miscellaneous content_types are excluded.

The rationale is that for newer articles which have a dedicated reference list, the reference list would be anything after the point that the heading appears.

In [133]:
#let's add 2 columns to the preprocessed dataframe:
cleaned['Ref_code']=None #whether the keyword 'REFERENCES' was found or not in the article
cleaned['Ref_start']=None #page on which the references supposedly starts


In [154]:
#remove miscellaneous
Ex_mis=cleaned[cleaned.content_type!='MISC']
Ex_mis.loc[1]['stable_url'].split('https://www.jstor.org/stable/')

['', '26848478']

In [156]:
#Search strings
String = "references"
String1 = "References"
String2 = "REFERENCES"

In [159]:
# suppress logs
import logging 
logging.propagate = False 
logging.getLogger().setLevel(logging.ERROR)
#https://stackoverflow.com/questions/29762706/warnings-on-pdfminer

Looking at 10 articles, The reference headings are in all caps. However, this could change as these are all 2019 articles. The regex is overkill because substring would be enough, but it may come in handy later.

In [174]:
for a in Ex_mis.index[:10]:
    print(a)
    id=Ex_mis.loc[a]['stable_url'].split('https://www.jstor.org/stable/')[-1]
    print(id)
    print(Ex_mis.loc[a]['year'])
    pdf_path=path+'\\'+id+'.pdf'
    temp=path+'\\dummy.pdf'
    print(pdf_path)    
    if os.path.exists(pdf_path)==True:
        with open(pdf_path, 'rb') as read_stream:
            pdf_reader = PdfFileReader(read_stream)  
            numPages=pdf_reader.numPages
            for i in range(1, numPages):
                
                pdf_reader = PdfFileReader(read_stream)    
                pdf_writer2=PdfFileWriter()
                pdf_writer2.addPage(pdf_reader.getPage(i))
                with open(temp, 'wb') as x:
                    pdf_writer2.write(x)
                Text = slate.PDF(open(temp, 'rb')).text()
                #if re.search(r'This content(.*)Conditions',Text):
                #    print('t&c found')
                #if re.search('VOL(.*)NO(.*):\d+',Text):
                #    print('header found')
                    
                if re.search(String,Text):
                    print(String + " pattern Found on Page: " + str(i))
                    
                if re.search(String1,Text):
                    print(String1 + " pattern Found on Page: " + str(i))
                    
                if re.search(String2,Text):
                    print(String2 + " pattern Found on Page: " + str(i))

1
26848478
2019
C:\Users\sjwu1\Journal_Data\AER_data\26848478.pdf
references pattern Found on Page: 26
references pattern Found on Page: 34
references pattern Found on Page: 38
REFERENCES pattern Found on Page: 38
2
26848479
2019
C:\Users\sjwu1\Journal_Data\AER_data\26848479.pdf
references pattern Found on Page: 29
REFERENCES pattern Found on Page: 29
3
26848480
2019
C:\Users\sjwu1\Journal_Data\AER_data\26848480.pdf
REFERENCES pattern Found on Page: 34
4
26848481
2019
C:\Users\sjwu1\Journal_Data\AER_data\26848481.pdf
REFERENCES pattern Found on Page: 40
5
26848482
2019
C:\Users\sjwu1\Journal_Data\AER_data\26848482.pdf
references pattern Found on Page: 2
references pattern Found on Page: 6
REFERENCES pattern Found on Page: 39
6
26848483
2019
C:\Users\sjwu1\Journal_Data\AER_data\26848483.pdf
references pattern Found on Page: 29
REFERENCES pattern Found on Page: 40
7
26848484
2019
C:\Users\sjwu1\Journal_Data\AER_data\26848484.pdf
REFERENCES pattern Found on Page: 40
8
26848485
2019
C:\Use

### Testing OCR extraction on some affiliations
AER articles have the affiliations indicated by an asterisk in the footnotes of the first page. There is potential for extracting it and doing some string operations to obtain it. Testing this on 5 articles, the performance is quite good. However, these are articles from 2019 and there are not in 2-column format like most pre-2000 AER articles. Tesseract has been shown to be capable of extracting text from 2-column layout documents https://towardsdatascience.com/read-a-multi-column-pdf-with-pytesseract-in-python-1d99015f887a. Tesseract is necessary because most articles are scanned. There is another library for searchable pdfs called

In [169]:
for a in id_list[:5]:
    pdf_path=path+'\\page1\\'+a+'_1.pdf'
    print(pdf_path)
    if os.path.exists(pdf_path)==True:
        Text = slate.PDF(open(pdf_path, 'rb')).text()
        if '*' in Text:
            m=Text.split('*')
            print(m[-1])
        else:
            'asterisk indicator for affiliations not found'
    print('\n\n')

C:\Users\sjwu1\Journal_Data\AER_data\page1\26848477_1.pdf



C:\Users\sjwu1\Journal_Data\AER_data\page1\26848478_1.pdf
 Alsan: Harvard Kennedy School, 79 JFK Street, Cambridge, MA 02138 (email: marcella_alsan@hks.harvard. edu); Garrick: Bridge Clinical Research, 333 Hegenberger Road, Suite 208, Oakland, CA 94621 (email: owen. garrick@bridgeclinical.com); Graziani: University of California, Berkeley, Evans Hall, Berkeley, CA 94720 (email: gcgraziani@berkeley.edu). Esther Duflo was the coeditor for this article. We are grateful to an anonymous coed- itor and four anonymous referees. We thank Pascaline Dupas and the J-PAL Board and Reviewers who provided important feedback that improved the design and implementation of the experiment. We thank Ran Abramitzky, Ned Augenblick, Jeremy Bulow, Kate Casey, Arun Chandrasekhar, Raj Chetty, Stefano DellaVigna, Mark Duggan, Karen Eggleston, Erica Field, Matthew Gentzkow, Gopi Shah Goda, Susan Godlonton, Jessica Goldberg, Michael Greenstone, Guido I