# AER Cleaning

This notebook walks through how the AER articles were sorted into categories of articles and non-articles.

## Load Libraries

In [2]:
from tokenize import Ignore
from numpy import NaN
import pandas as pd
import time
from os import path
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter
import re
import os
from difflib import SequenceMatcher

## Load Files

Replace the file paths below to match local file paths

In [3]:
masters = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Master lists\\AER_master.xlsx")
pivots = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\pivots\\AER_pivots.xlsx")
scopus = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Scopus\\AER_SCOPUS.xlsx")

## Create file names

In [3]:
authors="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_authors.xlsx"
non_auth="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_Nauthors.xlsx"
saveas="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_processed.xlsx"
reviews="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_reviews.xlsx"
misc="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_misc.xlsx"
conf="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_conf.xlsx"

## Some random checks on the masters list

My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

Note: in both cases I've restricted to output to 20 to for sake of viewing on github - there is no scroll function for output.


In [171]:
pd.set_option('display.max_rows',masters.shape[0])
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1]).head(20)

Unnamed: 0,title
new books,2015
front matter,553
discussion,542
back matter,444
notes,304
periodicals,204
volume information,112
titles of new books,107
"documents, reports, and legislation",89
report of the finance committee,66


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [172]:
temp2=masters[masters['authors'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp2).head(20)

Unnamed: 0,title
new books,2009
front matter,553
back matter,443
notes,301
periodicals,204
volume information,112
titles of new books,107
"documents, reports, and legislation",72
report of the finance committee,63
report of the auditor,37


There is also many reports with unique titles due to the year of the report being included in the title. Interestingly, discussions are no longer part of the table excluding non-authored articles indicating these may be non-adminstrative documents.

The next block corrects for individual errors that were noted.

In [93]:
#Block for misspelling or renaming of data
masters.loc[8990,'title']="Back Matter"
masters.loc[10861,'title']="Back Matter"
masters.loc[16376,'title']="Foreword"
masters.loc[25807,'title']="Documents, Reports and Legislation"
masters.loc[25815,'authors']="Alexander Marx"
masters.loc[25720,'authors']="Review by: James Bonar"
masters.loc[6425,'content_type']="Discussion"
masters.loc[2284,'authors']="Victoria Ivashina and David Scharfstein"
masters.loc[503,'authors']="Jennifer L. Doleac and Benjamin Hansen"
masters.loc[22177,'authors']="Review by: W. L. Crum"
masters.loc[22176,'authors']="Review by: Gardiner C. Means"
masters.loc[24681,'authors']="Review by: Victor H. Pelz"
masters.loc[6073,'authors']='Haizhou Huang'
masters.loc[19384,'authors']='Review by: Anon'
masters.loc[6149,'content_type']="Discussion"
masters.loc[18729,'authors']='Anon'
masters.loc[14710,'authors']='Anon'
masters.loc[14710,'title']='Human Resources: The Wealth of a Nation by Eli Ginzberg: Erratum'
masters.loc[24876,'authors']='Review by: Henry Pratt Fairchild'
masters.loc[11919,'authors']='Review by: Anon'
masters.loc[23831,'authors']='Review by: Roy G. Blakey'
masters.loc[24620,'authors']='Review by: Ralph H. Blanchard'
masters.loc[27402,'authors']='Review by: Anon'
masters.loc[19927,'authors']='Anon'

## Classifying Miscellaneous content

In [94]:
scopus.rename(columns = {'abstract':'abstract2', 'title':'title2', 'authors':'authors2'}, inplace = True)
scopus['pages2']=scopus['pages']
masters['pages']=masters['pages'].str.strip()  
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = NaN  
pivots['type']=NaN

masters.loc[masters.title.str.lower() == "back matter", 'content_type'] = "MISC"  
masters.loc[masters.title.str.lower() == "front matter", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "volume matter", 'content_type'] = "MISC"
masters.loc[masters.title == "Announcements", 'content_type'] = "MISC"
masters.loc[masters.title == "Announcement", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "foreword", 'content_type'] = "MISC"
masters.loc[masters.title == "Periodicals", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "doctoral dissertations", 'content_type'] = "MISC"
masters.loc[masters.title == "Editorial Statement", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "list of members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "annual meetings", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "biographical listing of members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "honorary members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower().str.contains("preliminary announcement of the program"), 'content_type'] = "MISC"
masters.loc[masters["title"].str.contains("Distinguished Fellow"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("\[photograph\]"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("volume information"),'content_type']="MISC"
masters.loc[masters['title'].str.contains("The John Bates Clark Award"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("new books"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("new book"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("the american economic association"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("in memoriam"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("in memorium"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("memorial:"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("list of doctoral dissertations"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("notes") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("documents, reports and legislation") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("documents, reports, and legislation") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("editor") & masters["title"].str.lower().str.contains("introduction"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("classification of members"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("aer survey of members"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("annual business meeting"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("auditor") & masters["title"].str.lower().str.contains("report"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("proceedings of the") & masters["title"].str.lower().str.contains("annual meeting"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("report of the") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("minutes of the") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("minutes of business meetings") & masters['authors'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.len()<3,'content_type']='MISC'
masters.loc[masters['title'].str.match(r'^Program.*')==True,'content_type']='MISC'
masters.loc[masters['title'].str.match(r'^Business Meeting.*')==True,'content_type']='MISC'
#masters[masters['title'].str.lower().str.contains("review")]['title']
masters.loc[masters['title'].str.lower().str.match(r'the committee on.*')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^report of.* representative')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^report of.*committee on')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*francis.*walker.*award')==True,'content_type']='MISC'
masters.loc[masters['authors'].isna() & masters['content_type'].isna(),'content_type']='MISC' 


In [95]:
#masters[masters["title"].str.lower().str.contains('affiliation')][['title','stable_url']]
#masters[masters['title'].str.match(r'Business Meeting*')==True]
#masters[masters["title"].str.lower().str.contains("aer survey of members")][['title','stable_url']]

... One last check. Note: I found that after removing most of the miscellaneous content the remainder that did not have author names were not articles.

In [96]:
print(masters[masters['authors'].isna() & masters['content_type'].isna()]['title'].shape[0])
pd.set_option('display.max_colwidth', None)
masters[masters['authors'].isna() & masters['content_type'].isna()][['title','stable_url']].sort_values('title')

0


Unnamed: 0,title,stable_url


In [101]:
pd.set_option('display.max_rows',masters.shape[0])
print(sum(masters['content_type']=='MISC'))
#pd.DataFrame(masters['title'][masters['content_type']=='MISC']).sort_values('title')


5732


In [102]:
masters[masters.title.str.lower().str.match(r'.*:.*') & masters.content_type.isna()].head()

Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages,year,volume,issue
4,https://www.jstor.org/stable/26848481,"Tatyana Deryugina, Garth Heutel, Nolan H. Miller, David Molitor and Julian Reif",The Mortality and Medical Costs of Air Pollution: Evidence from Changes in Wind Direction,,,https://www.jstor.org/stable/10.2307/e26848476,4178-4219,2019,109,12
6,https://www.jstor.org/stable/26848483,Sharon Traiberman,Occupations and Import Competition: Evidence from Denmark,,,https://www.jstor.org/stable/10.2307/e26848476,4260-4301,2019,109,12
8,https://www.jstor.org/stable/26848485,Luigi Bocola and Alessandro Dovis,Self-Fulfilling Debt Crises: A Quantitative Analysis,,,https://www.jstor.org/stable/10.2307/e26848476,4343-4377,2019,109,12
9,https://www.jstor.org/stable/26848486,"Mathieu Couttenier, Veronica Petrencu, Dominic Rohner and Mathias Thoenig","The Violent Legacy of Conflict: Evidence on Asylum Seekers, Crime, and Public Policy in Switzerland",,,https://www.jstor.org/stable/10.2307/e26848476,4378-4425,2019,109,12
12,https://www.jstor.org/stable/26807866,"Judd B. Kessler, Corinne Low and Colin D. Sullivan",Incentivized Resume Rating: Eliciting Employer Preferences without Deception,,,https://www.jstor.org/stable/10.2307/e26807864,3713-3744,2019,109,11


## Separating out other types

In [103]:
masters.loc[~(masters['authors'].isna()) & masters['authors'].str.lower().str.match(r'.*review by:.*'),'content_type']='Review'
masters[masters.content_type=='Review'].shape[0]

7144

In [104]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0]

809

In [105]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

505

In [106]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters[masters['content_type']=='Discussion'].shape[0]

592

In [107]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

52

In [108]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

12584

In [109]:
# block for testing regex strings
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion(|.*)$')==True] #false positive for discussion
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True] comments to specific people
#masters[masters.title.str.lower().str.match(r'^(|a )note.*')]

In [114]:
masters[masters.content_type=='Article'].shape[0] #articles in data set

12584

In [117]:
masters[(masters['content_type']=='Article') & ((masters.year>1939) ==True)].shape[0] #all articles after 1940

11203

In [118]:
masters[(masters['content_type']=='Article') & ((masters.year>1939) ==True) & ((masters.year<2011) ==True)].shape[0] #articles between 1940 and 2010

9340

## Consider the pivots file

At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles. The next code block separates special issues (S) from normal issues (N)

In [119]:
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'(.*)(supplement|proceedings|annual meeting|survey)(.*)'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots[pivots.type=='S'].head()

Unnamed: 0,year,month,volume,issue,issue_url,Jstor_issue_text,journal,pivot_url,no_docs,type
30,2017,MAY,107,5,https://www.jstor.org/stable/10.2307/i40178116,"No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Ninth Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2017 pp. i-xii, 1-681",AER,https://www.jstor.org/stable/44250350,132,S
42,2016,MAY,106,5,https://www.jstor.org/stable/10.2307/i40158602,"No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Eighth Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2016 pp. i-xiv, 1-683",AER,https://www.jstor.org/stable/43860977,128,S
54,2015,MAY,105,5,https://www.jstor.org/stable/10.2307/i40156735,"No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Seventh Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2015 pp. i-xiv, 1-682",AER,https://www.jstor.org/stable/43821842,124,S
66,2014,MAY,104,5,https://www.jstor.org/stable/10.2307/i40112127,"No. 5 PAPERS AND PROCEEDINGS OF One Hundred Twenty-Sixth Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2014 pp. i-xiv, 1-608, i-ii",AER,https://www.jstor.org/stable/42920902,104,S
75,2013,MAY,103,3,https://www.jstor.org/stable/10.2307/i23469657,"No. 3 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Fifth Annual Meeting OF THE AMERICAN ECONOMIC ASSOCIATION MAY 2013 pp. i-xiv, 1-683",AER,https://www.jstor.org/stable/23469697,116,S


Merging the pivots with masters

In [120]:
result = pd.merge(masters, pivots[['issue_url','journal','type']], how="left", on=["issue_url", "issue_url"])

In [121]:
result.head()

Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages,year,volume,issue,journal,type
0,https://www.jstor.org/stable/26848477,,Front Matter,,MISC,https://www.jstor.org/stable/10.2307/e26848476,,2019,109,12,AER,N
1,https://www.jstor.org/stable/26848478,"Marcella Alsan, Owen Garrick and Grant Graziani",Does Diversity Matter for Health? Experimental Evidence from Oakland,,Article,https://www.jstor.org/stable/10.2307/e26848476,4071-4111,2019,109,12,AER,N
2,https://www.jstor.org/stable/26848479,Drew Fudenberg and Annie Liang,Predicting and Understanding Initial Play,,Article,https://www.jstor.org/stable/10.2307/e26848476,4112-4141,2019,109,12,AER,N
3,https://www.jstor.org/stable/26848480,"Hanno Lustig, Andreas Stathopoulos and Adrien Verdelhan",The Term Structure of Currency Carry Trade Risk Premia,,Article,https://www.jstor.org/stable/10.2307/e26848476,4142-4177,2019,109,12,AER,N
4,https://www.jstor.org/stable/26848481,"Tatyana Deryugina, Garth Heutel, Nolan H. Miller, David Molitor and Julian Reif",The Mortality and Medical Costs of Air Pollution: Evidence from Changes in Wind Direction,,Article,https://www.jstor.org/stable/10.2307/e26848476,4178-4219,2019,109,12,AER,N


## Creating counts of the data

In [122]:
pd.DataFrame(masters['content_type'].value_counts()) # counts by content_type

Unnamed: 0,content_type
Article,12584
Review,7144
MISC,5732
Comment,809
Discussion,592
Reply,505
Rejoinder,52


In [123]:
pd.DataFrame(masters[masters.year>1939].content_type.value_counts()) # counts after 1940 (inclusive)

Unnamed: 0,content_type
Article,11203
Review,4571
MISC,2964
Comment,792
Discussion,545
Reply,489
Rejoinder,46


In [124]:
pd.DataFrame(masters[(masters.year>1939) & (masters.year<2011)].content_type.value_counts()) 
# counts between 1940 and 2010 (inclusive)

Unnamed: 0,content_type
Article,9340
Review,4571
MISC,2714
Comment,748
Discussion,545
Reply,461
Rejoinder,46


In [None]:
result.to_excel(saveas, index=False)

## This section aims to match up Scopus records and Jstor articles
If an article's affiliations, citations or abstracts are recorded on Scopus, I want to exclude them from the set of pdf's that are sent to docParser. Matching up the Scopus data is also useful for comparing the textual accuracy of OCR parsers. I use volume, issue, year and page numbers which are common to both the scopus data and the Jstor metadata to match articles. 

Then I use a sequence comparison between the journal titles of the matched articles to decide if the scopus data has been matched correctly. If the match ratio is below 70%, the title is investigated and if wrong, the scopus data for that matched article is eihter corrected or discarded. If the scopus data is missing all of affiliations, abstract and citations fields then the match is also discarded.

Finally, if the document type of scopus is different to the classification done during the cleaning section, the article is reclassified according to the Scopus document type.

In [54]:
cleaned=pd.read_excel('C:\\Users\\sjwu1\\Journal_Data\\datadumps\\processed\\AER_processed.xlsx')
scopus['pages']=scopus['pages'].str.strip()
print(scopus.shape)

(5810, 14)


In [31]:
#Note that we only have data up to 2019 in the masterlists because of the moving wall on JSTOR
print(sum(scopus['year']<2020))

5573


In [57]:
Merged=pd.merge(cleaned, scopus, on=['year', 'issue','volume','pages'], how='left')

There are 47 titles that don't match. 

In [56]:
sum(Merged['title_y'].isna()==False)

5399

In [58]:
count=0
for m in Merged.index:
    
    if(pd.isna(Merged.iloc[m]['title_y'])==False):
        ratio=SequenceMatcher(None, Merged.iloc[m]['title_x'].lower(), Merged.iloc[m]['title_y'].lower()).ratio()

        if((ratio<0.7) & (Merged.iloc[m]['content_type']!='MISC')):
            print(Merged.iloc[m]['year'])
            print(ratio)
            count+=1
            A_ratio=SequenceMatcher(None, Merged.iloc[m]['authors_x'].lower(), Merged.iloc[m]['authors_y'].lower()).ratio()
            print(A_ratio)
            print(Merged.iloc[m]['stable_url'])
            print('vol: '+str(Merged.iloc[m]['volume']))
            print('issue: '+str(Merged.iloc[m]['issue']))
            print('pages: '+Merged.iloc[m]['pages'])
            print('jstor: '+Merged.iloc[m]['title_x'])
            print('scopus: '+Merged.iloc[m]['title_y'])
            print('jstor: '+Merged.iloc[m]['authors_x'])
            print('scopus: '+Merged.iloc[m]['authors_y'])
            print('scopus index: '+str(scopus[scopus['title']==Merged.iloc[m]['title_y']].index))
            print(m)
            print()
print(count)

2014
0.16666666666666666
0.23809523809523808
https://www.jstor.org/stable/43495358
vol: 104
issue: 12
pages: 3814-3840
jstor: Ambiguity Aversion with Three or More Outcomes
scopus: Hospital choices, hospital prices, and financial incentives to physicians?
jstor: Mark J. Machina
scopus: Ho, K.--a--
Pakes, A.--b-- 
scopus index: Int64Index([4331], dtype='int64')
1131

2014
0.30601092896174864
0.2727272727272727
https://www.jstor.org/stable/43495359
vol: 104
issue: 12
pages: 3841-3884
jstor: Hospital Choices, Hospital Prices, and Financial Incentives to Physicians
scopus: Consumption and debt response to unanticipated income shocks: Evidence from a natural experiment in Singapore?
jstor: Kate Ho and Ariel Pakes
scopus: Agarwal, S., Qian, W.
scopus index: Int64Index([4332], dtype='int64')
1132

2014
0.3522012578616352
0.19672131147540983
https://www.jstor.org/stable/43495360
vol: 104
issue: 12
pages: 3885-3920
jstor: Is It Whom You Know or What You Know? An Empirical Assessment of the Lobb

2001
0.36923076923076925
0.38461538461538464
https://www.jstor.org/stable/2677725
vol: 91
issue: 2
pages: 12-17
jstor: Human Capital and Growth
scopus: Human capital: Growth, history, and policy - A session to honor Stanley Engerman: Human capital and growth
jstor: Robert J. Barro
scopus: Barro, R.J.
scopus index: Int64Index([1722], dtype='int64')
4162

2001
0.5753424657534246
0.4411764705882353
https://www.jstor.org/stable/2677728
vol: 91
issue: 2
pages: 29-33
jstor: Input Trade and the Location of Production
scopus: Development and history: A session to honor Stanley Engerman: Input trade and the location of production
jstor: Ronald Findlay and Ronald W. Jones
scopus: Findlay, R.--a--
Jones, R.W.--b-- 
scopus index: Int64Index([1738], dtype='int64')
4165

2001
0.5714285714285714
0.49557522123893805
https://www.jstor.org/stable/2677763
vol: 91
issue: 2
pages: 219-225
jstor: Interest Rates and Inflation
scopus: Recent advances in monetary-policy rules: Interest rates and inflation
jsto

47


In [59]:
scopus.at[4331,'pages']='3841-3884'
scopus.at[4335,'pages']='3814-3840'
scopus.at[4340,'pages']='3885-3920'
scopus.at[4333,'pages']='3921-3955'
scopus.at[4337,'pages']='3956-3990'
scopus.at[4334,'pages']='3991-4026'
scopus.at[4330,'pages']='4027-4070'
scopus.at[4338,'pages']='4071-4103'
scopus.at[4328,'pages']='4104-4146'
scopus.at[4329,'pages']='4147-4183'
scopus.at[4336,'pages']='4184-4204'
scopus.at[4332,'pages']='4205-4230'
scopus.at[4339,'pages']='4231-4239'

scopus.at[4045,'title']="Erratum: Macroeconomic Effects of Financial Shocks"
scopus.at[2925,'title']="Erratum: When does coordination require centralization?"
scopus.at[2581,'title']="Erratum: Equilibrium incentives in oligopoly"
scopus.at[2722,'title']="Erratum: International protection of intellectual property"
scopus.at[2370, 'title']="Erratum: The savers-spenders theory of fiscal policy"
scopus.at[3178,'title']="Erratum: Women, wealth, and mobility"

In [60]:
Merged=pd.merge(cleaned, scopus, on=['year', 'issue','volume','pages'], how='left')

In [61]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.DataFrame(Merged[Merged['document type'].str.len()>100])


Unnamed: 0,stable_url,authors_x,title_x,abstract_x,content_type,issue_url,pages,year,volume,issue,journal_x,type,authors_y,title_y,journal_y,DOI,affiliations,abstract_y,citations,document type,index keywords,author keywords
1166,https://www.jstor.org/stable/43495314,"Matthew Gentzkow, Jesse M. Shapiro and Michael...",Competition and Ideological Diversity: Histori...,,Article,https://www.jstor.org/stable/10.2307/i40138653,3073-3114,2014,104,10,AER,N,"Gentzkow, M.--a-- --b--\nShapiro, J.M.--a-- --...",Competition and ideological diversity: Histori...,American Economic Review,10.1257/aer.104.10.3073,a--University of Chicago Booth School of Busin...,We study the competitive forces which shaped i...,"Ackerberg, D.A., Rysman, M.|Unobserved Product...",1850-1967(2006) The Historical Statistics of ...,,
2662,https://www.jstor.org/stable/29730164,Emir Kamenica,Contextual Inference in Markets: On the Inform...,,Article,https://www.jstor.org/stable/10.2307/i29730147,2127-2149,2008,98,5,AER,N,"Kamenica, E.",Contextual inference in markets: On the inform...,American Economic Review,10.1257/aer.98.5.2127,"Graduate School of Business, University of Chi...",Context can influence decisions. This malleabi...,"Anand, B.N., Shachar, R.|Brands as Beacons: A ...",When and Why Variety Backfires(2005) Marketin...,,
4696,https://www.jstor.org/stable/117017,Douglas A. Irwin,Changes in U.S. Tariffs: The Role of Import Pr...,,Article,https://www.jstor.org/stable/10.2307/i300823,1015-1026,1998,88,4,AER,N,"Irwin, D.A.",Changes in U.S. Tariffs: The Role of Import Pr...,American Economic Review,,"Department of Economics, Dartmouth College, Ha...",,"Anderson, J.E., Neary, J.P.|Measuring the Rest...","Percentage Distribution of Free, Specific, an...",,


In [62]:
Merged.at[4696,'document type']='Article'
Merged.at[2662,'document type']='Article'
Merged.at[1166,'document type']='Review'

In [40]:
pd.DataFrame(Merged[Merged['document type'].str.len()>100])

In [None]:
Merged.to_excel('C:\\Users\\sjwu1\\Journal_Data\\datadumps\\AER_M_sco_du.xlsx', index=False)