# QJE Cleaning
This notebook walks through how the QJE articles were sorted into categories of articles and non-articles.

## Load Libraries

In [1]:
from tokenize import Ignore
from numpy import NaN
import pandas as pd
from difflib import SequenceMatcher
from multiprocessing import Pool
import multiprocessing as mp
import time

## Load Files
Please change file paths to local and comment out file reads that are not present eg: datadump

In [140]:
masters = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Master lists\\QJE_master.xlsx")
pivots = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\pivots\\QJE_pivots.xlsx")
scopus = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Scopus\\QJE_SCOPUS.xlsx")
datadump = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\datadumps\\QJE_datadump.xlsx")

pd.set_option('display.max_colwidth', None)

## Create file names
For output

In [141]:
authors="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\QJE_authors.xlsx"
non_auth="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\QJE_Nauthors.xlsx"
saveas="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\QJE_processed.xlsx"
reviews="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\QJE_reviews.xlsx"
misc="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\QJE_misc.xlsx"
conf="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\QJE_conf.xlsx"

## Some random checks on the masters list
My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

In [142]:
pd.set_option('display.max_rows',masters.shape[0])
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1])

Unnamed: 0,title
front matter,412
back matter,409
volume information,178
recent publications,141
books received,100
recent publications upon economics,86
[notes and memoranda],39
the quarterly journal of economics,14
chapters on machinery and labor,4
[introduction],4


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [143]:
temp1=masters[masters['authors'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp1)

Unnamed: 0,title
front matter,412
back matter,409
volume information,178
recent publications,141
books received,100
recent publications upon economics,86
[notes and memoranda],36
the quarterly journal of economics,14
scientific publications of harvard university,3
[introduction],3


In [144]:
# block for testing regex patterns
#pd.DataFrame(masters[masters['content_type'].isna()]['title'].str.lower().value_counts())
#masters[masters['title'].str.lower().str.match(r'(^|: )report of the')]
#masters[masters['title'].str.lower().str.match(r'(^|.*: )report of the')]
#masters.loc[masters['title'].str.lower().str.match(r'^combined references(.*)')==True,'content_type']='MISC'

Unnamed: 0.1,Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages
5996,5996,https://www.jstor.org/stable/1882418,A. Piatt Andrew,Report of the Mexican Currency Commission,,,https://www.jstor.org/stable/10.2307/i305257,585-587
6424,6424,https://www.jstor.org/stable/1882286,Edward Cummings,Report of the Connecticut Labor Bureau,,,https://www.jstor.org/stable/10.2307/i305223,480-487


It seems anything with duplicates greater than 5 are miscellaneous according to the list above and the bulk of miscellaneous content can be removed.

In [146]:
temp2=masters[(masters['content_type'].isna()==True) & (masters['authors'].isna()==True)]['title'].str.lower().value_counts()
pd.DataFrame(temp2)
removal=list(temp2[temp2>=5].index)
removal
masters.loc[masters.title.str.lower().isin(removal),'content_type']='MISC'

In [147]:
masters.loc[masters['title'].str.lower().str.match(r'\[introduction\]')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'schumpeter prize$')==True,'content_type']='MISC'

In [148]:

scopus.rename(columns = {'abstract':'abstract2', 'title':'title2', 'authors':'authors2'}, inplace = True)
scopus['pages2']=scopus['pages']
masters['pages']=masters['pages'].str.strip()  
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = NaN  


## Classifying other content

In [149]:
sum(masters.content_type.isna())
#masters.shape[0]

5313

In [150]:
masters.loc[masters['authors'].str.lower().str.match(r'^review(ed|) by(.*)')==True,'content_type']='Review' #reviews
masters.loc[(masters['title'].str.lower().str.match(r'(.*) by (.*)')==True) & (masters.authors.isna()==True),'content_type']='Review2' 
#possible reviews that don't have author names
masters[(masters['content_type']=='Review2') | (masters['content_type']=='Review')].shape[0] #reviews

113

In [151]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?).*comment.*$')==True,'content_type']='Comment'
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*comment$')==True,'content_type']='Comment'
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(a further|further) comment.*$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0]
#.shape[0] 
#comments

Unnamed: 0.1,Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages
157,157,https://www.jstor.org/stable/26372539,Gary Lyn and Andrés Rodríguez-Clare,EXTERNAL ECONOMIES AND INTERNATIONAL TRADE REDUX: COMMENT,,Comment,https://www.jstor.org/stable/10.2307/e26372527,1895-1905
448,448,https://www.jstor.org/stable/25098902,Christopher L. Foote and Christopher F. Goetz,The Impact of Legalized Abortion on Crime: Comment,,Comment,https://www.jstor.org/stable/10.2307/i25098891,407-423
839,839,https://www.jstor.org/stable/2587007,"Steven M. Fazzari, R. Glenn Hubbard and Bruce C. Petersen",Investment-Cash Flow Sensitivities are Useful: A Comment on Kaplan and Zingales,,Comment,https://www.jstor.org/stable/10.2307/i324120,695-705
953,953,https://www.jstor.org/stable/2586993,"Kevin Lee, M. Hashem Pesaran and Ron Smith",Growth Empirics: A Panel Data Approach -- A Comment,,Comment,https://www.jstor.org/stable/10.2307/i324111,319-323
1195,1195,https://www.jstor.org/stable/2118344,Kyoji Fukao and Roland Benabou,History Versus Expectations: A Comment,,Comment,https://www.jstor.org/stable/10.2307/i337095,535-542
1406,1406,https://www.jstor.org/stable/2937827,David M. Newbery,The Isolation Paradox and the Discount Rate for Benefit-Cost Analysis: A Comment,,Comment,https://www.jstor.org/stable/10.2307/i352301,235-238
1486,1486,https://www.jstor.org/stable/1885547,"Wilfrid W. Csaplar, Jr. and Edward Tower",Trade and Industrial Policy Under Oligopoly: Comment,,Comment,https://www.jstor.org/stable/10.2307/i332453,599-602
1567,1567,https://www.jstor.org/stable/1885073,Daniel J. Seidmann,Incentives for Information Production and Disclosure: Comment,,Comment,https://www.jstor.org/stable/10.2307/i332429,445-452
1606,1606,https://www.jstor.org/stable/1885700,Robert E. Kohn,The Limitations of Pigouvian Taxes as a Long-Run Remedy for Externalities: Comment,,Comment,https://www.jstor.org/stable/10.2307/i332475,625-630
1704,1704,https://www.jstor.org/stable/1885748,Robert Cameron Mitchell and Richard T. Carson,Option Value: Empirical Evidence From a Case Study of Recreation and Water Quality: Comment,,Comment,https://www.jstor.org/stable/10.2307/i332473,291-294


In [152]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

156

In [153]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?|).*rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

Unnamed: 0.1,Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages
2524,2524,https://www.jstor.org/stable/1882050,Assar Lindbeck,Rejoinder,,Rejoinder,https://www.jstor.org/stable/10.2307/i305245,665-683
2527,2527,https://www.jstor.org/stable/1882053,Orley Ashenfelter and John H. Pencavel,"American Trade Union Growth, 1900-1960: A Rejoinder",,Rejoinder,https://www.jstor.org/stable/10.2307/i305245,691-692
2542,2542,https://www.jstor.org/stable/1880808,Paul A. Samuelson,[The Consumer does Benefit from Feasible Price Stability]: Rejoinder,,Rejoinder,https://www.jstor.org/stable/10.2307/i332309,500-503
2562,2562,https://www.jstor.org/stable/1880572,Peter L. Swan,The Influence of Monopoly on Product Innovation: Rejoinder,,Rejoinder,https://www.jstor.org/stable/10.2307/i332308,346-349
2591,2591,https://www.jstor.org/stable/1882275,J. C. H. Fei,[The Marginalist Principle in a Discrete Production Model Under Uncertain Demand]: Rejoinder,,Rejoinder,https://www.jstor.org/stable/10.2307/i305244,710-711
2593,2593,https://www.jstor.org/stable/1882277,William Poole,[Optimal Choice of Monetary Policy Instruments in a Simple Stochastic Macro Model]: Rejoinder,,Rejoinder,https://www.jstor.org/stable/10.2307/i305244,716-717
2637,2637,https://www.jstor.org/stable/1881845,John F. Kain,"[A Note on John Kain's ""Housing Segregation, Negro Employment and Metropolitan Decentralization""]: Rejoinder",,Rejoinder,https://www.jstor.org/stable/10.2307/i305216,161-162
2689,2689,https://www.jstor.org/stable/1883018,James Tobin,[Comment on Tobin]: Rejoinder,,Rejoinder,https://www.jstor.org/stable/10.2307/i332354,328-329
3002,3002,https://www.jstor.org/stable/1880629,Sayre P. Schatz,Rejoinder,,Rejoinder,https://www.jstor.org/stable/10.2307/i332299,246-247
3611,3611,https://www.jstor.org/stable/1882154,Evsey D. Domar,Accelerated Depreciation: A Rejoinder,,Rejoinder,https://www.jstor.org/stable/10.2307/i305239,299-304


In [154]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters[masters['content_type']=='Discussion'].shape[0]

4

In [155]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

4746

In [2]:
# code block for testing regex
#masters[masters['title'].str.lower().str.match(r'^\washington notes$')==True]
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True]
#masters[masters.content_type=='Discussion']

## Consider the pivots file
At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles. The next block separates special issues (S) from normal issues (N) 

In [158]:
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'(.*)(supplement|proceedings|annual meeting|survey)(.*)'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots.type.value_counts()

N    529
S      3
Name: type, dtype: int64

## Merging pivots and masters

In [159]:
result = pd.merge(masters, pivots[['issue_url','year','volume','issue','journal','type']], how="left", on=["issue_url", "issue_url"])

## Summaries

In [160]:
pd.DataFrame(result.content_type.value_counts())

Unnamed: 0,content_type
Article,4746
MISC,1383
Comment,263
Reply,156
Review,113
Rejoinder,31
Discussion,4


In [161]:
pd.DataFrame(result[result.year>1939].content_type.value_counts())

Unnamed: 0,content_type
Article,3162
MISC,824
Comment,249
Reply,133
Rejoinder,15
Review,3


In [162]:
pd.DataFrame(result[(result.year>1939) & (result.year<2011)].content_type.value_counts())

Unnamed: 0,content_type
Article,2914
MISC,778
Comment,248
Reply,132
Rejoinder,15
Review,3


In [163]:
result.to_excel(saveas, index=False)