# ECTA Cleaning

This notebook walks through how the ECTA articles were sorted into categories of articles and non-articles.

## Load Libraries

In [341]:
from tokenize import Ignore
from numpy import NaN
import pandas as pd
from difflib import SequenceMatcher
from multiprocessing import Pool
import multiprocessing as mp
import time

## Load Files
Replace these file paths with local file paths

In [342]:
masters = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Master lists\\ECONOMETRICA_master.xlsx")
pivots = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\pivots\\ECONOMETRICA_pivots.xlsx")
scopus = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Scopus\\ECONOMETRICA_SCOPUS.xlsx")
#datadump = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\datadumps\\ECONOMETRICA_datadump.xlsx")

pd.set_option('display.max_colwidth', None)

## Create File names

In [343]:
authors="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\ECTA_authors.xlsx"
non_auth="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\ECTA_Nauthors.xlsx"
saveas="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\ECTA_processed.xlsx"
reviews="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\ECTA_reviews.xlsx"
misc="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\ECTA_misc.xlsx"
conf="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\ECTA_conf.xlsx"

## Some random checks on the masters list

My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

In [344]:
pd.set_option('display.max_rows',masters.shape[0])
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1])

Unnamed: 0,title
back matter,439
front matter,430
news notes,193
announcements,146
accepted manuscripts,116
volume information,80
submission of manuscripts to econometrica,49
forthcoming papers,36
report of the secretary,31
report of the treasurer,31


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [345]:
temp1=masters[masters['authors'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp1)

Unnamed: 0,title
back matter,439
front matter,430
news notes,193
announcements,134
accepted manuscripts,116
volume information,80
submission of manuscripts to econometrica,49
forthcoming papers,36
news note,25
fellows of the econometric society,25


In [346]:
# block for testing regex matching
#pd.DataFrame(masters[masters['content_type'].isna()]['title'].str.lower().value_counts())
#masters[masters['title'].str.lower().str.match(r'(^|: )report of the')]
#masters[masters['title'].str.lower().str.match(r'(^|.*: )report of the')]
#masters.loc[masters['title'].str.lower().str.match(r'^combined references(.*)')==True,'content_type']='MISC'
#masters[masters['title'].str.lower().str.match(r'.*(members|members and subscribers)$')]

Unnamed: 0.1,Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages
80,80,https://www.jstor.org/stable/45172310,Enrique Sentana,THE ECONOMETRIC SOCIETY ANNUAL REPORTS: REPORT OF THE SECRETARY,,,https://www.jstor.org/stable/10.2307/i40222381,349-357
81,81,https://www.jstor.org/stable/45172311,Enrique Sentana,THE ECONOMETRIC SOCIETY ANNUAL REPORTS: REPORT OF THE TREASURER,,,https://www.jstor.org/stable/10.2307/i40222381,359-364
82,82,https://www.jstor.org/stable/45172312,"Joel Sobel, Dirk Bergemann, Itzhak Gilboa, Ulrich K. Müller, Aviv Nevo, Giovanni L. Violante and Fabrizio Zilibotti",THE ECONOMETRIC SOCIETY ANNUAL REPORTS: REPORT OF THE EDITORS 2017–2018,,,https://www.jstor.org/stable/10.2307/i40222381,365-367
84,84,https://www.jstor.org/stable/45172314,Donald Andrews and Jeffrey Ely,THE ECONOMETRIC SOCIETY ANNUAL REPORTS: REPORT OF THE EDITORS OF THE MONOGRAPH SERIES,,,https://www.jstor.org/stable/10.2307/i40222381,381-383
943,943,https://www.jstor.org/stable/40056534,Rafael Repullo,The Econometric Society Reports: Report of the Secretary,,,https://www.jstor.org/stable/10.2307/i40002377,327-333
944,944,https://www.jstor.org/stable/40056535,Rafael Repullo,The Econometric Society Reports: Report of the Treasurer,,,https://www.jstor.org/stable/10.2307/i40002377,335-340
945,945,https://www.jstor.org/stable/40056536,"Stephen Morris, Daron Acemoglu, Steve Berry, David Levine, Whitney Newey, Larry Samuelson and Harald Uhlig",The Econometric Society Annual Reports: Report of the Editors 2007-2008,,,https://www.jstor.org/stable/10.2307/i40002377,341-345
947,947,https://www.jstor.org/stable/40056538,Andrew Chesher and George Mailath,The Econometric Society Annual Reports: Report of the Editors of the Monograph Series,,,https://www.jstor.org/stable/10.2307/i40002377,357-359
976,976,https://www.jstor.org/stable/40056501,Lars Peter Hansen,"The Econometric Society Annual Reports, 2007: Report of the President",,,https://www.jstor.org/stable/10.2307/i40002375,1225-1226
1089,1089,https://www.jstor.org/stable/4123117,"Rafael Repullo, Eddie Dekel, David Levine, Costas Meghir, Whitney Newey and Larry Samuelson",The Econometric Society Annual Reports: Report of the Secretary,,,https://www.jstor.org/stable/10.2307/i383441,291-297+299-308


Judging from the above anything with greater than or equal to 5 duplicates are miscellaneous. The next code blocks classify it as such.

In [None]:
temp2=masters[masters['content_type'].isna()==True]['title'].str.lower().value_counts()
#pd.DataFrame(temp2)
removal=list(temp2[temp2>=5].index)
removal
masters.loc[masters.title.str.lower().isin(removal),'content_type']='MISC'

## Classifying miscellaneous content

In [350]:
scopus.rename(columns = {'abstract':'abstract2', 'title':'title2', 'authors':'authors2'}, inplace = True)
scopus['pages2']=scopus['pages']
masters['pages']=masters['pages'].str.strip()  
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = NaN  
masters[masters['title'].str.lower().str.contains('\[illustration\]')==True]

Unnamed: 0.1,Unnamed: 0,stable_url,authors,title,abstract,content_type,issue_url,pages
1,1,https://www.jstor.org/stable/45238022,,[Illustration],,,https://www.jstor.org/stable/10.2307/i40226149,
551,551,https://www.jstor.org/stable/23524986,,[Illustration],,,https://www.jstor.org/stable/10.2307/i23524124,
717,717,https://www.jstor.org/stable/41237778,,[Illustration],,,https://www.jstor.org/stable/10.2307/i40055874,


In [351]:
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'front matter').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'back matter').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'news note(|s)').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'announcements').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'accepted manuscripts').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'volume information').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'submission of manuscripts to econometrica').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'forthcoming papers').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters['title'].str.lower().str.match(r'(^|.*: )report of the'), 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*report (of|on) the(.*)(editors|fellows)'), 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'meeting of the econometric society'), 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'(^|.*: )report of the.*')==True,'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('econometric society')==True,'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('econometrica')==True,'content_type']='MISC'
masters.loc[(masters['title'].str.lower().str.contains('report')==True) & (masters['authors'].isna()==True),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.strip().str.match(r'treasurer(.*)report'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.strip().str.contains(r'report from the president'),'content_type']='MISC'
masters.loc[((masters['title'].str.lower().str.contains('announcement of the')==True)),'content_type']='MISC'
masters.loc[((masters['title'].str.lower().str.match(r'editor(.*)note')==True)),'content_type']='MISC'
masters.loc[((masters['title'].str.lower().str.match(r'(.*):program$')==True)),'content_type']='MISC'
masters.loc[((masters['title'].str.lower().str.strip().str.match(r'accountant(.*)opinion')==True)),'content_type']='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'unpublished research memoranda').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[((masters['title'].str.lower().str.strip().str.match(r'^(obituary|death(s?) of members)$')==True)),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*fellows$'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('nomination of fellows'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'.*editorial$'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'(index of authors|summary of accounts)'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*(members|members and subscribers)$'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.contains('\[illustration\]'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.contains('abstracts of papers'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('frisch medal award'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^membership list'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('additive preferences'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('communications'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('letters to the editor'),'content_type']="MISC"


## Classifying other content

In [352]:
sum(masters.content_type.isna())
#masters.shape[0]

6569

In [353]:
masters.loc[masters['authors'].str.lower().str.match(r'^review(ed|) by(.*)')==True,'content_type']='Review' #reviews
masters.loc[(masters['title'].str.lower().str.match(r'(.*) by (.*)')==True) & (masters.authors.isna()==True),'content_type']='Review2' 
#possible reviews that don't have author names
masters[(masters['content_type']=='Review2') | (masters['content_type']=='Review')].shape[0] #reviews

935

In [354]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0] #comments

70

In [355]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

43

In [356]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

14

In [357]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters[masters['content_type']=='Discussion'].shape[0]

9

In [358]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

5501

In [1]:
#masters[masters['title'].str.lower().str.match(r'^\washington notes$')==True]
masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True]

masters[masters.content_type=='Discussion'].shape[0]

NameError: name 'masters' is not defined

## Summary Statistics

The values below show the full dataset. The masterlist is merged with pivot files to be able to sort by date.

In [360]:
pd.DataFrame(masters['content_type'].value_counts())

Unnamed: 0,content_type
Article,5501
MISC,2658
Review,930
Comment,70
Reply,43
Rejoinder,14
Discussion,9
Review2,5


## Consider the pivots file
At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles.

In [361]:
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'(.*)(supplement|proceedings|annual meeting|survey)(.*)'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots.type.value_counts()

N    443
S      6
Name: type, dtype: int64

Merge and calculate value counts of all the content types.

In [362]:
result = pd.merge(masters, pivots[['issue_url','year','volume','issue','journal','type']], how="left", on=["issue_url", "issue_url"])

In [363]:
pd.DataFrame(result.content_type.value_counts())

Unnamed: 0,content_type
Article,5501
MISC,2658
Review,930
Comment,70
Reply,43
Rejoinder,14
Discussion,9
Review2,5


In [364]:
pd.DataFrame(result[result.year>1939].content_type.value_counts())

Unnamed: 0,content_type
Article,5290
MISC,2517
Review,930
Comment,70
Reply,41
Rejoinder,10
Discussion,9
Review2,5


In [365]:
pd.DataFrame(result[(result.year>1939) & (result.year<2011)].content_type.value_counts())

Unnamed: 0,content_type
Article,4736
MISC,2302
Review,930
Comment,67
Reply,39
Rejoinder,10
Discussion,9
Review2,5


In [366]:
result.to_excel(saveas, index=False)