# JPE cleaning
This notebook walks through how the JPE articles were sorted into categories of articles and non-articles.

## Loading libraries

In [2]:
from tokenize import Ignore
from numpy import NaN
import pandas as pd
from difflib import SequenceMatcher
from multiprocessing import Pool
import multiprocessing as mp
import time

## Loading Files
Please replace file paths with local file paths and comment out unapplicable content eg: datadump

In [5]:
masters = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Master lists\\JPE_master.xlsx")
pivots = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\pivots\\JPE_pivots.xlsx")
scopus = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\Scopus\\JPE_SCOPUS.xlsx")
datadump = pd.read_excel("C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_datadump.xlsx")

pd.set_option('display.max_colwidth', None)

## Create File names
Again, replace these with local file paths

In [6]:
authors="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_authors.xlsx"
non_auth="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_Nauthors.xlsx"
saveas="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_processed.xlsx"
reviews="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_reviews.xlsx"
misc="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_misc.xlsx"
conf="C:\\Users\\sjwu1\\Journal_Data\\datadumps\\JPE_conf.xlsx"

## Some random checks on the masters list
My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

In [7]:
pd.set_option('display.max_rows',masters.shape[0])
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1])

Unnamed: 0,title
front matter,431
back matter,322
books received,248
volume information,137
washington notes,110
journal of political economy: acknowledges the assistance of:,74
new publications,50
journal of political economy,31
[notes],27
notices,18


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [9]:
temp2=masters[masters['authors'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp2)

Unnamed: 0,title
front matter,431
back matter,322
books received,248
volume information,137
washington notes,110
journal of political economy: acknowledges the assistance of:,74
new publications,50
journal of political economy,31
[notes],24
notices,18


## Classifying miscellaneous documents

In [237]:
scopus.rename(columns = {'abstract':'abstract2', 'title':'title2', 'authors':'authors2'}, inplace = True)
scopus['pages2']=scopus['pages']
masters['pages']=masters['pages'].str.strip()
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = NaN

masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'front matter').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'back matter').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'volume information').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'books recieved').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'washington notes').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters['title'].str.lower().str.match(r'(in )?memori(a|u)(m|l)')==True, 'content_type']='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'books reccieved').ratio(), axis=1)>0.75,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^journal of political economy(.*)')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^index to volume(.*)')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^new publications')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^(prefatory |\[)note(|s)(|\])$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^(|\[)questions and answers(\]|)$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^(|short )notice(|s)$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^back cover(.*)')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^introduction(.*)')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^combined references(.*)')==True,'content_type']='MISC'

## Classifying other content types

In [239]:
# check for how many articles are still unclassified
sum(masters.content_type.isna())
#masters.shape[0]

12407

In [240]:
masters.loc[masters['authors'].str.lower().str.match(r'^review(ed|) by(.*)')==True,'content_type']='Review' #reviews
masters.loc[(masters['title'].str.lower().str.match(r'(.*) by (.*)')==True) & (masters.authors.isna()==True),'content_type']='Review2' 
#possible reviews that don't have author names


In [241]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0] #comments

166

In [242]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

114

In [243]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

46

In [244]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters[masters['content_type']=='Discussion'].shape[0]

17

In [245]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

5863

In [1]:
# block for testing regex matches
#masters[masters['title'].str.lower().str.match(r'^\washington notes$')==True]
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True]
#masters[masters.content_type=='Discussion'].shape[0]

NameError: name 'masters' is not defined

## Consider the pivots file
At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles. Separate special issues (S) from normal issues (N)

In [248]:
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'(.*)(supplement|proceedings|annual meeting|survey)(.*)'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots.type.value_counts()
pivots[pivots.type=="S"]

Unnamed: 0,year,month,volume,issue,issue_url,Jstor_issue_text,journal,pivot_url,no_docs,type
77,2004,February,112,S1,https://www.jstor.org/stable/10.1086/jpe.2004.112.issue-s1,No. S1 Papers in Honor of Sherwin Rosen: A Supplement to Volume 112 February 2004 pp. S1-S336,JPE,https://www.jstor.org/stable/10.1086/379940,13,S


Merge pivots and masters together

In [250]:
result = pd.merge(masters, pivots[['issue_url','year','volume','issue','journal','type']], how="left", on=["issue_url", "issue_url"])

## Summaries of content 

In [251]:
pd.DataFrame(result.content_type.value_counts())

Unnamed: 0,content_type
Article,5863
Review,5452
MISC,1509
Review2,764
Comment,166
Reply,114
Rejoinder,46
Discussion,17


In [252]:
pd.DataFrame(result[result.year>1939].content_type.value_counts())

Unnamed: 0,content_type
Article,4138
Review,2874
MISC,1053
Comment,161
Reply,101
Rejoinder,38
Review2,4
Discussion,1


In [253]:
pd.DataFrame(result[(result.year>1939) & (result.year<2011)].content_type.value_counts())

Unnamed: 0,content_type
Article,3906
Review,2874
MISC,1001
Comment,160
Reply,99
Rejoinder,38
Review2,3
Discussion,1


In [254]:
result.to_excel(saveas, index=False)