# Author Name Recon

This notebook cleans the author names of each paper for the top 5 economic journals for all time using jstor metadata. Running through the cells produces a reconciliation file for the author names as they appear in the masterlists in a raw format. ie: what went in for each author name of an article, what came out and possible aliases. (write this more clearly) I have not filtered on content_type or year.

## Notes

* Going through this data, some of the citation files for JPE acquired from the chicago journals website has an error where special characters are not encoding properly. I have created a manual resolution of these files, but this should be fixed further up the pipeline. There are only 33 names with this problem and you can't resolve it by using the unidecode library. But, specifically, these files:

{'uchicago_jpe126_1.bib',
 'uchicago_jpe126_3.bib',
 'uchicago_jpe126_5.bib',
 'uchicago_jpe126_6.bib',
 'uchicago_jpe126_S1.bib',
 'uchicago_jpe127_1.bib',
 'uchicago_jpe127_2.bib',
 'uchicago_jpe127_3.bib',
 'uchicago_jpe127_4.bib',
 'uchicago_jpe127_5.bib',
 'uchicago_jpe128_1.bib',
 'uchicago_jpe128_10.bib',
 'uchicago_jpe128_11.bib',
 'uchicago_jpe128_12.bib',
 'uchicago_jpe128_2.bib',
 'uchicago_jpe128_4.bib',
 'uchicago_jpe128_5.bib',
 'uchicago_jpe128_7.bib',
 'uchicago_jpe128_8.bib',
 'uchicago_jpe128_9.bib'} 
* Resolution of some special characters results in empty string using unicodedata.normalize() function. For example the danish o (o with a slash) does not result in o but rather an empty string. So I have made a function and a accompanying recon file that fixes this.
* There are some authors which:
    * have only their initials (even the last name is contracted)
    * they are just referred to by their last name because of how well known they are. 
    * there are so many authors that only the primary author is mentioned and the others are contracted into a catch-all term like "other", "company" or "co.". 
    
    These are limited and recommend resolving here as they are easily picked out. Todo: Construct a recon file manually compiled to replace these.

**Input**: 
* each of the cleaned masterlists with file ending in \_M_sco_du.xlsx 

**Output**: 
* a json file of reconciliated and cleaned author names and associated article URLs
* a list of UTF-8 characters and the ascii equivalent to which they should resolve.
* a list of author name corrections for JPE errors
* TODO: a list of author name corrections for contraction cases, need to do article URL matching.

## Import libraries

In [447]:
import pandas as pd
from unidecode import unidecode
import re
from datetime import date
import json
import numpy as np
import string
import time
# set column options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

## Read in input files

In [448]:
JOURNALS= ['AER', 'JPE', 'ECTA', 'RES', 'QJE']
#read in all processed masterlists
j_data=pd.DataFrame()
for i in JOURNALS:
    j_data=pd.concat([pd.read_excel('/Users/sijiawu/Work/Refs Danae/Thesis/Data/Combined/'+i+'_M_sco_du.xlsx'), j_data], ignore_index=True)
#Create a batch file

j_data=j_data[j_data.duplicated()==False].ßreset_index().drop('index', axis=1)

In [449]:
j_data.columns # verify headers

Index(['issue_url', 'ISSN', 'URL', 'journal', 'number', 'publisher', 'title',
       'urldate', 'volume', 'year', 'abstract', 'author', 'pages',
       'reviewed-author', 'uploaded', 'content_type', 'author_split',
       'title_10', 'type', 'authorsSCO', 'titleSCO', 'journalSCO', 'DOI',
       'affiliations', 'abstractSCO', 'citations', 'document type',
       'index keywords', 'author keywords', 'document_type'],
      dtype='object')

In [450]:
len(j_data) #number of articles

62262

In [451]:
j_data["content_type"].value_counts()

content_type
Article       32882
Review        13037
MISC          12466
Comment        1420
Reply           834
Review2         761
Discussion      709
Rejoinder       153
Name: count, dtype: int64

## Create a restricted character set

In [452]:
char_set="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,'.-"
chars=[*char_set]

## Constants

In [453]:
auth_ad=[
    "c.p.a.",
    "m.e.",
    "jr.", #junior
    "m.d.",
    "s.j.", #society of jesus
    "s. j.", #society of jesus
    "2nd",
    "3rd",
    "wm.",  #contraction of the given name William
    "yr." #misspelling of junior?
]

## Resolve the JPE formatting issue

Please generate the file using the code block at the end of the notebook and manually resolve the errors. Rename the file and run the next two codeblocks.

In [454]:
with open('jpe_problem_names.json') as f: 
    data = f.read() 
names_repl = json.loads(data) 

In [455]:
for i in j_data.index:
    if j_data.loc[i, "author"] in names_repl.keys():
         j_data.loc[i, "author"]=names_repl[j_data.loc[i, "author"]]

## Resolve special characters outside ascii

As above, generate the file using code at the end of the notebook. Or copy in the file.

In [456]:
with open('spec_char_replacement.json') as f: 
    data = f.read() 
js = json.loads(data) 

def replace_spec_chars(str_in):
    hold=""
    for o in str_in:
        if o.lower() in js.keys():
            hold=hold+js[o.lower()]
        else:
            hold=hold+o
    return hold

In [499]:
for i in j_data.index:
    if j_data.loc[i, "author"] is not np.NaN:
        if j_data.loc[i, "author"].isascii()==True:
            continue
        j_data.loc[i, "author"]=replace_spec_chars(j_data.loc[i,"author"])
        if j_data.loc[i, "author"].isascii()==False:
            print(j_data.loc[i, "author"])

Ayse İmrohoroglu and Selahattin İmrohoroglu and Douglas H. Joines
Luisa Fuster and Ayse İmrohoroglu and Selahattin İmrohoroglu
Mariagiovanna Baccara and Ayse İmrohoroglu and Alistair J. Wilson and Leeat Yariv
Kaiji Chen and Ayse İmrohoroglu and Selahattin İmrohoroglu
Ayse İmrohoroglu and Selahattin İmrohoroglu and Douglas H. Joines


Create a new column where all author names are split by " and ".

In [459]:
j_data["test"]=j_data['author'].str.split(" and ")

## Process author names

In [513]:
#store the aliases in lists for comparison later
a1=[]
a2=[]
i1=[]
ln=[]


#function for creating aliases
#todo annotate function
def proc_auth(auth_str):
#     print(auth_str)
    auth_split=auth_str.split(" ")
    check=0
    for j in auth_split:
        if "." in j:
            check=check+1
    if check==len(auth_split):
        return [auth_str, auth_str, auth_str, auth_str, 0]

    if len(auth_split)>1:
        init_auth=auth_split[0][0]+'. '+auth_split[-1]
        alt_auth=""
        alt_2_auth=auth_split[0]+" "
        if len(auth_split)>2:
            for k in auth_split[1:-1]:
                alt_2_auth+=k[0]+'. '
        alt_2_auth+=auth_split[-1]
        
        
        for k in auth_split[:-1]:
            alt_auth+=k[0]+'. '
        alt_auth+=auth_split[-1]
        ln.append(auth_split[-1])
        a1.append(alt_2_auth)
        a2.append(alt_auth)
        i1.append(init_auth)
        
        
        
        return [alt_2_auth, alt_auth, init_auth, auth_split[-1]]
    else:
         return[auth_str, auth_str, auth_str, auth_str, 0]


In [514]:

all_authors=[]
all_authors_a=[]
autht=0
proc_auths_all={}
for i in j_data.index:
    authors=j_data.loc[i,"test"]
    proc_auths={"authors":{}, "year":j_data.loc[i,"year"], 'content_type':j_data.loc[i, "content_type"]}
    if authors is not np.NaN:
        if "suggested" in j_data.loc[i, "author"]:
            print(j_data.loc[i, "author"])
            continue
        for j in range(len(authors)):
            
            sa=authors[j].lower()
            all_authors_a.append(str(sa))
            sa=re.sub('\xa0'," ",sa)
            sa=re.sub("  "," ",sa)
            autht=autht+1
            
            if "," in sa:
                o=None
                for k in auth_ad:
                    p=", "+k
                    if p in sa:
                        o=", "+k
                if o is None:
                    reorg=sa.split(r", ")
                    sa=reorg[1].strip()+ " "+ reorg[0].strip()
           
            all_authors.append(sa)
            proc_auths["authors"][j]={}
            proc_auths["authors"][j]["raw"]=authors[j]
            proc_auths["authors"][j]["init"]=sa
            
            aliases=proc_auth(sa)
            if len(aliases)==5:
                print(sa)
                print(j_data.loc[i,"author"])
                print(j_data.loc[i, "URL"])
                print(j_data.loc[i, "year"])
                print(j_data.loc[i, "journal"])

            proc_auths["authors"][j]["a1"]=aliases[0]
            proc_auths["authors"][j]["a2"]=aliases[1]
            proc_auths["authors"][j]["a3"]=aliases[2]

            if ("[" in sa) or ("(" in sa):
                print(sa)
                print(j_data.loc[i,"author"])
                print(j_data.loc[i, "year"])
                print(j_data.loc[i, "journal"])
                print(j_data.loc[i, "URL"])

    else:
        a=a+1
    proc_auths_all[j_data.loc[i,"URL"]]=proc_auths

e. h. c.
E. H. C.
https://www.jstor.org/stable/1885441
1929
The Quarterly Journal of Economics
f. w. t.
F. W. T.
https://www.jstor.org/stable/1886061
1909
The Quarterly Journal of Economics
f. w. t.
F. W. T.
https://www.jstor.org/stable/1883356
1907
The Quarterly Journal of Economics
e. f. g.
E. F. G.
https://www.jstor.org/stable/1884868
1904
The Quarterly Journal of Economics
w. j. a.
Frances Gardiner Davenport and W. J. A.
https://www.jstor.org/stable/1882129
1897
The Quarterly Journal of Economics
w. j. a.
Henry W. Wolff and W. J. A.
https://www.jstor.org/stable/1882880
1893
The Quarterly Journal of Economics
j. g. b.
J. G. B.
https://www.jstor.org/stable/1882518
1892
The Quarterly Journal of Economics
[samuel b. clarke]
[Samuel B. Clarke]
1891
The Quarterly Journal of Economics
https://www.jstor.org/stable/1879614
jeong ho (john) kim
JUNEHYUK JUNG and JEONG HO (JOHN) KIM and FILIP MATeJKA and CHRISTOPHER A. SIMS
2019
The Review of Economic Studies
https://www.jstor.org/stable/26839

company
Touche Ross and Company
https://www.jstor.org/stable/1806876
1989
The American Economic Review
company
Touche Ross and Company
https://www.jstor.org/stable/1804119
1987
The American Economic Review
co.
Touche Ross and Co.
https://www.jstor.org/stable/1814832
1985
The American Economic Review
company
Touche Ross and Company
https://www.jstor.org/stable/1831567
1982
The American Economic Review
company
Arthur Andersen and Company
https://www.jstor.org/stable/1802802
1981
The American Economic Review
l. a. s., jr.
L. A. S., Jr.
https://www.jstor.org/stable/1811482
1945
The American Economic Review
others
H. R. Tolley and Others
https://www.jstor.org/stable/1818464
1945
The American Economic Review
lt. (j.g.) kenyon e. poole
Lt. (j.g.) Kenyon E. Poole
1944
The American Economic Review
https://www.jstor.org/stable/1910756
n. s. b. g.
N. S. B. G.
https://www.jstor.org/stable/1807681
1940
The American Economic Review
company
Arthur Andersen and Company
https://www.jstor.org/stable/180

## Save the output

In [496]:

with open("author_proc_"+str(time.time())+".json", "w") as outfile: 
    json.dump(proc_auths_all, outfile, indent=4, default=int)

## Compute some stats

In [515]:
auth_t=list(set(all_authors_a))
u_auth=list(set(all_authors))
u_a1=list(set(a1))
u_a2=list(set(a2))
u_a3=list(set(i1))
u_ln=list(set(ln))
print("The total number of authors: "+ str(len(all_authors)))
print("The number of unique author names without editing: "+ str(len(auth_t)))
print("The number of unique author names after initial processing: "+str(len(u_auth)))
print("Unique alias 1 type names (contracted middle names): "+str(len(u_a1)))
print("Unique alias 2 type names (contracted first names): "+str(len(u_a2)))
print("Unique alias 3 type names (contracted first name + last name): "+str(len(u_a3)))
print("Unique last names: "+str(len(u_ln)))


The total number of authors: 68450
The number of unique author names without editing: 23164
The number of unique author names after initial processing: 22559
Unique alias 1 type names (contracted middle names): 22115
Unique alias 2 type names (contracted first names): 20289
Unique alias 3 type names (contracted first name + last name): 17720
Unique last names: 12562


## Top 20 Author Output Ranked for all time

In [516]:
# after initial process
pd.DataFrame(all_authors).value_counts()[:20]

frank h. knight         113
george j. stigler        95
a. b. wolfe              92
paul a. samuelson        91
daron acemoglu           85
f. w. taussig            84
william j. baumol        83
m. bronfenbrenner        81
j. m. clark              79
jean tirole              77
joseph e. stiglitz       75
paul h. douglas          71
h. parker willis         68
wesley c. mitchell       67
milton friedman          66
franklin m. fisher       64
harry g. johnson         63
t. n. carver             60
j. laurence laughlin     57
chester w. wright        57
Name: count, dtype: int64

In [517]:
# using a1
pd.DataFrame(a1).value_counts()[:20]

frank h. knight       113
george j. stigler      95
a. b. wolfe            92
paul a. samuelson      91
j. m. clark            88
daron acemoglu         85
f. w. taussig          84
william j. baumol      83
m. bronfenbrenner      81
jean tirole            77
joseph e. stiglitz     75
paul h. douglas        71
h. p. willis           70
wesley c. mitchell     67
milton friedman        66
franklin m. fisher     64
harry g. johnson       63
frank a. fetter        60
t. n. carver           60
j. l. laughlin         58
Name: count, dtype: int64

In [518]:
# using a2
pd.DataFrame(a2).value_counts()[:20]

f. h. knight         133
m. bronfenbrenner    101
j. m. clark           97
p. a. samuelson       96
g. j. stigler         95
a. b. wolfe           94
w. j. baumol          90
f. w. taussig         86
j. e. stiglitz        86
d. acemoglu           85
j. tirole             77
w. c. mitchell        75
h. p. willis          74
p. h. douglas         73
m. friedman           66
f. m. fisher          66
t. n. carver          64
a. p. lerner          64
h. g. johnson         63
f. a. fetter          63
Name: count, dtype: int64

In [519]:
# using a3
pd.DataFrame(i1).value_counts()[:20]

f. knight            133
j. clark             112
f. fetter            111
m. bronfenbrenner    101
r. gordon            101
j. stiglitz           98
p. samuelson          98
g. stigler            97
a. wolfe              94
w. baumol             92
f. taussig            86
d. acemoglu           85
w. mitchell           83
m. feldstein          78
j. tirole             77
v. smith              74
p. douglas            74
h. willis             74
j. robinson           73
c. wright             71
Name: count, dtype: int64

## Frequency of Authors scale of output for all time

In [119]:
# after initial processing
summary=[]
seq=[0,1,2,5,10,50,100,200,1000]
nms=pd.DataFrame(pd.DataFrame(all_authors).value_counts()).reset_index()
for i in range(len(seq)-1):
#     print("between "+str(seq[i])+" exclusive and "+str(seq[i+1])+ " inclusive")
#     print(nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0])
    summary.append({"occurences":str(seq[i])+" < x <= "+str(seq[i+1]), "number":nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0]})
    
pd.DataFrame(summary)

Unnamed: 0,occurences,number
0,0 < x <= 1,12873
1,1 < x <= 2,3610
2,2 < x <= 5,3439
3,5 < x <= 10,1610
4,10 < x <= 50,1166
5,50 < x <= 100,33
6,100 < x <= 200,1
7,200 < x <= 1000,0


In [120]:
# first initial + last name
summary=[]
seq=[0,1,2,5,10,50,100,200,1000]
nms=pd.DataFrame(pd.DataFrame(i1).value_counts()).reset_index()
for i in range(len(seq)-1):
#     print("between "+str(seq[i])+" exclusive and "+str(seq[i+1])+ " inclusive")
#     print(nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0])
    summary.append({"occurences":str(seq[i])+" < x <= "+str(seq[i+1]), "number":nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0]})
    
pd.DataFrame(summary)

Unnamed: 0,occurences,number
0,0 < x <= 1,8824
1,1 < x <= 2,2854
2,2 < x <= 5,3155
3,5 < x <= 10,1640
4,10 < x <= 50,1365
5,50 < x <= 100,44
6,100 < x <= 200,5
7,200 < x <= 1000,0


In [112]:
# Initials and last name
summary=[]
seq=[0,1,2,5,10,50,100,200,1000]
nms=pd.DataFrame(pd.DataFrame(a2).value_counts()).reset_index()
for i in range(len(seq)-1):
    print("between "+str(seq[i])+" exclusive and "+str(seq[i+1])+ " inclusive")
    print(nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0])
    summary.append({"occurences":str(seq[i])+" < x <= "+str(seq[i+1]), "number":nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0]})
    
pd.DataFrame(summary)

between 0 exclusive and 1 inclusive
10915
between 1 exclusive and 2 inclusive
3266
between 2 exclusive and 5 inclusive
3366
between 5 exclusive and 10 inclusive
1637
between 10 exclusive and 50 inclusive
1247
between 50 exclusive and 100 inclusive
39
between 100 exclusive and 200 inclusive
2
between 200 exclusive and 1000 inclusive
0


Unnamed: 0,occurences,number
0,0 < x <= 1,10915
1,1 < x <= 2,3266
2,2 < x <= 5,3366
3,5 < x <= 10,1637
4,10 < x <= 50,1247
5,50 < x <= 100,39
6,100 < x <= 200,2
7,200 < x <= 1000,0


In [121]:
# contracted middle names
summary=[]
seq=[0,1,2,5,10,50,100,200,1000]
nms=pd.DataFrame(pd.DataFrame(a1).value_counts()).reset_index()
for i in range(len(seq)-1):
#     print("between "+str(seq[i])+" exclusive and "+str(seq[i+1])+ " inclusive")
#     print(nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0])
    summary.append({"occurences":str(seq[i])+" < x <= "+str(seq[i+1]), "number":nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0]})
    
pd.DataFrame(summary)

Unnamed: 0,occurences,number
0,0 < x <= 1,12542
1,1 < x <= 2,3564
2,2 < x <= 5,3418
3,5 < x <= 10,1619
4,10 < x <= 50,1182
5,50 < x <= 100,34
6,100 < x <= 200,1
7,200 < x <= 1000,0


## Resolving problem names for JPE

It is just read in above but this is how it should be generated if the file above is not applied to the variable j_data. They are printed to a json file with a unique timestamp please resolve each name, using the URLs below to check for the author name on the article page if the names were cutoff.

In [281]:
stuff={}
for i in j_data.index:
    authors=j_data.loc[i,"author"]
    if authors is not np.NaN:
        sa=authors.lower()
        if ("{" in sa) or ("\\" in sa):
            print(j_data.loc[i,"author"])
            print(j_data.loc[i, "URL"])
            stuff[j_data.loc[i, "author"]]=j_data.loc[i, "author"]
            print()

with open("jpe_problem_names_"+str(time.time())+".json", "w") as outfile: 
    json.dump(stuff, outfile, indent=4)

Gr\"{o}nqvist, Hans and Nilsson, J. Peter and Robling, Per-Olof
https://doi.org/10.1086/708725

Foley-Fisher, Nathan and Narajabad, Borghan and Verani, St\'{e}phane
https://doi.org/10.1086/708817

Hansman, Christopher and Hjort, Jonas and Le\'{o}n-Ciliotta, Gianmarco and Teachout, Matthieu
https://doi.org/10.1086/708818

Baliga, Sandeep and Sj\"{o}str\"{o}m, Tomas
https://doi.org/10.1086/707767

Mourifi\'{e
https://doi.org/10.1086/708724

Harstad, B\r{a}rd
https://doi.org/10.1086/707024

Alm\r{a}s, Ingvild and Cappelen, Alexander W. and Tungodden, Bertil
https://doi.org/10.1086/705551

Bhuller, Manudeep and Dahl, Gordon B. and L\o{}ken, Katrine V. and Mogstad, Magne
https://doi.org/10.1086/705330

Kosse, Fabian and Deckers, Thomas and Pinger, Pia and Schildberg-H\"{o}risch, Hannah and Falk, Armin
https://doi.org/10.1086/704386

Herrera, Helios and Ordo\~{n}ez, Guillermo and Trebesch, Christoph
https://doi.org/10.1086/704544

Battaglini, Marco and Harstad, B\r{a}rd
https://doi.org/10.10

## Resolving special characters

Todo: optimize this code.
As above, add in the the unresolvable characters directly in the file and rename it to match the file name above.

In [406]:
spec_char_set={}

spec_chars=[]
def process_string(str_in):
    if str_in.isascii()==True:
        return str_in
    hold=""
    for o in str_in:
        if o.isascii()==False:
            spec_chars.append(o)
            print(str_in)

for i in j_data.index:
    authors=j_data.loc[i,"author"]
    if authors is not np.NaN:
        sa=authors.lower()
        process_string(sa)
        
u_spec_chars=list(set(spec_chars))
u_spec_chars.sort()

for o in u_spec_chars:
    spec_char_set[o]=unicodedata.normalize("NFKD", o).encode('ascii', 'ignore').decode('utf-8')
    if len(spec_char_set[o])==0:
        print(o)
        
print("end")
        
with open("spec_char_replacement_"+str(time.time())+".json", "w") as outfile: 
    json.dump(spec_char_set, outfile, indent=4)

ayse i̇mrohoroglu and selahattin i̇mrohoroglu and douglas h. joines
ayse i̇mrohoroglu and selahattin i̇mrohoroglu and douglas h. joines
luisa fuster and ayse i̇mrohoroglu and selahattin i̇mrohoroglu
luisa fuster and ayse i̇mrohoroglu and selahattin i̇mrohoroglu
mariagiovanna baccara and ayse i̇mrohoroglu and alistair j. wilson and leeat yariv
kaiji chen and ayse i̇mrohoroglu and selahattin i̇mrohoroglu
kaiji chen and ayse i̇mrohoroglu and selahattin i̇mrohoroglu
ayse i̇mrohoroglu and selahattin i̇mrohoroglu and douglas h. joines
ayse i̇mrohoroglu and selahattin i̇mrohoroglu and douglas h. joines
̇
end
