In [1]:
import pandas as pd
import json
import numpy as np
import glob
import functools
import re
import regex

In [2]:
#read in data and view collumns of interest 
columns = ["Corresponding author Addresses", "Author Full Names"]
data = pd.read_csv("Plt_sci_publications_updated_3.22.csv", header=0, sep = ",", low_memory=False, usecols = columns)

In [3]:
# 22 entries do not have corresp. author addresses, so drop them
data = data.dropna(subset=['Corresponding author Addresses'])

In [4]:
data = data.reset_index(drop = True)

## Extract corresponding author last name and use it to pull associated first name from list of coauthors 
This apprach doesnt handle situations where two authors on the same paper share a last name, but it matches well in other instances. 

Results from running this:

~15k have duplicate last names.

~3k dont retrieve any name becasue of a spelling error. 

~56k entries have only the first initial of the first name.

In [6]:
data.loc[0]['Corresponding author Addresses']

'Abad, JP; Marin, I (corresponding author), Univ Autonoma Madrid, Dept Biol Mol, E-28049 Madrid, Spain.'

In [6]:
#get last name of the corresponding author from the address line. split text based on white space. keep first word
# each entry in the new column has exactly one author
data['Corresponding_author_last_name'] = data['Corresponding author Addresses'].str.split(', ').str[0]

In [7]:
data

Unnamed: 0,Author Full Names,Corresponding author Addresses,Corresponding_author_last_name
0,"Raho, Nicolas; Fraga, Santiago; Abad, Jose P.;...","Abad, JP; Marin, I (corresponding author), Uni...",Abad
1,"Abdel-Hafez, Sobhy I. I.; Abo-Elyousr, Kamal A...","Abdel-Rahim, IR (corresponding author), Assiut...",Abdel-Rahim
2,"Abdelahad, Nadia; Bolpagni, Rossano; Jona Lasi...","Abdelahad, N (corresponding author), Sapienza ...",Abdelahad
3,"Aboal, Marina; Chapuis, Iara; Paiano, Monica O...","Aboal, M (corresponding author), Fac Biol, Dep...",Aboal
4,"Aboal, Marina; Eugenia Garcia-Fernandez, Maria...","Aboal, M (corresponding author), Univ Murcia, ...",Aboal
...,...,...,...
292493,"Brennan, EB; Smith, RF","Brennan, EB (corresponding author), USDA ARS, ...",Brennan
292494,"Frihauf, John C.; Stahlman, Phillip W.; Geier,...","Frihauf, JC (corresponding author), Kansas Sta...",Frihauf
292495,"Wicks, GA; Nordquist, PT; Baenziger, PS; Klein...","Wicks, GA (corresponding author), Dept Agron &...",Wicks
292496,"Nelson, Kelly A.; Johnson, William G.; Wait, J...","Nelson, KA (corresponding author), Univ Missou...",Nelson


In [8]:
#fill NaN
data["Author Full Names"].fillna("NaN", inplace = True) 

In [9]:
#Find full name of each corresponding author and write to new collum "first and last names"
# changed the regex a bit to include everything before the semicolon but nothing after
data["first and last names"]=[re.findall(data.iloc[i]['Corresponding_author_last_name']+"+[^;]*", data.iloc[i]['Author Full Names']) for i in range(len(data))]

In [10]:
data

Unnamed: 0,Author Full Names,Corresponding author Addresses,Corresponding_author_last_name,first and last names
0,"Raho, Nicolas; Fraga, Santiago; Abad, Jose P.;...","Abad, JP; Marin, I (corresponding author), Uni...",Abad,"[Abad, Jose P.]"
1,"Abdel-Hafez, Sobhy I. I.; Abo-Elyousr, Kamal A...","Abdel-Rahim, IR (corresponding author), Assiut...",Abdel-Rahim,"[Abdel-Rahim, Ismail R.]"
2,"Abdelahad, Nadia; Bolpagni, Rossano; Jona Lasi...","Abdelahad, N (corresponding author), Sapienza ...",Abdelahad,"[Abdelahad, Nadia]"
3,"Aboal, Marina; Chapuis, Iara; Paiano, Monica O...","Aboal, M (corresponding author), Fac Biol, Dep...",Aboal,"[Aboal, Marina]"
4,"Aboal, Marina; Eugenia Garcia-Fernandez, Maria...","Aboal, M (corresponding author), Univ Murcia, ...",Aboal,"[Aboal, Marina]"
...,...,...,...,...
292493,"Brennan, EB; Smith, RF","Brennan, EB (corresponding author), USDA ARS, ...",Brennan,"[Brennan, EB]"
292494,"Frihauf, John C.; Stahlman, Phillip W.; Geier,...","Frihauf, JC (corresponding author), Kansas Sta...",Frihauf,"[Frihauf, John C.]"
292495,"Wicks, GA; Nordquist, PT; Baenziger, PS; Klein...","Wicks, GA (corresponding author), Dept Agron &...",Wicks,"[Wicks, GA]"
292496,"Nelson, Kelly A.; Johnson, William G.; Wait, J...","Nelson, KA (corresponding author), Univ Missou...",Nelson,"[Nelson, Kelly A.]"


## We have first and last names for most authors now. We work on the special cases below.

##### Johannes van Staden

Appears as Van Staden, J. and van Staden, J, so we should consolidate these names. No other author in the top 100 appears to have this thing going on

In [11]:
#possible appraoch for reducing the number of first innitial only names 
#out of curiosity, what are the most common names? who has published the most? 
# this only counts frequency of first author name in name list
auth_frequency = data["first and last names"].str[0].value_counts()

In [12]:
auth_frequency

Van Staden, J.          144
Van Staden, Johannes    110
van Staden, J           107
Li                       98
Lin                      88
                       ... 
Guerrero, B.              1
Gueant, JL                1
Gu, Chunsun               1
Gronwald, J. W.           1
Brennan, EB               1
Name: first and last names, Length: 130283, dtype: int64

In [13]:
JVS = data[data['Corresponding_author_last_name'].str.contains('Van Staden', case = False)].index

In [14]:
for i in JVS:
    data.loc[i]['first and last names'] = ['Van Staden, Johannes']

#####  Attila Molnar V

First name is Attila, caught this because having a one-letter corresp. author address last name gave error

In [16]:
data.loc[286932]['first and last names'] = ['Molnar V, Attila']

#### Special case 0. Multiple corresponding authors
Not doing anything here besides finding the names. It's up to you if you want to drop these or include both (either would not be too hard)

In [71]:
mult_first_authors = []

for i in data.index:
    if ';' in data['Corresponding author Addresses'][i]:
        mult_first_authors.append(i)

#### Special case 1. first innitial only (MANY instances)
Note: look to see which journals and countries the all_initials people come from.

In [17]:
# gather all names of the form Lastname, AB
all_initials = []

for i in data.index:
    author_list = data.loc[i]['Author Full Names'].split('; ')
    if len([x for x in author_list if not re.findall(', [A-Z]{1,}$',x) and not re.findall(', [A-Z]. [A-Z].$',x) and not re.findall(', [A-Z].$',x)]) == 0:
        all_initials.append(i)

# remove entries with names found above from the dataset
no_initials = data[~data.index.isin(all_initials)]

#### Special case 2: Names do not have a comma separating first and last name

Basically all entries are researchers at Chinese or Vietnamese institutions. Chinese names are in the form "Familyname givenname" and Vietnamese names vary in form.

Need to re-extract full names because the code to extract names above was dependent on commas

In [21]:
# find all authors with no comma in name
name_order = []

for i in no_initials.index:
    if ',' not in no_initials.loc[i]['Author Full Names']:
        name_order.append(i)
        
name_order_df = no_initials.loc[name_order]

In [23]:
# Name order seems to vary, go through these manually, only 81 entries
Vietnamese_authors = []

# larger list, automated process to go through these is down below
Chinese_authors = []

# Very small list, go through these manually
Other_authors = []

for i in name_order_df.index:
    if 'Vietnam' in name_order_df.loc[i]['Corresponding author Addresses']:
        Vietnamese_authors.append(i)
    elif 'China' in name_order_df.loc[i]['Corresponding author Addresses']:
        Chinese_authors.append(i)
    else:
        Other_authors.append(i)

In [24]:
Chinese_df = no_initials.loc[Chinese_authors]

In [25]:
# find last name and first initial of corresponding author
Chinese_df['Corresponding_author_last_name_init'] = Chinese_df['Corresponding author Addresses'].str.findall(r'^[^,]*,..').str[0]

In [26]:
# list all authors
Chinese_df['all_names'] = Chinese_df['Author Full Names'].str.split('; ')

In [27]:
# find corresponding author from list of all authors
Chinese_df["first and last names"]=[re.findall(Chinese_df.loc[i]['Corresponding_author_last_name']+" "+Chinese_df.loc[i]['Corresponding_author_last_name_init'][-1]+"[a-zA-Z ]*", '; '.join(Chinese_df.loc[i]['all_names'])) for i in Chinese_df.index]

In [28]:
# the cell above returns some entries with more than one matching author
# manually review these
# these are names from authors in China that didn't have commas that do not have a unique surname
to_review = [i for i in Chinese_df.index if len(Chinese_df.loc[i]['first and last names'])>1]

In [29]:
# remove the special cases from the cell above
Chinese_df = Chinese_df[~Chinese_df.index.isin(to_review)]

In [30]:
# reformat names to be "family name, given name"
comma_names = []

for i in list(Chinese_df['first and last names']):
    comma_names.append([ ", ".join(item.split(" ")) for item in i])
    
Chinese_df['first and last names'] = comma_names

In [32]:
# replace names in larger dataframe with names that have comma in them
for i in Chinese_df.index:
    no_initials.loc[i]['first and last names'] = Chinese_df.loc[i]['first and last names']

#### Special case 3. two or more authors with same last name (~15k instances )
if a row has more than two comma seperated vlaues in "first and last names" go to "Corresponding_author_last_name" and pull out the last name plus the first initial then run pattern match against "author full names" to isolate just the corresponding author and pull their first name

In [33]:
# create a dataframe of all entries with two or more authors with the same last name

mult_authors = []

for i in no_initials.index:
    if len(no_initials['first and last names'][i])>1:
        mult_authors.append(i)
        
mult_author_df = no_initials.loc[mult_authors]

In [35]:
# get last name and first initial of corresponding author
mult_author_df['Corresponding_author_last_name'] = mult_author_df['Corresponding author Addresses'].str.findall(r'^[^,]*,..').str[0]

In [36]:
# use previous cell to find first and last names of corresponding authors
author_matches = []

for i in mult_author_df.index:
    mylist = mult_author_df.loc[i]['first and last names']
    r = re.compile(str(mult_author_df.loc[i]['Corresponding_author_last_name'])+".*")
    newlist = list(filter(r.match, mylist))
    author_matches.append(newlist)
    
mult_author_df['first and last names'] = author_matches

In [38]:
mult_author_df

Unnamed: 0,Author Full Names,Corresponding author Addresses,Corresponding_author_last_name,first and last names
33,"Ruan, Kun; Duan, Jingbo; Bai, Fangwen; Lemaire...","Bai, LH (corresponding author), Sichuan Univ, ...","Bai, L","[Bai, Linhan]"
77,"Buchheim, Mark A.; Sutherland, Danica M.; Buch...","Buchheim, MA (corresponding author), Univ Tuls...","Buchheim, M","[Buchheim, Mark A.]"
162,"Diaz, Patricio A.; Molinet, Carlos; Seguel, Mi...","Diaz, PA (corresponding author), Univ Los Lago...","Diaz, P","[Diaz, Patricio A.]"
249,"Luo, Zhaohe; Mertens, Kenneth Neil; Nezan, Eli...","Gu, HF (corresponding author), Minist Nat Reso...","Gu, H","[Gu, Haifeng]"
251,"Luo, Zhaohe; Lim, Zhen Fei; Mertens, Kenneth N...","Gu, HF (corresponding author), SOA, Inst Ocean...","Gu, H","[Gu, Haifeng]"
...,...,...,...,...
291863,"Cong, Cong; Wang, Zhaozhen; Li, Rongrong; Li, ...","Wang, JX (corresponding author), Shandong Agr ...","Wang, J","[Wang, Jinxin]"
291923,"Ma, Xiaoyan; Yang, Jinyan; Wu, Hanwen; Jiang, ...","Ma, XY (corresponding author), Chinese Acad Ag...","Ma, X","[Ma, Xiaoyan]"
292097,"Young, Frank L.; Thorne, Mark E.; Young, Dougl...","Young, FL (corresponding author), Washington S...","Young, F","[Young, Frank L.]"
292180,"Harre, Nick T.; Duncan, Garth W.; Young, Julie...","Young, BG (corresponding author), Purdue Univ,...","Young, B","[Young, Bryan G.]"


In [39]:
# create df with all entries without exactly one author name per entry (this happens when multiple authors have 
# same last  name and first initial)
# 1539 such entries

# collect all entries without exactly one name in "first and last names"
problems_mult = []

for i in mult_author_df.index:
    if len(mult_author_df.loc[i]['first and last names']) != 1:
        problems_mult.append(i)

# find these entries in original data set
for i in mult_author_df.index:
    no_initials.loc[i]['first and last names'] = mult_author_df.loc[i]['first and last names']
    
mult_problems_df = mult_author_df.loc[problems_mult]

It is probably impossible to determine who the corresponding author is when there is only one first initial. However, if there are two first initials, we can probably figure it out.

In [None]:
# step 0: isolate names with two first initials

In [58]:
mult_problems_df['first_initials'] = no_initials.loc[problems_mult]['Corresponding author Addresses'].str.split(', ').str[1]

In [65]:
mult_problems_df['first_initials'] = mult_problems_df['first_initials'].str.split().str[0]

In [66]:
mult_problems_df

Unnamed: 0,Author Full Names,Corresponding author Addresses,Corresponding_author_last_name,first and last names,first_initials
2512,"Kim, Myung Sook; Kim, Miryang; Terada, Ryuta; ...","Kim, MS (corresponding author), Pusan Natl Uni...","Kim, M","[Kim, Myung Sook, Kim, Miryang]",MS
3397,"Zhang, Li-Bing; Zhang, Liang","Zhang, LB (corresponding author), Chinese Acad...","Zhang, L","[Zhang, Li-Bing, Zhang, Liang]",LB
3400,"Zhang, Li-Bing; Zhang, Liang","Zhang, LB (corresponding author), Missouri Bot...","Zhang, L","[Zhang, Li-Bing, Zhang, Liang]",LB
4068,"Hanson, Nikki; Ross-Davis, Amy L.; Davis, Anth...","Davis, AS (corresponding author), Oregon State...","Davis, A","[Davis, Amy L., Davis, Anthony S.]",AS
5222,"Lopez-Aranda, Jose M.; Miranda, Luis; Medina, ...","Santos, BM (corresponding author), Univ Florid...","Santos, B","[Santos, Berta, Santos, Biclinski M.]",BM
...,...,...,...,...,...
289765,"Ali, Ahmad; Wang Jin-Da; Pan Yong-Bao; Deng Zu...","Chen, RK; Gao, SJ (corresponding author), Fuji...","Chen, R",[],RK;
289783,"Ibrahim, Aminu Kurawa; Xu, Yi; Niyitanga, Sylv...","Zhang, LW (corresponding author), Fujian Agr &...","Zhang, L","[Zhang, Lilan, Zhang, Liemei, Zhang, Liwu]",LW
291040,"Namdjoyan, Shahram; Namdjoyan, Shahrokh; Kerma...","Namdjoyan, S (corresponding author), Islamic A...","Namdjoyan, S","[Namdjoyan, Shahram, Namdjoyan, Shahrokh]",S
291208,Thanh Son Lo; Hoang Duc Le; Vu Thanh Thanh Ngu...,"Chu, HM (corresponding author), Thai Nguyen Un...","Chu, H",[],HM


In [None]:
# step 1: look for first names with multiple words and match initials

In [None]:
# step 2: look for entries where only one of the given names contain both first initials

In [None]:
# step 3: check the rest of the names manually

#### Special case 4. last name does not match any author full name

Solution: Allow one insertion and deletion. This is because many of these names are German names missing an e, or mutli-part last names with inconsistent capitalization. This also fixes general typos.

In [36]:
# get df with all entries that did not find any author matching corresp. author last name
no_authors = []

for i in no_initials.index:
    if len(no_initials['first and last names'][i]) == 0:
        no_authors.append(i)
        
no_authors_df = no_initials.loc[no_authors]

In [38]:
# the new last names are pulled from the list of all authors based on some regex replacement
new_last_names = []

for i in list(no_authors_df.index):
    old_last_name = no_authors_df.loc[i]['Corresponding_author_last_name']
    author_full_name = no_authors_df.loc[i]['Author Full Names']
    # if the old last name is one insertion away from something in the author full names, then consider it a match
    new_last_name = regex.findall("("+old_last_name+"){i<=1,d<=1}", author_full_name, overlapped=True)
    # if there are still no matches, ignore for now
    if len(new_last_name) == 0:
        pass
    
    else:
        new_last_names.append([i,new_last_name[0]])

In [39]:
# replace last names in old dataframe with new last names (which contain an extra letter)
for i in range(len(new_last_names)):
    no_authors_df.loc[new_last_names[i][0]]['Corresponding_author_last_name'] = new_last_names[i][1]

In [40]:
# look again to match the updated last names with the full author list
no_authors_df["first and last names"]=[re.findall(no_authors_df.loc[i]['Corresponding_author_last_name']+"+[^;]*", no_authors_df.loc[i]['Author Full Names']) for i in no_authors_df.index]

In [41]:
no_authors_df

Unnamed: 0,Author Full Names,Corresponding author Addresses,Corresponding_author_last_name,first and last names
200,"Floethe, Carla R.; Molis, Markus; Kruse, Inken...","Flothe, CR (corresponding author), Alfred Wege...",Floethe,"[Floethe, Carla R.]"
362,"Parys, Sabine; Kehraus, Stefan; Pete, Romain; ...","Konig, GM (corresponding author), Inst Pharmac...",Koenig,"[Koenig, Gabriele M.]"
376,"Krueger, Thomas; Oelmueller, Ralf; Luckas, Bernd","Kruger, T (corresponding author), Univ Jena, I...",Krueger,"[Krueger, Thomas]"
528,"Ni-Ni-Win; Hanyuda, Takeaki; Draisma, Stefano ...","Ni-Ni-Win (corresponding author), Kobe Univ, G...",Ni-Ni-Win (corresponding author),[]
603,"Demchenko, Eduard; Mikhailyuk, Tatiana; Colema...","Proschold, T (corresponding author), Brown Uni...",Proeschold,"[Proeschold, Thomas]"
...,...,...,...,...
290882,"Oeztuerk, Zahide Neslihan; Greiner, Steffen; R...","Ozturk, ZN (corresponding author), Heidelberg ...",ztuerk,"[ztuerk, Zahide Neslihan]"
290935,"SevgI, Tuba; DemIrkan, Elif","Sevgi, T (corresponding author), Bursa Uludag ...",SevgI,"[SevgI, Tuba]"
291047,"Trimanto; Hapsari, Lia; Budiharta, Sugeng","Trimanto (corresponding author), Indonesian In...",Trimanto (corresponding author),[]
291208,Thanh Son Lo; Hoang Duc Le; Vu Thanh Thanh Ngu...,"Chu, HM (corresponding author), Thai Nguyen Un...",Chu,"[ Chu, Chu]"


In [42]:
# Now look to see what's remaining after fuzzy matching
no_authors2 = []

for i in no_authors_df.index:
    if len(no_authors_df['first and last names'][i])<1:
        no_authors2.append(i)

In [43]:
no_authors_df2 = no_authors_df.loc[no_authors2]

In [44]:
no_authors_df2

Unnamed: 0,Author Full Names,Corresponding author Addresses,Corresponding_author_last_name,first and last names
528,"Ni-Ni-Win; Hanyuda, Takeaki; Draisma, Stefano ...","Ni-Ni-Win (corresponding author), Kobe Univ, G...",Ni-Ni-Win (corresponding author),[]
15094,"Moothoo-Padayachie, Anushka; Varghese, Boby; P...","Sershen (corresponding author), Univ KwaZulu N...",Sershen (corresponding author),[]
16553,"Nie; Zhao, Z. P.; Chen, G. P.; Zhang, B.; Ye, ...","Nie (corresponding author), Chongqing Univ, Bi...",Nie (corresponding author),[]
18015,Ikram-Ul-Haq,"Ikram-Ul-Haq (corresponding author), Natl Inst...",Ikram-Ul-Haq (corresponding author),[]
32514,"Govindjee; Khanna, Rita; Zilinskas, Barbara","Govindjee (corresponding author), Univ Illinoi...",Govindjee (corresponding author),[]
...,...,...,...,...
288417,"Ardiansyah; Nada, Annisa; Rahmawati, Nuraini T...","Ardiansyah (corresponding author), Univ Bakrie...",Ardiansyah (corresponding author),[]
288980,"Rudiyansyah; Masriani; Mudianta, I. Wayan; Gar...","Rudiyansyah (corresponding author), Univ Tanju...",Rudiyansyah (corresponding author),[]
289112,"Fatema-Tuz-Zohora; Muhit, Md. Abdul; Hasan, Ch...","Fatema-Tuz-Zohora (corresponding author), Univ...",Fatema-Tuz-Zohora (corresponding author),[]
290351,"JamjanMeeboon; Takamatsu, Susumu","JamjanMeeboon (corresponding author), Mie Univ...",JamjanMeeboon (corresponding author),[]


In [45]:
# the remaining 259 entries are either authors with only one name or buildings or something 
# Single names are more common in the global south, so worth going through manually
no_authors_df2['Corresponding_author_last_name'].value_counts()

Amanullah (corresponding author)          19
Govindjee (corresponding author)          15
Sodmergen (corresponding author)          13
Sershen (corresponding author)            10
Inderjit (corresponding author)            9
                                          ..
Gayacharan (corresponding author)          1
Veereshkumar (corresponding author)        1
Lee Kong Chian Nat Hist Museum             1
Thirthamallappa (corresponding author)     1
Trimanto (corresponding author)            1
Name: Corresponding_author_last_name, Length: 156, dtype: int64

## Misc code below
some random data exploration. looking at most common first names. adding names to orrigional dataframe. writing to csv for GenderAPI. 

In [None]:
#Make object with jsut the names 
names=data["first and last names"].str.split(',', expand=True)


In [None]:
#Rename column. 
names = names.rename(columns={names.columns[0]: 'Last'})
names = names.rename(columns={names.columns[1]: 'First'})
names


In [None]:
name_freq=names['First'].value_counts()
name_freq

In [None]:
#add new collums to origional df 
data = pd.concat([data, names], axis=1)

In [None]:
#write data to csv. this is the fiel that will be submitted to GenderAPI and NamSor. just save the nationality, first, and last name collumns 
data.to_csv("~/Desktop/Postdoc_work/Projects/Participation_in_PlantSci/names_gender/Corresponding_author_names1.csv", columns=["first and last names", 'Corresponding_author_last_name', "Author Full Names", "Corresponding author Nation"])

# General problems

##### Things to spot-check

In [None]:
data.loc[to_review]
data.loc[Vietnamese_authors]
data.loc[Other_authors]
no_authors_df2
mult_problems_df # check only after coding and running steps 1-3

##### Md honorific

Some South Asian authors have "Md." or "Md" as their first name, which is an honorific/signifies "Mohammed". GenderAPI does not recognize this as a masculine name, so one solution is to replace the strings "Md." or "Md " with "Mohammed", noting this is just for genderapi and that these authors do not actually use the name Mohammed. Another solution would be to update GenderAPI somehow to train it so that it knows Md. is masculine. Another solution is to just go in and manually assign these names a masculinity score of 1.0

##### Subproblem: Maiden names

I think the code is ok with parenthesis now, but should double check that authors with maiden names are having their first names recognized

##### Ignore this

In [86]:
testlist = [1,2,3,4]

In [87]:
for i in testlist:
    print(i)
    testlist.remove(i)

1
3


In [88]:
testlist

[2, 4]