# Install libraries

Make sure you [install `rapidfuzz`](https://github.com/rapidfuzz/RapidFuzz?tab=readme-ov-file) for fuzzy matching. Click the link for documentation.

If you have issues with it, you can always go with [thefuzz](https://github.com/seatgeek/thefuzz). In that case, do `from thefuzz import fuzz, process` instead.

In [5]:
%pip install rapidfuzz
%pip install openpyxl
# Standard Python
import os
import unicodedata
from glob import glob
from importlib import reload
import datetime

# Canon libraries
import pandas as pd
import numpy as np

# You'll need to install this one
from rapidfuzz import fuzz, process

# The utils file to keep the notebook short
import utils


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [openpyxl]1/2[0m [openpyxl]
[1A[2KSuccessfully installed et-xmlfile-2.0.0 openpyxl-3.1.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;

- The file `pnas.2217564120.sd01.xlsx` has a list of plant-specific academic journals. The base of the list comes [from here](https://www.pnas.org/doi/10.1073/pnas.2217564120). Then based on MU, I added missing journals containing the words 
```
['Plant', 'Botan', 'Phyto', 'Hort', 'Crop']
```

- I manually excluded the Crop-related journals whose scope also included topics closer to economy or robot engineering
- The `MeSH_terms` file contains keywords that are mostly unique to plant biology research. 
- The terms are lowercased to have better chances of matching.

In [6]:
src = '..' + os.sep + 'raw' + os.sep
journals = pd.read_excel(src + 'pnas.2217564120.sd01.xlsx')
plantssns = pd.unique( journals.loc[:, ['ISSN','eISSN'] ].values.ravel() )
plantssns = plantssns[~pd.isna(plantssns)]
meshterms = np.char.lower(np.loadtxt(src + 'MeSH_terms.txt', dtype=str, delimiter=','))
meshterms = np.array([' ' + x for x in meshterms])

  meshterms = np.char.lower(np.loadtxt(src + 'MeSH_terms.txt', dtype=str, delimiter=','))


## Loading and preparing the data

- Load all the papers published with at least one author affiliated to MU since 2015.
- Data obtained from [dimensions.ai](https://www.dimensions.ai/)
- In reality `MU_Pubs_2026.xlsx` are all the papers published under *University of Missouri System*.
- To make name comparisons and diagnostics easier down the road, all the author names will be converted to `ascii` (standard English-language keyboard)

```python
# Example
i = 321
text = df.iloc[i]['Authors']
print(text)
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore').decode("utf-8")
print('\n',text,sep='')
```
The comands above rewrite the names
```
Sanz, Amparo; Pike, Sharon; Khan, Mather A; Carrió-Seguí, Àngela; Mendoza-Cózatl, David G; Peñarrubia, Lola; Gassmann, Walter
```
as:
```
Sanz, Amparo; Pike, Sharon; Khan, Mather A; Carrio-Segui, Angela; Mendoza-Cozatl, David G; Penarrubia, Lola; Gassmann, Walter
```
- Notice that all the accents and non-English characters have been replaced.
- Also, part of the data preparation removes any `.` or `-` characters: `David M. Braun ---> David M Braun`
- This saves a copy of the processed, compiled file of 42K published papers

In [7]:
# Of the 54 fields available from Dimensions, these 29 might be relevant at some point
columns_to_keep = [
    'Publication ID', 'Title', 'Abstract', 'Source title', 'ISSN', 'Publisher', 'MeSH terms', 'PubYear',
    'Open Access', 'Publication Type', 'Document Type', 'Authors', 'Authors (Raw Affiliation)', 'Corresponding Authors',
    'Research Organizations - standardized', 'GRID IDs', 'City of standardized research organization',
    'State of standardized research organization', 'Country of standardized research organization', 'Funder',
    'Funder Group', 'Funder Country', 'Times cited', 'RCR', 'FCR', 'Altmetric', 'Fields of Research (ANZSRC 2020)',
    'Units of Assessment', 'Sustainable Development Goals'
]
institute = 'KState'
places = ['manhattan', 'kansas', '[see note]']
filename = src + institute+'_Pubs.csv'

if not os.path.isfile(filename):
    filenames = sorted(glob(src + institute + '*.xlsx'))
    df = utils.prepare_dimensions_data(filenames, columns_to_keep)
    df.to_csv(filename, index=False)
else:
    print('Found', filename, 'already computed')
    df = pd.read_csv(filename)
print('Loaded', len(df), 'publications')

../raw/KState2015.xlsx
../raw/KState2016-7.xlsx
../raw/KState2018-9.xlsx
../raw/KState2020.xlsx
../raw/KState2021.xlsx
../raw/KState2022-3.xlsx
../raw/KState2024.xlsx
../raw/KState2025-6.xlsx
Loaded 23580 publications


- Change the row indices to the Dimensions ID. This will help us discard duplicates later
- The `Raw Affiliation` column lists the actual author address as scraped by Dimensions
- We will only keep the papers that the raw address mention either the university name, city, or state
    - The places are lowercase to account for papers with addresses all uppercase
    - In MU case, the university name and state are the same
- Also consider those papers with `Raw Affiliation` simply `[see note]`: this happens in papers where there are more than 100 authors (not uncommon in astrophysics or database-related publications)
- Example of paper where [authors are listed as MU](https://app.dimensions.ai/details/publication/pub.1173010241) but the original paper shows this is not the case.

In [8]:
df = df.set_index(columns_to_keep[0])
isplace = np.zeros(len(df), dtype=bool)
for i,idx in enumerate(df.index):
    isplace[i] = any([place in df.loc[idx, 'Authors (Raw Affiliation)'].lower() for place in places])
df = df.iloc[isplace]
print('Only kept', len(df), 'publications')
df.head()

Only kept 23544 publications


Unnamed: 0_level_0,Title,Abstract,Source title,ISSN,Publisher,MeSH terms,PubYear,Open Access,Publication Type,Document Type,...,Funder,Funder Group,Funder Country,Times cited,RCR,FCR,Altmetric,Fields of Research (ANZSRC 2020),Units of Assessment,Sustainable Development Goals
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
pub.1055116673,New Insights into Peptide–Silver Nanoparticle ...,We studied the interaction of four new pentape...,Langmuir,"0743-7463, 1520-5827",American Chemical Society (ACS),cysteine; lysine; metal nanoparticles; peptide...,2015,Closed,Article,Research Article,...,Swedish Research Council; Office of the Direct...,US Federal Funders; NSF - National Science Fou...,Sweden; United States; Canada; United States; ...,55,1.74,5.3,,32 Biomedical and Clinical Sciences; 3206 Medi...,B12 Engineering,
pub.1039661938,Treatment of the Bleaching Effluent from Sulfi...,Pulp and paper waste water is one of the major...,Membranes,"2077-0375, 2077-0375",MDPI,,2015,All OA; Gold,Article,Research Article,...,Deutsche Bundesstiftung Umwelt,,Germany,48,0.63,3.75,,40 Engineering; 4004 Chemical Engineering; 401...,B12 Engineering,12 Responsible Consumption and Production
pub.1032669301,Development of a sheep challenge model for Rif...,Rift Valley fever (RVF) is a zoonotic disease ...,Virology,"0042-6822, 1096-0341",Elsevier,"animals; antibodies, viral; disease models, an...",2015,Closed,Article,Research Article,...,United States Department of Homeland Security;...,US Federal Funders; USDA - United States Depar...,United States; United States,38,1.97,7.85,4.0,"30 Agricultural, Veterinary and Food Sciences;...","A01 Clinical Medicine; A06 Agriculture, Veteri...",3 Good Health and Well Being
pub.1055139813,Network from Dihydrocoumarin via Solvent-Free ...,The main challenge in converting polymerized e...,ACS Sustainable Chemistry & Engineering,2168-0485,American Chemical Society (ACS),,2015,Closed,Article,Research Article,...,National Institute of Food and Agriculture,US Federal Funders; USDA - United States Depar...,United States,29,,2.89,,34 Chemical Sciences; 3403 Macromolecular and ...,B12 Engineering,
pub.1055083730,Epitaxy of Boron Phosphide on Aluminum Nitride...,The boron phosphide (BP) semiconductor has man...,Crystal Growth & Design,"1528-7483, 1528-7505",American Chemical Society (ACS),,2015,Closed,Article,Research Article,...,Brookhaven National Laboratory; Office of Scie...,DoE - United States Department of Energy; US F...,United States; United States; United States; U...,74,,7.2,1.0,40 Engineering; 4016 Materials Engineering,B12 Engineering,


## Criteria to determine if a publication is plant-specific

To determine if a paper is plant-specific, it has to match at least one criteria (out of three).

### 1. Published in a plant-specific journal

- Subset the papers that were published in plant-specific journals.
- Looking at journals by their unique numerical identifier (ISSN) instead of their names to avoid spelling confusions.

In [9]:
reload(utils)
isplantjournal = utils.mask_plant_journals(df, plantssns)
print('Plant-specific journal publications:\t',np.sum(isplantjournal))

Plant-specific journal publications:	 1188


### 2. Categorized as *Agriculture*, *Plant Biology*, *Soil*, or *Horticulture* according to the ANZSRC Fields of Research

- There are other releated fields or research (e.g. Environmental Biotechnology), but those two are the ones pretty much exclusive for plant research

In [10]:
ANZSRC = ['3108 Plant Biology','3008 Horticultural', '4106 Soil']
isplantanz = utils.mask_plant_anzsrc(df, ANZSRC)
print('Categorized as plant-specific:\t',np.sum(isplantanz))

Categorized as plant-specific:	 1028


### 3. It has at least 3 plant-related keywords

- The `MeSH_terms` file has a list of what I deemed plant-related keywords that are very much related to plant biology and **not** biology in general
- We require at least 3 terms to make sure the paper is truly focused on plants
- e.g. While *cellulose* or *lignin* are very much plant-exclusive terms, you have papers that discuss them in the context of material science

In [11]:
isplantmesh = utils.count_plant_mesh(df, meshterms)

---
## Update

We have decided that we are only going to use the first and third criteria.

---

## Getting plant-specific corresponding authors

- Find the union of those three criteria to determine the subset of plant-specific publications
- Discard those with no corresponding authors

In [12]:
min_kws = 3
#isplant = isplantanz + isplantjournal + (isplantmesh >= min_kws)
isplant = isplantjournal + (isplantmesh >= min_kws)

data = df.iloc[isplant][['Title', 'Source title', 'Authors', 'Corresponding Authors']]
data = data[~pd.isna(data['Corresponding Authors'])]
print(data.shape)
data.head()

(1276, 4)


Unnamed: 0_level_0,Title,Source title,Authors,Corresponding Authors
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pub.1038474155,Wheat streak mosaic virus resistance in eight ...,Plant Breeding,"Zhang, Xinzhong; Bai, Guihua; Xu, Rugen; Zhang...","Zhang, Guorong (Kansas State University)"
pub.1046982823,Wheat leaf lipids during heat stress: II. Lipi...,Plant Cell & Environment,"Narayanan, Sruthi; Prasad, P V Vara; Welti, Ruth","Narayanan, Sruthi (Kansas State University; Cl..."
pub.1000555342,A safety vs efficiency trade‐off identified in...,New Phytologist,"Ocheltree, Troy W; Nippert, Jesse B; Prasad, P...","Ocheltree, Troy W (Colorado State University F..."
pub.1009896183,"Fluctuating, warm temperatures decrease the ef...",New Phytologist,"Burghardt, Liana T; Runcie, Daniel E; Wilczek,...","Burghardt, Liana T (Brown University; Duke Uni..."
pub.1071120973,"Registration of OK05312, a High‐Yielding Hard ...",Journal of Plant Registrations,"Carver, Brett F; Smith, C Michael; Chuang, Wen...","Carver, Brett F (Oklahoma State University)"


- Susbset only the papers that have at least one corresponding author in the list of `institutes`
- Some papers have multiple corresponding authors: these are separated by `);`
    - They appear as `Author 1 (University 1); Author 2 (University 2)`
    - The closing parenthesis `)` is important to separate multiple authors
    - Otherwise, you can get confused with authors with multiple affiliations: `Author 1 (University 1; University 2)`
- Once you have separated all the corresponding authors, separate their name from their affiliation
- Get the unique authors (remove the repetitions)
- Count how many papers they have associated to them

In [13]:
reload(utils)
institutes = ['Kansas State University']
authors, pnum, uni_idx = utils.corresponding_authors_from_institute(data, institutes)
print('Found', len(authors),'different corresponding authors across', len(uni_idx), 'publications')

Found 236 different corresponding authors across 698 publications


Some Corresponding Author values are weird in the raw data. For example, one paper lists its corresponding author as
```
Ferrieri, Richard (University of Missouri-Columbia; Missouri Research Reactor Center, University of Missouri, Columbia, MO 65211, USA;, srstt9@mail.missouri.edu, (S.S.);, afbkhn@mail.missouri.edu, (A.H.);, garren.powell@mail.missouri.edu, (G.P.);, alanstaett@burnsmcd.com, (A.A.);, gerheart@msu.edu, (A.G.);, mvbenoit@mail.missouri.edu, (M.B.);, wildersl@missouri.edu, (S.W.);, schuellerm@missouri.edu, (M.S.
```
Which gives 
```
University of Missouri-Columbia; University of Missouri-Columbia)
```
as a corresponding author.

- This is obviously wrong, so we are going to remove from the list those names that are *too* long.
- *Too long* in this case means much larger than the 90% quantile. As in a [boxplot](https://en.wikipedia.org/wiki/Box_plot).
- Keep a Series with number of papers associated to each name

In [14]:
authors, pnum = utils.remove_long_corresponding(authors, pnum)
print('Reduced to', len(authors),'corresponding authors')

Dropped:
[]
--
Reduced to 236 corresponding authors


### Fuzzy-match each name with everyone else in the list

- A score of 100 means perfect match
- **I have not fully verified, but I think the fuzzy match operations are not symmetric**
- (Which does not make sense to me, but oh well...)

In [15]:
fz = utils.fuzzy_matrix(authors)
fz.iloc[:5, :5]

Unnamed: 0,"Djanaguiraman, Maduraimuthu","de Borja Reis, Andre Froes","Kouame, Koffi BadouJeremie","Suleria, Hafiz Ansar Rasul","Ciampitti, Ignacio Antonio"
"Djanaguiraman, Maduraimuthu",-1.0,-1.0,-1.0,-1.0,-1.0
"de Borja Reis, Andre Froes",-1.0,-1.0,-1.0,40.0,26.666667
"Kouame, Koffi BadouJeremie",-1.0,-1.0,-1.0,-1.0,-1.0
"Suleria, Hafiz Ansar Rasul",-1.0,40.0,-1.0,-1.0,40.0
"Ciampitti, Ignacio Antonio",-1.0,26.666667,-1.0,40.0,-1.0


**Re-order the remaining authors by the length of their names.**
- Make a copy of the list
- Remove those names that are deemed copies
- Add the papers of the matches (if the fuzzy match is higher than `tol`)
- Only remove names downstream:
    - E.g. Since the list is ordered by name length, *David Braun* will be removed because of *David M Braun* but not the other way around
    - That way, we always keep the longer version of the name (which I assume is the correct version)

In [16]:
tol = 90
pnums = utils.fuzzymatching_authors(authors, pnum, fz, tol)

Started with:	 236 

Djanaguiraman, Maduraimuthu	-->	['Djanaguiraman, M']
Ciampitti, Ignacio Antonio	-->	['Ciampitti, Ignacio A' 'Ciampitti, Ignacio' 'Ciampitti, I A']
Smith, Charles Michael	-->	['Smith, C Michael' 'Michael Smith, C']
Little, Christopher R	-->	['Little, C R' 'Little, CR']
Demarco, Paula Andrea	-->	['Demarco, Paula A']
Jagadish, S V Krishna	-->	['Jagadish, Krishna S V' 'Jagadish, Krishna SV' 'Jagadish, SV Krishna']
Jagadish, Krishna S V	-->	['Jagadish, Krishna SV' 'Jagadish, SV Krishna']
Rupp, Jessica L Shoup	-->	['Shoup Rupp, Jessica L']
Schapaugh, William T	-->	['Schapaugh, William' 'Schapaugh, WT']
Jagadish, Krishna SV	-->	['Jagadish, SV Krishna']
Ciampitti, Ignacio A	-->	['Ciampitti, Ignacio' 'Ciampitti, I A']
Dille, Johanna Anita	-->	['Dille, J Anita']
Channa, B Rajashekar	-->	['C, B Rajashekar']
Holman, Johnathon D	-->	['Holman, Johnathan D' 'Holman, Johnathon']
Upadhyaya, Hari Deo	-->	['Upadhyaya, Hari D' 'Upadhyaya, H D']
Ciampitti, Ignacio	-->	['Ciampitti, I A'

____

# Computing the USDA file

- Some folks are both USDA and MU, but Dimensions registers them as USDA only.
- We'll repeat the steps above and see if we can add any papers to the list of MU authors we already have
- **We'll be adding only papers, not authors**
- The end goal is to generate a file with USDA-affiliated authors and the number of plant-specific publications they have as corresponding authors.

In [17]:
usda_institute = 'USDA'
filename = src + usda_institute+'_Pubs.csv'

if not os.path.isfile(filename):
    filenames = sorted(glob(src + usda_institute + '*.xlsx'))
    usda = utils.prepare_dimensions_data(filenames, columns_to_keep)
    usda.to_csv(filename, index=False)
else:
    usda = pd.read_csv(filename)

usda = usda.set_index(columns_to_keep[0])
print('Loaded', len(usda), 'publications')

../raw/USDA_Pubs_1.xlsx
../raw/USDA_Pubs_2.xlsx
../raw/USDA_Pubs_3.xlsx
../raw/USDA_Pubs_4.xlsx
../raw/USDA_Pubs_5.xlsx
../raw/USDA_Pubs_6.xlsx
../raw/USDA_Pubs_7.xlsx
../raw/USDA_Pubs_8.xlsx
Loaded 21908 publications


- Only consider the USDA papers where at least one author is geographically associated to the city/state of the university
- This way we discard people in USDA units outside the state that could be wrongly added later
    - E.g. someone who published as a PhD student/postdoc at MU but then moved to USDA in Florida

In [18]:
isplace = np.zeros(len(usda), dtype=bool)
for i,idx in enumerate(usda.index):
    isplace[i] = any([place in usda.loc[idx, 'Authors (Raw Affiliation)'].lower() for place in places])
usda = usda.iloc[isplace]
print('Only kept', len(usda), 'publications that are geographically associated to', places)
#usda.head()

Only kept 1473 publications that are geographically associated to ['manhattan', 'kansas', '[see note]']


- Apply the same criteria to only look at plant-specific papers
- Drop the papers with missing corresponding author information

In [19]:
isplantjournal = utils.mask_plant_journals(usda, plantssns)
print('Plant-specific journal publications:\t',np.sum(isplantjournal))
isplantanz = utils.mask_plant_anzsrc(usda, ANZSRC)
print('Categorized as plant-specific:\t',np.sum(isplantanz))
isplantmesh = utils.count_plant_mesh(usda, meshterms)

isplant = (isplantjournal + (isplantmesh >= min_kws))
usdata = usda.iloc[isplant][['Title', 'Source title', 'Authors', 'Corresponding Authors']]
usdata = usdata[~pd.isna(usdata['Corresponding Authors'])]
print('Only kept', len(usdata), 'papers with corresponding author')

Plant-specific journal publications:	 273
Categorized as plant-specific:	 208
Only kept 293 papers with corresponding author


- Count only those papers with corresponding author who is affiliated to USDA
- Exclude the papers were the corresponding author is also affiliated to the university: avoid double counting
- Do fuzzy matching to drop duplicate names

In [20]:
reload(utils)
usda_institutes = ['United States Department of Agriculture', 'Agricultural Research Service', 'Biological Control of Insects Research']
usauthors, uspnum, idx = utils.corresponding_authors_from_institute(usdata, usda_institutes, exclude_list=institutes)
print('Found', len(usauthors),'different corresponding authors in', len(idx), 'publications')

usauthors, uspnum = utils.remove_long_corresponding(usauthors, uspnum)
print('Reduced to', len(usauthors),'corresponding authors')
fz = utils.fuzzy_matrix(usauthors)
usda_pubs = utils.fuzzymatching_authors(usauthors, uspnum, fz, tol)

Found 47 different corresponding authors in 83 publications
Dropped:
[]
--
Reduced to 47 corresponding authors
Started with:	 47 


After matching:	47


- Add the USDA papers to authors that have at least one corresponding paper with the university

In [21]:
choices = usda_pubs.index.values
for i in range(len(pnums)):
    name = pnums.index[i]
    match, fscore, idx = process.extractOne(name, choices, scorer=fuzz.partial_ratio)
    if fscore >= 99:
        print(name, '--->', match, '[{:.2f}]'.format(fscore), sep='\t')
        pnums[name] += usda_pubs[match]

Bai, Guihua	--->	Bai, Guihua	[100.00]


## Determine which authors make the cut

- To be discussed with David
- How many papers are required to be considered an "IPG" member?
- **Update: We settled with 2 papers as minimum**
    - It is a compromise to include faculty that joined recently but exclude most of the false positives (students, postdocs, scientists)

In [22]:
foo = pnums.to_frame('N').reset_index(names='names').sort_values(by=['N','names'], ascending=[False,True]).set_index('names').squeeze()
foo.to_csv(institute + '_IPG.csv', index=True, index_label='Corresponding Authors', header=['Pubs Num'])
foo

names
Jagadish, S V Krishna         44
Ciampitti, Ignacio Antonio    41
Bai, Guihua                   39
Poland, Jesse A               33
Prasad, P V Vara              31
                              ..
Zhu, Kun Yan                   1
Zukoff, Sarah N                1
de Borja Reis, Andre Froes     1
de Oliveira Silva, Amanda      1
van Versendaal, Emmanuela      1
Name: N, Length: 197, dtype: int64

- List of "IPG" authors sorted by number of publications and then by alphabetical order

----

# Ignore all below

In [21]:
print('Number of authors with at least N plant-specific papers as corresponding author:\n--')
for N in range(1,11):
    print(N, np.sum(pnums >= N), sep='\t')

Number of authors with at least N plant-specific papers as corresponding author:
--
1	191
2	88
3	62
4	45
5	35
6	33
7	29
8	23
9	19
10	16


In [96]:
iscorr = np.zeros(len(df), dtype=bool)
for i,idx in enumerate(df.index):
    iscorr[i] = any([place in df.loc[idx, 'Authors (Raw Affiliation)'].lower() for place in ['columbia', 'missouri']])
print(np.sum(~iscorr))
foo = df.iloc[~iscorr][['Title', 'Source title', 'MeSH terms', 'Fields of Research (ANZSRC 2020)', 'Authors', 'Corresponding Authors']]
foo = foo.drop(df[df['Authors (Raw Affiliation)'] == '[see note]'].index)
print(foo.shape)
foo.head(20)

134
(8, 6)


Unnamed: 0_level_0,Title,Source title,MeSH terms,Fields of Research (ANZSRC 2020),Authors,Corresponding Authors
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pub.1100176260,Inter-population plasticity in growth and repr...,Journal of Vertebrate Biology,,31 Biological Sciences; 3103 Ecology,"Mas, Guillem; Latorre, Daniel; Tarkan, Ali Ser...","Almeida, David (University of Girona)"
pub.1109929513,Genes and Dietary Fatty Acids in Regulation of...,Nutrients,"biomarkers; chromosomes, human, pair 11; delta...",32 Biomedical and Clinical Sciences; 3202 Clin...,"Lankinen, Maria; Uusitupa, Matti; Schwab, Ursula","Lankinen, Maria (University of Eastern Finland..."
pub.1107894701,Micronutrient Status in Sri Lanka: A Review,Nutrients,humans; malnutrition; micronutrients; nutritio...,32 Biomedical and Clinical Sciences; 3210 Nutr...,"Abeywickrama, Hansani Madushika; Koyama, Yu; U...","Abeywickrama, Hansani Madushika (Niigata Unive..."
pub.1120290964,Percolation models of pathogen spillover.,Philosophical Transactions of the Royal Societ...,"animals; communicable diseases, emerging; dise...",31 Biological Sciences; 32 Biomedical and Clin...,"Washburne, Alex D; Crowley, Daniel E; Becker, ...",
pub.1117292642,Nordic Diet and Inflammation—A Review of Obser...,Nutrients,"adult; c-reactive protein; cathepsins; diet, h...",32 Biomedical and Clinical Sciences; 3210 Nutr...,"Lankinen, Maria; Uusitupa, Matti; Schwab, Ursula","Lankinen, Maria (University of Eastern Finland..."
pub.1128365960,Quality of Life and Symptom Burden among Chron...,International Journal of Environmental Researc...,adult; aged; cross-sectional studies; female; ...,32 Biomedical and Clinical Sciences; 3202 Clin...,"Abeywickrama, Hansani Madushika; Wimalasiri, S...","Abeywickrama, Hansani Madushika (Niigata Unive..."
pub.1173010241,Impacts of Distribution-Level Joint Scheduling...,IEEE Transactions on Industry Applications,,40 Engineering,"Aktar, Abdullah Krat; Tackaraolu, Akn; Erdin, ...","Tackaraolu, Akn (University of MissouriColumbia)"
pub.1187462604,Eye2Heart : a validated lumped-parameter model...,arXiv,,32 Biomedical and Clinical Sciences; 3212 Opht...,"Sala, Lorenzo; Zaid, Mohamed; Hughes, Faith; S...",


In [91]:
df.loc['pub.1040985095', 'Authors (Raw Affiliation)']

'[see note]'

In [39]:
iscorr = np.zeros(len(df), dtype=bool)
for i,idx in enumerate(df.index):
    if not pd.isna(df.loc[idx, 'Corresponding Authors']):
        iscorr[i] = 'Gillman, Jason' in df.loc[idx, 'Corresponding Authors']
print(np.sum(iscorr))
foo = df.loc[iscorr, ['Title', 'Source title', 'MeSH terms', 'Fields of Research (ANZSRC 2020)', 'Authors', 'Corresponding Authors']]
foo.head(20)

7


Unnamed: 0_level_0,Title,Source title,MeSH terms,Fields of Research (ANZSRC 2020),Authors,Corresponding Authors
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pub.1043091355,Development of Rigorous Fatty Acid Near‐Infrar...,Journal of the American Oil Chemists' Society,,"30 Agricultural, Veterinary and Food Sciences;...","Karn, Avinash; Heim, Crystal; FlintGarcia, She...","Gillman, Jason (US Department of AgricultureAg..."
pub.1018220376,Impact of heat stress during seed development ...,Metabolomics,,32 Biomedical and Clinical Sciences; 3205 Medi...,"Chebrolu, Kranthi K; Fritschi, Felix B; Ye, So...","Gillman, Jason D (University of MissouriColumbia)"
pub.1079357170,Genotyping-by-Sequencing-Based Investigation o...,"G3: Genes, Genomes, Genetics","base sequence; genes, plant; genotype; mutatio...",31 Biological Sciences; 3105 Genetics,"Heim, Crystal B; Gillman, Jason D","Gillman, Jason D (University of MissouriColumb..."
pub.1120401495,A seed germination transcriptomic study contra...,BMC Research Notes,"adaptation, physiological; gene expression pro...",32 Biomedical and Clinical Sciences,"Gillman, Jason D; Biever, Jessica J; Ye, Songq...","Gillman, Jason D (University of MissouriColumbia)"
pub.1132771180,Quantitative trait locus mapping for resistanc...,Crop Science,,"30 Agricultural, Veterinary and Food Sciences;...","Gillman, Jason D; Chebrolu, Kranthi; Smith, Ja...","Gillman, Jason D (Agricultural Research Servic..."
pub.1145872581,Genomic prediction models for traits differing...,BMC Plant Biology,"genetic markers; genome, plant; linkage disequ...","30 Agricultural, Veterinary and Food Sciences;...","Kaler, Avjinder S; Purcell, Larry C; Beissinge...","Gillman, Jason D (University of MissouriColumbia)"
pub.1182796904,Association mapping for water use efficiency i...,Frontiers in Plant Science,,"30 Agricultural, Veterinary and Food Sciences;...","Chamarthi, Siva K; Purcell, Larry C; Fritschi,...","Gillman, Jason D (University of MissouriColumbia)"


In [129]:
pd.unique(foo['Fields of Research (ANZSRC 2020)'])

array(['31 Biological Sciences; 3108 Plant Biology; 37 Earth Sciences',
       '31 Biological Sciences; 3103 Ecology; 3108 Plant Biology',
       '30 Agricultural, Veterinary and Food Sciences; 3004 Crop and Pasture Production; 31 Biological Sciences; 3101 Biochemistry and Cell Biology; 3108 Plant Biology',
       '30 Agricultural, Veterinary and Food Sciences; 3007 Forestry Sciences; 31 Biological Sciences; 3103 Ecology'],
      dtype=object)

In [15]:
pd.unique(foo['MeSH terms'])

array([nan,
       'plant roots; water; abscisic acid; plant growth regulators; droughts; cell wall; soil; dehydration'],
      dtype=object)

In [40]:
iscorr = np.zeros(len(usda), dtype=bool)
for i, idx in enumerate(usda.index):
    if not pd.isna(usda.iloc[i]['Corresponding Authors']):
        iscorr[i] = 'Gillman, Jason' in usda.loc[idx, 'Corresponding Authors']
print(np.sum(iscorr))
foo = usda.loc[iscorr, ['Title', 'Source title', 'Authors', 'MeSH terms', 'Corresponding Authors']]
foo.head(20)

5


Unnamed: 0_level_0,Title,Source title,Authors,MeSH terms,Corresponding Authors
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
pub.1132771180,Quantitative trait locus mapping for resistanc...,Crop Science,"Gillman, Jason D; Chebrolu, Kranthi; Smith, Ja...",,"Gillman, Jason D (Agricultural Research Servic..."
pub.1182796904,Association mapping for water use efficiency i...,Frontiers in Plant Science,"Chamarthi, Siva K; Purcell, Larry C; Fritschi,...",,"Gillman, Jason D (University of MissouriColumbia)"
pub.1120401495,A seed germination transcriptomic study contra...,BMC Research Notes,"Gillman, Jason D; Biever, Jessica J; Ye, Songq...","adaptation, physiological; gene expression pro...","Gillman, Jason D (University of MissouriColumbia)"
pub.1018220376,Impact of heat stress during seed development ...,Metabolomics,"Chebrolu, Kranthi K; Fritschi, Felix B; Ye, So...",,"Gillman, Jason D (University of MissouriColumbia)"
pub.1079357170,Genotyping-by-Sequencing-Based Investigation o...,"G3: Genes, Genomes, Genetics","Heim, Crystal B; Gillman, Jason D","base sequence; genes, plant; genotype; mutatio...","Gillman, Jason D (University of MissouriColumb..."


In [41]:
foo.loc['pub.1079357170', 'MeSH terms']

'base sequence; genes, plant; genotype; mutation, missense; phenotype; plant proteins; seeds; soybean oil; glycine max; stearic acids'

In [50]:
pd.unique(foo['Research Organizations - standardized'])

array(['Ateneo de Manila University; Agricultural Research Service - Midwest Area; University of Missouri-Columbia',
       'Iowa State University of Science and Technology; Agricultural Research Service - Midwest Area',
       'Agricultural Research Service - Midwest Area; Purdue University West Lafayette',
       'Agricultural Research Service - Midwest Area; University of Missouri-Columbia',
       'Agricultural Research Service - Midwest Area; Purdue University West Lafayette; Lewis Clark State College'],
      dtype=object)

In [175]:
jkw = ['Rhiz']
isjournal = np.zeros(len(df), dtype=bool)
for i,idx in enumerate(df.index):
    if not pd.isna(df.loc[idx, 'Source title']):
        foo = df.loc[idx, 'Source title']
        isjournal[i] = any([ kw in foo for kw in jkw ])

print(np.sum(isjournal))
uq = [ x.upper() for x in pd.unique(df.loc[isjournal, 'Source title']) ]
print(len(uq))
uq, idx = np.unique(df.loc[isjournal, 'Source title'].values, return_index=True)
idx = df.loc[isjournal, 'Source title'].index[idx].values
bar = df.loc[idx, ['Source title', 'ISSN']]
bar['Source title'] = bar['Source title'].str.upper()
bar = bar.set_index('Source title')
bar.loc[ np.setdiff1d(bar.index, journals['Journal']) ]

1
1


Unnamed: 0_level_0,ISSN
Source title,Unnamed: 1_level_1


In [177]:
df.loc[df['Source title'] == 'Crops', ['Title', 'Authors', 'MeSH terms', 'Corresponding Authors']]

Unnamed: 0_level_0,Title,Authors,MeSH terms,Corresponding Authors
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pub.1181215363,Application of Pyroligneous Acid as a Plant Gr...,"Noel, Randi; Schueller, Michael J; Guthrie, Ja...",,"Ferrieri, Richard A (University of MissouriCol..."


In [38]:
for i in range(len(df.columns)):
    print(i, df.columns[i], sep='\t')

0	Rank
1	Publication ID
2	DOI
3	PMID
4	PMCID
5	ISBN
6	Title
7	Abstract
8	Acknowledgements
9	Funding
10	Source title
11	Anthology title
12	Book editors
13	Publisher
14	ISSN
15	MeSH terms
16	Publication date
17	PubYear
18	Publication date (online)
19	Publication date (print)
20	Volume
21	Issue
22	Pagination
23	Open Access
24	Publication Type
25	Document Type
26	Authors
27	Authors (Raw Affiliation)
28	Corresponding Authors
29	Authors Affiliations
30	Research Organizations - standardized
31	GRID IDs
32	City of standardized research organization
33	State of standardized research organization
34	Country of standardized research organization
35	Funder
36	Funder Group
37	Funder Country
38	Grant IDs of Supporting Grants
39	Supporting Grants
40	Times cited
41	Recent citations
42	RCR
43	FCR
44	Altmetric
45	Source Linkout
46	Dimensions URL
47	Fields of Research (ANZSRC 2020)
48	RCDC Categories
49	HRCS HC Categories
50	HRCS RAC Categories
51	Cancer Types
52	CSO Categories
53	Units of Assessment
54	

In [46]:
meshs = df.loc[~isplant, 'MeSH terms']
meshs = meshs[~pd.isna(meshs)]
mesh = set()
for i in range(len(meshs)):
    mesh |= set(meshs.iloc[i].split('; '))
mesh = sorted(list(mesh))
#pd.Series(mesh).to_csv(src + 'mesh.txt', index=False, header=False, sep='\n')

In [109]:
meshs = df.loc[isplant, 'MeSH terms']
meshs = meshs[~pd.isna(meshs)]
mesh = set()
for i in range(len(meshs)):
    mesh |= set(meshs.iloc[i].split('; '))
mesh = sorted(list(mesh))
print(len(mesh))
pd.Series(mesh).to_csv(src + 'plant_MU_mesh.txt', index=False, header=False, sep='\n')

1274


In [158]:
foo = df.loc[~isplant]
bar = foo.iloc[ isplantmesh[~isplant] > 2 ][['Publication ID', 'MeSH terms', 'Title', 'Source title', 'Corresponding Authors']]
print(bar.shape)

(99, 5)


In [216]:
foo = pnum.to_frame(name='Pubs Num')
foo['Length'] = np.array(list(map(len,pnum.index.values)))
foo['Names'] = foo.index.values
foo = foo.sort_values(by=['Pubs Num', 'Length', 'Names'], ascending=[False, False, True])
foo

Unnamed: 0,Pubs Num,Length,Names
"Nguyen, Henry T",58,15,"Nguyen, Henry T"
"Mittler, Ron",51,12,"Mittler, Ron"
"Meyers, Blake C",36,15,"Meyers, Blake C"
"Stacey, Gary",34,12,"Stacey, Gary"
"Birchler, James A",28,17,"Birchler, James A"
...,...,...,...
"Li, Song",1,8,"Li, Song"
"Qin, Hua",1,8,"Qin, Hua"
"Song, Li",1,8,"Song, Li"
"Chen, P",1,7,"Chen, P"


In [85]:
name1, name2 = 'Stacey, Minviluz G', 'Stacey, Gary'

lname1, fname1 = name1.split(', ')
inits1 = [x[0] for x in fname1.split(' ')]

lname2, fname2 = name2.split(', ')
inits2 = [x[0] for x in fname2.split(' ')]

#If none of the initials match, then assume that names are not equal and move on
if not any([x in inits1 for x in inits2]):
    print('No initial matching')

fuzzscore = fuzz.ratio(name1.casefold(), name2.casefold())
if fuzzscore >= 98:
    print('Fuzzscore', fuzzscore)

# Add blank spaces for first names reduced to intials
# e.g. Riedell, WE --> Riedell, W E
                
fname1 = utils.add_blanks(fname1)
inits1 = [x[0] for x in fname1.split(' ')]

fname2 = utils.add_blanks(fname2)
inits2 = [x[0] for x in fname2.split(' ')]

print('Last name: ', lname1, ' --\tFirst name: ', fname1, ' --\tInitials: ',inits1, sep='')
print('Last comp: ', lname2, ' --\tFirst comp: ', fname2, ' --\tInitials: ',inits2, sep='')

Last name: Stacey --	First name: Minviluz G --	Initials: ['M', 'G']
Last comp: Stacey --	First comp: Minviluz G --	Initials: ['M', 'G']
