# Install libraries

Make sure you [install `rapidfuzz`](https://github.com/rapidfuzz/RapidFuzz?tab=readme-ov-file) for fuzzy matching. Click the link for documentation.

If you have issues with it, you can always go with [thefuzz](https://github.com/seatgeek/thefuzz). In that case, do `from thefuzz import fuzz, process` instead.

In [1]:
# Standard Python
import os
import json
import unicodedata
from glob import glob
from importlib import reload
import datetime

# Canon libraries
import pandas as pd
import numpy as np

# You'll need to install this one
from rapidfuzz import fuzz, process

# The utils file to keep the notebook short
import utils

- The file `pnas.2217564120.sd01.xlsx` has a list of plant-specific academic journals. The base of the list comes [from here](https://www.pnas.org/doi/10.1073/pnas.2217564120). Then based on MU, I added missing journals containing the words 
```
['Plant', 'Botan', 'Phyto', 'Hort', 'Crop']
```

- I manually excluded the Crop-related journals whose scope also included topics closer to economy or robot engineering
- The `MeSH_terms` file contains keywords that are mostly unique to plant biology research. 
- The terms are lowercased to have better chances of matching.

In [2]:
src = '..' + os.sep + 'raw' + os.sep
journals = pd.read_excel(src + 'pnas.2217564120.sd01.xlsx')
plantssns = pd.unique( journals.loc[:, ['ISSN','eISSN'] ].values.ravel() )
plantssns = plantssns[~pd.isna(plantssns)]
meshterms = np.char.lower(np.loadtxt(src + 'MeSH_terms.txt', dtype=str, delimiter=','))
meshterms = np.array([' ' + x for x in meshterms])

## Loading and preparing the data

- Load all the papers published with at least one author affiliated to MU since 2015.
- Data obtained from [dimensions.ai](https://www.dimensions.ai/)
- In reality `MU_Pubs_2026.xlsx` are all the papers published under *University of Missouri System*.
- To make name comparisons and diagnostics easier down the road, all the author names will be converted to `ascii` (standard English-language keyboard)

```python
# Example
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore').decode("utf-8")
```
The comands above rewrite the names
```
Sanz, Amparo; Pike, Sharon; Khan, Mather A; Carrió-Seguí, Àngela; Mendoza-Cózatl, David G; Peñarrubia, Lola; Gassmann, Walter
```
as:
```
Sanz, Amparo; Pike, Sharon; Khan, Mather A; Carrio-Segui, Angela; Mendoza-Cozatl, David G; Penarrubia, Lola; Gassmann, Walter
```
- Notice that all the accents and non-English characters have been replaced.
- Also, part of the data preparation removes any `.` or `-` characters: `David M. Braun ---> David M Braun`

---

- The `Raw Affiliation` column lists the actual author address as scraped by Dimensions
- We will only keep the papers that the raw address mention either the university name, city, or state
    - The places are lowercase to account for papers with addresses all uppercase (e.g. `Missouri` vs `MISSOURI`)
    - In MU case, the university name and state are the same
- Also consider those papers with `Raw Affiliation` as `[see note]`: this happens in papers where there are more than 100 authors (not uncommon in astrophysics or database-compilation publications)
- Example of paper where [authors are listed as MU](https://app.dimensions.ai/details/publication/pub.1173010241) but if you go to the original paper, this is not the case.

---

### The generated `MU_Pubs.csv` file 

The script writes automatically a `MU_Pubs.csv` file so that you save time loading for future ocassions. The file is:
- A compilation of all the Dimension Excel spreadsheets
- Minus entries related to talks and old preprints
- Minus entries where no author is associated to the university/city/state
- Minus duplicates e.g. there are duplicate papers between those associated to MU vs those associated to the UM System
- With only 29/54 columns
- Where the author-related columns have no accents/hyphens/dots

---

- Generate or load `MU_Pubs.csv`
- Change the row indices to the Dimensions ID
- Slightly less that 42K publications for MU

In [3]:
# Of the 54 fields available from Dimensions, these 29 might be relevant at some point
columns_to_keep = [
    'Publication ID', 'Title', 'Abstract', 'Source title', 'ISSN', 'Publisher', 'MeSH terms', 'PubYear',
    'Open Access', 'Publication Type', 'Document Type', 'Authors', 'Authors (Raw Affiliation)', 'Corresponding Authors',
    'Research Organizations - standardized', 'GRID IDs', 'City of standardized research organization',
    'State of standardized research organization', 'Country of standardized research organization', 'Funder',
    'Funder Group', 'Funder Country', 'Times cited', 'RCR', 'FCR', 'Altmetric', 'Fields of Research (ANZSRC 2020)',
    'Units of Assessment', 'Sustainable Development Goals'
]

institute = 'MU'
places = ['columbia', 'missouri', '[see note]']
institutes = ['University of Missouri-Columbia', 'University of Missouri System']

filename = src + institute+'_Pubs.csv'

reload(utils)
if not os.path.isfile(filename):
    filenames = sorted(glob(src + institute + '*.xlsx'))
    df = utils.prepare_dimensions_data(filenames, columns_to_keep, places)
    df.to_csv(filename, index=False)
else:
    print('Found', filename, 'already computed')
    df = pd.read_csv(filename)

df = df.set_index(columns_to_keep[0])
print('Loaded', len(df), 'publications')

Found ../raw/MU_Pubs.csv already computed
Loaded 41909 publications


## Criteria to determine if a publication is plant-specific

To determine if a paper is plant-specific, it has to match at least one criteria (out of three).

### 1. Published in a plant-specific journal

- Subset the papers that were published in plant-specific journals.
- Looking at journals by their unique numerical identifier (ISSN) instead of their names to avoid spelling confusions.

In [4]:
isplantjournal = utils.mask_plant_journals(df, plantssns)
print('Plant-specific journal publications:\t',np.sum(isplantjournal))

Plant-specific journal publications:	 1297


### 2. Categorized as *Agriculture*, *Plant Biology*, *Soil*, or *Horticulture* according to the ANZSRC Fields of Research

- There are other releated fields or research (e.g. Environmental Biotechnology), but those two are the ones pretty much exclusive for plant research

In [5]:
ANZSRC = ['3108 Plant Biology','3008 Horticultural', '4106 Soil']
isplantanz = utils.mask_plant_anzsrc(df, ANZSRC)
print('Categorized as plant-specific:\t',np.sum(isplantanz))

Categorized as plant-specific:	 1175


### 3. It has at least 3 plant-related keywords

- The `MeSH_terms` file has a list of what I deemed plant-related keywords that are very much related to plant biology and **not** biology in general
- We require at least 3 terms to make sure the paper is truly focused on plants
- e.g. While *cellulose* or *lignin* are very much plant-exclusive terms, you have papers that discuss them in the context of material science

In [6]:
isplantmesh = utils.count_plant_mesh(df, meshterms)

---
## Update

We have decided that we are only going to use the first and third criteria.

---

## Getting plant-specific corresponding authors

- Find the union of those ~~three~~ two criteria to determine the subset of plant-specific publications
- Discard the publications with no corresponding authors

In [7]:
min_kws = 3
#isplant = isplantanz + isplantjournal + (isplantmesh >= min_kws)
isplant = isplantjournal + (isplantmesh >= min_kws)

data = df.iloc[isplant][['Title', 'Source title', 'Authors', 'Corresponding Authors']]
data = data[~pd.isna(data['Corresponding Authors'])]
print(data.shape)
data.head()

(1438, 4)


Unnamed: 0_level_0,Title,Source title,Authors,Corresponding Authors
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pub.1071447931,Genome‐Wide Association Analysis of Diverse So...,The Plant Genome,"Dhanapal, Arun Prabhu; Ray, Jeffery D; Singh, ...","Fritschi, Felix B (University of MissouriColum..."
pub.1071447926,Phytic Acid and Inorganic Phosphate Compositio...,The Plant Genome,"Vincent, Jennifer A; Stacey, Minviluz; Stacey,...","Bilyeu, Kristin D (University of MissouriColum..."
pub.1071447895,Linkage Maps of a Mediterranean × Continental ...,The Plant Genome,"Dierking, Ryan; Azhaguvel, Perumal; Kallenbach...","Dierking, Ryan (Purdue University West Lafayette)"
pub.1059858594,A model for intracellular movement of Cauliflo...,Journal of Experimental Botany,"Schoelz, James E; Angel, Carlos A; Nelson, Ric...","Schoelz, James E (University of MissouriColumbia)"
pub.1059858561,"Core clock, SUB1, and ABAR genes mediate flood...",Journal of Experimental Botany,"Syed, Naeem H; Prince, Silvas J; Mutava, Raymo...","Syed, Naeem H (Canterbury Christ Church Univer..."


- Susbset only the papers that have at least one corresponding author in the list of `institutes`
- Some papers have multiple corresponding authors: these are separated by `);`
    - They appear as `Author 1 (University 1); Author 2 (University 2)`
    - The closing parenthesis `)` is important to separate multiple authors
    - Otherwise, you can get confused with authors with multiple affiliations: `Author 1 (University 1; University 2)`
- Once you have separated all the corresponding authors, separate their name from their affiliation
- Get the unique authors (remove the repetitions)
- Count how many papers they are associated to these unique authors

In [8]:
pnum, uni_idx = utils.corresponding_authors_from_institute(data, institutes)
print('Found', len(pnum),'different corresponding authors across', len(uni_idx), 'publications')

Found 219 different corresponding authors across 714 publications


Some Corresponding Author values are weird in the raw data. For example, one paper lists its corresponding author as
```
Ferrieri, Richard (University of Missouri-Columbia; Missouri Research Reactor Center, University of Missouri, Columbia, MO 65211, USA;, srstt9@mail.missouri.edu, (S.S.);, afbkhn@mail.missouri.edu, (A.H.);, garren.powell@mail.missouri.edu, (G.P.);, alanstaett@burnsmcd.com, (A.A.);, gerheart@msu.edu, (A.G.);, mvbenoit@mail.missouri.edu, (M.B.);, wildersl@missouri.edu, (S.W.);, schuellerm@missouri.edu, (M.S.
```
Which gives 
```
University of Missouri-Columbia; University of Missouri-Columbia)
```
as one of the corresponding authors (the other one is `Ferrieri, Richard`, which is also kept)

- This is obviously wrong, so we are going to remove from the list those names that are *too* long.
- *Too long* in this case means much larger than the 90% quantile. As in a [boxplot](https://en.wikipedia.org/wiki/Box_plot).
- Keep a Series with number of papers associated to each name
- **You may skip this step if the discarded names are actual people names**

In [9]:
pnum = utils.remove_long_corresponding(pnum)
print('Reduced to', len(pnum),'corresponding authors')

Dropped:
Index(['University of MissouriColumbia)',
       'University of MissouriColumbia; University of MissouriColumbia)',
       'University of MissouriColumbia; University of MissouriColumbia; University of MissouriColumbia)'],
      dtype='object')
--
Reduced to 216 corresponding authors


### Fuzzy-match each name with everyone else in the list

- A score of 100 means perfect match
- I have not fully verified, but I think the fuzzy match operations are not symmetric (which does not make sense to me, but oh well...)

In [10]:
fz = utils.fuzzy_matrix(pnum)
fz.iloc[:5, :5]

Unnamed: 0,"de Borja Reis, Andre Froes","Kalaitzandonakes, Nicholas","MgbechiEzeri, Josephine U","Mitchum, Melissa Goellner","Chhapekar, Sushil Satish"
"de Borja Reis, Andre Froes",-1.0,-1.0,-1.0,-1.0,-1.0
"Kalaitzandonakes, Nicholas",-1.0,-1.0,-1.0,-1.0,-1.0
"MgbechiEzeri, Josephine U",-1.0,-1.0,-1.0,-1.0,-1.0
"Mitchum, Melissa Goellner",-1.0,-1.0,-1.0,-1.0,-1.0
"Chhapekar, Sushil Satish",-1.0,-1.0,-1.0,-1.0,-1.0


**Re-order the remaining authors by the length of their names.** If two names have the same character length, preference is given to the name style with more associated papers. If two names share length and paper count, they are sorted alphabetically.

- Make a copy of the list
- Remove those names that are deemed copies
- Add the papers of the matches (if the fuzzy match is higher than `tol`)
- Only remove names downstream:
    - E.g. Since the list is ordered by name length, *David Braun* will be removed because of *David M Braun* but not the other way around
    - That way, we always keep the longer version of the name (which I assume is the correct version)
 
**Additionally, a dictionary with names and fuzzy alternatives is kept**: `fznames`

- This dictionary will later help us to quickly detect author names to eventually build quickly an adjacency matrix
- The dictionary is also enriched with initial-less variants for each name

In [27]:
tol = 90
pnums, fznames = utils.fuzzymatching_authors(pnum, fz, tol)

Started with:	 216 

Bilyeu, Kristin D	-->	['Bilyeu, Kristin']
Birchler, James A	-->	['Birchler, James']
Bish, Mandy D	-->	['Bish, Mandy']
Bradley, Kevin W	-->	['Bradley, Kevin']
Chen, Pengyin	-->	['Chen, P']
Cocroft, Reginald B	-->	['Cocroft, Reginald']
Ferrieri, Richard A	-->	['Ferrieri, Richard', 'Ferrieri, R A']
FlintGarcia, Sherry A	-->	['FlintGarcia, Sherry']
Fritschi, Felix B	-->	['Fritschi, F B']
Guo, Ya	-->	['Guo, Y']
Hibbard, Bruce E	-->	['Hibbard, B E']
Kallenbach, Robert L	-->	['Kallenbach, Robert']
Lall, Namrita	-->	['Lall, N']
Matthes, Michaela S	-->	['Matthes, Michaela']
Mitchum, Melissa Goellner	-->	['Mitchum, Melissa G']
Nguyen, Henry T	-->	['T Nguyen, Henry', 'Nguyen, H T']
Patharkar, Osric Rahul	-->	['Patharkar, O Rahul']
Scaboo, Andrew M	-->	['Scaboo, Andrew']
Shannon, J Grover	-->	['Shannon, Grover']
Shelby, Kent S	-->	['Shelby, KS']
Voothuluru, Priyamvada	-->	['Voothuluru, Priya']
Vuong, Tri D	-->	['Vuong, T D']
Xiong, Xi	-->	['Xi, Xiong']

After matching:	191


In [29]:
name = 'Fritschi, Felix B'
fznames[name]

['Fritschi, Felix B', 'Fritschi, F B', 'Fritschi, Felix']

____

# Computing the USDA file

- Some folks are both USDA and MU, but Dimensions registers them as USDA only.
- We'll repeat the steps above and see if we can add any papers to the list of MU authors we already have
- **We'll be adding only papers, not authors**
- The end goal is to generate a file with USDA-affiliated authors and the number of plant-specific publications they have as corresponding authors.

In [30]:
# Same pipeline as above

usda_institute = 'USDA'
filename = src + usda_institute+'_Pubs.csv'

if not os.path.isfile(filename):
    filenames = sorted(glob(src + usda_institute + '*.xlsx'))
    usda = utils.prepare_dimensions_data(filenames, columns_to_keep)
    usda.to_csv(filename, index=False)
else:
    usda = pd.read_csv(filename)

usda = usda.set_index(columns_to_keep[0])
print('Loaded', len(usda), 'publications')

Loaded 21908 publications


- Only consider the USDA papers where at least one author is geographically associated to the city/state of the university
- This way we discard people in USDA units outside the state that could be wrongly added later
    - E.g. someone who published as a PhD student/postdoc at MU but then moved to USDA in Florida

In [31]:
isplace = np.zeros(len(usda), dtype=bool)
for i,idx in enumerate(usda.index):
    isplace[i] = any([place in usda.loc[idx, 'Authors (Raw Affiliation)'].lower() for place in places])
usda = usda.iloc[isplace]
print('Only kept', len(usda), 'publications that are geographically associated to', places)
#usda.head()

Only kept 1308 publications that are geographically associated to ['columbia', 'missouri', '[see note]']


- Apply the same criteria to only look at plant-specific papers
- Drop the papers with missing corresponding author information

In [32]:
usda_isplantjournal = utils.mask_plant_journals(usda, plantssns)
print('Plant-specific journal publications:\t',np.sum(usda_isplantjournal))
usda_isplantanz = utils.mask_plant_anzsrc(usda, ANZSRC)
print('Categorized as plant-specific:\t',np.sum(usda_isplantanz))
usda_isplantmesh = utils.count_plant_mesh(usda, meshterms)

usda_isplant = (usda_isplantjournal + (usda_isplantmesh >= min_kws))
usdata = usda.iloc[usda_isplant][['Title', 'Source title', 'Authors', 'Corresponding Authors']]
usdata = usdata[~pd.isna(usdata['Corresponding Authors'])]
print('Only kept', len(usdata), 'papers with corresponding author')

Plant-specific journal publications:	 232
Categorized as plant-specific:	 200
Only kept 274 papers with corresponding author


- Count only those papers with corresponding author who is affiliated to USDA
- Exclude the papers were the corresponding author is also affiliated to the university: avoid double counting
- Do fuzzy matching to drop duplicate names
- Keep a dictionary with all the altername, fuzzy spellings of author names

In [33]:
usda_institutes = ['United States Department of Agriculture', 'Agricultural Research Service', 'Biological Control of Insects Research']
uspnum, _ = utils.corresponding_authors_from_institute(usdata, usda_institutes, exclude_list=institutes)
print('Found', len(uspnum),'different corresponding authors in', len(idx), 'publications')

uspnum = utils.remove_long_corresponding(uspnum)
print('Reduced to', len(uspnum),'corresponding authors')
usfz = utils.fuzzy_matrix(uspnum)
usda_pubs, usda_fzdict = utils.fuzzymatching_authors(uspnum, usfz, tol)

Found 45 different corresponding authors in 14 publications
Dropped:
Index([], dtype='object')
--
Reduced to 45 corresponding authors
Started with:	 45 

Best, Norman B	-->	['Best, Norman']

After matching:	44


- Add the USDA papers to authors that have at least one corresponding paper with the university
- Refine the fuzzy dictionary by only keeping authors (and their fuzzy spellings) associated to the university

In [34]:
usda_fznames = dict()
choices = usda_pubs.index.values
for name in pnums.index:
    match, fscore, _ = process.extractOne(name, choices, scorer=fuzz.partial_ratio)
    if fscore >= 99:
        print(name, '--->', match, '[{:.2f}]'.format(fscore), name==match, sep='\t')
        usda_fznames[name] = usda_fzdict[match]
        pnums[name] += usda_pubs[match]

Best, Norman B	--->	Best, Norman B	[100.00]	True
Bilyeu, Kristin D	--->	Bilyeu, Kristin	[100.00]	False
Das, Debatosh	--->	Das, Debatosh	[100.00]	True
Gillman, Jason D	--->	Gillman, Jason D	[100.00]	True
Hibbard, Bruce E	--->	Hibbard, Bruce E	[100.00]	True
Pereira, Adriano E	--->	Pereira, Adriano E	[100.00]	True
Shelby, Kent S	--->	Shelby, Kent S	[100.00]	True
Washburn, Jacob D	--->	Washburn, Jacob D	[100.00]	True


# Saving results

### Determine which authors make the cut

- How many papers are required to be considered an "IPG" member?
- **Update: We settled with 2 papers as minimum**
    - It is a compromise to include faculty that joined recently but exclude most of the false positives (students, postdocs, scientists)
- List of "IPG" authors sorted by number of publications and then by alphabetical order

In [35]:
ipg_list = pnums.to_frame('N').reset_index(names='names').sort_values(by=['N','names'], ascending=[False,True]).set_index('names').squeeze()
ipg_list = ipg_list[ipg_list >= 2]
ipg_list.to_csv(src + institute + '_IPG.csv', index=True, index_label='Corresponding Authors', header=['Pubs Num'])
ipg_list

names
Nguyen, Henry T           57
Mittler, Ron              43
Meyers, Blake C           33
Stacey, Gary              33
Birchler, James A         25
                          ..
Sharp, Robert E            2
Shergill, Lovreet S        2
Swain, Durga Madhab        2
Voothuluru, Priyamvada     2
Ye, Heng                   2
Name: N, Length: 88, dtype: int64

### Additional USDA papers that should be added to the analysis

- Going back to the list of all USDA papers that are geographically associated to the university location
- Take the subset of papers where at least one author is a joint USDA - IPG member (e.g. Norman B Best)
- Since we now know all the fuzzy spellings for every name, we don't need to repeat the fuzzy match
- We can match quickly and directly [with `.str.contains`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html#pandas.Series.str.contains) and having the name end in `;`.
- We also [use `.str.endswith`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.endswith.html#pandas.Series.str.endswith) for when the name is the last one (so it doesn't have a `;`)
- **Except** when the fuzzy spelling has a first name consisting of only initials
    - `Smith, D` could be David Smith or Diana Smith
    - In that case, we limit our possible matching to the plant-specific subset of papers
    - I assume that there is only one D. Smith who also is a plant scientist

In [47]:
usda_plant = usda.loc[usda_isplant]
usda_ids = []

for author in usda_fznames:
    for name in usda_fznames[author]:
        lname, fname = name.split(', ')

        # If the first name is NOT only initials
        # Match to the whole dataset
        if any([ L != 1 for L in map(len,fname.split(' ')) ]):
            papers = usda.loc[usda['Authors'].str.contains(name+';', regex=False)]
            papers = pd.concat( (papers, usda.loc[usda['Authors'].str.endswith(name)]), axis=0)
            usda_ids += papers.index.tolist()
        else:
            # Else, match only with plant-specific papers
            papers = usda_plant.loc[usda_plant['Authors'].str.contains(name+';', regex=False)]
            papers = pd.concat( (papers, usda_plant.loc[usda_plant['Authors'].str.endswith(name)]), axis=0)
            usda_ids += papers.index.tolist()

# Get unique IDs
usda_ids = list(set(usda_ids))
usda_extra = usda.loc[np.setdiff1d( usda_ids, df.index)]
print('There are', len(usda_extra), 'USDA papers that should be associated to', institute,'but are not')

filename = src + institute + '_plus_' + usda_institute + '_Pubs.csv'
print('These papers are listed in', institute + '_plus_' + usda_institute + '_Pubs.csv')
usda_extra.to_csv(filename, index=True, index_label=usda_extra.index.name)

There are 33 USDA papers that should be associated to MU but are not
These papers are listed in MU_plus_USDA_Pubs.csv


### Dictionary with fuzzy name spellings

- Add fuzzy spellings that were found in the USDA dataset
- Only keep fuzzy spellings from "IPG" members
- Save the collection of fuzzy spellings as a JSON file to be loaded later

In [38]:
filename = src + institute + '_fuzzy_names.json'
for name in usda_fznames:
    for alt in usda_fznames[name]:
        if alt not in fznames[name]:
            fznames[name].append(alt)

fznames = dict((name, fznames[name]) for name in ipg_list.index)
with open(filename, 'w') as f:
    json.dump(fznames, f,indent=1, separators=(',', ':'))


----

# Example on how to find all the papers published by one author

In [46]:
df_plant = df.loc[isplant]
ids = []

author = ipg_list.index[0]
print('Looking at',author)
print('Possible spellings:\t', fznames[author],'\n---')

for name in fznames[author]:
    lname, fname = name.split(', ')
    if any([ L != 1 for L in map(len,fname.split(' ')) ]):
        papers = df.loc[df['Authors'].str.contains(name+';', regex=False)]
        papers = pd.concat( (papers, df.loc[df['Authors'].str.endswith(name)]), axis=0)
        ids += papers.index.tolist()
        print('All:', name, len(papers), sep='\t')
    else:
        papers = df_plant.loc[df_plant['Authors'].str.contains(name+';', regex=False)]
        papers = pd.concat( (papers, df_plant.loc[df_plant['Authors'].str.endswith(name)]), axis=0)
        ids += papers.index.tolist()
        print('Plant:', name, len(papers), sep='\t')

# Remove duplicate IDs (if any)
ids = list(set(ids))
print('---\nTotal papers\t',len(ids))

Looking at Nguyen, Henry T
Possible spellings:	 ['Nguyen, Henry T', 'T Nguyen, Henry', 'Nguyen, H T', 'Nguyen, Henry'] 
---
All:	Nguyen, Henry T	169
All:	T Nguyen, Henry	1
Plant:	Nguyen, H T	13
All:	Nguyen, Henry	21
---
Total papers	 203


In [48]:
df.loc[ids].head()

Unnamed: 0_level_0,Title,Abstract,Source title,ISSN,Publisher,MeSH terms,PubYear,Open Access,Publication Type,Document Type,...,Funder,Funder Group,Funder Country,Times cited,RCR,FCR,Altmetric,Fields of Research (ANZSRC 2020),Units of Assessment,Sustainable Development Goals
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
pub.1122022308,Impacts of genomic research on soybean improve...,It has been commonly accepted that soybean dom...,Theoretical and Applied Genetics,"0040-5752, 1432-2242",Springer Nature,"asia, eastern; genome, plant; genomics; phenot...",2019,All OA; Hybrid,Article,Review Article,...,University Grants Committee,,China,61,1.99,11.49,8.0,"30 Agricultural, Veterinary and Food Sciences;...","A06 Agriculture, Veterinary and Food Science",2 Zero Hunger
pub.1033945056,Genetic variants in root architecture-related ...,BackgroundRoot system architecture is importan...,BMC Genomics,1471-2164,Springer Nature,"alleles; chromosome mapping; genome, plant; pl...",2015,All OA; Gold,Article,Research Article,...,Directorate for Biological Sciences; National ...,US Federal Funders; NSF - National Science Fou...,United States; United States,82,1.74,7.8,7.0,31 Biological Sciences; 3105 Genetics,"A06 Agriculture, Veterinary and Food Science",
pub.1135327031,Development of an automated plant phenotyping ...,Plant high-throughput phenotyping technology i...,Computers and Electronics in Agriculture,"0168-1699, 1872-7107",Elsevier,,2021,Closed,Article,Research Article,...,China Scholarship Council,,China,29,,7.77,,"30 Agricultural, Veterinary and Food Sciences;...","A06 Agriculture, Veterinary and Food Science",
pub.1130186383,Mapping Quantitative Trait Loci for Soybean Se...,Wild soybean species (Glycine soja Siebold & Z...,Frontiers in Plant Science,1664-462X,Frontiers,,2020,All OA; Gold,Article,Research Article,...,National Institute of Food and Agriculture,US Federal Funders; USDA - United States Depar...,United States,32,1.19,6.65,3.0,"30 Agricultural, Veterinary and Food Sciences;...","A06 Agriculture, Veterinary and Food Science",
pub.1041562973,Identification and Comparative Analysis of Dif...,Drought and flooding are two major causes of s...,Frontiers in Plant Science,1664-462X,Frontiers,,2016,All OA; Gold,Article,Research Article,...,Agricultural Research Service,US Federal Funders; USDA - United States Depar...,United States,148,4.46,14.53,5.0,31 Biological Sciences; 3102 Bioinformatics an...,A05 Biological Sciences,


---

# Ignore below

In [129]:
pd.unique(foo['Fields of Research (ANZSRC 2020)'])

array(['31 Biological Sciences; 3108 Plant Biology; 37 Earth Sciences',
       '31 Biological Sciences; 3103 Ecology; 3108 Plant Biology',
       '30 Agricultural, Veterinary and Food Sciences; 3004 Crop and Pasture Production; 31 Biological Sciences; 3101 Biochemistry and Cell Biology; 3108 Plant Biology',
       '30 Agricultural, Veterinary and Food Sciences; 3007 Forestry Sciences; 31 Biological Sciences; 3103 Ecology'],
      dtype=object)

In [15]:
pd.unique(foo['MeSH terms'])

array([nan,
       'plant roots; water; abscisic acid; plant growth regulators; droughts; cell wall; soil; dehydration'],
      dtype=object)

In [40]:
iscorr = np.zeros(len(usda), dtype=bool)
for i, idx in enumerate(usda.index):
    if not pd.isna(usda.iloc[i]['Corresponding Authors']):
        iscorr[i] = 'Gillman, Jason' in usda.loc[idx, 'Corresponding Authors']
print(np.sum(iscorr))
foo = usda.loc[iscorr, ['Title', 'Source title', 'Authors', 'MeSH terms', 'Corresponding Authors']]
foo.head(20)

5


Unnamed: 0_level_0,Title,Source title,Authors,MeSH terms,Corresponding Authors
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
pub.1132771180,Quantitative trait locus mapping for resistanc...,Crop Science,"Gillman, Jason D; Chebrolu, Kranthi; Smith, Ja...",,"Gillman, Jason D (Agricultural Research Servic..."
pub.1182796904,Association mapping for water use efficiency i...,Frontiers in Plant Science,"Chamarthi, Siva K; Purcell, Larry C; Fritschi,...",,"Gillman, Jason D (University of MissouriColumbia)"
pub.1120401495,A seed germination transcriptomic study contra...,BMC Research Notes,"Gillman, Jason D; Biever, Jessica J; Ye, Songq...","adaptation, physiological; gene expression pro...","Gillman, Jason D (University of MissouriColumbia)"
pub.1018220376,Impact of heat stress during seed development ...,Metabolomics,"Chebrolu, Kranthi K; Fritschi, Felix B; Ye, So...",,"Gillman, Jason D (University of MissouriColumbia)"
pub.1079357170,Genotyping-by-Sequencing-Based Investigation o...,"G3: Genes, Genomes, Genetics","Heim, Crystal B; Gillman, Jason D","base sequence; genes, plant; genotype; mutatio...","Gillman, Jason D (University of MissouriColumb..."


In [41]:
foo.loc['pub.1079357170', 'MeSH terms']

'base sequence; genes, plant; genotype; mutation, missense; phenotype; plant proteins; seeds; soybean oil; glycine max; stearic acids'

In [50]:
pd.unique(foo['Research Organizations - standardized'])

array(['Ateneo de Manila University; Agricultural Research Service - Midwest Area; University of Missouri-Columbia',
       'Iowa State University of Science and Technology; Agricultural Research Service - Midwest Area',
       'Agricultural Research Service - Midwest Area; Purdue University West Lafayette',
       'Agricultural Research Service - Midwest Area; University of Missouri-Columbia',
       'Agricultural Research Service - Midwest Area; Purdue University West Lafayette; Lewis Clark State College'],
      dtype=object)

In [175]:
jkw = ['Rhiz']
isjournal = np.zeros(len(df), dtype=bool)
for i,idx in enumerate(df.index):
    if not pd.isna(df.loc[idx, 'Source title']):
        foo = df.loc[idx, 'Source title']
        isjournal[i] = any([ kw in foo for kw in jkw ])

print(np.sum(isjournal))
uq = [ x.upper() for x in pd.unique(df.loc[isjournal, 'Source title']) ]
print(len(uq))
uq, idx = np.unique(df.loc[isjournal, 'Source title'].values, return_index=True)
idx = df.loc[isjournal, 'Source title'].index[idx].values
bar = df.loc[idx, ['Source title', 'ISSN']]
bar['Source title'] = bar['Source title'].str.upper()
bar = bar.set_index('Source title')
bar.loc[ np.setdiff1d(bar.index, journals['Journal']) ]

1
1


Unnamed: 0_level_0,ISSN
Source title,Unnamed: 1_level_1


In [177]:
df.loc[df['Source title'] == 'Crops', ['Title', 'Authors', 'MeSH terms', 'Corresponding Authors']]

Unnamed: 0_level_0,Title,Authors,MeSH terms,Corresponding Authors
Publication ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pub.1181215363,Application of Pyroligneous Acid as a Plant Gr...,"Noel, Randi; Schueller, Michael J; Guthrie, Ja...",,"Ferrieri, Richard A (University of MissouriCol..."


In [38]:
for i in range(len(df.columns)):
    print(i, df.columns[i], sep='\t')

0	Rank
1	Publication ID
2	DOI
3	PMID
4	PMCID
5	ISBN
6	Title
7	Abstract
8	Acknowledgements
9	Funding
10	Source title
11	Anthology title
12	Book editors
13	Publisher
14	ISSN
15	MeSH terms
16	Publication date
17	PubYear
18	Publication date (online)
19	Publication date (print)
20	Volume
21	Issue
22	Pagination
23	Open Access
24	Publication Type
25	Document Type
26	Authors
27	Authors (Raw Affiliation)
28	Corresponding Authors
29	Authors Affiliations
30	Research Organizations - standardized
31	GRID IDs
32	City of standardized research organization
33	State of standardized research organization
34	Country of standardized research organization
35	Funder
36	Funder Group
37	Funder Country
38	Grant IDs of Supporting Grants
39	Supporting Grants
40	Times cited
41	Recent citations
42	RCR
43	FCR
44	Altmetric
45	Source Linkout
46	Dimensions URL
47	Fields of Research (ANZSRC 2020)
48	RCDC Categories
49	HRCS HC Categories
50	HRCS RAC Categories
51	Cancer Types
52	CSO Categories
53	Units of Assessment
54	

In [46]:
meshs = df.loc[~isplant, 'MeSH terms']
meshs = meshs[~pd.isna(meshs)]
mesh = set()
for i in range(len(meshs)):
    mesh |= set(meshs.iloc[i].split('; '))
mesh = sorted(list(mesh))
#pd.Series(mesh).to_csv(src + 'mesh.txt', index=False, header=False, sep='\n')

In [109]:
meshs = df.loc[isplant, 'MeSH terms']
meshs = meshs[~pd.isna(meshs)]
mesh = set()
for i in range(len(meshs)):
    mesh |= set(meshs.iloc[i].split('; '))
mesh = sorted(list(mesh))
print(len(mesh))
pd.Series(mesh).to_csv(src + 'plant_MU_mesh.txt', index=False, header=False, sep='\n')

1274


In [158]:
foo = df.loc[~isplant]
bar = foo.iloc[ isplantmesh[~isplant] > 2 ][['Publication ID', 'MeSH terms', 'Title', 'Source title', 'Corresponding Authors']]
print(bar.shape)

(99, 5)


In [216]:
foo = pnum.to_frame(name='Pubs Num')
foo['Length'] = np.array(list(map(len,pnum.index.values)))
foo['Names'] = foo.index.values
foo = foo.sort_values(by=['Pubs Num', 'Length', 'Names'], ascending=[False, False, True])
foo

Unnamed: 0,Pubs Num,Length,Names
"Nguyen, Henry T",58,15,"Nguyen, Henry T"
"Mittler, Ron",51,12,"Mittler, Ron"
"Meyers, Blake C",36,15,"Meyers, Blake C"
"Stacey, Gary",34,12,"Stacey, Gary"
"Birchler, James A",28,17,"Birchler, James A"
...,...,...,...
"Li, Song",1,8,"Li, Song"
"Qin, Hua",1,8,"Qin, Hua"
"Song, Li",1,8,"Song, Li"
"Chen, P",1,7,"Chen, P"


In [85]:
name1, name2 = 'Stacey, Minviluz G', 'Stacey, Gary'

lname1, fname1 = name1.split(', ')
inits1 = [x[0] for x in fname1.split(' ')]

lname2, fname2 = name2.split(', ')
inits2 = [x[0] for x in fname2.split(' ')]

#If none of the initials match, then assume that names are not equal and move on
if not any([x in inits1 for x in inits2]):
    print('No initial matching')

fuzzscore = fuzz.ratio(name1.casefold(), name2.casefold())
if fuzzscore >= 98:
    print('Fuzzscore', fuzzscore)

# Add blank spaces for first names reduced to intials
# e.g. Riedell, WE --> Riedell, W E
                
fname1 = utils.add_blanks(fname1)
inits1 = [x[0] for x in fname1.split(' ')]

fname2 = utils.add_blanks(fname2)
inits2 = [x[0] for x in fname2.split(' ')]

print('Last name: ', lname1, ' --\tFirst name: ', fname1, ' --\tInitials: ',inits1, sep='')
print('Last comp: ', lname2, ' --\tFirst comp: ', fname2, ' --\tInitials: ',inits2, sep='')

Last name: Stacey --	First name: Minviluz G --	Initials: ['M', 'G']
Last comp: Stacey --	First comp: Minviluz G --	Initials: ['M', 'G']
