# NSF COA author/affiliation tool

Inspired by [this awesome tool](https://github.com/ejfertig/NSFBiosketch) from Dr. Elana Fertig, but unable to get it to run in time due to a java install problem with the xlsx package in my perpetually infuriating R environment, I whipped up something similar for the Pythonistas. 

This tool will take a list of PMIDs and return the list of authors and affiliations, along with most recent authorship date. 

In [40]:
import pandas as pd
from pymed import PubMed
from time import sleep

## Import papers
Import a list of your publication PMIDs, one per line in a plaintext file

In [23]:
pmids = []
with open('PMID-export.txt', 'r') as f:
    for line in f:
        pmids.append(line.strip())

In [24]:
pmids

['31911491',
 '31629686',
 '31792218',
 '31672156',
 '31616586',
 '31154984',
 '31239396',
 '31132110',
 '30822347',
 '29995839',
 '30275573',
 '30059804',
 '30109092',
 '29982531',
 '29795328',
 '29921959',
 '29511180',
 '29482639',
 '29282061',
 '28977692',
 '29088705',
 '28985400',
 '28393425',
 '28026894',
 '28234579',
 '28230052',
 '27801992',
 '27481780',
 '27154284']

We'll sort them in chronological order, to ensure we get the most recent conflict dates per author

In [86]:
pmids.sort()

In [31]:
# Create a PubMed object that GraphQL can use to query
# Note that the parameters are not required but kindly requested by PubMed Central
# https://www.ncbi.nlm.nih.gov/pmc/tools/developers/

pubmed = PubMed(tool="BioSketchify", email="my@email.address")


## Retrieve and parse PubMed entries

Query PubMed one publication at a time, and parse the author and affiliation list.

Due to API limits, we have to limit the rate at which we query.

In [87]:
authors = {}

for pmid in pmids:
    results = pubmed.query(pmid, max_results=1)
    for article in results:
        for author in article.authors:
            name = '%s, %s' % (author['lastname'], author['firstname'])
            year = article.publication_date.year
            affiliation = author['affiliation']
            authors[name] = (year, affiliation)
    print(article.title)
    sleep(1)
    

Cephaloticoccus gen. nov., a new genus of 'Verrucomicrobia' containing two novel species isolated from Cephalotes ant guts.
Dissecting host-associated communities with DNA barcodes.
Gut microbiota of dung beetles correspond to dietary specializations of adults and larvae.
By their own devices: invasive Argentine ants have shifted diet without clear aid from symbiotic microbes.
Unraveling the processes shaping mammalian gut microbiomes over evolutionary time.
Corrigendum: Cephaloticoccus gen. nov., a new genus of 'Verrucomicrobia' containing two novel species isolated from Cephalotes ant guts.
The structured diversity of specialized gut symbionts of the New World army ants.
Ant-plant mutualism: a dietary by-product of a tropical ant's macronutrient requirements.
Dramatic Differences in Gut Bacterial Densities Correlate with Diet and Habitat in Rainforest Ants.
A communal catalogue reveals Earth's multiscale microbial diversity.
The human microbiome in evolution.
Improving saliva shotgun

Make an author dataframe, with blank columns for "Organization" and "Department"

In [88]:
author_df = pd.DataFrame.from_dict(authors, orient='index', columns=['year','affiliation'])
author_df['Organization'] = ''
author_df['Department'] = ''

author_df.head()

Unnamed: 0,year,affiliation,Organization,Department
"Lin, Jonathan Y",2017,"Department of Biology, Calvin College, Grand R...",,
"Russell, Jacob A",2018,"Department of Biology, Drexel University.",,
"Sanders, Jon G",2020,"Department of Pediatrics, School of Medicine, ...",,
"Wertz, John T",2018,"Department of Biology, Calvin College, Grand R...",,
"Baker, Christopher C M",2016,Department of Organismic and Evolutionary Biol...,,


## Split affiliation into department and organization

This might be optional, but PubMed stores affiliation in a single column, and NSF requests 'Organization' be in its own column. This function will loop over the author dataframe, and present each comma-separated element of the 'affiliation' value to you and prompt for input. Press 1 to store that chunk to the 'Department' column, 2 to store that chunk to the 'Organization' column, and any other key to move to the next author.

It will only parse authors that have no entry for the required 'Organization' column, so if you miss that and re-run this cell it will pick up where you left off.

In [82]:
print("Enter 1 for Department, 2 for Organization, or nothing to skip rest")

for i, author in author_df.iterrows():
    if author['Organization'] != '':
        continue
    try:
        for bit in author['affiliation'].split(','):

            print(bit)
            choice = input("Input:")
            if choice == '1':
                author_df.loc[i, 'Department'] = author_df.loc[i, 'Department'] + bit
            elif choice == '2':
                author_df.loc[i, 'Organization'] = author_df.loc[i, 'Organization'] + bit
            else:
                break
    except:
        continue
    

Enter 1 for Department, 2 for Organization, or nothing to skip rest
Lajuma Research Centre
Input:2
 Louis Trichardt (Makhado)
Input:
Estacion Biologica Corrientes (MACN-BR) - CONICET
Input:2
 Corrientes
Input:


In [84]:
author_df.head()

Unnamed: 0,year,affiliation,Organization,Department
"Lin, Jonathan Y",2017,"Department of Biology, Calvin College, Grand R...",Calvin College,Department of Biology
"Russell, Jacob A",2018,"Department of Biology, Drexel University.",Drexel University.,Department of Biology
"Sanders, Jon G",2020,"Department of Pediatrics, School of Medicine, ...",University of California San Diego,Department of Pediatrics School of Medicine
"Wertz, John T",2018,"Department of Biology, Calvin College, Grand R...",Calvin College,Department of Biology
"Baker, Christopher C M",2016,Department of Organismic and Evolutionary Biol...,Harvard University,Department of Organismic and Evolutionary Biology


## Export author dataframe to CSV file

You can now open this in your favorite spreadsheet column to clean it up and add to the NSF workbook.

In [85]:
author_df.to_csv('authors_with_affiliations.csv')