This folder contains code aiming to create deduplicated
list of NSF and NIH investigators. See README
in order to download cleaned grant dataset from Amazon S3. Here, we assume that you download
dataset from S3 to data
folder (data/grid
, data/nih
and data/nsf
) and
run the script in this folder.
-
NSF investigators dedupe: dedupe NSF investigators is
dedupe_nsf_investigator.py
. Runpython dedupe_nsf.py
in order to run active learning part. -
Investigators linkage: record linkage between NIH and NSF investigators is located in
link_investigator.py
. We assume all NIH investigators have their own unique applicants id, that is, we don't have to dedupe NIH investigators. -
Affiliations dedupe: run
python dedupe_affiliation.py
in order to dedupe affiliation across NIH and NSF grants. Many parameters of the deduplication process can be tweaked by looking at the parameters withpython dedupe_affiliation.py -h
. For example, to skip the console labeling step, runpython dedupe_affiliation.py --skiplabel
. This code will produce 2 main files includinginstitutions_disambiguated.csv
andapplication_vs_affiliation.csv
(andjson
file produced fromdedupe
). -
Unify NIH and NSF grants: run
python unify_grants.py
. This will grab files produced fromdedupe_affiliation.py
in this folder and merge all the dataset together. -
Affiliations linkage: record linkage script between deduped NIH and NSF affiliations and GRID database is in
link_affiliation.py
.