Skip to content

Latest commit

 

History

History

dedupe

Dedupe grant dataset

This folder contains code aiming to create deduplicated list of NSF and NIH investigators. See README in order to download cleaned grant dataset from Amazon S3. Here, we assume that you download dataset from S3 to data folder (data/grid, data/nih and data/nsf) and run the script in this folder.

  • NSF investigators dedupe: dedupe NSF investigators is dedupe_nsf_investigator.py. Run python dedupe_nsf.py in order to run active learning part.

  • Investigators linkage: record linkage between NIH and NSF investigators is located in link_investigator.py. We assume all NIH investigators have their own unique applicants id, that is, we don't have to dedupe NIH investigators.

  • Affiliations dedupe: run python dedupe_affiliation.py in order to dedupe affiliation across NIH and NSF grants. Many parameters of the deduplication process can be tweaked by looking at the parameters with python dedupe_affiliation.py -h. For example, to skip the console labeling step, run python dedupe_affiliation.py --skiplabel. This code will produce 2 main files including institutions_disambiguated.csv and application_vs_affiliation.csv (and json file produced from dedupe).

  • Unify NIH and NSF grants: run python unify_grants.py. This will grab files produced from dedupe_affiliation.py in this folder and merge all the dataset together.

  • Affiliations linkage: record linkage script between deduped NIH and NSF affiliations and GRID database is in link_affiliation.py.