Skip to content

Getting identity data

paulalbert1 edited this page Feb 8, 2020 · 10 revisions

ReCiter's accuracy is tied to institutions' ability to load a range of identity data into the Identity table. This article offers some suggestions as to which offices and organizational units may have these data, and how one might go about getting it.

The benefit of collecting data from a variety of sources increases the likelihood you pull in data that is consistent with data that appears in a bibliographic record. Having evidence that a given researcher spells their name three ways in four systems is a good thing. The less of a track record someone has, the more challenging they are to disambiguate. (Incidentally, this is why some disambiguation projects that limit themselves to long-lived principal investigators are making it a bit easier on themselves.)

Use institutional data

We generally advise against using bibliographic metadata itself as a source for identity data. In other words, don't harvest Prof. X's relationships by grabbing co-authors or her grant identifiers from her published papers or from RePORTER. Everything included among the identity data should be explicitly asserted or inferrable on the institution side. This limits the likelihood of a spurious feedback loop. For example, it's entirely possible that you can pull a grant from RePORTER, but your target individual is not associated with that grant! Don't worry though - it is not like data like grant identifier or co-authors are not used. ReCiter does leverage certain features such as grants for clustering, which itself affects the overall evidence score. In the case where you have a sparse profile, you will be able to leverage these attributes once you accept a handful of articles.

As an aside, the ReCiter dev team is considering developing some machine learning technology that would pull out key attributes and ask for explicit feedback from the user. For example, "Did Prof. X ever work at University of Michigan? Did Prof. X ever use abc123@gmail.com?" That said, ReCiter should work at scale of thousands without the need for feedback from users.

Pitching this effort

How do you get these data? First, see if you can figure out where these data live. Do a bit of scouting. Sometimes institutional reporting is an effective ally to this end, especially if you intimate that you would be happy to provide them with regular reports.

When you know which data you want, one approach is to try to get all the data first, one system owner at a time. This risks getting a clear "no" on the record, and then the people who gave you that no may feel compelled to justify their "totally sensible choice." After all, your initiative isn't even up and running. For a different project that involved collecting data, we at WCM got shot down pretty early, so I would recommend a second option.

Another approach is to stand up the application and run ReCiter for a couple profiles, perhaps grabbing a full set of identity attributes from an available CV or two. Run ReCiter for these individuals and then present the Publication Manager interface and the results to a higher up who can champion this effort. Show a sample report. (paa2013@med.cornell.edu can share.) Maybe use a stopwatch to show long it takes to disambiguate a given person. Talk about how this effort solves specific problems and use cases - especially current awareness. This last use case is especially appealing to External Affairs and Office of the Research Dean.

The goal of this meeting is to get your would be champion to admit that this initiative is, in fact, highly valuable. If the higher up agrees, s/he may ask, "What do you need to proceed?" Share your data wishlist and ask for an introduction to the relevant data owners.

When you talk to data owners, say that you don't want or need full system access. All you need is a view, and these data can be restricted from public view as per their requirements.

Sources

Type of data Sources WCM uses
Name Any identity source system including Human Resources (HR) system, Office of Faculty Affairs (OFA) system, student management system, directory, clinical profile system, affiliated hospital system.... Basically any point where you're asking users or administrators to provide your scholar's name. Some institutions only have one identity system. (If this describes your institution, congratulations on being sane.) Names that are no longer used are also available.
Organizational unit Any identity source system including Human Resources (HR) system, Office of Faculty Affairs (OFA) system, student management system, directory, clinical profile system, affiliated hospital system.... At WCM, we also include educational programs as those are included in affiliations.
Email WCM imports the following types of emails from all of the systems listed above. This includes emails created by WCM, those from prior affiliations which OFA collects, personal emails, and inferred emails. WCM has some people who have up to 4 different emails in these sources. WCM will infer the presence of an email even if it isn't explicitly defined by defining the domain in the application.properties file. For example, we will look for this pattern: our College-wide identifier + "@med.cornell.edu" OR "@mail.med.cornell.edu" OR "@nyp.org" OR "@tritdii.org."
Relationships This may require some institutional savvy. These are the systems WCM uses: grants management system (A and B have a co-investigator relationship if they are both listed on the same grant); student management/mentoring system (A and B can have a mentor or mentee relationship), HR system (A and B can have an "HR" type relationship if they are in the same HR org unit, excluding cases where there are more than 100 people in a single unit). You're looking for any available system when two people have a connection of some kind. In all of the above examples, WCM will add A as a relationship to B in B's profile, and vise versa.
Institutions WCM collects institutions from: OFA (has institutions of undergraduate and graduate degrees) and clinical profile system (has institutions where researchers did their internship or residency). Everyone automatically gets "Weill Cornell Medicine", "Weill Cornell Graduate School", and "Weill Cornell Medical College." Additionally, if someone is active in the affiliated hospital system, they get "NewYork-Presbyterian Hospital."
Grant identifiers Grant management system. WCM has had two grants systems. Before they decommissioned the old one, we exported all the NIH grant identifiers to a flat file. So, now we have identifiers (and relationship data, see above) from two systems. Be wary of certain mega-grants that have 500+ people on them. You might want to exclude them from your import.
Degree year At WCM, these data are explicitly defined in the OFA system. It is okay if you only have one of the two. We don't actually have a source of bachelor degree data or explicitly expected graduation year for students, so we guess when they're going to get their doctoral degree. A third year med student in spring of 2020 would probably get his MD credential in 2021. MD-PhD students require a bit more inferencing. At WCM, they have 2 years of MD, 3 years of PhD, and 2 years of MD.
Title This is not used for disambiguation but instead for display in Publication Manager.