Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create authorAffiliationScoringStrategy #47

Closed
michaelbales1 opened this issue Apr 24, 2015 · 4 comments
Closed

Create authorAffiliationScoringStrategy #47

michaelbales1 opened this issue Apr 24, 2015 · 4 comments

Comments

@michaelbales1
Copy link
Contributor

michaelbales1 commented Apr 24, 2015

Overview

With this scoring strategy, we're trying to account for the extent to which affiliation of all authors affects the likelihood a given targetAuthor authored an article.

To do this, we need to ask and answer several questions.

  1. Which sources are we using to make the match?
  • Scopus - does institutional disambiguation; provides affiliations as numeric codes (e.g., 6007997)
  • PubMed - affiliations are just strings
  1. Which affiliation(s) are we considering?
  • targetAuthor
  • non-targetAuthor
  1. What type of match is this?
  • explicitly defined for the individual, e.g., Dr. X got an undergraduate degree from Georgetown University, did her residency at Montefiore, etc.
  • explicitly defined for the institution, e.g., Weill Cornell faculty frequently co-author papers with individuals from Hospital for Special Surgery
  • match was not attempted because there was no available affiliation data
  • match was attempted but failed

About Scopus data

There are currently 276,666 institutions in the Identity table, which represents 3,861 unique institutions. This comes from several sources, which use a controlled vocabulary.

We've looked up the Scopus Institution ID for the 1,786 institutions that are most often cited as being a current or historical affiliation. This collectively represents 273,006 affiliations. In other words, ~99% of the time we can predict what the Scopus Institution ID could be. Note that a given institution such as Weill Cornell might have multiple institution IDs.

Values in application.properties

targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1.5
targetAuthor-institutionalAffiliation-matchType-null-score: 0
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2

nonTargetAuthor-institutionalAffiliation-weight: 0.5
nonTargetAuthor-institutionalAffiliation-maxScore: 3

homeInstitution-scopusInstitutionIDs: 60007997, 60019868, 60000247, 60072750, 60109878

homeInstitution-keywords: weill|cornell, weill|medicine, cornell|medicine, cornell|medical, weill|medical, weill|bugando, weill|graduate, cornell|presbyterian, weill|presbyterian, 10065|cornell, 10065|presbyterian, 10021|cornell, 10021|presbyterian, weill|qatar, cornell|qatar, @med.cornell.edu, @qatar-med.cornell.edu

institutionStopwords: of, the, for, and, to

collaboratingInstitutions-scopusInstitutionIDs: 60010570, , 60025849, 60012732, 60018043, 60008981, 60022875, 60019970, 60025879, 60009343, 60009656, 60072743, 60072746, 60104769, 60012981, 60000764, 60004670, 60014933, 60022377, 60005705, 60003158, 60027954, 60003711, 60103484, 60029961, 60031841, 60005208, 60002388, 60024099, 60030304, 60029652, 60026273, 60024541, 60023247, 60007555, 60017027, 60002896, 60011605, 60027565

collaboratingInstitutions-keywords: new|york|presbyterian, HSS, hospital|special|surgery, North|Shore|hospital, Long|Island|Jewish, memorial|sloan, sloan|kettering, hamad, mount|sinai, methodist|houston, National|Institute|Mental|Health, beth israel, University|Pennsylvania|Medicine, Merck|Research, New|York|Medical|College, Medicine|Dentistry|New|Jersey, Montefiore, Lenox|Hill, Cold|Spring|Harbor, St|Luke|Roosevelt, New|York|University|Medicine, Langone, SUNY|Downstate, Albert|Einstein|Medicine, Yeshiva, UMDNJ, Icahn|Medicine, Mount|Sinai, columbia|medical, columbia|physicians

Desired output

Variables

targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType: null
targetAuthor-institutionalAffiliation-matchType: noMatch

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-source: PubMed

nonTargetAuthor-institutionalAffiliation-source: Scopus
nonTargetAuthor-institutionalAffiliation-source: PubMed

TargetAuthor

Case 1: Target author has affiliation statements in Scopus and PubMed

targetAuthorAffiliation
	Scopus
		1 
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
			targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University"
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Medicine" 
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
		2
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-identity: "Hospital for Special Surgery"
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Special Surgery"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "61492421"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 1.5
		3
			targetAuthor-institutionalAffiliation-matchType: noMatch
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "University of Adelaide"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "6999421"  
			targetAuthor-institutionalAffiliation-matchType-noMatch-individual-score: -2			
		etc...
	PubMed
			targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Medicine, New York, NY 10065" 

Notes:

  • One target author can have N affiliations in Scopus (as opposed to PubMed). Each of these matches will count towards additional points.
  • We output the PubMed affiliation statement, but that's just for reference. We're not using it for scoring purposes.

Case 2: Target author has affiliation statements in Scopus only

targetAuthorAffiliation
	Scopus
		1 
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
			targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University"
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Medicine" 
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
		2
			targetAuthor-institutionalAffiliation-matchType: noMatch
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "University of Adelaide"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "6999421"  
			targetAuthor-institutionalAffiliation-matchType-noMatch-individual-score: -2			

Case 3: Target author has affiliation statements only in PubMed

targetAuthorAffiliation
    PubMed
		targetAuthor-institutionalAffiliation-source: PubMed
		targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA."
		targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
		targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
		targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2

Non-target author

Case 4: Non-target author(s) have one or more affiliation statements in Scopus

nonTargetAuthorAffiliation
	Scopus
		nonTargetAuthor-institutionalAffiliation-source: Scopus
		nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Cornell Medicine, 60007997, 3
		nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Graduate School of Medical Sciences, 60000247, 2
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: Methodist Hospital System, 60008981, 2
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Medical Research Institute, 60022377, 1
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Rehabilitation Hospital, 60005705, 1
		nonTargetAuthor-institutionalAffiliation-matchType-match-score: 2.4  /* example */

Notes:

  • Don't worry about displaying PubMed affiliations in this case.

Case 5: Non-target author(s) have an affiliation statement in PubMed but not Scopus

We don't consider this case.

Psuedocode

Evaluate targetAuthor

Decide which source to use for scoring.

We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.

1. As set in application.properties, is use.scopus.articles=true?
  • If yes, go to 2
  • If no, go to 3
2. Does article have a Scopus affiliation for targetAuthor?
  • If no, go to 3
  • If yes, go to "Evaluate Scopus affiliation"
3. Does candidate article have a PubMed affiliation for targetAuthor?
  • If no, go to 4
  • If yes, go to "Evaluate PubMed Affiliation"
4. Return the following:
targetAuthor-institutionalAffiliation-matchType: null
targetAuthor-institutionalAffiliation-matchType-null-score: 0

Evaluate Scopus affiliation

1. Get list of institutions (these are strings) from identity.Institution for target person. Also, get Scopus institution IDs from homeInstitution-scopusInstitutionIDs from application.properties.
2. Get any scopusInstitutionIDs (e.g., 60007997) from article.affiliation for targetAuthor.
3. Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table. For example Weill Graduate School of Medical Sciences of Cornell University returns:
  "afids": [
    "60007997",
    "60019868",
    "60000247",
    "60072750",
    "60026978",
    "60025849",
    "105533257"
    ]
4. Attempt match between article and identity.

If there's a positive match between article and identity, output the following:

targetAuthor-institutionalAffiliation-source: Scopus

For EACH positive match between article and identity, output the following:

targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Graduate School of Medical Sciences"  /* example */
targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
 /* value stored in application.properties */

If match, go to 7.
If no match, go to 5.

5. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-scopusInstitutionIDs (stored in application.properties). Look for overlap between the two.

If there's any one positive match between article and identity, output the following for all matches:

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1
 /* value stored in application.properties */
targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Special Surgery"  /* example */
targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "61492421"  /* example */

While there can be multiple matches, the maximum score returned for this type of match should be 1.

If no match, go to 6.

6. There's no match. Output:
targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Sick Children"  /* example */
targetAuthor-institutionalAffiliation-matchType: noMatch
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2  /* value stored in application.properties */

Test case: meb7002 and 22667600

Go to 7.

7. If PubMed affiliation exists, output that (but don't score it):
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Medicine, New York, NY 10065" 

Evaluate PubMed affiliation

1. Get list of institutions (these are strings) from identity.institutions for person under consideration.
2. Get article.affiliation for targetAuthor.
3. Preprocess.

Get list of stopwords from institution-Stopwords field in application.properties.

Remove stopwords, commas, and dashes from article.affiliation and identity.institutions.

Ignore any words inside parentheses. These are typically countries and are not included in affiliation statements.

4. Attempt match from article.affiliation and identity.institutions. The logic here is that keywords from identity.institutions are some substring of article.affiliation.

Here's how we do this match. Grab each affiliation and see if all the keywords are represented in a single affiliation. For example, suppose an author has a known affiliation in identity.institutions of "Weill Cornell Medical College". And, suppose the article affiliation is "Department of Pharmacology, Medical College of Weill Cornell." This would be a match because all the words in the identity affiliation are represented in the article affiliation.

If there's a match, output the following:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA." /* example */
targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
 /* value stored in application.properties */

Maximum of one match.

If there's no match, go to 5.

5. Attempt match against homeInstitution-keywords.

Get homeInstitution-keywords from application.properties.

Look for cases where homeInstitution keywords is present in affiliation string in any order. Here's how we do this. Take any groups of terms from homeInstitution, e.g., "weill|cornell". In order for this to be a match, both terms must be present in any order, with any case.

  • These are matches: "Cornell Weill Medical College", "The Weill Medical School of Cornell University"
  • These are not matches: "Cornell University", "Cornell Med"

If there's a match, output the following:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA." /* example */
targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
homeInstitution-Label: Weill Cornell Medicine / NewYork-Presbyterian Hospital
 /* value stored in application.properties */

Maximum of one match.

If there's no match, go to 6.

6. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-keywords (stored in application.properties). Look for overlap between the two.

If there's any one positive match between article and identity, output the following for all matches:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1
 /* value stored in application.properties */
targetAuthor-institutionalAffiliation-article-pubMedLabel: "Hospital for Special Surgery, New York, NY 10021"  /* example */

While there can be multiple matches, the maximum score returned for this type of match should be 1.

targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution

If there's no match, go to 7.

7. There's no match. Output:
targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubMedLabel: "Hospital for Sick Children, Quebec City, Quebec, Canada YRV MX1"  /* example */
targetAuthor-institutionalAffiliation-matchType: noMatch
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2  /* value stored in application.properties */

Evaluate nonTargetAuthor

Decide which source to use

We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.

1. As set in application.properties, is use.scopus.articles=true?
  • If yes, go to 2
  • If no, go to 3
2. Does article have any Scopus affiliation for nonTargetAuthor?
  • If no, go to 3
  • If yes, go to "Evaluate Scopus affiliation"
3. Does candidate article have any PubMed affiliation for nonTargetAuthor?
  • If no, go to 4
  • If yes, go to "Evaluate PubMed Affiliation"
4. Return the following:
nonTargetAuthor-institutionalAffiliation-matchType: null
nonTargetAuthor-institutionalAffiliation-matchType-null-score: 0

Evaluate Scopus affiliation

1. Preprocessing

A. Create scopusIDsNonTargetAuthor-Article.

  • This contains all scopusInstitutionIDs (e.g., 60007997) from article.affiliation for all nonTargetAuthors.

B. Create scopusIDsNonTargetAuthor-Identity-KnownInstitutions.

  • This contains all Scopus Institution IDs from homeInstitution-scopusInstitutionIDs as stored in application.properties.
  • It also contains all Scopus Institution IDs for targetAuthor from identity.institutions; do this by matching against identity.institutionafids as described above.

C. Create scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions

  • This contains all Scopus Institution IDs from collaboratingInstitution-scopusInstitutionIDs as stored in application.properties.
2. Determine overlap.

Compute the following:

  • countScopusIDNonTargetAuthor-Affiliations - non-unique count of all Scopus affiliation IDs for all authors
  • countScopusIDsNonTargetAuthor-Article-KnownInstitution - count of cases where affiliation ID from scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-KnownInstitutions
  • countScopusIDsNonTargetAuthor-Article-CollaboratingInstitution - count of cases where affiliation IDfrom scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions
  • countScopusIDsNonTargetAuthor-Article-NoMatch - count of cases in which none of the above are true
3. Compute overall score.

Get nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight and nonTargetAuthor-institutionalAffiliation-maxScore from application.properties.

nonTargetAuthor-institutionalAffiliation-maxScore * (countScopusIDsNonTargetAuthor-Article-KnownInstitution + (countScopusIDsNonTargetAuthor-Article-CollaboratingInstitution * nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight )) / countScopusIDNonTargetAuthor-Affiliations
4. Output values
nonTargetAuthor-institutionalAffiliation-source: Scopus
nonTargetAuthor-institutionalAffiliation-matchType-match-score: 2.4  /* example */

/* Here we're outputting Scopus institution labels, identifiers, and counts for all matching institutions. */
nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Cornell Medicine, 60007997, 3
nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Graduate School of Medical Sciences, 60000247, 2
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: Methodist Hospital System, 60008981, 2
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Medical Research Institute, 60022377, 1
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Rehabilitation Hospital, 60005705, 1

Evaluate PubMed affiliation

At this time, we're not evaluating PubMed affiliation for nonTargetAuthors.

@michaelbales1 michaelbales1 changed the title Leverage data on institutional affiliation to improve phase 1 matching Leverage data on institutional affiliation to improve phase two matching Apr 28, 2015
@michaelbales1 michaelbales1 changed the title Leverage data on institutional affiliation to improve phase two matching Leverage data on past institutional affiliation to improve phase two matching Jun 5, 2015
@jl987-Jie jl987-Jie self-assigned this Mar 6, 2017
@jl987-Jie
Copy link
Contributor

Added data from the provided file to MLab's MongoDB server.

@paulalbert1 paulalbert1 changed the title Leverage data on past institutional affiliation to improve phase two matching Output score when targetAuthor has institutional affiliation which matches affiliation in Identity table Jun 20, 2018
@paulalbert1 paulalbert1 changed the title Output score when targetAuthor has institutional affiliation which matches affiliation in Identity table Create targetAuthorAffiliationScore Jul 13, 2018
@paulalbert1 paulalbert1 changed the title Create targetAuthorAffiliationScore Create authorAffiliationScore Jul 15, 2018
@paulalbert1 paulalbert1 changed the title Create authorAffiliationScore Create authorAffiliationScoringStrategy Jul 16, 2018
@paulalbert1 paulalbert1 added this to In progress in ReCiter Development 2020-1 Jul 17, 2018
@paulalbert1 paulalbert1 moved this from In progress to Testing in ReCiter Development 2020-1 Jul 26, 2018
@paulalbert1
Copy link
Contributor

@sarbajitdutta - A bug for ses9022 and 16614246, the institutional affiliation in Scopus is null. Therefore the score should be 0 rather than -3.

    "pmid": 16614246,
        "affiliationEvidence": {
          "scopusTargetAuthorAffiliation": [
            {
              "targetAuthorInstitutionalAffiliationSource": "SCOPUS",
              "targetAuthorInstitutionalAffiliationIdentity": null,
              "targetAuthorInstitutionalAffiliationArticleScopusLabel": null,
              "targetAuthorInstitutionalAffiliationArticleScopusAffiliationId": 0,
              "targetAuthorInstitutionalAffiliationMatchType": "NO_MATCH",
              "targetAuthorInstitutionalAffiliationMatchTypeScore": -3
            }
          ],

@paulalbert1 paulalbert1 moved this from Testing to In progress in ReCiter Development 2020-1 Aug 13, 2018
@paulalbert1
Copy link
Contributor

Also, we should match against all affiliations. We're currently only doing first. Finally, we should incorporate home institution from application.properties.

@paulalbert1 paulalbert1 moved this from In progress to To do in ReCiter Development 2020-1 Aug 13, 2018
@paulalbert1 paulalbert1 moved this from To do to In progress in ReCiter Development 2020-1 Aug 13, 2018
@paulalbert1 paulalbert1 moved this from In progress to Testing in ReCiter Development 2020-1 Sep 5, 2018
@paulalbert1
Copy link
Contributor

I think this is fixed.

@sarbajitdutta sarbajitdutta moved this from Testing to Done in ReCiter Development 2020-1 Jan 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants