Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement revised clustering strategy #217

Closed
paulalbert1 opened this issue May 26, 2018 · 2 comments
Closed

Implement revised clustering strategy #217

paulalbert1 opened this issue May 26, 2018 · 2 comments

Comments

@paulalbert1
Copy link
Contributor

paulalbert1 commented May 26, 2018

This clustering strategy should exist parallel to the existing strategy so that we can assess performance and maybe improve this one day.

1. Retrieve candidate articles from esearchresult

  • Instructions
    • If mode=evidence, retrieve all articles from esearchresult
    • If mode=testing, retrieve all articles from esearchresult except those returned by goldStandardRetrievalStategy. Only use those.

2. Assign each individual article to its own cluster.

3. "Tepid Clustering" - merge articles into the same clusters in cases where they share a certain proportion of common features

  • Theory: these features generally occur fewer than 100,000 times in a corpus of 30 million records. We will use these merge articles into a single cluster but only when they occur a certain proportion of the time.
  • General instructions: compare clusters to each other using the below features. If their similarity exceeds some threshold, combine the clusters.
  • Rationale: Because these features may occur more often by chance, we do not automatically combine the clusters if they share the feature. Instead the cluster-cluster comparison needs to meet or exceed a scoring threshold (which is described below).

3a. Determine if name match is plausible. (ON HOLD)

When doing ANY clustering (including "definitive clustering" below), we should first check to see if it's plausible that the same author wrote the same paper. We're trying to make a simple determination: are a pair of articles eligible to be combined during clustering. The following logic should apply to both tepid and definitive clustering.

  1. Do both articles must have targetAuthor=TRUE assigned for at least one of their authors?
  • If yes, go to 2
  • If no, the pair of articles are eligible to cluster.
  1. Is length(article1.forename) = 2 characters, and length(article2.forename) = 2 characters?
  • If yes, go to 3
  • If no, go to 4
  1. Does article1.forename = article2.forename?
  • If yes, the pair of articles are eligible to cluster.
  • If no, the pair of articles are NOT eligible to cluster.
  1. Is length(article1.forename) > 3 characters, and length(article2.forename) > 3 characters?
  • If yes, go to 5.
  • If no, the pair of articles are eligible to cluster.
  1. Check to see if any of the following conditions true. (One will suffice.)
  • forename1 = forename2
  • 4 consecutive characters of forename1 overlap with forename2
  • Levenstein distance of < 2 between forename1 and forename2 (e.g., AhRum vs. AhReum)
  1. Is one of the above conditions is true?
  • If yes, the pair of articles are eligible to cluster.
  • If no, the pair of articles are NOT eligible to cluster.

Test case: mcr2004, 27741972 (M. Carrington) should not be clustered with 27631718 (Christopher M).

3b. Identify features of each cluster.

Feature: journal name

  • Identify journal title of all articles in a given cluster

Feature: co-author name

  • Identify the lastName, firstInitial of all authors in a given cluster where targetAuthor=FALSE
  • Exclude the following common author names (lastName, firstInitial):
    • Y. Wang
    • J. Wang
    • J. Smith
    • S. Kim
    • S. Lee
    • J. Lee

Feature: MeSH major term

  • Identify all MeSH major terms in a given cluster where count of that MeSH term is < 100,000 in MeSH table.

Feature: Scopus Affiliation ID for targetAuthor

  • Grab Scopus affiliation IDs.

3c. Create arrays for each cluster

{
	clusterId: 1
	journal: Cell; 
	MeSH major: Thalassemia, Sunlight, Tamoxifen, Tryptophan, Brain;
	coauthors: Marshall T, Michaels A;
        targetAuthorScopusAffiliationID: 6007997;
}

3d. Calculate the clusterClusterSimilarityScore between ALL pairs of different clusters

So, if you had 4 clusters: A, B, C, D, you would need to calculate the following cluster-cluster similarity scores. (As you'll see, the order of the cluster comparisons, A-B vs. B-A, doesn't matter.):

  • A-B
  • A-C
  • A-D
  • B-C
  • B-D
  • C-D

There are three variables for each clusterCluster comparison:

  • countItemsCluster1
  • countItemsCluster2
  • overlapCluster1Cluster2

Let's do a sample calculation:

{
	clusterId: 1
	journals: Cell; 
	MeSH major: Thalassemia, Sunlight, Tamoxifen, Tryptophan, Brain;
	coauthors: Marshall T, Michaels A;
        targetAuthorScopusAffiliationID: 6007997;
}

{
	clusterId: 2;
	journals: Cell; 
	MeSH major: Thalassemia, Tamoxifen;
	coauthors: Marshall T, Michaels A, Johnson Q;
        targetAuthorScopusAffiliationID: 6007997, 342823053;
}

Compute those variables.

totalItemsCluster1 = 1 + 5 + 2 + 1 = 9
totalItemsCluster2 = 1 + 2 + 3 + 1 = 7
overlapCluster1Cluster2 = 6
/* Notes:

- Overlap is done only between types, i.e., a journal (e.g., Brain) can't match with a MeSH major (e.g., Brain). If one of the two articles doesn't have a feature (e.g., MeSH major), neither article's features are included in the matching.
- Overlap between institutions counts a maximum of one point even if one target author has multiple affiliations and another has one affiliation.
*/

Compute the clusterClusterSimilarityScore as per this formula...

overlapCluster1Cluster2 ^ 2 / (totalItemsCluster1 * totalItemsCluster2)

Let's figure out clusterClusterSimilarityScore in this example:

(6^2) / (9 * 7) = 0.57

3e. Compare against threshold

Set clusterClusterSimilarityScoreThreshold in application.properties to be 0.2. (We can change this if it's too aggressive. It's actually a bit high perhaps.)

If clusterClusterSimilarityScore > clusterClusterSimilarityScoreThreshold, merge clusters.

Example

Suppose there are three clusters. We want to measure similarity between all of these:

- Cluster 1 vs. Cluster 2 - score = 0.5
- Cluster 1 vs. Cluster 3 - score = 0.1
- Cluster 1 vs. Cluster 4 - score = 0.1
- Cluster 1 vs. Cluster 5 - score = 0.0
- Cluster 2 vs. Cluster 3 - score = 0.1
- Cluster 2 vs. Cluster 4 - score = 0.6
- Cluster 2 vs. Cluster 5 - score = 0.0
- Cluster 3 vs. Cluster 4 - score = 0.1
- Cluster 3 vs. Cluster 5 - score = 0.5
- Cluster 4 vs. Cluster 5 - score = 0.0

Identify clusterClusterSimilarityScores that exceeds our threshold:

{1,2}
{2,4}
{3,5}

Combine clusters until there is no overlap.

{1,2,4}
{3,5}

4. Definitive Clustering - merge articles into the same clusters in cases where they share certain features.

  • Theory: these features generally occur thousands or fewer times in a corpus of 30 million records. Because they occur so infrequently, we will use these to merge clusters whenever they occur.
  • Instructions: any article that shares any of these features with another article should be in the same cluster as that other article.

Feature: email

Feature: grant identifiers

  • Parse NIH grant identifiers into a standard format. Logic:
    • Find the first two consecutive letters. Track the letters.
    • There may be a space or a dash or no additional characters.
    • Now identify the first 4-6 consecutive numbers afterward.
    • Stop looking for additional numbers when:
      • a dash or space interrupts the numbers
      • or the value ends
      • or, we're exceeding 6 numbers
    • Track the numbers.
    • This gives you a normalized version of a grant ID - "DA-01457"
  • Note that we’re ignoring British grants - G0902173 22927437, MOP2390941 25692343. Also, if there are multiple grants in a single grant ID, we're only selecting the second one.
  • Ignore cases where article indexes more grants than clusteringGrants-threshold (see below) as recorded in application.properties. (e.g., amc2056, 22966490)
clusteringGrants-threshold: 12

We want these:

<GrantID>UL1TR000457</GrantID>
<GrantID>UL1-TR000457-06</GrantID>
<GrantID>GM55767</GrantID>
<GrantID>CA 59327</GrantID>
<GrantID>057559</GrantID>
<GrantID>R01 DK060933-01A2</GrantID>
<GrantID>R01 DK060933-02</GrantID>

We don't care about these:

<GrantID>PD/2008/1</GrantID>
<GrantID>79,533</GrantID>

Feature: cites or cited by

  • Identify cases where an article from one cluster cites an article from another cluster, or vise versa.
  • This code already exists.
  • Theoretically, we could also do this with data from Scopus, which contains 3x as much citation coverage.

Feature: MeSH major where global raw count in MeSH table < 4,000

  • Identify cases where an article from one cluster shares the same MeSH major as an article from another cluster, and that MeSH major has a global count of < 4,000.
  • Part of this code already exists.
@paulalbert1
Copy link
Contributor Author

Now being tested.

@sarbajitdutta
Copy link
Contributor

Changing feature count from using all scopus affiliation Ids as a single feature to use each affiliation Ids as a single feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants