Implement revised clustering strategy #217

paulalbert1 · 2018-05-26T11:40:23Z

This clustering strategy should exist parallel to the existing strategy so that we can assess performance and maybe improve this one day.

1. Retrieve candidate articles from esearchresult

Instructions
- If mode=evidence, retrieve all articles from esearchresult
- If mode=testing, retrieve all articles from esearchresult except those returned by goldStandardRetrievalStategy. Only use those.

2. Assign each individual article to its own cluster.

3. "Tepid Clustering" - merge articles into the same clusters in cases where they share a certain proportion of common features

Theory: these features generally occur fewer than 100,000 times in a corpus of 30 million records. We will use these merge articles into a single cluster but only when they occur a certain proportion of the time.
General instructions: compare clusters to each other using the below features. If their similarity exceeds some threshold, combine the clusters.
Rationale: Because these features may occur more often by chance, we do not automatically combine the clusters if they share the feature. Instead the cluster-cluster comparison needs to meet or exceed a scoring threshold (which is described below).

3a. Determine if name match is plausible. (ON HOLD)

When doing ANY clustering (including "definitive clustering" below), we should first check to see if it's plausible that the same author wrote the same paper. We're trying to make a simple determination: are a pair of articles eligible to be combined during clustering. The following logic should apply to both tepid and definitive clustering.

Do both articles must have targetAuthor=TRUE assigned for at least one of their authors?

If yes, go to 2
If no, the pair of articles are eligible to cluster.

Is length(article1.forename) = 2 characters, and length(article2.forename) = 2 characters?

If yes, go to 3
If no, go to 4

Does article1.forename = article2.forename?

If yes, the pair of articles are eligible to cluster.
If no, the pair of articles are NOT eligible to cluster.

Is length(article1.forename) > 3 characters, and length(article2.forename) > 3 characters?

If yes, go to 5.
If no, the pair of articles are eligible to cluster.

Check to see if any of the following conditions true. (One will suffice.)

forename1 = forename2
4 consecutive characters of forename1 overlap with forename2
Levenstein distance of < 2 between forename1 and forename2 (e.g., AhRum vs. AhReum)

Is one of the above conditions is true?

If yes, the pair of articles are eligible to cluster.
If no, the pair of articles are NOT eligible to cluster.

Test case: mcr2004, 27741972 (M. Carrington) should not be clustered with 27631718 (Christopher M).

3b. Identify features of each cluster.

Feature: journal name

Identify journal title of all articles in a given cluster

Feature: co-author name

Identify the lastName, firstInitial of all authors in a given cluster where targetAuthor=FALSE
Exclude the following common author names (lastName, firstInitial):
- Y. Wang
- J. Wang
- J. Smith
- S. Kim
- S. Lee
- J. Lee

Feature: MeSH major term

Identify all MeSH major terms in a given cluster where count of that MeSH term is < 100,000 in MeSH table.

Feature: Scopus Affiliation ID for targetAuthor

Grab Scopus affiliation IDs.

3c. Create arrays for each cluster

{
	clusterId: 1
	journal: Cell; 
	MeSH major: Thalassemia, Sunlight, Tamoxifen, Tryptophan, Brain;
	coauthors: Marshall T, Michaels A;
        targetAuthorScopusAffiliationID: 6007997;
}

3d. Calculate the clusterClusterSimilarityScore between ALL pairs of different clusters

So, if you had 4 clusters: A, B, C, D, you would need to calculate the following cluster-cluster similarity scores. (As you'll see, the order of the cluster comparisons, A-B vs. B-A, doesn't matter.):

A-B
A-C
A-D
B-C
B-D
C-D

There are three variables for each clusterCluster comparison:

countItemsCluster1
countItemsCluster2
overlapCluster1Cluster2

Let's do a sample calculation:

{
	clusterId: 1
	journals: Cell; 
	MeSH major: Thalassemia, Sunlight, Tamoxifen, Tryptophan, Brain;
	coauthors: Marshall T, Michaels A;
        targetAuthorScopusAffiliationID: 6007997;
}

{
	clusterId: 2;
	journals: Cell; 
	MeSH major: Thalassemia, Tamoxifen;
	coauthors: Marshall T, Michaels A, Johnson Q;
        targetAuthorScopusAffiliationID: 6007997, 342823053;
}

Compute those variables.

totalItemsCluster1 = 1 + 5 + 2 + 1 = 9
totalItemsCluster2 = 1 + 2 + 3 + 1 = 7
overlapCluster1Cluster2 = 6
/* Notes:

- Overlap is done only between types, i.e., a journal (e.g., Brain) can't match with a MeSH major (e.g., Brain). If one of the two articles doesn't have a feature (e.g., MeSH major), neither article's features are included in the matching.
- Overlap between institutions counts a maximum of one point even if one target author has multiple affiliations and another has one affiliation.
*/

Compute the clusterClusterSimilarityScore as per this formula...

overlapCluster1Cluster2 ^ 2 / (totalItemsCluster1 * totalItemsCluster2)

Let's figure out clusterClusterSimilarityScore in this example:

(6^2) / (9 * 7) = 0.57

3e. Compare against threshold

Set clusterClusterSimilarityScoreThreshold in application.properties to be 0.2. (We can change this if it's too aggressive. It's actually a bit high perhaps.)

If clusterClusterSimilarityScore > clusterClusterSimilarityScoreThreshold, merge clusters.

Example

Suppose there are three clusters. We want to measure similarity between all of these:

- Cluster 1 vs. Cluster 2 - score = 0.5
- Cluster 1 vs. Cluster 3 - score = 0.1
- Cluster 1 vs. Cluster 4 - score = 0.1
- Cluster 1 vs. Cluster 5 - score = 0.0
- Cluster 2 vs. Cluster 3 - score = 0.1
- Cluster 2 vs. Cluster 4 - score = 0.6
- Cluster 2 vs. Cluster 5 - score = 0.0
- Cluster 3 vs. Cluster 4 - score = 0.1
- Cluster 3 vs. Cluster 5 - score = 0.5
- Cluster 4 vs. Cluster 5 - score = 0.0

Identify clusterClusterSimilarityScores that exceeds our threshold:

{1,2}
{2,4}
{3,5}

Combine clusters until there is no overlap.

{1,2,4}
{3,5}

4. Definitive Clustering - merge articles into the same clusters in cases where they share certain features.

Theory: these features generally occur thousands or fewer times in a corpus of 30 million records. Because they occur so infrequently, we will use these to merge clusters whenever they occur.
Instructions: any article that shares any of these features with another article should be in the same cluster as that other article.

Feature: email

Parse email addresses of all authors including cases where targetAuthor=FALSE and targetAuthor=TRUE
Preprocess into standardized format:
- Set to lowercase
- Get rid of unnecessary characters:
  - Approach Update ReCiter clustering so that it can be run locally and produce readable output #1: substitute out: periods, dashes, commas, parentheses, <, >, ;
  - Approach Move selected to-do items from ReCiter wiki to GitHub; assign tasks to milestones #2: only allow letters, numbers, and @

Feature: grant identifiers

Parse NIH grant identifiers into a standard format. Logic:
- Find the first two consecutive letters. Track the letters.
- There may be a space or a dash or no additional characters.
- Now identify the first 4-6 consecutive numbers afterward.
- Stop looking for additional numbers when:
  - a dash or space interrupts the numbers
  - or the value ends
  - or, we're exceeding 6 numbers
- Track the numbers.
- This gives you a normalized version of a grant ID - "DA-01457"
Note that we’re ignoring British grants - G0902173 22927437, MOP2390941 25692343. Also, if there are multiple grants in a single grant ID, we're only selecting the second one.
Ignore cases where article indexes more grants than clusteringGrants-threshold (see below) as recorded in application.properties. (e.g., amc2056, 22966490)

clusteringGrants-threshold: 12

We want these:

<GrantID>UL1TR000457</GrantID>
<GrantID>UL1-TR000457-06</GrantID>
<GrantID>GM55767</GrantID>
<GrantID>CA 59327</GrantID>
<GrantID>057559</GrantID>
<GrantID>R01 DK060933-01A2</GrantID>
<GrantID>R01 DK060933-02</GrantID>

We don't care about these:

<GrantID>PD/2008/1</GrantID>
<GrantID>79,533</GrantID>

Feature: cites or cited by

Identify cases where an article from one cluster cites an article from another cluster, or vise versa.
This code already exists.
Theoretically, we could also do this with data from Scopus, which contains 3x as much citation coverage.

Feature: MeSH major where global raw count in MeSH table < 4,000

Identify cases where an article from one cluster shares the same MeSH major as an article from another cluster, and that MeSH major has a global count of < 4,000.
Part of this code already exists.

The text was updated successfully, but these errors were encountered:

paulalbert1 · 2018-06-20T23:02:08Z

Now being tested.

sarbajitdutta · 2019-01-10T16:08:17Z

Changing feature count from using all scopus affiliation Ids as a single feature to use each affiliation Ids as a single feature

paulalbert1 assigned sarbajitdutta May 26, 2018

sarbajitdutta added the enhancement label May 31, 2018

sarbajitdutta added this to In progress in ReCiter Development 2020-1 May 31, 2018

paulalbert1 moved this from In progress to Testing in ReCiter Development 2020-1 Jun 20, 2018

paulalbert1 closed this as completed Jun 20, 2018

paulalbert1 mentioned this issue Jul 17, 2018

Use additional criteria for Phase One clustering #115

Closed

This was referenced Jul 26, 2018

Refine clustering by not clustering under certain circumstances #254

Closed

Output clustering relationships in feature generator API #256

Closed

sarbajitdutta moved this from Testing to Done in ReCiter Development 2020-1 Jan 7, 2019

sarbajitdutta reopened this Jan 10, 2019

sarbajitdutta moved this from Done to In progress in ReCiter Development 2020-1 Jan 10, 2019

sarbajitdutta mentioned this issue Jan 10, 2019

Add maven dependency from Maven and Cleanup #316

Merged

sarbajitdutta moved this from In progress to Done in ReCiter Development 2020-1 Jan 10, 2019

paulalbert1 closed this as completed Apr 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement revised clustering strategy #217

Implement revised clustering strategy #217

paulalbert1 commented May 26, 2018 •

edited

paulalbert1 commented Jun 20, 2018

sarbajitdutta commented Jan 10, 2019

Implement revised clustering strategy #217

Implement revised clustering strategy #217

Comments

paulalbert1 commented May 26, 2018 • edited

1. Retrieve candidate articles from esearchresult

2. Assign each individual article to its own cluster.

3. "Tepid Clustering" - merge articles into the same clusters in cases where they share a certain proportion of common features

3a. Determine if name match is plausible. (ON HOLD)

3b. Identify features of each cluster.

Feature: journal name

Feature: co-author name

Feature: MeSH major term

Feature: Scopus Affiliation ID for targetAuthor

3c. Create arrays for each cluster

3d. Calculate the clusterClusterSimilarityScore between ALL pairs of different clusters

3e. Compare against threshold

Example

4. Definitive Clustering - merge articles into the same clusters in cases where they share certain features.

Feature: email

Feature: grant identifiers

Feature: cites or cited by

Feature: MeSH major where global raw count in MeSH table < 4,000

paulalbert1 commented Jun 20, 2018

sarbajitdutta commented Jan 10, 2019

paulalbert1 commented May 26, 2018 •

edited