Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor retrieval strategy #259

Closed
paulalbert1 opened this issue Jul 27, 2018 · 0 comments
Closed

Refactor retrieval strategy #259

paulalbert1 opened this issue Jul 27, 2018 · 0 comments

Comments

@paulalbert1
Copy link
Contributor

paulalbert1 commented Jul 27, 2018

application.properties

Add these to application.properties. We will describe how to use these later..

searchStrategy-leninent-threshold: 2000
searchStrategy-strict-threshold: 1000

Define refreshFlag

0a. Set variable "dateFilter" equal to null.

0a. User selects one of these refreshFlags for retrieval (this should be an option in the Swagger UI):

  • If "All publications" - re-import all publications
  • If "Only newly added publications" - go to 0b
  • If "False" (default) - retrieve existing records from eSearchResults

0b. Construct date filter.

If "Only newly added publications" is selected, we're going to do an incremental lookup. Grab latestRetrievalDate from esearchresults. Construct query using last retrieval date. Suppose the latest lookup data was August 1. The modifier would look like this. Note that 3000 is PubMed's suggested maximum year, but it could be anything.

(("2018/08/01"[Date - Entrez] : "3000"[Date - Entrez]))

Store this as dateFilter.

goldStandardRetrievalStrategy

1. Is use.gold.standard.evidence=true?

  • If yes, use goldStandardRetrievalStrategy.
  • If no, do nothing.

Go to 2.

emailRetrievalStrategy

2. If user has one or more emails, run emailRetrievalStrategy.

Test case: mcr2004@med.cornell.edu

lastNameFirstInitialRetrievalStrategy

3. Retrieve all unique forms of identity.name.lastName and identity.name.firstInitial from Identity table for targetAuthor.

Look at both primaryName and alternateNames.

4. Derive additional names, if possible.

Logic:
A. For any name in primaryName or alternateNames, does targetAuthor have a surname, which satisfies these conditions: contains a space or dash; if you break up the name at the first space or dash, there would be two strings of four characters or greater

  • If yes, go to B
  • If no, there is no need to derive name aliases.

B. Attempt to derive additional name aliases by breaking up any surnames, splitting on space or dash.

For example: ses9022 has a primaryName of Selin Somersan-Karakaya. This would translate into:

  • Somersan-Karakya S
  • Somersan S
  • Karakaya S

Some users have multiple spaces (e.g., alg9037 - Gonzalez Della Valle). In such cases, split only on the first space or dash.

  • Gonzalez A
  • Della Valle A

Test cases (CWID, surname in Enterprise Directory, surname in article)

  • lveeck, Veeck Gosden, Veeck
  • nlt2002, Tottenham-Delafield, Tottenham
  • sinhana, Sinha Gregory, Gregory

C. Does user have a first initial followed by a period orspace in givenName for primaryName like any of the following?

- W. Clay[firstName] Bracken[lastName]
- W.[firstName] Clay[middleName] Bracken[lastName]
- W Clay[firstName] Bracken[lastName]
- W[firstName] Clay[middleName] Bracken[lastName]
  • If yes, attempt to derive new name alias for lookup. In all the above cases, we derive the name “Bracken C”. When we do the lookup, let’s call this, “abbreviatedFirstNameRetrievalStrategy”

use case: wcb2001

5. Sanitize

  • Deaccent for all names.
  • Remove for all names by referring to any suffixes listed in nameScoringStrategy-excludedSuffixes, which is stored in application.properties.

6. Does this search return more than value set in searchStrategy-leninent-threshold? (Do not include derived names at this point.)

[lastName firstInitial for primaryName] OR [lastName firstInitial for alternateName1] OR [lastName firstInitial for alternateName2]...

  • If no, do the above search AND dateFilter. Return results. We're done with primaryName and alternateNames. Then go to strictRetrievalStrategy for any cases where name type=derived. If there are no such cases, stop.
  • If yes, we're going to look these names up using strictRetrievalStrategy-*.

strictRetrievalStrategy

7. Preprocessing: we need to construct a series of parameters which limit our result set.

A. strictRetrievalStrategy-knownRelationships

  • Source: identity.knownRelationships where:
  • Example: "OR Albert P[au] OR Bales M[au]"
  • Test case: 23945227 for drw2004. Should match on did2005 (Delgado D).

B. strictRetrievalStrategy-fullName

  • Source: identity.fullName
  • Example: "OR Reid M. Carrington[au]"

C. strictRetrievalStrategy-grants

  • Source: grant identifiers from identity.grants
  • Example: for ajg9004: "OR TR000457 OR OR TR0000458"

D. strictRetrievalStrategy-institutions

  • Source: identity.institutions and homeInstitution-keywords
  • Instructions: remove any institutionStopwords. Here's what this attribute and value, as stored in application.properties looks like:
homeInstitution-keywords: weill|cornell, weill|medicine, cornell|medicine, cornell|medical, weill|medical, weill|bugando, weill|graduate, cornell|presbyterian, weill|presbyterian, 10065|cornell, 10065|presbyterian, 10021|cornell, 10021|presbyterian, weill|qatar, cornell|qatar, @med.cornell.edu, @qatar-med.cornell.edu

For preprocessing, take the above and:

  • Substitute , for OR
  • Substitute | for AND

The full output using the existing value in application.properties should look like this.

AND (weill AND cornell) OR (weill AND medicine) OR (cornell AND medicine) OR (cornell AND medical) OR (weill AND medical) OR (weill AND bugando) OR (weill AND graduate) OR (cornell AND presbyterian) OR (weill AND presbyterian) OR (10065 AND cornell) OR (10065 AND presbyterian) OR (10021 AND cornell) OR (10021 AND presbyterian) OR (weill AND qatar) OR (cornell AND qatar) OR @med.cornell.edu OR @qatar-med.cornell.edu
  • Example: "OR (University of Milwaukee)[affiliation] OR (University of Shanghai (China))[affiliation] OR (weill AND cornell) OR (weill AND medicine)..."

E. strictRetrievalStrategy-departments

  • Source: identity.departments
  • Example: for ajg9004 - "OR Radiology[affiliation]"

F. strictRetrievalStrategy-secondInitial

  • Lookup by the first two capital letters in the user's first name or middle name. Examples:
Warren,James,David --> Warren JD
Choi,Augustine,M.K. --> Choi AM
Choi,Hyo Kyoung,NULL --> Choi HK
Moore,John,P --> Moore JP 
  • Do this across all primary and alternate names.

G. ON HOLD: strictRetrievalStrategy-meshMajor

  • Source: meshMajor from any pub retrieved during goldStandardRetrievalStrategy or emailRetrievalStrategy, and that has meshTerm.count of 50,000 or less
  • Example: "OR cardiac arrest[Majr] OR Aneurysm, Dissecting[majr]"

8. Prepare searches

Strict retrieval strategy searches consist of two major pieces: name(s) and additional parameters from 7 and the dateFilter.

Suppose we've identified three distinct names that need to be retrieved using strict retrieval. Let's call them A, B, and C. Our searches would be:

  • (A OR B OR C) AND 7A AND dateFilter
  • (A OR B OR C) AND 7B AND dateFilter
  • (A OR B OR C) AND 7C AND dateFilter
  • (A OR B OR C) AND 7D AND dateFilter
  • (A OR B OR C) AND 7E AND dateFilter

9. Conduct searches

For each search, first get a count of results. If the result count is lower than the value set in searchStrategy-strict-threshold, proceed with storing the results. If it is not, skip over that search to the next one.

Test cases

  • For ajg9004: 24008170 21659479 22135127 24263697 24812015 24200901 21596805 21624991 11716577 23945228 23945227 24474262 25977478 26063003 26427831 26564432 26965465 27256856 27127002 27365331 25572949 26471747 23811973 24309123 29519791
  • For ses9022: 10903715 11673488 12496375 16614246 23012453 24197888 27144688
  • yiwang
  • For ccole: 30009991

Future development

  • Identify additional name aliases from targetAuthor in goldStandardRetrievalStrategy and in cases where known email is a match. We need to be storing data in the Analysis table before we can do this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants