Skip to content

Configuring application.properties

paulalbert1 edited this page Jul 1, 2019 · 7 revisions

Most institution-specific configuration occurs within the application.properties file. This file, which contains a lot of inline documentation, is currently populated with a variety of settings including those that are Weill Cornell-specific.

It may be easy to become overwhelmed by the sheer number of options in application.properties. The purpose of this article is to help ReCiter users prioritize which values are the most important to configure. To understand how the application functions, please see How ReCiter works.

Any changes you make to application.properties will only be used the next time you restart the application.

Local database configuration

You have the option to install the DynamoDB database on Amazon AWS or locally. The latter is free. If you install it locally, you need to set aws.dynamoDb.local=false. The remaining fields are invoked in your environment variables.

Source of identity data

If this setting is set to true, ReCiter will look for an "identity.json" file in /src/main/resources/files.

aws.dynamodb.settings.file.import=true

If false, it will get identity data from the DynamoDB connection.

S3 file storage

The S3 bucket name needs to have a globally unique name. If it isn't unique, you'll get an error during the build. Consider naming it like "myInstitution-reciter-dynamodb".

aws.s3.dynamodb.bucketName=reciter-dynamodb

Select evidence parameters and scores

Most of application.properties consists of attributes and scores. This section highlights which fields ReCiter user might want to look most closely at updating.

The current list of scores has been chosen through trial and error at one institution, Weill Cornell Medicine. To be most effective at populating these fields, we suggest identifying a corpus of known publications and figuring out, for example which email domains known author use. In an indictment of indexing practices or our researchers' spelling, we learned that there are 242 papers which have an affiliation statement in PubMed is for "Weil[sic] Cornell." We consider the values in this configuration file to be a first pass. There's certainly an opportunity for us to optimize these values using linear regression and other advanced techniques.

One of ReCiter's goals is to assign scores to an article which meaningfully speak to how compelling a piece of evidence is. This is useful not only for selecting a given article, but also the nature of clustering where the score of one article affects the scores of other articles in a cluster. You could assign an evidence score of a million in cases where there's a match for email, but that would mean all other scores in that cluster would automatically exceed most thresholds – even in cases where the clustering accidentally comingled candidate articles by different people. Or, you could assign a score of -1,000 if an article has been rejected, but that would mean other articles in that cluster would not be suggested no matter how compelling their evidence was.

Note that the article How ReCiter works describes how these fields are used in the context of the algorithm.

Affiliation evidence

  • authorAffiliationScoringStrategy.homeInstitution-keywords - Terms or combinations of terms appearing in PubMed affiliation statements that highly suggest someone from your institution is an author on a paper. This could include names, zip codes, and email address domains.
  • authorAffiliationScoringStrategy.homeInstitution-scopusInstitutionIDs - If you're using Scopus, include any Scopus Institution identifiers that correspond to your home institution. Scopus makes a number of splitting errors, so you may have a dozen or more.
  • authorAffiliationScoringStrategy.collaboratingInstitutions-scopusInstitutionIDs - If you're using Scopus, include any Scopus Institution identifiers that correspond to common institutional collaborators of your home institution. For example, Hospital for Special Surgery is in the same complex as Weill Cornell Medicine, and many Hospital for Special Surgery papers have a Weill Cornell co-author. So, if a paper lists HSS as an affiliation, it would be common that a Weill Cornell person would be a co-author. The way we came up with this list is to count the most common affiliates and benchmark against the total number of papers that institution has authored in Scopus. HSS has only 11,031 publications in Scopus of which a whopping 2,888, or 26%, have a Weill Cornell co-author. Columbia University College of Physicians and Surgeons, which is only four miles away from Weill Cornell, has 49,360 publications and only 1,009 collaborations, only 2%. This field picks up not only common collaborations but cases where individuals from one institution are more likely to move to a second institution.
  • strategy.authorAffiliationScoringStrategy.collaboratingInstitutions-keywords - Terms or combinations of terms appearing in PubMed affiliation statements that highly suggest someone is at another institution that commonly collaborates with your institution is an author on a paper. This could include names, zip codes, and email address domains.

Email evidence

  • strategy.email.default.suffixes - any domains associated with the home institution; here we're looking to see if "uid" + [domain] is listed in the affiliation statement

Organizational unit evidence

  • orgUnitScoringStrategy.organizationalUnitSynonym - Despite the pleadings of External Affairs, your institution's authors can be inconsistent with how they refer to an organizational unit. Here you get a chance to designate certain terms as org unit synonyms. "Otolaryngology" = "Otorhinolaryngology" = "ENT", etc. This is used in several places including Organizational Unit scoring and Journal Category scoring.

Person type evidence

  • personTypeScoringStrategy.personTypeScore-academic-faculty-weillfulltime - You can increase or decrease the weighting of candidate articles for broad designations of people where "academic-faculty-weillfulltime" is the person type as stored in identity.personTypes. At Weill Cornell, we've noticed that our full-time faculty are more likely to author paper than MD students. An analysis could also show that tenure-faculty or that individuals on the research/investigation pathway are more likely to author articles.

Average clustering evidence

  • strategy.acceptedRejectedScoringStrategy.feedbackScore-accepted, strategy.acceptedRejectedScoringStrategy.feedbackScore-rejected, strategy.acceptedRejectedScoringStrategy.feedbackScore-null - These scores are only used during cluster scoring. This strategy and these scores allow us to upweight articles in clusters that have already been accepted and downweight articles that have been rejected.

Scoring

  • standardizedScoreMapping - ReCiter outputs both a raw or non-standardized and standardized scores. To insulate users from changing raw scores and its non-intuitive range ("How good is 12.1?"), we map the raw score to a 1-10 scale. Articles with scores between the 1st and 2nd term in standardizedScoreMapping have a score of 1. Articles with scores between the 2nd and 3rd term in standardizedScoreMapping have a score of 2.
  • totalArticleScore-standardized-default - If an admin does not select any global score, the system defaults to this score.
  • reciter.minimumStorageThreshold - sets a lower limit for storing ReCiter's output in the "Analysis" DynamoDB table.