Add identifiers to all FamPlex entries #87

cthoyt · 2019-07-21T12:24:14Z

Closes #86

This PR adds a script, famplex/reorganize_old_resources.py that migrates the old resource TSVs into the famplex/resources/ folder. It moves the code in the exports/ folder to famplex/export/ and refactors code necessary to cope with the new format changes.

Identifiers:

Identifiers will match \d{6}
With the addition of identifiers, it now no longer makes sense to maintain the entities.csv in a label-sorted order, but rather to use the identifiers to sort.

Format changes:

~~All CSV files were switched to TSV. I'm not dogmatic about this, but it makes it much easier to deal with funny characters~~ After reading the README one more time, I think it's better just to stick with CSV
A descriptions.csv file was introduced for now, but it will be jointly upgraded with entities.csv to famplex/resources/entities.tsv which has the identifier, name, description, and references.
grounding_map.csv has been upgraded to famplex/resources/grounding_map.tsv and will no longer be a wide CSV, but a tall and skinny TSV with four columns: text, database, database identifier, and name. This means one text reference may appear on many lines.
equivalence.csv has been upgraded to famplex/resources/equivalence.tsv and now has both labels and identifiers.
relations.csv has been upgraded to famplex/resources/relations.tsv and now has both identifiers and labels for subject and object

Features:

The check_references.py script now checks both that the HGNC identifier is valid and the associated names are up to date using indra.databases.hgnc_client
The check_references.py script now checks for text in the grounding map that has multiple grounds to different entities in the same database. After some investigation, I suppose this sort of makes sense because one textual reference could be disambiguated to multiple entities. @steppi and @bgyori you should take a look at this, and consider how the grounding map in this repository fits into your current entity disambiguation work, and whether it even makes sense to store this kind of information here (in the famplex repo) anymore. Here's the current (short) output:

WARNING "B1" has multiple UP groundings: UP:O43157 ! PLXB1_HUMAN, UP:P14635 ! CCNB1_HUMAN
WARNING "cell motility" has multiple GO groundings: GO:GO:0016477 ! cell migration, GO:GO:0048870 ! cell motility
WARNING "MA" has multiple PUBCHEM groundings: PUBCHEM:10836 ! , PUBCHEM:73659 ! 
WARNING "MA" has multiple CHEBI groundings: CHEBI:CHEBI:6809 ! methamphetamine, CHEBI:CHEBI:66682 ! maslinic acid
WARNING "OA" has multiple CHEBI groundings: CHEBI:CHEBI:37659 ! oleanolic acid, CHEBI:CHEBI:44658 ! okadaic acid
WARNING "OA" has multiple PUBCHEM groundings: PUBCHEM:10494 ! , PUBCHEM:446512 ! 
WARNING "p42" has multiple UP groundings: UP:P28482 ! MK01_HUMAN, UP:Q9UQ80 ! PA2G4_HUMAN

Exporter changes:

BEL Namespace: none
Relations graph: now shows both identifier and label for each entity
OBO: fix major problem with OBO export related to locally defined relationships (they need a [Typedef] stanza). Added identifiers where possible and used OBO syntax to keep showing names when available. Added description and references for each entity when available. Renamed root entity to "famplex".

To Do and Questions:

@bgyori is export/hgnc_ids.py still necessary?
Change all names of DBs that don't use Identifiers.org to use identifiers.org. This might be an issue for INDRA, so we need to discuss.
Update importer scripts. Which ones are still relevant?

- This adds identifiers and descriptions to all entities. - For now, the old files will stay until there's a good reason to commit to this new format. - TODO: The check_references.py script needs to be rewritten still for the new format

During this process I noticed I was writing the identifiers and names backwards. That's why we test!

[skip ci]

Still can't run interpro, reactome, or signor successfully

bgyori · 2019-07-30T03:37:56Z

Wow, that's a serious PR! I like a lot of things about this, a couple of points to discuss:

Several downstream dependencies pull FamPlex content from GitHub via URLs and the expected location of files. These would break if the resource files are moved into another folder so we'd need to track down all these places and update them.
Obviously the same is true for the breaking changes in the structure of each of the resource files - lots of downstream code, potentially by users we don't know about would need to be examined and updated.
While perhaps necessary for making it a proper Python package (though wouldn't a setup.py be able to move files from the top level to famplex/resources upon install?) it is nice to have the main, human-editable resource files at the top level of the repo so it's very easy to e.g, edit it manually on Github.
By changing to IDs, we would lose the human-readability of entries in the namespace
I assume that it's not recommended to redefine the existing identifiers.org entry for fplx (https://registry.identifiers.org/registry/fplx) which uses standard names. I wonder if we go with using numeric IDs, whether the right course of action is to register another name space like fplx.id in addition to the existing one.
I wouldn't want to flatten the grounding_map's structure because the point of having multiple entries per row is that each row corresponds to one "sense" of an entity. With the flattened structure it isn't possible to represent ambiguity among multiple senses.
The OBO export was designed (through some trial&error) to be compatible with TRIPS and BioPortal simultaneously (https://bioportal.bioontology.org/ontologies/FPLX). We'd have to make sure that the new format still works with these (I'm almost sure the changes won't work out of the box with TRIPS so again that would require some adaptation).

cthoyt · 2019-08-10T11:35:24Z

Wow, that's a serious PR! I like a lot of things about this, a couple of points to discuss:

Several downstream dependencies pull FamPlex content from GitHub via URLs and the expected location of files. These would break if the resource files are moved into another folder so we'd need to track down all these places and update them.

I agree this is a problem, it wouldn't be a bad idea to do an audit on what's using FamPlex and where. For example, it might make sense to import a FamPlex package that already is a "famplex_client" for use in INDRA and Gilda.

Obviously the same is true for the breaking changes in the structure of each of the resource files - lots of downstream code, potentially by users we don't know about would need to be examined and updated.

Martin always joked that a good way to determine how much a web service is used is to to turn it off and see how long it takes for someone to email and complain. Maybe that strategy would work here?

We could make a tag/release before making any big changes so users could refer to the repository by that hash in the mean time.

While perhaps necessary for making it a proper Python package (though wouldn't a setup.py be able to move files from the top level to famplex/resources upon install?) it is nice to have the main, human-editable resource files at the top level of the repo so it's very easy to e.g, edit it manually on Github.

I've spent an inordinate amount of time (several sessions spanning several hours of digging through the rabbit hole that is setuputils, distutils, etc.) on this and never come up with a solution. I would love to solve the problem how you propose! Maybe you have some ideas, or at least have a fresh pair of eyes.

By changing to IDs, we would lose the human-readability of entries in the namespace

I assume that it's not recommended to redefine the existing identifiers.org entry for fplx (https://registry.identifiers.org/registry/fplx) which uses standard names. I wonder if we go with using numeric IDs, whether the right course of action is to register another name space like fplx.id in addition to the existing one.

Good point. Again, I wonder how many users are currently depending on resolving through identifiers. Maybe we could get away with switching the current entry to something different

I wouldn't want to flatten the grounding_map's structure because the point of having multiple entries per row is that each row corresponds to one "sense" of an entity. With the flattened structure it isn't possible to represent ambiguity among multiple senses.

Okay! Will revert that.

The OBO export was designed (through some trial&error) to be compatible with TRIPS and BioPortal simultaneously (https://bioportal.bioontology.org/ontologies/FPLX). We'd have to make sure that the new format still works with these (I'm almost sure the changes won't work out of the box with TRIPS so again that would require some adaptation).

Sounds like this will be an ongoing issue... Currently the OBO isn't compatible with OLS.

Overall, do you think we should split this PR into a couple smaller ones? Which ones do you @bgyori @johnbachman think are the lowest hanging fruits that we can address first?

cthoyt added 3 commits July 19, 2019 21:06

Create .gitignore

edc1f80

Add descriptions file

7f8186c

Add upgrade script and rewrite exporters

6e2cd0c

- This adds identifiers and descriptions to all entities. - For now, the old files will stay until there's a good reason to commit to this new format. - TODO: The check_references.py script needs to be rewritten still for the new format

cthoyt changed the title ~~Add identifiers~~ Add identifiers to all FamPlex entries Jul 21, 2019

cthoyt added 16 commits July 21, 2019 17:56

Update check_references script

45fcd17

During this process I noticed I was writing the identifiers and names backwards. That's why we test!

Delete relations_graph.pdf

13b36b8

Update .gitignore

7089836

Delete __init__.py

9729dda

Update hgnc_ids.py

24746f6

Update .travis.yml

b1375ca

Update .travis.yml

93ea5a2

Remove duplicates from original grounding map

60a74c6

Switch over to CSVs for everything

6fc3b7a

Fix filename and update README

fff8849

Add check for text with multiple groundings to the same database

06a6dc7

Update formatting of output

0d64363

[skip ci]

Check for existence of cache files

3e945f4

Update hgnc_ids.py

d9c7775

Improve output

864baef

Refactor code for import

ee417e4

Still can't run interpro, reactome, or signor successfully

bgyori mentioned this pull request Apr 15, 2021

Gene grounding curation #172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add identifiers to all FamPlex entries #87

Add identifiers to all FamPlex entries #87

cthoyt commented Jul 21, 2019 •

edited

Loading

bgyori commented Jul 30, 2019 •

edited

Loading

cthoyt commented Aug 10, 2019

Add identifiers to all FamPlex entries #87

Are you sure you want to change the base?

Add identifiers to all FamPlex entries #87

Conversation

cthoyt commented Jul 21, 2019 • edited Loading

bgyori commented Jul 30, 2019 • edited Loading

cthoyt commented Aug 10, 2019

cthoyt commented Jul 21, 2019 •

edited

Loading

bgyori commented Jul 30, 2019 •

edited

Loading