Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drush command to update chado instance to match Tripal Term expectations #1895

Conversation

laceysanderson
Copy link
Member

@laceysanderson laceysanderson commented Jun 11, 2024

Tripal 4 Core Dev Task

Issue #1698

Tripal Version: 4

Description

This PR adds a drush command that will check a specific chado instance to ensure it matches the expectations set out by Tripal 4. This is needed because chado vocabularies and cvterms often become a bit eccentric over the years ;-p Right now, we are finding that running the prepare step in Tripal 4 on a chado exported from a Tripal 3 site fails due to changes in term expectations, definitions, etc.

More specifically, the drush command drush trp-check-terms --chado_schema=chado would run a number of checks on the cv, db, cvterm, and dbxref tables to ensure the values associated with key Tripal terms match those defined in the Chado Content Terms YAML. This command also offers the option in some cases to fix any deviations from what is expected.

Specific cases which are checked:

  • checks if the vocabulary exists in the cv table where cv.name = vocab.label
    • if the cv exists then compares the definition and warns if they differ
  • checks if the ID Spaces for this vocabulary exist in the db table where db.name = idSpace.name
    • if the db exists then compares the description, urlprefix and url and warns if they differ.
  • checks if the term exists as a unique cvterm/dbxref combo and reports success if yes.
    • reports errors for terms if:

      1. cvterm exists (cv.name = vocab.name + cvterm.name = term.name) but is attached to the wrong dbxref.
      2. cvterm exists with the term name but wrong cv; however, it is connected to the right dbxref
      3. dbxref exists but is connected to 1+ cvterms that do not match the term we are looking for.
      4. cvterm exists but is connected to a dbxref where the accession matches but it's in a different id space.
    • warns if the cvterm definition doesn't match what we expected.

    • reports but not an error if:

      1. dbxref exists but it is not connected to any cvterms
      2. neither cvterm nor dbxref exist
  • checks if the same term is defined in YAML configuration more than once

Testing?

I started automated testing for this!

Manual Testing via Docker

  1. Start a fresh docker on this branch.
git clone https://github.com/tripal/tripal trpCore-1895-checkTerms
cd trpCore-1895-checkTerms
git checkout tv4g2-issue1698-unable-to-prepare-an-existing-chado-database-from-tripal-3
docker build --tag=tripaldocker:tripal1698 --build-arg drupalversion=10.2.x-dev --build-arg postgresqlversion=16 --file tripaldocker/Dockerfile-php8.3 ./
docker run --publish=80:80 -tid --name=tripal1698 --volume=/Users/las166/Docker/trpCore-1698-prepareChado:/var/www/drupal/web/modules/contrib/tripal tripaldocker:tripal1698
docker exec tripal1698 service postgresql restart
  1. Run the drush command on the freshly installed chado. We expect to see the following output which indicates there are no errors and everything matches our expectations. There should not be any red/yellow in the summary table.
Screenshot 2024-06-17 at 3 15 23 PM Screenshot 2024-06-17 at 3 15 40 PM 4. Now you want to make some changes which will cause warnings:
UPDATE chado.cv SET definition='CHANGED FOR TESTING' WHERE cv.name = 'hydra';
UPDATE chado.db SET description='CHANGED FOR TESTING' WHERE db.name='schema'; 
UPDATE chado.db SET url='not/a/url', urlprefix='something/completely/different?{db}:{accession}'  WHERE db.name='TAXRANK';
UPDATE chado.cvterm SET definition='CHANGED FOR TESTING' WHERE name='Physical map';
  1. Now run the drush command and check that you get warnings for all terms that are in the hydra vocabulary or the schema id space. There should also be a warning for the 'Physical map' term.
Screenshot 2024-06-17 at 3 24 43 PM Screenshot 2024-06-17 at 3 25 00 PM It should prompt asking if you want to see a summary of the terms, choose yes and confirm you see the following summaries. After each summary it also asks if you want us to fix it. Also select yes. Screenshot 2024-06-17 at 3 27 22 PM 6. Now run the command again and confirm the warnings no longer show up. You can also query the database to ensure the fix is done right. 7. Now we want to trigger some of the errors. This can be done with the following commands:
  • Correct cvterm, dbxref found but db is wrong
INSERT INTO chado.db (name) VALUES ('DATA');
UPDATE chado.dbxref SET db_id=42 WHERE dbxref_id=8;
  • cvterm exists, dbxref exists but connection is broken (i.e. cvterm connected to different dbxref).
INSERT INTO chado.dbxref (db_id, accession) VALUES (42, 'NOTREAL');
UPDATE chado.cvterm SET dbxref_id=3496 WHERE dbxref_id=13;
  • cvterm exists but is connected to wrong dbxref and the correct one doesn't exist.
UPDATE chado.dbxref SET accession='UPDATEEXISTING', db_id=42 WHERE dbxref_id=24;
  • cvterm doesn't exist but dbxref does and it's connected to the wrong cvterm
DELETE FROM chado.cvterm WHERE cvterm_id=215;
UPDATE chado.cvterm SET dbxref_id=215 WHERE cvterm_id=2133;
  • cvterm with that name exists and is attached to the right dbxref but cv is wrong
INSERT INTO chado.cv (name) VALUES ('NEWCV');
UPDATE chado.cvterm SET cv_id=34 WHERE cvterm_id=219;
  1. Now run the command and make sure it catches all these problems.
docker exec -it tripal1698 drush trp-check-terms --chado_schema=chado
Screenshot 2024-06-17 at 5 07 46 PM Screenshot 2024-06-17 at 5 09 36 PM

@laceysanderson laceysanderson linked an issue Jun 11, 2024 that may be closed by this pull request
@laceysanderson laceysanderson added Tripal 4 Any issue or pull request focused on Tripal 4 Group 2 - Data Storage | Tripal DBX | Chado Any issue relating to biological data storage, Tripal DBX and Chado integration, Materialized Views labels Jun 11, 2024
@laceysanderson laceysanderson self-assigned this Jun 11, 2024
@dsenalik
Copy link
Contributor

dsenalik commented Jun 19, 2024

🙃 Tripal 3 defined two terms in the tripal_contact vocabulary in tripal_chado/files/tcontact.obo:

[Term]
id: TCONTACT:0000004
name: Organization
is_a: TCONTACT:0000001 ! Contact Type

[Term]
id: TCONTACT:0000018
name: Organization
def: A generic term for any organization.
relationship: part_of TCONTACT:0000015 ! Affiliation

In the Tripal 3 docker, when the OBO was imported, the second one became Organization(TCONTACT:0000018)
In my case, for whatever reason, the first one became Organization (TCONTACT:0000004)

Likewise with the "Department" term, it is also in there twice.

Anyway, when I import my Chado into Tripal 4, and try to prepare Chado, I am seeing
...

 [notice] Step 4: Loading terms...
 [notice] Percent complete: 0%. Memory: 38,689,072 bytes.
 [notice] Percent complete: 0%. Memory: 38,689,072 bytes.
 [notice] Percent complete: 3.57 %. Memory: 38,689,104 bytes.
 [notice] Percent complete: 7.14 %. Memory: 38,689,136 bytes.
 [notice] Percent complete: 10.71 %. Memory: 38,689,168 bytes.
 [error]  ERROR: SQLSTATE[23505]: Unique violation: 7 ERROR:  duplicate key value violates unique constraint "cvterm_c1"
DETAIL:  Key (name, cv_id, is_obsolete)=(Organization, 9, 0) already exists.:
  UPDATE "chado"."cvterm" SET "name"=:db_update_placeholder_0, "definition"=:db_update_placeholder_1, "is_obsolete"=:db_update_placeholder_2, "is_relationshiptype"=:db_update_placeholder_3
WHERE "cvterm_id" = :db_condition_placeholder_0; Array
(
)

9 is the tripal_contact cv_id

So what is happening is that the Chado prepare wants to rename the "TCONTACT:0000004" term from "Organization (TCONTACT:0000004)" to "Organization", then rename the "TCONTACT:0000018" term from "Organization" to "Organization(TCONTACT:0000018)".
But because of the unique constraint, having them both be named "Organization" at the intermediate step fails, and this errror occurs.
The workaround is simple, rename "Organization" to "Organization(TCONTACT:0000018)", or even just to "Organization2" and likewise for "Department", then chado prepare completes successfully.

My question is, can we somehow detect this situation with this tool (Update we can after commit 8654b1c), or else modify the OBO importer to handle this case. Or third option, fix tcontact.obo! We shouldn't have two terms with identical names. Issue #1906
Note that the change from #1906 won't fix my particular case since the second one is renamed in the YAML but that is processed second.

@dsenalik
Copy link
Contributor

dsenalik commented Jun 19, 2024

Something that should be added, check for our standard entity record identifier term. If it was not present in the Tripal 3 site you may get this error
Exception: Cannot create a StorageProperty object for entity type "tripal_entity" field type "chado_string_type_default", key "record_id" as accession for the property term is not recognized: "SIO:000729" in Drupal\tripal\TripalStorage\StoragePropertyBase->__construct() (line 77 of modules/contrib/tripal/tripal/src/TripalStorage/StoragePropertyBase.php).

I see this if I get everything ready for Chado prepare, make the imported chado the default, and at that point make a database backup. If I restore from this backup, I see the error.

* okay, thinking about this, it is just that you can't make the imported chado the default chado until after it is prepared, if you want things to work correctly.

@laceysanderson laceysanderson added the waiting on changes Added on a PR when further review is waiting on the requested changes to be made. label Jun 24, 2024
@dsenalik
Copy link
Contributor

dsenalik commented Jun 29, 2024

Additional test procedure to reproduce the situation causing the error I saw, and see if this is properly flagged after commit 2e9af7b. We will need a new docker so that these terms get checked, because this commit modified some YAML.

UPDATE cvterm SET name='Organization (TCONTACT:0000004)' WHERE name='Organization' AND cv_id=(SELECT cv_id FROM cv WHERE name='tripal_contact');
UPDATE 1
UPDATE cvterm SET name='Organization' WHERE name='Affiliated Organization' AND cv_id=(SELECT cv_id FROM cv WHERE name='tripal_contact');
UPDATE 1
UPDATE cvterm SET name='Department (TCONTACT:0000009)' WHERE name='Department' AND cv_id=(SELECT cv_id FROM cv WHERE name='tripal_contact');
UPDATE 1
UPDATE cvterm SET name='Department' WHERE name='Affiliated Department' AND cv_id=(SELECT cv_id FROM cv WHERE name='tripal_contact');
UPDATE 1
SELECT T.name, X.accession from cvterm T left join dbxref X on T.dbxref_id=X.dbxref_id where cv_id=(SELECT cv_id FROM cv WHERE name='tripal_contact') order by cvterm_id;
              name               | accession 
---------------------------------+-----------
 Organization (TCONTACT:0000004) | 0000004
 Department (TCONTACT:0000009)   | 0000009
 Department                      | 0000016
 Organization                    | 0000018

So that reproduces the situation that I had. Running the validation:

...
  Organization (TCONTACT:0000004)                                           26   27    205     **202**     
  Department (TCONTACT:0000009)                                             26   27    204     **203**     
...
 Would you like more details regarding the errors we found? (yes/no) [yes]:
 > 

Term (cvterm/dbxref) Issues.
----------------------------

We have detected 2 Term(s) with a key deviation from what is expected. Specifically:
+---------------------------------+-------------------------------------------+------------------+------------------+------------------+
| TERM                            | MESSAGE                                   | COLUMN           | EXPECTED         | YOURS            |
+---------------------------------+-------------------------------------------+------------------+------------------+------------------+
| Organization (TCONTACT:0000004) | Broken Connection between cvterm + dbxref | cvterm.dbxref_id | TCONTACT:0000004 | TCONTACT:0000018 |
| Department (TCONTACT:0000009)   | Broken Connection between cvterm + dbxref | cvterm.dbxref_id | TCONTACT:0000009 | TCONTACT:0000016 |
+---------------------------------+-------------------------------------------+------------------+------------------+------------------+

This does flag the situation that will violate the unique constraint when Chado is prepared, though maybe not in a super-obvious way...

Copy link
Contributor

@dsenalik dsenalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is now working for me, I would approve except one change requested 😁. Need to validate that the chado schema exists in a nice way. To reproduce, misspell your chado schema name:
2024-06-29_splunge

  • Update: This is fixed with commit 9697044

@Ferrisx4
Copy link
Member

I've been testing this both in the Docker (works as intended), as well as on an imported Chado schema from the i5k database. I'm running into a few issues:

[ERROR] We missed a case with the dbxref for reference annotation (SBO:0000552). These are the dbxrefs we have to work 
         with: Array                                                                                                    
         (                                                                                                              
             [1006] => stdClass Object                                                                                  
                 (                                                                                                      
                     [dbxref_id] => 1006                                                                                
                     [accession] => 0000552                                                                             
                     [db_name] => SO                                                                                    
                 )                                                                                                      
                                                                                                                        
         ) 

This is then followed by

Error: Object of class stdClass could not be converted to string in /var/www/t4/web/modules/contrib/tripal/tripal_chado/src/Commands/ChadoCheckTermsAgainstYaml.php on line 637

The object in question is indeed not a string, but an array:

stdClass Object
(
    [dbxref_id] => 113276
    [accession] => 0000004
    [db_name] => TContact
)

I'm still a little fuzzy on all of this term checking, and all of the different // CASE:s that are checked, but have you seen any issues like this? I think

chado vocabularies and cvterms often become a bit eccentric over the years ;-p

is a bit of an understatement.

@Ferrisx4
Copy link
Member

My second error above was just caused by a missing reference to the actual dbxref_id value. Under the standard T4 Docker, this section of code doesn't get called, so it probably slipped under the radar, but i5k's DB triggered it.

For the case that is missed, indicated by the first error, I can provide values from the database that may help cover this, but I'm not sure what would be pertinent.

Also, now that the second error above isn't causing the check function to crash entirely, I run into an error on line 651 where $term_info['db_name'] doesn't exist (another case of the code never being run on a more ideal database state). Working on a fix for this now...

@Ferrisx4
Copy link
Member

Ferrisx4 commented Jul 12, 2024

@laceysanderson In the chadoCheckTerms_reportProblem_eccentricCVTerm() function, do we want to default to non-destructive like the other fixes? This argument is missing when calling $this->askOrRespectOptions()

@dsenalik
Copy link
Contributor

@laceysanderson In the chadoCheckTerms_reportProblem_eccentricCVTerm() function, do we want to default to non-destructive like the other fixes? This argument is missing when calling $this->askOrRespectOptions()

I definitely vote non-destructive ( !💣 )

@dsenalik
Copy link
Contributor

dsenalik commented Jul 12, 2024

I can trigger the code at line 651 to be executed with
update chado.dbxref set db_id=2 where dbxref_id=(select dbxref_id from chado.cvterm where name='accession');
(This is the very first term checked.)
and it generates

 [warning] Undefined array key "db_name" ChadoCheckTermsAgainstYaml.php:651

...

The following table summarizes the terms.
 ------------------------------------------------------------------------- ------ ------ -------- -------- 
  YAML Term                                                                 CV     DB     CVTERM   DBXREF  
 ------------------------------------------------------------------------- ------ ------ -------- -------- 
  accession (CO_010:0000044)                                                12      20    2811      3117   <-- 3117 is red

...

Term (cvterm/dbxref) Issues.
----------------------------

We have detected 1 Term(s) with a key deviation from what is expected. Specifically:
+----------------------------+-----------------------------------------------+--------------+----------+-------+
| TERM                       | MESSAGE                                       | COLUMN       | EXPECTED | YOURS |
+----------------------------+-----------------------------------------------+--------------+----------+-------+
| accession (CO_010:0000044) | Wrong db but dbxref connected to right cvterm | dbxref.db_id |          | OBI   |
+----------------------------+-----------------------------------------------+--------------+----------+-------+

Commit 425ec4d fixes this. Now the output will be

We have detected 1 Term(s) with a key deviation from what is expected. Specifically:
+----------------------------+-----------------------------------------------+--------------+----------+-------+
| TERM                       | MESSAGE                                       | COLUMN       | EXPECTED | YOURS |
+----------------------------+-----------------------------------------------+--------------+----------+-------+
| accession (CO_010:0000044) | Wrong db but dbxref connected to right cvterm | dbxref.db_id | CO_010   | OBI   |
+----------------------------+-----------------------------------------------+--------------+----------+-------+

@dsenalik
Copy link
Contributor

dsenalik commented Jul 13, 2024

Well Lacey is busy telling people about Tripal, so I went ahead and fixed the ugly stuff that happens when the chado schema does not exist. With commit 9697044 you will now see this instead
2024-07-13_invalid-schema

But I didn't like still seeing the red X for cognitive complexity, so with commit ddad6f5 I went ahead and split the main function into two parts to reduce the cognitive complexity (it was a pretty long function 🙃) since it did two separate things, 1. find, then 2. report

@dsenalik dsenalik removed the waiting on changes Added on a PR when further review is waiting on the requested changes to be made. label Jul 13, 2024
@dsenalik
Copy link
Contributor

dsenalik commented Jul 15, 2024

Sean's error:
2024-07-15_tbd

The cv term for SO:0000552 is Shine_Dalgarno_sequence | A region in the 5' UTR that pairs with the 16S rRNA during formation of the preinitiation complex. [SO:jh]
SBO:0000552 https://www.ebi.ac.uk/ols4/search?q=SBO%3A0000552
is "reference annotation"

I see this because I don't have a "reference annotation" defined in my Tripal 3 DB
reference annotation (SBO:0000552) 50 37 - -

Okay, the reason that I am not seeing this error is because I have more than one dbxref, and Sean only has one. The code that does this is

    // If we never did find any meaningful connections with the multiple
    // dbxrefs we found then they were false positives.
    if ($summary_dbxref == ' ? ' AND count($dbxrefs) > 1) {

I am pretty sure this should be >= 1 so I will push such a commit and Lacey can tell me later if I was wrong 🙃

@laceysanderson
Copy link
Member Author

Thanks for all the fixes @dsenalik and @Ferrisx4, these look great :-)

@laceysanderson laceysanderson merged commit f64996e into 4.x Jul 22, 2024
15 checks passed
@laceysanderson laceysanderson deleted the tv4g2-issue1698-unable-to-prepare-an-existing-chado-database-from-tripal-3 branch July 22, 2024 16:23
@Ferrisx4
Copy link
Member

Ferrisx4 commented Jul 23, 2024

I know this is closed but for convenience and organization sake, I'll put this here:

I just realized the issue I'm having as described in my last comment on the tripal doc issue #82 was already brought up by @dsenalik up above: #1895 (comment).

In that comment, the workaround is a simple renaming. I'd like to also point out that, at least in the database I'm using, there are two "TContact"-like entries in the db table:

select * from chado.db where name ilike '%contact%';
 db_id |   name   |                                                description                                                |           urlprefix            |        url         
-------+----------+-----------------------------------------------------------------------------------------------------------+--------------------------------+--------------------
   116 | TContact |                                                                                                           |                                | 
   400 | TCONTACT | Tripal Contact Ontology. A temporary ontology until a more formal appropriate ontology can be identified. | cv/lookup/TCONTACT/{accession} | cv/lookup/TCONTACT

In our db, we have a number of dbxrefs that point to the TContact one, with the empty columns, and none that point to the proper one (TCONTACT).

Interestingly, the Chado term checker flags two rows as having the dbxref in red:

The following table summarizes the terms.
 ------------------------------------------------------------------------- ----- ----- -------- ---------- 
  YAML Term                                                                 CV    DB    CVTERM   DBXREF    
 ------------------------------------------------------------------------- ----- ----- -------- ---------- 
  Organization (TCONTACT:0000004)                                           24    400   39488     113276   
  Department (TCONTACT:0000009)                                             24    400   39493     113281   

and the summary:

Term (cvterm/dbxref) Issues.
----------------------------

We have detected 2 Term(s) with a key deviation from what is expected. Specifically:
+---------------------------------+-----------------------------------------------+--------------+----------+---------+
| TERM                            | MESSAGE                                       | COLUMN       | EXPECTED | YOURS   |
+---------------------------------+-----------------------------------------------+--------------+----------+---------+
| Organization (TCONTACT:0000004) | Wrong db but dbxref connected to right cvterm | dbxref.db_id | TCONTACT | OBO_REL |
| Department (TCONTACT:0000009)   | Wrong db but dbxref connected to right cvterm | dbxref.db_id | TCONTACT | OBO_REL |
+---------------------------------+-----------------------------------------------+--------------+----------+---------+

In the database, the two dbxrefs that were flagged in red, 113276 and 113281, both point to db_id 116, the erroneous TContact from earlier. Changing this value from 116 to 400 satisfies the Chado term check.

Since the functionality was mostly there, and this was subsequently merged, I can open a new issue if we think this case needs to be handled differently, at least with different language, as it wasn't immediately clear what the solution was, or if the solution doesn't necessarily work with Doug's suggestion above, which I will test.

Edit:
At any rate, in the summary table above, the YOURS column indicates that these two terms are currently linked to the OBO_REL database, but are actually linked to the TContact, which might have been where my confusion came from. There's every possibility I'm interpreting the table wrong though.

@dsenalik
Copy link
Contributor

dsenalik commented Jul 24, 2024

One thing I would like to see added is a case-insensitive check for more that one DB or CV having the same name but different case. I think I had something like that a few years back that I cleaned up, it would just be a nice warning that would be simple to add I think.

That OBO_REL is odd, it would be good to know how to reproduce it because it seems suspicious, there could be a bug in there.

Edit - I still have one! I have eco and ECO

carrotomics=> SELECT LOWER(name) FROM db GROUP BY LOWER(name) HAVING COUNT(LOWER(name))>1;
 lower 
-------
(0 rows)

carrotomics=> SELECT LOWER(name) FROM cv GROUP BY LOWER(name) HAVING COUNT(LOWER(name))>1;
 lower 
-------
 eco
(1 row)

carrotomics=> select count(*) from cvterm T left join cv C on T.cv_id=C.cv_id where C.name='eco';
 count 
-------
  1825
(1 row)

carrotomics=> select count(*) from cvterm T left join cv C on T.cv_id=C.cv_id where C.name='ECO';
 count 
-------
     0
(1 row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Group 2 - Data Storage | Tripal DBX | Chado Any issue relating to biological data storage, Tripal DBX and Chado integration, Materialized Views Tripal 4 Any issue or pull request focused on Tripal 4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to prepare an existing chado database from Tripal 3
4 participants