Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: sample IDs that look like scientific notation are treated as numbers #131

Merged
merged 23 commits into from
May 2, 2024

Conversation

colinbrislawn
Copy link
Contributor

@colinbrislawn colinbrislawn commented Jan 31, 2024

Trying to close #130

More testing is needed, especially a regression test for sampleIDs like this.

@lizgehret
Copy link
Member

Thanks for working on this @colinbrislawn! Lmk when it's ready for a review and I'll take a look 🙂

@colinbrislawn
Copy link
Contributor Author

This seems to work:

t(read.delim(inp_abundances_path,
  check.names = TRUE, # change to true to so the empty first column becomes .$X
  row.names = 1,
  colClasses = c(X = "character") # Make the .$X column a character, leave other columns
))

Any idea why check.names = F originally?

For the regression test, maybe import this file and make sure the sample names match?
table-SI-names.tsv

Could I just run test_ids_in_table_not_in_md() again, with a new set of files?

@lizgehret lizgehret self-assigned this Feb 1, 2024
@lizgehret
Copy link
Member

This seems to work:

t(read.delim(inp_abundances_path,
  check.names = TRUE, # change to true to so the empty first column becomes .$X
  row.names = 1,
  colClasses = c(X = "character") # Make the .$X column a character, leave other columns
))

Any idea why check.names = F originally?

There wasn't a strong reason for that being set to False in the initial implementation - I hadn't run into this issue while developing, so I didn't realize this would be an edge case we'd need to consider. This looks reasonable to me though!

For the regression test, maybe import this file and make sure the sample names match? table-SI-names.tsv

Could I just run test_ids_in_table_not_in_md() again, with a new set of files?

Yeah that should work! I think as long as we continue to use the original files/tests, and then just create additional tests with your new set of files to make sure the behavior is the same, that should work well.

@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Feb 1, 2024

Hey Liz, I think I'm ready for your help here.

I've got a new test_ids_in_table_with_es testing function that should fail because I've reverted the ancom import code... but it works fine. I'm also unfamiliar with the dataloaf format

EDIT:

and then just create additional tests with your new set of files to make sure the behavior is the same, that should work well

Yes! I've added those files. Can you help me write that?

@lizgehret
Copy link
Member

Yes, happy to help! Hung up with a few other things this afternoon, but I'll take a closer look tomorrow 🙂

@lizgehret
Copy link
Member

lizgehret commented Feb 2, 2024

Okay just adding some notes here for myself as I'm investigating:

  • Forum user who reported this issue was using QIIME 2 2023.9 and R 4.2.2.
  • I replicated the error with the same package versions using their dataset.
  • I created a dummy table/md file inline in a test to replicate the behavior using the same sample IDs that the forum user did to make sure the error would catch (which it did).
  • I added two tests (that both fail) for uppercase E and lowercase E (to ensure that R doesn't care about casing).

Adding just these two tests (not finalized) in a commit so you can take a look @colinbrislawn!

EDIT: Let me know if you want me to finish this up, or if you'd rather take it across the finish line - I'm happy either way, just don't want to steal your thunder!

@lizgehret
Copy link
Member

Another thing I decided to do @colinbrislawn was to just create the table/MD files inline instead of using your newly created files - sorry for flip flopping on that, I realized that would be easier for me to see/iterate on the table as I was testing!

@colinbrislawn
Copy link
Contributor Author

Okay, I'll take a look. I'm excited to see how you set up testing.

Let me know if you want me to finish this up, or if you'd rather take it across the finish line

I would like more experience with Python, so I'll see if I can get this finished. This may be more work for you, but if you are willing to help me out I would very much appreciate it.

@lizgehret
Copy link
Member

I would like more experience with Python, so I'll see if I can get this finished. This may be more work for you, but if you are willing to help me out I would very much appreciate it.

Happy to help, and in full support of you learning! You just let me know what questions you have 🙂

@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Feb 3, 2024

I've run into issues around metadata, so I've updated both import functions.
(Good thing we tested!)

In the metadata importe, I reference the sampleID column with the key sample-id.
Is this first column always called that in this plugin or should I support other names?

Zooming out a little, Tidyverse packages like readr may give us better ways to handle this.
We could start the tidyverse retooling, or save it for the ancom-bc2 function!

@colinbrislawn colinbrislawn marked this pull request as ready for review February 5, 2024 19:34
@lizgehret
Copy link
Member

Hey @colinbrislawn, sample-id is only one of the supported names for metadata identifiers (that's just the one I happened to use in the original test data); the full list can be found here. I'm a bit wary of hard-coding all of these into the import function unless absolutely necessary - but definitely open to any improvements that might be available within tidyverse, since that's already a dependency!

Re: ancom-bc2 - we were waiting for that paper to be published before starting development on this method, but it looks like that's finally happened as of a little over a month ago! I'll chat with the rest of the eng team in this week's meeting about this and see what thoughts are there.

@colinbrislawn
Copy link
Contributor Author

I thought I remembered a bunch of allowed sample-ids!

Here's the options I can see:

  • Use the python code to make sure that column is called sample-id before passing it to the R code I have in this PR
  • Switch to the tidyverse readr::read_delim() and use their better API.

What do you suggest?

@lizgehret
Copy link
Member

I thought I remembered a bunch of allowed sample-ids!

Here's the options I can see:

* Use the python code to make sure that column is called `sample-id` before passing it to the R code I have in this PR

* Switch to the tidyverse `readr::read_delim()` and use their better API.

What do you suggest?

Let's see what can be done with readr::read_delim() - I was reviewing the available parameters for this method in the R docs, and I'm wondering if a good solution would be to utilize the col_types param and set that to character for the first column (i.e. the index column containing the sample IDs). I haven't investigated this in detail, so not sure if this will actually work like we want it to, but it seems like a good option to explore!

@colinbrislawn
Copy link
Contributor Author

Checks pass on my machine!

I've got to remove my own notes and code before we merge...

@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Mar 15, 2024

@lizgehret, I've got the R code working, including a patch to _ancombc.py to always output sample-id

When you have a chance, could you take a look at the unit test?

Thanks!

@lizgehret
Copy link
Member

Hey @colinbrislawn!

Sorry for the delay, still getting caught up on things after vacation and travel. Will give this a proper review shortly!

Copy link
Member

@lizgehret lizgehret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @colinbrislawn,

Sorry (again) for the delay on this! Overall this looks reasonable to me, thanks again for working on this! I think the remaining task here is (just as you mention in your test comment) to add assertions into the contents of the dataloaf itself for these new test cases.

One additional (small) comment is that we may want to re-name IDE with something a bit more verbose (like IDs-with-Es, or something like that). Only reason I bring that up is because I was initially thinking of IDE in terms of interactive development environment and that may cause confusion in the future.

@colinbrislawn
Copy link
Contributor Author

I was initially thinking of IDE in terms of interactive development environment

Good idea. I noticed that too when I wrote it. This I can fix.

add assertions into the contents of the dataloaf itself for these new test cases.

I'm unfamiliar with the dataloaf format. Would you like me to try this first or is this something you could wrap up for me?

@lizgehret
Copy link
Member

I'm unfamiliar with the dataloaf format. Would you like me to try this first or is this something you could wrap up for me?

I can finish that up, thanks @colinbrislawn! I'll wait for your name change updates re: IDE and then take it from there. 🙂

@lizgehret lizgehret assigned lizgehret and unassigned colinbrislawn May 1, 2024
@lizgehret lizgehret changed the title SampleIDs as number BUG: sample IDs that look like scientific notation are treated as numbers May 2, 2024
@lizgehret
Copy link
Member

lizgehret commented May 2, 2024

okay @colinbrislawn i've updated these tests to be more comprehensive - they now assert that the contents of the resultant IDs in each slice of the dataloaf are what we'd expect.

i also apologize for asking you to update the names of the files you added that contained IDE - i realized after the fact that we didn't need those files after all 🤦‍♀️

i'm going to get someone else to give this a final review (@colinvwood TIA!) since i'm no longer an unbiased reviewer 😉 but this should be g2g!

@lizgehret lizgehret requested a review from colinvwood May 2, 2024 00:12
@lizgehret lizgehret assigned colinvwood and unassigned lizgehret May 2, 2024
Co-authored-by: colinvwood <68213641+colinvwood@users.noreply.github.com>
@lizgehret lizgehret merged commit 33e8a9e into qiime2:dev May 2, 2024
4 checks passed
@colinbrislawn colinbrislawn deleted the ancomSamples branch May 3, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

BUG: ANCOM-BC Fails if sample IDs look too much like exponent
4 participants