Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify feature type detection #724

Merged
merged 36 commits into from
May 19, 2024
Merged

Conversation

Lilly-May
Copy link
Collaborator

@Lilly-May Lilly-May commented May 12, 2024

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked (closes Harmonize feature type detection #701)
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes

  • ehrapy now fully relies on ep.ad.infer_feature_types whenever a distinction between categorical and numerical features is needed.
  • The feature types are inferred automatically when not done manually by the user.
  • Feature/Encoding types stored previously in adata.var["ehrapy_column_type"] were removed.
  • Improved date detection in the ep.ad.infer_feature_types method: All dates stored in any ISO-format as a String are not automatically detected as dates.

Discussion points

  1. The encoding method stores data in adata.uns. With the new feature type detection, we could remove that now. The only reason why I kept it is because it is used to find out if the adata is already encoded or not, and, depending on that, the behavior of the encoding method changes. I could manually test that instead by just checking that all features are numerical. However, that would be more computationally expensive than simply checking if a specific key in adata.uns is present.
  2. When loading data via the dataloader, we have the option to automatically encode them (encoded=True). Because the encoding relies on feature types, those will automatically be inferred whenever data are loaded. I think automatically encoding them is a very convenient feature, so I guess we want to keep this behavior, but I still want to confirm that you are aware of and fine with this behavior.
  3. Related to the point above, we could also pre-define the correct feature types for the dataset offered by ehrapy, which would be really convenient for the users but not so pleasant for us to do. Just wanted to raise this as an idea and hear what you think about that.
  4. Should we offer a convenient function to correct feature type annotations? Something like ep.ad.set_feature_type(["feature1", "feature2", "feature3"], "categorical"), where three features that were detected incorrectly would be corrected to the feature type "categorical"?

ToDos

  • Add more specific warnings for tricky cases when inferring feature types
  • Resolve ToDos in the code
  • Don't save feature type information several times (specifically the encoding information in adata.uns, as described above)
  • Store everything encoding related currently stored in adata.uns in adata.var or remove it
  • Pre-safe all feature types in all ehrapy dataloader
  • Add method to switch feature types easily and reference it in the warning

@github-actions github-actions bot added the enhancement New feature or request label May 13, 2024
@Lilly-May Lilly-May marked this pull request as ready for review May 13, 2024 11:23
@Lilly-May Lilly-May requested a review from Zethson May 16, 2024 10:21
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so so much for your hard work on this!

undo_encoding

Does this also need to be removed from the docs?

For the release notes, it'd be great if we had a function which takes the old feature annotations and moves them to the right places. We would tell people to run that when moving versions. Is this even necessary when it will run our new functions anyways? It might just reannotate the features with the right slots then, right?

ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_encoding.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_encoding.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_encoding.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_encoding.py Outdated Show resolved Hide resolved
tests/tools/feature_ranking/test_rank_features_groups.py Outdated Show resolved Hide resolved
@Zethson Zethson requested a review from eroell May 16, 2024 20:34
@Lilly-May
Copy link
Collaborator Author

undo_encoding Does this also need to be removed from the docs?

I removed it from usage.md and looked for other usages. Is there another place I need to delete it from?

For the release notes, it'd be great if we had a function which takes the old feature annotations and moves them to the right places. We would tell people to run that when moving versions. Is this even necessary when it will run our new functions anyways? It might just reannotate the features with the right slots then, right?

The changes shouldn't cause any function to fail, as ep.ad.infer_feature_types will run automatically when feature types are not present. The only noticeable change might be that certain features are detected differently now. For instance, previously, a column with 0/1 values would usually be detected as numeric, whereas now it is categorical. Consequently, analysis results could differ with the new release. Still, I would vote against having a method that transfers the old feature type annotations, as they are not more reliable than the new ones. Also, several methods previously handled this differently and stored the annotations inconsistently, so there isn't really just one old feature type annotation.

@Zethson
Copy link
Member

Zethson commented May 18, 2024

I removed it from usage.md and looked for other usages. Is there another place I need to delete it from?

No, I just probably missed it. Thanks!

OK concerning the "transfer method"

@github-actions github-actions bot added the chore label May 19, 2024
@Zethson Zethson removed the chore label May 19, 2024
@Zethson Zethson merged commit 6331db2 into main May 19, 2024
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Harmonize feature type detection
3 participants