Unify feature type detection #724

Lilly-May · 2024-05-12T15:11:10Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked (closes Harmonize feature type detection #701)
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes

ehrapy now fully relies on ep.ad.infer_feature_types whenever a distinction between categorical and numerical features is needed.
The feature types are inferred automatically when not done manually by the user.
Feature/Encoding types stored previously in adata.var["ehrapy_column_type"] were removed.
Improved date detection in the ep.ad.infer_feature_types method: All dates stored in any ISO-format as a String are not automatically detected as dates.

Discussion points

The encoding method stores data in adata.uns. With the new feature type detection, we could remove that now. The only reason why I kept it is because it is used to find out if the adata is already encoded or not, and, depending on that, the behavior of the encoding method changes. I could manually test that instead by just checking that all features are numerical. However, that would be more computationally expensive than simply checking if a specific key in adata.uns is present.
When loading data via the dataloader, we have the option to automatically encode them (encoded=True). Because the encoding relies on feature types, those will automatically be inferred whenever data are loaded. I think automatically encoding them is a very convenient feature, so I guess we want to keep this behavior, but I still want to confirm that you are aware of and fine with this behavior.
Related to the point above, we could also pre-define the correct feature types for the dataset offered by ehrapy, which would be really convenient for the users but not so pleasant for us to do. Just wanted to raise this as an idea and hear what you think about that.
Should we offer a convenient function to correct feature type annotations? Something like ep.ad.set_feature_type(["feature1", "feature2", "feature3"], "categorical"), where three features that were detected incorrectly would be corrected to the feature type "categorical"?

ToDos

Add more specific warnings for tricky cases when inferring feature types
Resolve ToDos in the code
Don't save feature type information several times (specifically the encoding information in adata.uns, as described above)
Store everything encoding related currently stored in adata.uns in adata.var or remove it
Pre-safe all feature types in all ehrapy dataloader
Add method to switch feature types easily and reference it in the warning

# Conflicts: # pyproject.toml

tests/anndata/test_anndata_ext.py

ehrapy/preprocessing/_encoding.py

Zethson

Thank you so so much for your hard work on this!

undo_encoding

Does this also need to be removed from the docs?

For the release notes, it'd be great if we had a function which takes the old feature annotations and moves them to the right places. We would tell people to run that when moving versions. Is this even necessary when it will run our new functions anyways? It might just reannotate the features with the right slots then, right?

ehrapy/anndata/_feature_specifications.py

ehrapy/preprocessing/_encoding.py

tests/tools/feature_ranking/test_rank_features_groups.py

ehrapy/anndata/_feature_specifications.py

ehrapy/preprocessing/_encoding.py

ehrapy/anndata/_feature_specifications.py

ehrapy/preprocessing/_encoding.py

Lilly-May · 2024-05-18T08:15:49Z

undo_encoding Does this also need to be removed from the docs?

I removed it from usage.md and looked for other usages. Is there another place I need to delete it from?

For the release notes, it'd be great if we had a function which takes the old feature annotations and moves them to the right places. We would tell people to run that when moving versions. Is this even necessary when it will run our new functions anyways? It might just reannotate the features with the right slots then, right?

The changes shouldn't cause any function to fail, as ep.ad.infer_feature_types will run automatically when feature types are not present. The only noticeable change might be that certain features are detected differently now. For instance, previously, a column with 0/1 values would usually be detected as numeric, whereas now it is categorical. Consequently, analysis results could differ with the new release. Still, I would vote against having a method that transfers the old feature type annotations, as they are not more reliable than the new ones. Also, several methods previously handled this differently and stored the annotations inconsistently, so there isn't really just one old feature type annotation.

ehrapy/anndata/_feature_specifications.py

Zethson · 2024-05-18T09:02:57Z

I removed it from usage.md and looked for other usages. Is there another place I need to delete it from?

No, I just probably missed it. Thanks!

OK concerning the "transfer method"

ehrapy/anndata/_feature_specifications.py

Lilly-May added 13 commits May 3, 2024 10:14

Updated cohort tracker to new feature type detection

ad2a022

Improved date detection

a3b8f8c

Started updating rank_features_groups

47cb0ee

Updated rank_features_groups to use infer_feature_types

fef097a

Updated anndata_ext and rank_features_groups to use infer_feature_types

ae7be10

Merge branch 'main' into feature/feature_types_harmonization

5bbb16d

# Conflicts: # pyproject.toml

Updated encoding

7fd3413

Encode only non-numerical categorical features

8c245c6

Remove old encoding constants

d7b5f02

Rename CONTINUOUS_TAG to NUMERIC_TAG

917df6d

Updated dateutil dependency

6e09cfb

Fixed detection of categorical columns stored numerically

61e06b3

Resolved Code ToDos

e9d9905

Lilly-May commented May 13, 2024

View reviewed changes

tests/anndata/test_anndata_ext.py Outdated Show resolved Hide resolved

Merge branch 'main' into feature/feature_types_harmonization

733e37c

github-actions bot added the enhancement New feature or request label May 13, 2024

Lilly-May added 4 commits May 13, 2024 13:12

Merge branch 'main' into feature/feature_types_harmonization

145f473

Removed unused fixture

9635c4c

Specific warnings

82f461e

Use warning instead of info

4c2ed8e

Lilly-May marked this pull request as ready for review May 13, 2024 11:23

Lilly-May added 5 commits May 14, 2024 11:37

Remove multi-column encoding

3759427

Remove storing things in uns during encoding

67cd1d8

Looked through datasets until chronic_kidney_disease

e417ce3

Fixed _reorder_encodings

2164c88

Allow for partial encoding with autodetect

3d7ace1

Lilly-May commented May 15, 2024

View reviewed changes

ehrapy/preprocessing/_encoding.py Outdated Show resolved Hide resolved

Lilly-May added 3 commits May 15, 2024 15:01

Dataloader modified until synthea_1k_sample

93fce73

Updated synthea_1k_sample dataloader

3d3db46

Show encoding mode in feature_type_overview

8da6d56

Lilly-May added 3 commits May 16, 2024 11:14

Updated encoding so that it doesn't save unencoded data in uns

5b8fe68

Updated QC to new encoding functionalities

976ce52

Added examples

3275450

Lilly-May requested a review from Zethson May 16, 2024 10:21

Zethson reviewed May 16, 2024

View reviewed changes

Zethson requested a review from eroell May 16, 2024 20:34

eroell reviewed May 17, 2024

View reviewed changes

ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved