Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve move_to_x #661

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

improve move_to_x #661

wants to merge 3 commits into from

Conversation

xinyuejohn
Copy link
Collaborator

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes
Add three arguments to move_to_x: copy_uns, copy_obsm, and copy_varm and they are all set to be True and optional.

Technical details
Tests are already been added.

Additional context

@xinyuejohn xinyuejohn self-assigned this Mar 4, 2024
@xinyuejohn xinyuejohn linked an issue Mar 4, 2024 that may be closed by this pull request
@xinyuejohn xinyuejohn requested a review from Zethson March 4, 2024 14:01
@eroell
Copy link
Contributor

eroell commented Mar 5, 2024

A few points on this:

  • I wonder if it is a good idea to store computed results (e.g. adata.obsm["X_pca"]) together with altered data that would actually yield different results if one would recompute. Is there a striking example justifying to do that?
  • If .obsm is copied, .obsp might be as well following this logic I think?
  • Copying .varm should be omitted in any case my opinion; This contains per-feature computed results. By doing move_to_x, a new feature is actually added. In addition to the thoughts about changed data above, here the dimensionality of the data does not match anymore. Example below of how this pretty quickly fails :)
import ehrapy as ep
adata = ep.dt.diabetes_130(columns_obs_only=["race", "gender", "age"])
adata_prep = adata[
    :,
    (adata.var.index == "time_in_hospital_days")
    | (adata.var.index == "num_lab_procedures")
    | (adata.var.index == "num_procedures")
    | (adata.var.index == "num_medications")
    | (adata.var.index == "number_diagnoses"),
]
ep.pp.pca(adata_prep, n_comps=2)
adata_prep.varm["PCs"]
adata_prep_2 = ep.ad.move_to_x(adata_prep, "gender", copy_varm=True)
ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions ('var',) of parent. Value had shape (5,) while it should have had (6,).

Considering that move_to_x creates somewhat a new dataset, it might be justified and/or necessary to (re)compute things?



def test_move_to_x_copy_varm(adata_move_obs_mix):
move_to_obs(adata_move_obs_mix, ["name"], copy_obs=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing copy_obs=True here means that "name" is kept as variable; so the move_to_x doesn't actually move it back; hence the failing example in my comment was a bit hidden :)

@Zethson
Copy link
Member

Zethson commented Mar 5, 2024

@xinyuejohn could you also outline for us again for what exactly you need this feature, please? I do totally see @eroell points

@xinyuejohn
Copy link
Collaborator Author

A few points on this:

  • I wonder if it is a good idea to store computed results (e.g. adata.obsm["X_pca"]) together with altered data that would actually yield different results if one would recompute. Is there a striking example justifying to do that?
  • If .obsm is copied, .obsp might be as well following this logic I think?
  • Copying .varm should be omitted in any case my opinion; This contains per-feature computed results. By doing move_to_x, a new feature is actually added. In addition to the thoughts about changed data above, here the dimensionality of the data does not match anymore. Example below of how this pretty quickly fails :)
import ehrapy as ep
adata = ep.dt.diabetes_130(columns_obs_only=["race", "gender", "age"])
adata_prep = adata[
    :,
    (adata.var.index == "time_in_hospital_days")
    | (adata.var.index == "num_lab_procedures")
    | (adata.var.index == "num_procedures")
    | (adata.var.index == "num_medications")
    | (adata.var.index == "number_diagnoses"),
]
ep.pp.pca(adata_prep, n_comps=2)
adata_prep.varm["PCs"]
adata_prep_2 = ep.ad.move_to_x(adata_prep, "gender", copy_varm=True)
ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions ('var',) of parent. Value had shape (5,) while it should have had (6,).

Considering that move_to_x creates somewhat a new dataset, it might be justified and/or necessary to (re)compute things?

Thanks for your reply and I totally got your points!

Here's my use case:

adata.obs = pd.merge(adata.obs, df_statistics, how="left", left_index=True, right_index=True)
adata = ep.ad.move_to_x(adata, list(df_statistics.columns))
adata.obsm = obsm

In .obsm, there are some awkward arrays which still valid as long as the length of observations keep same.

@eroell
Copy link
Contributor

eroell commented Mar 8, 2024

Thanks for the example - so to my understanding you add some (maybe externally computed) statistics to the .obs field, and then move these to the .X field.

Quick clarification, what is in the awkward arrays (I guess not a PCA embedding as in my comment above), which would be in the .obm field that is still valid?

@xinyuejohn
Copy link
Collaborator Author

Thanks for the example - so to my understanding you add some (maybe externally computed) statistics to the .obs field, and then move these to the .X field.

Quick clarification, what is in the awkward arrays (I guess not a PCA embedding as in my comment above), which would be in the .obm field that is still valid?

I think this diagram will help you understand what I want to do. Please kindly have a look at it.
image

In adata, each row represents a patient's visit.
In .obs, there are all the episode level features. And they have only a single value and without any time information, like height/age/gender.
In .obsm, there are all time series features stored in awkward array. They have multiple records and have both time information and values. For example, adata.obsm['heart rate'][0] = awkward_array(time: [0, 1, 2, 4, 5, 7], value: [99, 80, 78, 91, 70, 90])

In this settings, I can easily move data from .obs to .X using move_to_x we are talking about and I can also move .obsm to .X using some aggregations. And if I change .X, .obsm will still be valid as it stores data irrelevant to .X

@eroell
Copy link
Contributor

eroell commented Mar 18, 2024

Thanks for the diagram, I think I get the point here - it is data in the .obsm in your case, not a computed result.
The .obsm field in the scRNA-seq setting is often used for computed results, e.g. most things that produce an embedding of some sort. In ehrapy, also many tools compute an embedding and store that in the .obsm field.

It probably makes sense to not mix things computed from .X with raw data together in .obsm?
Although expert users like you @xinyuejohn ofc can use this structure if helpful ;)

We have yet to include time series support in a meaningful way - a data field alike your use of .obsm is one of the candidates for sure.

Whatever this might look like, copying all the -data- as you suggest here should happen indeed.

I think I would refrain from copying .obsm when doing move_to_x, as in many cases this field will be of results computed from .X

What do you think @xinyuejohn ?

@Zethson
Copy link
Member

Zethson commented Mar 18, 2024

I also think that it's weird if it happens automatically, but maybe we can just parameterize this and well document the use-case?

@eroell
Copy link
Contributor

eroell commented Mar 18, 2024

That would be .obsm, .uns, .obsp you'd consider taking in?

@Zethson
Copy link
Member

Zethson commented Mar 19, 2024

Think so ya? Or just obms and uns?!?

@Zethson
Copy link
Member

Zethson commented Apr 18, 2024

So what do we do with this PR?

@eroell
Copy link
Contributor

eroell commented Apr 19, 2024

I'd be in favor of not having these values copied now, but storing this data later more consistently.

But I'm also fine having options for copying .obsm, .uns, .obsp (while setting the default to False)

@Zethson
Copy link
Member

Zethson commented May 14, 2024

Okay, then let's make these options if it makes it easier for ehrdata. That can also depend on an evaluation of @eroell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

current move_to_x() doesn't copy original .uns/.obsm/.varm ...
3 participants