Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FlowSample on top of AxisArrays #15

Merged
merged 25 commits into from
Nov 2, 2022
Merged

Conversation

lgrozinger
Copy link
Contributor

@lgrozinger lgrozinger commented Sep 30, 2022

AxisArrays is supposed to be an efficient implemented of the kind of data structure that FlowSample.data is --- a matrix of values with named rows.

So I have reorganised FlowSample to use an AxisArray instead of a Dict for its data. I think this has a bunch of advantages, especially because it opens up all new ways to index into FlowSample. Here are a few:

The old channel access with String key still works:

julia> fr = load("test/testdata/BD-FACS-Aria-II.fcs");

julia> fr["SSC-A"]
1-dimensional AxisArray{Float32,1,...} with axes:
    :col, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  99991, 99992, 99993, 99994, 99995, 99996, 99997, 99998, 99999, 100000]
And data, a 100000-element Vector{Float32}:
  930.2
  253.87006
 1904.8274
  291.35733
...
julia> fr["SSC-A"][3]
1904.8274f0

Multiple channels can be taken also in a sensible way:

julia> fr[["SSC-A", "FSC-A"]]
2-dimensional AxisArray{Float32,2,...} with axes:
    :row, ["FSC-A", "SSC-A"]
    :col, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  99991, 99992, 99993, 99994, 99995, 99996, 99997, 99998, 99999, 100000]
And data, a 2×100000 Matrix{Float32}:
 159360.0  32128.0   187456.0   39616.0   71040.0    70976.0    …  70912.0    50304.0    68992.0    42176.0
    930.2    253.87    1904.83   1119.81    291.357    433.057       423.428    243.389    447.913    624.075

julia> fr[["SSC-A", "FSC-A"]][:, 3]
1-dimensional AxisArray{Float32,1,...} with axes:
    :row, ["FSC-A", "SSC-A"]
And data, a 2-element Vector{Float32}:
 187456.0
   1904.8274

Issue #12 is solved in a way:

julia> collect(fr[:, [1, 2, 4, 8, 16, 32]])
14×6 Matrix{Float32}:
 159360.0    32128.0      39616.0    52992.0     95808.0    66944.0
  72832.0    32128.0      38976.0    48704.0     84672.0    61824.0
    930.2      253.87      1119.81    1462.55      901.88     396.92
   7428.47     -38.3903     426.463    461.381     115.432   3872.81
    293.172     33.4145     411.433    516.568     124.547     70.5605
...

As well as combinations of these indexing strategies. There are also more tests added in this PR, for these new features, and to check that the old indexing works still as expected.

The GigaSOM test suite still passes just fine with these changes, so I hope it shouldn't have an impact on GigaSOM users, though that assumption is only as good as the GigaSOM test suite.

More detail can be found in the commit messages.

lgrozinger and others added 7 commits September 29, 2022 16:30
The existing tests have been rewritten to take their test data from
local files instead of attempting to download them --- the urls were
broken.

The large file test has been deleted.

New tests have been added for:
  - the key-based indexing of channels, which is existing
  functionality
  - the key-based indexing of multiple channels, which is to be
  implemented
  - the integer indexing of individual samples, which is to be
  implemented
  - the collection of integer indexing of multiple samples, which is
  to be implemented
Multiple channels can be indexed using a Vector of Strings
Individual samples can be indexed using an Integer
Multiple samples can be indexed using a Vector of Integers
I realised I was just rewriting the implementation of AxisArray. This
commit introduces the tests for a new FlowSample struct which is a
simple wrapper for AxisArray with some params attached.

AxisArray performance is supposed to be good, the interface doesn't
have to change, and they can easily be converted to matrices as needed
by GigaSOM.
These changes add lots of indexing options for FlowSamples.

I have tried to keep backwards compatibility with the old FlowSample,
I think we have it, but cannot be absolutely certain.
@tlnagy
Copy link
Owner

tlnagy commented Oct 1, 2022

Awesome work @lgrozinger! Thanks for taking the time to do this. I left my inline notes above, I think the main things other than that are to update the documentation in the README.

Project.toml Show resolved Hide resolved
src/parse.jl Outdated
end

data = AxisArray(datamatrix, rows, collect(1:size(datamatrix)[2]))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use size(x, 2) here instead of size(x)[2] here and elsewhere

src/type.jl Outdated Show resolved Hide resolved
test/testdata/testFCS2.fcs Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
src/type.jl Outdated Show resolved Hide resolved
@lgrozinger
Copy link
Contributor Author

lgrozinger commented Oct 1, 2022

Ok I will make those changes you requested, thanks for your review and a great package

@tlnagy
Copy link
Owner

tlnagy commented Oct 1, 2022

Once I merge #16, do you mind rebasing this PR on master and use the new testing framework? The datasets are now downloaded from the https://github.com/tlnagy/fcsexamples repo when you run the tests.

…0-01-00-09-49-374-1964714061

CompatHelper: add new compat entry for "AxisArrays" at version "0.4"
The axes of the FlowSample data are now called :param and :event

size calls refactored to include dimension

version bumped

Base.axes redefined to call out to AxisArrays versions

Tests for large file added back in
Iterate just forwards the method to AxisArrays
@lgrozinger
Copy link
Contributor Author

So now I just need to update the documentation in order to reflect these changes.

Since we are bumping up to 0.2.0, maybe this is an opportunity to tighten up the interface a little. I just wanted to get your thoughts @tlnagy before doing so.

I propose we make a getter for params and deprecate direct access to params and data. This should discourage users depending on the underlying FlowSample struct and allow more changes without having to worry about breaking user code.

Access to the data can be done using indexing, or by conversion to Array if they really want the matrix.

I suggest a get_params or simply params method which provides access to the FlowSample.params. A cool feature might be a Base.get_property definition which allows access to the parameters with the dot syntax.

@tlnagy
Copy link
Owner

tlnagy commented Oct 3, 2022

Since we are bumping up to 0.2.0, maybe this is an opportunity to tighten up the interface a little.

Great idea

I propose we make a getter for params and deprecate direct access to params and data.

This is a good idea. We can throw a deprecation warning for data and params in 0.2.0 and remove access in 0.3.0.

I suggest a get_params or simply params method which provides access to the FlowSample.params. A cool feature might be a Base.get_property definition which allows access to the parameters with the dot syntax.

I think the Base.getproperty overload approach is a great idea, there'll just need to be some sanitization of the parameters to make sure they are compatible with Symbol (i.e. remove whitespace whatever). This would avoid the problem below:

My one concern about adding a params function is that then the user would have to actually add FCSFiles and run using FCSFiles to access that key info versus currently the workflow is accessible from just using FileIO. I wonder if there is a Base function that we can overload so that we can avoid that extra step. EDIT: What about overloading Base.axes for FlowSample to return a vector of all parameters?

Also, now that I merged #13, do you mind rebasing again?

@lgrozinger
Copy link
Contributor Author

Ok I'll get to work on the first two, the deprecation and the Base.get_property.

I had not considered that final point. The thing is I'm pretty sure that Base.axes is used at some point for indexing with slicing, or maybe its for the begin and end keywords.

Another idea --- if we let Base.get_property handle it again, and 'simulate' access to the member params. This is kind of best of both worlds, the user won't notice that anything about params changed (no need to throw the deprecation warning), and we still have control over the actual implementation of params inside the FlowSample struct.

The downside is that we cannot handle a parameter in the FCS files called "$params", "params" (or probably "$PARAMS", "PARAMS" too). I don't see anything in the FCS3 spec called that so thats probably alright (unless they decide to add that in at some point in the future)

@tlnagy
Copy link
Owner

tlnagy commented Oct 4, 2022

I had not considered that final point. The thing is I'm pretty sure that Base.axes is used at some point for indexing with slicing, or maybe its for the begin and end keywords.

I think it is probably fine to abuse the overloading of axes. Technically, we're not subtyping AbstractArray so we're not guaranteeing that FlowSample behaves like one in all cases. If someone needs that they can always collect or whatever to convert it to an Array. We should make sure those common conversions work.

I like axes since it's a common function and it's technically accurate, the parameters are independent axes of data.

Actually do we even need to offer anything like params at all? The combination of the Base.getproperty + printing out the axes in the Base.show function would have good discovery and then people will index and slice using AxisArray fancy getindex functions.

I imagine a workflow something like this

julia> fcs = FileIO.load("test/fcsexamples/Day 3.fcs")
FlowSample{Float32, UnitRange{Int64}}
    Machine: DVSSCIENCES-CYTOF-6.0.626
    Begin Time: 09:59:46
    End Time: 10:34:14
    Date: 26-Jun-2014
    Axes:
        Time
        Event_length
        Cd108Di (Cd108Di)
        Cd110Di (Cd110Di)
        Cd111Di (Cd111Di)
        Cd112Di (Cd112Di)
        Cd113Di (Cd113Di)
        Cd114Di (Cd114Di)
        Cd116Di (Cd116Di)
        Xe131Di
--- clipped ---
--- the display here is too long as is---

julia> fcs["Time", :]
1-dimensional AxisArray{Float32,1,...} with axes:
    :event, 1:283706
And data, a 283706-element Vector{Float32}:
   5.013
   9.023

julia> fcs[["Time", "Xe131Di"], 1:10]
2-dimensional AxisArray{Float32,2,...} with axes:
    :param, ["Time", "Xe131Di"]
    :event, 1:10
And data, a 2×10 Matrix{Float32}:
 5.013     9.023    11.172    12.409    15.221  21.706    30.82      53.385  60.091  68.424
 0.984885  4.68169   5.07743   1.65703   0.0     2.20896   0.084262   0.0     0.0     1.95139

julia> Array(fcs[["Time", "Xe131Di"], 1:10])
2×10 Matrix{Float32}:
 5.013     9.023    11.172    12.409    15.221  21.706    30.82      53.385  60.091  68.424
 0.984885  4.68169   5.07743   1.65703   0.0     2.20896   0.084262   0.0     0.0     1.95139

But maybe for ease of use we can add support for all the basic AxisArray functions? Like axisnames etc.

Also is there any particular reason you put the parameters as the rows? I feel like the more intuitive thing, for me at least, is to have the events as the rows and the parameters as the columns. Because then we could add support for https://tables.juliadata.org/stable/#Implementing-the-Interface-(i.e.-becoming-a-Tables.jl-source) which make it super easy to do something like

julia> DataFrame(fcs) # convert to DataFrames

@lgrozinger
Copy link
Contributor Author

Actually do we even need to offer anything like params at all? The combination of the Base.getproperty + printing out the axes in the Base.show function would have good discovery and then people will index and slice using AxisArray fancy getindex functions.

I would prefer to get rid of params, yes. The only reason I mentioned it is in case somebodies code relies on it being there. Well, anyway I don't use it at all... If you are ok to get rid of params we can add a deprecation warning pointing out the new Base.get_property behaviour, should be good enough.

Also is there any particular reason you put the parameters as the rows? I feel like the more intuitive thing, for me at least, is to have the events as the rows and the parameters as the columns.

I was weighing up the choice, but landed on this way round. It depends on what people are doing with the data but it might make a difference depending on how users are iterating over it (column or row order). In the end the difference might be really small, and I'm not sure which is going to work better.

Because then we could add support for https://tables.juliadata.org/stable/#Implementing-the-Interface-(i.e.-becoming-a-Tables.jl-source)

I mean we can support that interface regardless, whatever order the axes go. It would be a really good feature I think. We can add it to this pull request if you like, or we can chop up these features and setup a "release" branch. Which we can merge these into separately. That way we can prepare a 0.2.0 and release when all these features have been decided and implemented.

@tlnagy
Copy link
Owner

tlnagy commented Oct 4, 2022

I would prefer to get rid of params, yes. The only reason I mentioned it is in case somebodies code relies on it being there. Well, anyway I don't use it at all... If you are ok to get rid of params we can add a deprecation warning pointing out the new Base.get_property behaviour, should be good enough.

Okay, so then add an if statement in getproperty to throw a depreciation warning whenever someone tries to use .params and then we can delete that later.

In the end the difference might be really small, and I'm not sure which is going to work better.

That's true. Julia is column major so indexing is always faster column-wise:

julia> strides(Array(fcs))
(1, 50)

It's a question of if people are gonna going to index per-event or per-column. Honestly, it's probably fine as is. Per-event seems more likely, right?

I mean we can support that interface regardless, whatever order the axes go. It would be a really good feature I think. We can add it to this pull request if you like, or we can chop up these features and setup a "release" branch. Which we can merge these into separately. That way we can prepare a 0.2.0 and release when all these features have been decided and implemented.

That's fair. We can add that as a separate PR for version 0.2.1 or whatever.

A warning is emitted when the user tries to access params or data
directly.

Tests for these warnings are also added.
FCS TEXT segment keywords can now be accessed using the dot
syntax. E.g. `flowrun.par` retrieves the "\$PAR" keyword value.

Dot syntax access to `data` and `params` throws deprecation warnings.

`Base.propertynames` produces a list of the keywords for the
`FlowSample`.

This commit adds tests for these features also.
@tlnagy
Copy link
Owner

tlnagy commented Oct 13, 2022

Hey @lgrozinger, what's the status of this PR?

@lgrozinger
Copy link
Contributor Author

Hey @lgrozinger, what's the status of this PR?

Hey sorry I was away for a while.

Everything should be implemented as we discussed, with the exception of the Tables interface, which I think I will do in a new PR. The documentation however, is not updated to include the new indexing styles. The example in the documentation works, but I think it would be good to mention the new indexing capabilities there.

I will try to get the docs done soon.

@lgrozinger
Copy link
Contributor Author

Ok so I have added some docs in the readme.

@tlnagy
Copy link
Owner

tlnagy commented Nov 2, 2022

Sorry for the delay, but this looks great! I think it's ready to be merged and tagged.

@tlnagy
Copy link
Owner

tlnagy commented Nov 2, 2022

Looks like the version number won't work. Do you mind changing it to v0.2.0 and then I can merge.

@lgrozinger
Copy link
Contributor Author

Looks like the version number won't work. Do you mind changing it to v0.2.0 and then I can merge.

ah sorry

@tlnagy tlnagy merged commit 5c28a71 into tlnagy:master Nov 2, 2022
@tlnagy
Copy link
Owner

tlnagy commented Nov 2, 2022

Thanks for getting this done!

@lgrozinger
Copy link
Contributor Author

Absolute pleasure, since I find this package very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants