Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_transform_fxn signature #146

Closed
JS3xton opened this issue Nov 23, 2015 · 2 comments
Closed

get_transform_fxn signature #146

JS3xton opened this issue Nov 23, 2015 · 2 comments

Comments

@JS3xton
Copy link
Contributor

JS3xton commented Nov 23, 2015

From #141, I was confused about the MEF section in process_beads_table and some of the confusion seemed to originate from the get_transform_fxn signature.

Relevant parts of original thread brought over from #141:

JS:

While I do think it's generally the case that beads should have the same number of peaks, I think it's an unnecessary assumption and possibly detrimental.

  • The user may not have put the known MEF values in that way. If this is a required assumption, I think it should be detected and an error should be raised if that assumption is violated. Would this get caught under the current implementation? Or would an underspecified known MEF values just get padded with zeros upon creation of the mef_values numpy array?
    • Looks like mef_values would not get padded, instead you appear to get a numpy object array where each element is a list. That may still work? I still think the more explicit dictionary data structure would be better there if an error is not warranted.
  • It's pretty easy to write this in a manner where each channel is independent. A {channel -> known MEF values} dictionary is intuitive and allows more general specification of known MEF values.
  • Some devil's advocate arguments for variable numbers of fluorophore concentrations:
    • it's not outside the realm of possibility that you could get beads with different numbers of fluorophore concentrations for different fluorophores. e.g. 8 MEF but 6 MECY. seems unlikely, though, and I'm not aware of a product which currently exists like this.
    • I could mix two different non-rainbow beads in the same sample, maybe to get FL1 and FL3 coverage, but the beads may have different numbers of peaks.

SMC:

  • All the MEF values from a single beads sample should have the same number of values, because a single sample has the same number of subpopulations.
  • It seems that you want to change the signature of get_transform_fxn, which does not belong here. You might want to open an issue for this. You might also want to reconsider this given my previous point.

JS:

  • Might consider promoting # Process MEF values to an emphasized comment.
    • I also had to re-read this section several times to understand it. Purpose of mef_values and mef_channels data structures isn't terribly clear from description or their names. Is mef_values referring to known MEF values? It's a list of lists that you convert to a numpy array? That seems dangerous if the specified known MEF values are not all of the same length (different beads have different numbers of peaks. just seems like an unnecessary assumption). Looks like this is supposed to be the input to mef.get_transform_fxn which expects a {channel -> known MEF} mapping. I'd almost prefer something more explicit like a dictionary or a list of numpy arrays rather than relying on indexing numpy columns? That would also more generally support different specifications of known MEF values.
    • Your comment refers to mef_channels which hasn't been defined yet. Very confusing way to start off the comment. Might explain that you first make mef_channels (and explain what that is).
    • What does if '{} MEF Values'.format(fl_channel) in beads_row: mean? A column exists with fl_channel (e.g. "FL1-H", "FL3", etc.) + "MEF Values" in the Beads table. When is this not true? If the instrument reads a fluorescence channel, but the user doesn't care about it for their analysis (e.g. our FL2?).
    • Is mef_channels a subset of fl_channels? Again, had to look at mef.get_transform_fxn documentation which apparently says it's channels that want MEF transformation functions. OK, that makes sense. If this really is just a subset of fl_channels, would help to mention that. When is this ever not fl_channels? Again, if the instrument measures a channel that the user doesn't care about? Something like a very simple "subset of fl_channels requesting MEF transformations" would help clarify that. I also still feel like it would be much more natural to have a {channel -> known MEF values} dictionary to capture all of this.
    • I'm still confused on what this is doing. ok, I think I got it?
This was referenced Nov 23, 2015
@JS3xton
Copy link
Contributor Author

JS3xton commented Dec 16, 2015

This is now a bug. If the MEF values for two different channels for the same bead row do not match in number of elements, I get the following error:

Processing Beads table (4 entries)
==================================

Beads ID B1...
Loading file "OR20140627/Data.002"...
Performing gating...
Plotting density plot and histogram...
Calculating standard curve for channels FL1-H, FL3-H...
Traceback (most recent call last):
  File "C:\Users\John\Anaconda2\envs\FlowCalv1.0.0b1\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Users\John\Anaconda2\envs\FlowCalv1.0.0b1\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "build\bdist.win-amd64\egg\FlowCal\excel_ui.py", line 830, in <module>
  File "build\bdist.win-amd64\egg\FlowCal\excel_ui.py", line 789, in run
  File "build\bdist.win-amd64\egg\FlowCal\excel_ui.py", line 333, in process_beads_table
  File "build\bdist.win-amd64\egg\FlowCal\mef.py", line 546, in get_transform_fxn
IndexError: tuple index out of range

The reason I do not specify the same number of MEF values for a given bead row is that I have different gain settings for different channels such that I can only resolve a subset of the peaks for one of the channels. I manually pruned the MEF Values entries for that channel and got this bug.

This bug seems to stem from the fact that you call mef_values = np.array(mef_values).copy() at the top of mef.get_transform_fxn(). mef_values is a list of lists containing the parsed MEF Values. When I tried to construct a numpy array from a list of lists where the inner set of lists were not of equal length, numpy gave me a 1D array of dtype=object (where each element is a list, instead of a 2D numpy array of numbers as originally intended here), which fails a few lines later when you try and call mef_values.shape[1].

This may be specific to my current version of numpy? 1.10.1. While enforcing numpy requirements might fix this problem, I think a much better solution would be to redo the mef_values/mef_channels data structure.

@castillohair
Copy link
Collaborator

There are two things here:

One, a complain about the impossibility of having a different number of MEF values on different channels. This is not a bug. We intentionally force these to be the same, because the beads that we have seen so far contain the same number of peaks in all channels. The clustering algorithm relies on this assumption, and clustering in N dimensions when the number of subpopulations in one axis is different than another doesn't make a lot of sense.

To address some of your "edge cases":

The user may not have put the known MEF values in that way

Then the user needs to read the documentation and use the package correctly.

If this is a required assumption, I think it should be detected and an error should be raised if that assumption is violated

A more explicit error message is not a bad idea.

it's not outside the realm of possibility that you could get beads with different numbers of fluorophore concentrations for different fluorophores. e.g. 8 MEF but 6 MECY. seems unlikely, though, and I'm not aware of a product which currently exists like this.

I have no idea how this would make sense if you look at it in a 2D scatter plot. I can think of a few convoluted ways in which this would be possible, but they are very complicated and would require extensive revisions of the clustering algorithm and the MEF algorithm in general. I'd rather not deal with this unless we confirm that such a product exists.

I could mix two different non-rainbow beads in the same sample, maybe to get FL1 and FL3 coverage, but the beads may have different numbers of peaks.

Or you could use beads properly, and run two separate samples. Even if you really wanted to do this, and even if beads in both channels had the same number of peaks, the clustering algorithm would break. Let's say that you're using beads with a green fluorophore (detected in FL1) and beads with a red fluorophore (FL3) and you mixed them in the same tube. If you assume no bleedthrough (best case scenario), then you would see that, in FL1, all the red beads would form their own group, probably with the same measured fluorescence as the blank peak of the green beads. Therefore, you would have a blank peak that has a much greater number of events than the other peaks. This would break the GMM clustering algorithm, because it is currently set up to find clusters of roughly the same size.

In summary, handling these edge cases requires way more than allowing different numbers of peak values. Since, to the best of my knowledge, there are no beads that require such behavior, and no scenario that can't be fixed by having the user use beads properly, this is not something that should be worked on. It may be a good idea, however, to have a more explicit error message when the user tries to enter a different number of MEF values into get_transform_fxn(). Feel free to open a new issue on this if you're interested.

The other issue is about mef_values being a dictionary, instead of a numpy array or a list of lists. Given that we don't want to support entering different number of MEF values on different channels (see above), the switch to a dictionary becomes more of a personal preference, in my opinion, and I am against changing a parameter for personal preference. Given the time passed since this issue's submission, I will assume that interest in this stylistic change has faded, and close. Feel free to open a more focused issue, but be aware that I will be against changing the datatype of a parameter if there is no clear advantage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants