Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mef.get_transform_fxn() outputs should be OrderedDict instead of lists to carry around channel ID #232

Closed
JS3xton opened this issue Feb 1, 2017 · 7 comments

Comments

@JS3xton
Copy link
Contributor

JS3xton commented Feb 1, 2017

mef.get_transform_fxn() loops through specified MEF channels and builds a functools.partial function which transforms RFI values to MEF values based on specified bead data. If you ask for full_output, this function accumulates useful intermediate data structures and information for each specified MEF channel (specified via mef_channels input) in a series of lists and returns these lists to the user. E.g. from the mef.get_transform_fxn() documentation -

    fitting : dict, only if ``full_output==True``
        Results of the model fitting step, containing the following fields:
        std_crv : list
            Functions encoding the standard curves, for each channel in
            `mef_channels`.
        beads_model : list
            Functions encoding the fluorescence model of the
            calibration beads, for each channel in `mef_channels`.
        beads_params : list
            Fitted parameters of the bead fluorescence model, for each
            channel in `mef_chanels`.
        beads_model_str : list
            String representation of the bead models used, for each channel
            in `mef_channels`.
        beads_params_names : list
            Names of the parameters given in `beads_params`, for each
            channel in `mef_channels`.

I think it would be more useful if these accumulation data structures were OrderedDicts keyed on the mef_channels values. This seems like a natural way to carry around the channel ID for use in subsequent computations, and I believe the original list can be recovered easily via OrderedDict.values().

My use case is that I'm using the excel_ui.process_beads_table() function from the Python API, and I want to print the detector voltages of the channels I've chosen to MEF.

excel_ui.process_beads_table() returns:

    beads_samples : list of FCSData objects
        A list of processed, gated, and transformed samples, as specified
        in `beads_table`, in the order of ``beads_table.index``.
    mef_transform_fxns : OrderedDict
        A dictionary of MEF transformation functions, indexed by
        ``beads_table.index``.
    mef_outputs : list
        A list with intermediate results of the generation of the MEF
        transformation functions, indexed by ``beads_table.index``. Only
        included if `full_output` is True.

so I can extract useful information (e.g. the bead fitting parameters) by accessing mef_outputs via the numerical list index which corresponds to the Beads table entry I am interested in. (I will note here that mef_outputs and beads_samples should also maybe be an OrderedDict to be consistent with mef_transform_fxns. In general, I've found it very useful to loop through the Beads Table UIDs (DataFrame index) and access these structures via the UID instead of having to additionally carry around the list index (e.g. via enumerate) to access a bunch of lists).

From this context, though, I don't have access to the mef_channels that excel_ui.process_beads_table() called mef.get_transform_fxn() with without going back to the Excel file and re-parsing/extracting them myself, which seems woefully inelegant. I don't think it's possible in the general case with the current implementation to determine the channel ID without consulting the Excel file because the user can specify any combination of valid channels to MEF.

Problematic example code:

beads_samples, mef_transform_fxns, mef_outputs = fc.excel_ui.process_beads_table(
    beads_table=beads_table,
    instruments_table=instruments_table,
    full_output=True)

print('Bead Fluorescence Model Parameters:')
for idx,uid in enumerate(beads_table.index):        # <-- idx feels unnecessary
    print('-Bead Table ID: {}'.format(uid))
    for ch_idx, ch_params in enumerate(mef_outputs[idx].fitting['beads_params']):
        print('--Channel Index: {} (voltage={}), Params: {}'.format(
            ch_idx,
            beads_samples[idx].detector_voltage(channels=???),
            ch_params))

I would prefer:

beads_samples, mef_transform_fxns, mef_outputs = fc.excel_ui.process_beads_table(
    beads_table=beads_table,
    instruments_table=instruments_table,
    full_output=True)

print('Bead Fluorescence Model Parameters:')
for uid in beads_table.index:
    print('-Bead Table ID: {}'.format(uid))
    for ch, ch_params in mef_outputs[uid].fitting['beads_params']:
        print('--Channel: {} (voltage={}), Params: {}'.format(
            ch,
            beads_samples[uid].detector_voltage(channels=ch),
            ch_params))
@castillohair
Copy link
Collaborator

There are two separate issues here. I'll tackle the one about process_beads_table (and by extension, the outputs of process_samples_table) first. I think that having their outputs being OrderedDict would be slightly better, but not enough to justify breaking user's code. OrderedDicts do not cleanly replace a list, unfortunately:

In [1]: import collections

In [2]: d = collections.OrderedDict()

In [3]: d['k1'] = 1

In [4]: d['k2'] = 2

In [5]: for di in d:
   ...:     print di
   ...:     
k1
k2

In addition, I think the main problem is that beads_samples is a separate structure from beads_table. I've found it annoying to keep track of both, and to update one when the other changes. Changing the datatype of beads_samples will not solve this. The cleanest solution, in my opinion, would be to have a column in beads_table pointing to processed FCSData objects, but I haven't found a way to do that.

The other issue that you present is the output of mef.get_transform_fxn(), particularly that the channels are not exposed outside the Excel UI. This can solved in many ways:

  1. Have process_beads_table pack the MEF channels in mef_outputs
  2. Have process_beads_table pack the MEF channels as another output argument.
  3. Change all the lists in the extended output of mef.get_transform_fxn() to OrderedDicts, as you mentioned.

I would prefer the first option. It seems to me that this is not a problem of poor design in mef_transform_fxn(), but of the Excel UI not giving full information to the caller.

In general, I have some inertia towards changing things to OrderedDicts because I prefer to use native types when possible, and because user's code will break.

@castillohair
Copy link
Collaborator

Also, the description of mef_outputs in the documentation could be improved. It took me a while to find exactly what the outputs were (and I had to look at the code).

@JS3xton
Copy link
Contributor Author

JS3xton commented May 7, 2017

Also, the description of mef_outputs in the documentation could be improved. It took me a while to find exactly what the outputs were (and I had to look at the code).

Agreed.

Issue 1 - beads_samples and mef_outputs Data Types

In addition, I think the main problem is that beads_samples is a separate structure from beads_table. I've found it annoying to keep track of both, and to update one when the other changes. Changing the datatype of beads_samples will not solve this.

I guess I disagree. I think making beads_table (and mef_outputs) OrderedDicts makes updating all data structures easier and more consistent (accessing them all with the same key/index). (I won't call it "solved", but I would argue that it's better.). I guess you could argue the other way, too (i.e. revert mef_transform_fxns from an OrderedDict back to a list, but I think the OrderedDicts are much more elegant in the way they allow you to key on the DataFrame index (and therefore only have to carry around 1 indexing variable), which the list doesn't allow you to do). These accumulator data structure update statements really should all be of the same form (in excel_ui.process_beads_table()):

# If no errors were found, store results
beads_samples.append(beads_sample_gated)
mef_transform_fxns[beads_id] = mef_transform_fxn
if full_output:
    mef_outputs.append(mef_output)

The cleanest solution, in my opinion, would be to have a column in beads_table pointing to processed FCSData objects, but I haven't found a way to do that.

I agree. In lieu of that, though, I think OrderedDict is better than what currently exists. I acknowledge that this could be an annoying API change for users, though. Perhaps it could be targeted for a larger release (e.g. when Compensation support rolls out?)

In general, I have some inertia towards changing things to OrderedDicts because I prefer to use native types when possible, and because user's code will break.

I think whatever data structure gets the job done best (most elegantly / simply) should be embraced (see above on superiority of accessing a sequence-type by DataFrame index instead of being restricted to numerical index). OrderedDict is about as close to a native type as you can get because it's in collections (which is already imported and in use in excel_ui and mef). We've also embraced other data structures from collections (namedtuple).

Issue 2 - Exposing mef_channels

OK, it looks like I found a way to solve this with the current data structures. Not sure why I didn't figure this out originally (I'm gonna point to a need for better documentation again, though... XD)

It looks like the mef_channels are packed with mef_outputs already, so to fix the example code, I should have done the following:

beads_samples, mef_transform_fxns, mef_outputs = fc.excel_ui.process_beads_table(
    beads_table=beads_table,
    instruments_table=instruments_table,
    full_output=True)

print('Bead Fluorescence Model Parameters:')
for idx,uid in enumerate(beads_table.index):
    print('-Bead Table ID: {}'.format(uid))
    for ch,ch_params in zip(mef_outputs[idx].mef_values, mef_outputs[idx].fitting['beads_params']):
        print('--Channel Index: {} (voltage={}), Params: {}'.format(
            ch,
            beads_samples[idx].detector_voltage(channels=ch),
            ch_params))

@JS3xton
Copy link
Contributor Author

JS3xton commented May 16, 2017

Ditto "Issue 1" for samples returned from FlowCal.excel_ui.process_samples_table(). In particular, I want to access some intermediate rows from the "Samples" sheet of my input XLSX file, and I and would like to simply index samples via the corresponding value in the "ID" column of the "Samples" sheet. Instead, I have to go figure out what index the intermediate entries are.

@castillohair
Copy link
Collaborator

So issue 2 is not an issue anymore, besides documentation?

I'm becoming more convinced that making samples and beads_samples be OrderedDict would be useful. It would require a second digit version change. Is this something you're currently working on?

@JS3xton
Copy link
Contributor Author

JS3xton commented Jun 12, 2017

So issue 2 is not an issue anymore, besides documentation?

Yeah, I think so.

I'm becoming more convinced that making samples and beads_samples be OrderedDict would be useful. It would require a second digit version change. Is this something you're currently working on?

I agree this should probably be a 2nd digit version change. As mentioned above, possibly appropriate to roll out with Compensation (but generally, whenever the issue has been sufficiently fixed and can be bundled with other more important changes that warrant a 2nd digit version change).

Are you omitting mef_outputs from your list of data structures which should be changed on purpose?

I am not currently working on this, but I think the foundational change should be pretty straightforward. Specifically, updating the accumulator data structure update statements:

# If no errors were found, store results
beads_samples.append(beads_sample_gated)
mef_transform_fxns[beads_id] = mef_transform_fxn
if full_output:
    mef_outputs.append(mef_output)

and propagating those changes to any code that uses those structures (I haven't looked recently at how far-reaching those propagations would be).

If you're on board with updating beads_samples and mef_outputs from excel_ui.process_beads_table() and samples from excel_ui.process_samples_table(), I can take this issue off the wishlist and start working on the code changes when I have time. (not committing to clarifying the documentation right now, though)

@castillohair
Copy link
Collaborator

Are you omitting mef_outputs from your list of data structures which should be changed on purpose?

No, I forgot about it. mef_outputs should reflect beads_samples.

If you're on board with updating beads_samples and mef_outputs from excel_ui.process_beads_table() and samples from excel_ui.process_samples_table(), I can take this issue off the wishlist and start working on the code changes when I have time. (not committing to clarifying the documentation right now, though)

Yes. I'll open another issue for the documentation and work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants