Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New signal class and tools for in-situ 4D-STEM data analysis #900

Merged
merged 18 commits into from
Jun 10, 2023

Conversation

SyHuang19
Copy link
Contributor

@SyHuang19 SyHuang19 commented Jan 17, 2023


name: New signal class and tools for in-situ 4D-STEM data analysis
about: This PR aims to add in some support for in-situ data analysis specifically for 4D-STEM time series.


Checklist

  • Add general tools to deal with in situ data set
    • Time series reconstruction with arbitrary virtual detector
    • Specimen drift registration using said reconstruction
  • Specific analysis functionalities
    • General time correlation calculation

What does this PR do? Please describe and/or link to an open issue.
This PR builds upon my own code to analyze 5-dimensional dataset from my experiments (an early version of the analysis code can be found in 5D_ECM_analysis). I think it would be beneficial for pyxem to have a separate signal class to deal with some general and specific analysis for in-situ 4D/5D data. I started building some general tools such as time series reconstruction and drift correction. And I also included time correlation calculation as a specialty analysis function, the math of which can be found here.

There is more work to be done, and I'm open to suggestion of general tools and function that this signal class could use.

Are there any known issues? Do you need help?
I'm not completely settled on the axes structure of this signal class. Currently I'm treating the time axis as the 3rd navigation dimension so the signal would look like this (N_x,N_y,N_t|N_kx,N_ky). This has been working out for me pretty well in both of the following case:

  • If a function is mapped over signal axes, everything works the same as a Signal2D.
  • If a function is to be mapped over time axis, the time axes is then transposed into signal axis before the function is applied. In this case, the navigation shape would change or a further transpose is needed to retain the navigation shape, depending on what further analysis you would want to do to the result.

But I don't know if there is a clever way to do this. The current structure handles the functions well but that might change if more tools and functions are to be added.

Work in progress?
Potentially new general tools. @CSSFrancis since you have been working on 5D-dataset as well can you take a look at this and see if you can think of any general operation that might be useful?
Some specific functionalities to be added for time correlation (ECM) analysis:

  • Different k averaging scheme for $g_2(t)$
  • Time resampling for $g_2(t)$

Other todo items

  • Add lazy signal class
  • Add tests for $g_2(t)$ calculations

@coveralls
Copy link

coveralls commented Jan 17, 2023

Coverage Status

Coverage: 95.986% (-0.01%) from 95.997% when pulling ba0f363 on SyHuang19:InSitu4DSTEM into 348c392 on pyxem:main.

@coveralls
Copy link

Coverage Status

Coverage: 95.651% (-0.4%) from 96.017% when pulling 11689ce on SyHuang19:InSitu4DSTEM into 5456873 on pyxem:main.

@CSSFrancis
Copy link
Member

@SyHuang19 very cool. Some initial thoughts.

Drift Correction

There are potentially three different ways this can be done:

  1. Non-rigid registration: This is the most complicated and requires that for each (x,y) real space point a displacement is given. Then the image is reconstructed from interpolation of the displacement.
  2. Rigid Registration with Interpolation: In this case the same registration is applied to every single (x,y) pixel. Then the data is reconstructed from the interpolation. This is almost as complicated as 1. as it still requires interpolation
  3. Rigid Registration without interpolation: This is the easiest case and in cases where resolution requirement aren't terribly high this is the right thing to do. This can be done much more efficiently lazily as you only have to at most load 4 chunks at one time.

In all cases this is hard. I wonder if you have considered adding rechunker as an optional dependency?

In that case what should probably happen is rechunking in the navigation dimension so that the navigation dimension is spanned. Then drift correction can be easily applied. In that case the data is saved/loaded 4 times as well (once to save a partial result, once to load the rechunking, once to save a partial result and once to load the data chunked back).

Maybe something like:

def rechunk_offline(self, nav_chunks=None, sig_chunks=-1, **kwargs):
    import from rechunker import rechunk # fails if not installed
    if not isinstance(sig_chunks, tuple):
        sig_chunks = (sig_chunks,)*len(self.axes_manager.signal_shape)
    if not isinstance(nav_chunks, tuple):
        nav_chunks = (nav_chunks,)*len(self.axes_manager.navigation_shape)
    new_chunks = nav_chunks + sig_chunks
    
    new_data = rechunk(self.data, new_chunks, **kwargs)
    self.data = new_data

Then the code could look like this:

def register(self, method="rigid interpolation"):
    self.rechunk_offine(nav_chunks=(1,-1,-1),  sig_chunks="auto)
    transposed = self.transpose(signal_axes=(1,2))
    registered = transposed.map(method_dict[method]) 
    registered = registered.transpose(signal_axes=(1,2))
    return registered

Part of me wants to really avoid the da.overlap function unless it is completely necessary. It might be worth bench marking this as you already have the hard version coded. It is possible that using the rechunker function is just as slow.

This stuff is hard because a general solution is hard and being chunk agnostic is the wrong thing to do in many cases and yet we kind of have to be in this case.

Now that I think of it maybe we should include a map_axes function which applies a function to arbitrary axes. Then there is the option to rechunk the data using the rechunker package or rechunking in dask. In our case this might be really helpful as we can temporarily store the data in a scratch directory. Does that sound reasonable? You've used the rechunker package more than I have so I'm not sure how fast it is.

That would help things out a significant amount and maybe lower the barrier to writing one of these functions that operate on different chunk schemes.

@SyHuang19
Copy link
Contributor Author

@CSSFrancis Thanks for the input.

There are potentially three different ways this can be done

What I've implemented at the moment is your option 2: rigid registration with interpolation. I don't think the level of experimental accuracy in my experiment at least warrant non-rigid registration (in fact, even without interpolation had produced similar result quality in my early attempts). But that could be a topic for another day.

I wonder if you have considered adding rechunker as an optional dependency?

I did, but as you've described, using of rechunker requires saving/loading several intermediate files in hard disk. I don't know if that level of autonomy is good for one class method (even if you know what you are doing). For me, as a user I would rather have the ability to control the chunking of my signal than running one line of code that may or may not use up gigabytes of disk space in the background. I like the idea of adding this rechunk_offline you suggest though. Maybe that can be added as a utility function.

It might be worth bench marking this as you already have the hard version coded. It is possible that using the rechunker function is just as slow.

Yes, I can try benchmarking it after we get our computing hardwares back online. I used da.overlap initially because I didn't want the hassle of using rechunker every step of the way, but that could very possibly be faster.

Now that I think of it maybe we should include a map_axes function which applies a function to arbitrary axes. Then there is the option to rechunk the data using the rechunker package or rechunking in dask. In our case this might be really helpful as we can temporarily store the data in a scratch directory. Does that sound reasonable? You've used the rechunker package more than I have so I'm not sure how fast it is.

rechunker is fast and one thing I like about it is that the number of tasks only dependents on your target chunk. So even if you have a lot of cross rechunking it wouldn't create too many overhead for the scheduler. But as mentioned above I feel kinda iffy about allowing rechunk_offline in the background for any function.

Maybe we can write a map_axes function that only work with the correct chunking? But then all it does other than what map already does is just handling the axes and transposition.

I'll benchmark the speed of rechunker vs. da.overlap. I anticipate rechunker to be faster. Then perhaps I can restructure the correct_real_space_drift function. rechunk_offline that uses rechunker as dependency can perhaps go into lazy_tools as a standalone utility function, and we can discuss further about using it in signal class functions.

@CSSFrancis CSSFrancis mentioned this pull request Feb 1, 2023
16 tasks
@SyHuang19 SyHuang19 marked this pull request as ready for review March 2, 2023 23:25
Copy link
Member

@CSSFrancis CSSFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SyHuang19 Looks pretty good. Just some minor comments and then we can probably merge this pretty soon!


"""
roi = kwargs.pop("roi", None)
ref = self.get_time_series(roi=roi, time_axis=time_axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this today. It might also be useful to be able to pass multiple ROI's and then return multiple shifts and take the average. This is more useful in my case where I don't really have distinct features but I can still align on the glass features.

roi = kwargs.pop("roi", None)
ref = self.get_time_series(roi=roi, time_axis=time_axis)

shift_reference = kwargs.pop("reference", "cascade")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to just to pass these directly rather than as keyword arguments. That at least makes them more visible in the doc string and accomplishes the same thing.


return shift_vectors

def correct_real_space_drift(self, shifts=None, time_axis=2, lazy_result=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super comfortable with this function but I think it probably stays in there for the time being. It's useful in a pinch but not really an ideal way to handle things.

We should try to include an example of how to use the rechunker package in the documentation.


xdrift = shifts.data[:, 0]
ydrift = shifts.data[:, 1]
xs = Signal1D(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think that this is overkill.

The map function will copy arrays to every single function so you should just be able to pass:

s_transposed.map(_register_drift_2d,
                                            shift1=xdrift,
                                            shift2=ydrift,
                                            inplace=False
                                            )

Without repeating the data. The added benefit is that it will use less memory duplicating the shift array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xdrift/'ydrift' parameters are different for different time coordinates, which is a navigation axes. Therefore I need to pass it as signal that has the same navigation dimension as s_transposed.


registered_data_t = registered_data.transpose(navigation_axes=[-2, -1, -3])
registered_data_t.set_signal_type("insitu_diffraction")
if self._lazy and not lazy_result:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by the map function. (see above)

@@ -0,0 +1,144 @@
# -*- coding: utf-8 -*-
# Copyright 2016-2022 The pyXem developers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2016-2022 The pyXem developers
# Copyright 2016-2023 The pyXem developers

@@ -0,0 +1,339 @@
# -*- coding: utf-8 -*-
# Copyright 2016-2022 The pyXem developers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2016-2022 The pyXem developers
# Copyright 2016-2023 The pyXem developers

@@ -0,0 +1,169 @@
# -*- coding: utf-8 -*-
# Copyright 2016-2022 The pyXem developers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2016-2022 The pyXem developers
# Copyright 2016-2023 The pyXem developers

import scipy.signal as ss


def _register_drift_5d(data, shifts1, shifts2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth it to be able to set your order here. I think that there are a couple of places where order=0 would be faster/quicker.

return data_t


def _register_drift_2d(data, shift1, shift2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above...

@CSSFrancis CSSFrancis mentioned this pull request Jun 5, 2023
10 tasks
@CSSFrancis
Copy link
Member

@SyHuang19 it would also be a good idea to have the insitu_utils.py file completely covered by tests.

The code style is just because you haven't run black on the changes. We can always merge the PR and then it will automatically do make a new PR to run black so that isn't a big deal.

@CSSFrancis CSSFrancis merged commit 54e1ac7 into pyxem:main Jun 10, 2023
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants