Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to keep things in the memory? #19

Closed
wjlei1990 opened this issue Jan 16, 2016 · 4 comments
Closed

Possible to keep things in the memory? #19

wjlei1990 opened this issue Jan 16, 2016 · 4 comments

Comments

@wjlei1990
Copy link

Hi Lion,

We are now integrating the whole workflow. For example, we want to combine signal processing(for both observed and synthetic) and window selection together. During the processing, we don't want any I/Os. To be more specific, we want the workflow to read in the raw data(one observed and one synthetic asdf file) at the very beginning, process them and select windows. After window selections, only the windows are written out.

Currently, I am using "process" function in the asdf_data_set.py. However, there is one argument called output_filename:

def process(self, process_function, output_filename, tag_map,
                traceback_limit=3, **kwargs):

meaning currently implementation requires to write the processed files out. However, if possible, my preferred way would be keep things in the memory. I am guessing if so, it is even against the basic implementation of the asdf, right? Cause when you initialize the "read", it is not even read the whole thing into the memory. So there is not such a thing called "keep all the things in the memory".

Or I am thinking about another option would be modify the process function. So the process function would take one observed and one synthetic, and walked all the way down to window selection. I think that might be the right and possible way.


If my words get you confused here, I will just illustrate a little more:
For example, you have two files, one raw observed asdf and one raw synthetic asdf. You want to process them and select windows. There are two ways of doing so:

  1. process the whole observed asdf(but keep all things in the memory), process the whole synthetic asdf(and keep things in the memory), and select windows(for traces in the memory). The advantage of this is to make my code modulelized so I can simply ensemble different parts together. But dis-advantage is this method might be not applicable to the currently asdf implementation.
  2. modify the process_function, to make it incorporate all the procedures. I think if so, it is possible to implement. But disadvantage is this will make the process_function so big and not very user-friendly?

Sorry to bring it up so late. I looked through the code and found if so, it might involves a lot of changes in the code.

@krischer
Copy link
Member

Hi Wenjie,

you should be able to do this with the already existing process_two_files_without_parallel_output() method.

http://seismicdata.github.io/pyasdf/parallel_processing.html#process-two-files-without-parallel-output-method

No need to add anything else I think.

  1. modify the process_function, to make it incorporate all the procedures. I think if so, it is possible to implement. But disadvantage is this will make the process_function so big and not very user-friendly?

I guess my suggestions also applies to that fear of your's but just split it up in a couple of functions and you should be more than fine.

Let me know if there is some roadblock in that approach and we can certainly add something else as well if required.

Cheers!

Lion

@wjlei1990
Copy link
Author

Hi Lion,

I ensembled an example for the preprocessing workflow based on your suggestion, using process_two_files_without_parallel_output() method.

This example includes:

  1. observed signal processing
  2. synthetic singal processing
  3. window selection
  4. adjoint source constructor

I uploaded the file here(this is just an example and it is not yet 100% complete). But it delievers what we have been discussed.

A bit concern about this way is we have two many in-place function definition. The major reason for that is we need to define some variables outside the function and the argument list is pretty limited for process function passing into the process_two_files_without_parallel_output().

Sorry, I named it proc_combo.py.txt for uploading. Please rename it to *.py.
proc_combo.py.txt

@krischer
Copy link
Member

Hi Wenjie,

yea that is actually an intended design choice. Otherwise the process()/process_two_files_without_parallel_output() methods would have very awkward function signatures.

Python has reasonable support for functional programming, in this case closures and function currying. You are using closures: the function definitions you make bind the outside variables and thus the function can see them.

An alternative would be to use functools.partial for function currying/partial application in Python. Your example could look like this:

from functools import partial

def combo_func(obsd_station_group, synt_station_group,
              process_obsd_function, process_synt_function, window_function,
              adjoint_source):
    ...


results = obsd_ds.process_two_files_without_parallel_output(
    synt_ds, partial(combo_func,
                     process_obsd_function=process_obsd_function,
                     process_synt_function=process_synt_function,
                     window_function=window_function
                     adjoint_source=adjoint_source))

Using either allows to write a well structure program.

@wjlei1990
Copy link
Author

Thanks for the suggestion. I will use the functools.partial
👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants