Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend use of bz2 compression for input files for seviri_l1b_hrit #1796

Closed
pdebuyl opened this issue Aug 19, 2021 · 9 comments · Fixed by #1798
Closed

Extend use of bz2 compression for input files for seviri_l1b_hrit #1796

pdebuyl opened this issue Aug 19, 2021 · 9 comments · Fixed by #1798
Labels
enhancement code enhancements, features, improvements

Comments

@pdebuyl
Copy link
Contributor

pdebuyl commented Aug 19, 2021

Feature Request

Is your feature request related to a problem? Please describe.

I use MSG seviri data in HRIT format. For some reason, our archive has the epilogue and prologue files compressed with bzip2 on disk. Currently, this is not possible to pass bzip2 compressed files to the seviri_l1b_reader

Describe the solution you'd like

Add .bz2 to the list of supported extension for prologue and epilogue files and use the unzip_file helper to decompress them as needed.

Describe any changes to existing user workflow

This does not remove any existing feature and does not change how one would use satpy.

Additional context

In principle, I could provide decompressed files to satpy but it is of course more convenient if the process is automatic.

I understand fully if this extra convenience is not taken into account because it is too specific. Else, I would be willing to draft a PR using another bz2 aware reader as example.

@mraspaud mraspaud added the enhancement code enhancements, features, improvements label Aug 19, 2021
@mraspaud
Copy link
Member

mraspaud commented Aug 19, 2021

@pdebuyl Interesting! Is your archive accessible to all or is it only for internal usage?
Anyway, the impact is small, so I would be fine with this being implemented if it makes your life easier :)

The unzip_file helper would indeed work great, but I was wondering if we could make it even better by making it a context manager? That way, it would be possible to do:

with unzip_file(somefilename) as non_compressed_file:
    with open(non_compressed_file)....

where unzip_file would yield the decompressed filename if the file was compressed in the first place, or alternatively the original filename in the case of not compressed data.

What do you think?

@pdebuyl
Copy link
Contributor Author

pdebuyl commented Aug 20, 2021

Hi @mraspaud thank you for replying!

Our archive is internal. It is a straight dump from EUMETCast though, with the prologue, epilogue, and cloudmask "bzip2-ed". The HRIT files use the internal compression (extension C_).

Regarding the use of a context manager, does satpy rely on lazy reading for some of the unzip_file-using readers? If so, the context should extent to the duration of use of the xarray dataset, which is less trivial. In other words: should the uncompressed file survive after xr.open_dataset is called?

@mraspaud
Copy link
Member

Some of the reader indeed do read the data lazily. However this is not the case for the prologue and epilogue files as far as I know, and when we do read lazily, we use np.memmap, which should in principle hold to the file until closed (so even if the file is removed, the data will still be available):

In [15]: arr = np.ones((10000, 10000))

In [16]: arr.tofile("ones", "")

In [19]: data = np.memmap("ones", dtype=np.float)

In [20]: data
Out[20]: memmap([1., 1., 1., ..., 1., 1., 1.])

In [21]: os.remove("ones")

In [22]: data
Out[22]: memmap([1., 1., 1., ..., 1., 1., 1.])

In [23]: data * 5
Out[23]: array([5., 5., 5., ..., 5., 5., 5.])

@pdebuyl
Copy link
Contributor Author

pdebuyl commented Aug 20, 2021

I have started to test an update. I tried first at the level of HRITFileHandler.__init__ but this calls _get_hd to obtain header information, which means that the file is already open at that point.

Current approach: modify self.filename before reaching _get_hdr with the temporary unzipped file.

It seems to work but needs some cleaning up. Also, would you like to replace the current uses of unzip_file with the context manager?

I'll give an update next week.

@mraspaud
Copy link
Member

I was thinking about using unzip_file here: https://github.com/pytroll/satpy/blob/main/satpy/readers/seviri_l1b_hrit.py#L229-L240
and here: https://github.com/pytroll/satpy/blob/main/satpy/readers/seviri_l1b_hrit.py#L301-L306

Regarding replacing the current usages, if it's possible, I definitely think we should use the context syntax :)

@pdebuyl
Copy link
Contributor Author

pdebuyl commented Aug 23, 2021

I found that you need to start the context before here: https://github.com/pytroll/satpy/blob/main/satpy/readers/hrit_base.py#L160

Else, the _get_hd routine open the zipped file.

I have the following as a WIP: https://github.com/pdebuyl/satpy/tree/bzip2_PRO_EPI

I don't remove the temporary files yet as they are removed too early with this solution.

@mraspaud
Copy link
Member

ok, I understand. So indeed, _get_hd needs to use the unzipping context too.

@pdebuyl
Copy link
Contributor Author

pdebuyl commented Aug 23, 2021

I have not yet looked at the image file handling, but for the prologue and epilogue, the files are opened and closed in _get_hd and re-opened as necessary.

Another solution: open the context before the __init__ of HRITMSGPrologueFileHandler, and keep it until read_prologue (and read_epilogue) is called.

I pushed this other solution to the same branch, if it is accepted I can make another branch or squash it.

@mraspaud
Copy link
Member

The easiest would be if you made a PR with your current branch, so we can review it properly. If you feel there are too many commits, we can do a squash merge when merging, so don't worry about that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement code enhancements, features, improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants