Clean up of the IO code #1921

DougBurke · 2023-10-04T20:42:07Z

Summary

Rework the astro data input and output code so that they take advantage of the new data representation in sherpa.astro.io.types (these classes provide a generic representation of FITS-like concepts), rather than passing around data as a dictionary. This moves the generic logic for deconstructing or constructing FITS like data from the backend code into the generic IO layer, reducing the chance for differences in the backend support for special cases.

Details

Follow the suggestion of #1913 and move the decoding logic from the backend code into the generic layer, taking advantage of the new types in the sherpa.astro.io.types module. There is also a lot of code clean up, in particular (but not limited to) the crates backend.

The idea is that we can use these "new" types (which are actually based on the work done in the 4.16 release for the XSPEC table model case) to better model the data we want (compared to just using a dictionary that is then sent to a DataXXX class). In doing so we can move some of the decoding logic from the backend into the frontend, and so avoid having cases where "the file can be read by backend a but not by backend b".

Rather than do this all in one go, I have tried to change one area at a time, so

RMF
ARF
PHA
generic table support
image access

And then, after going through all this, I then add specialized types for PHA/responses which reworks the code I've just changed. So, we could decide not to take the add HEASARC-specific table blocks commit, if we feel it's not worth it.

So, reviewing this you may see comments indicating "we have introduced this new function to replace this old function but haven't completed the transition yet". Some of these commits are annoying to review - in particular the RMF code - because code is moved around, sometimes between files, and slightly changed at the same time. These changes rely in part on the large number of tests we have for I/O on these files (some of which have been added in previous PRs in this series because of this eventuality).

Unfortunately the use of types doesn't always lead to completely type-safe interfaces, in part because the typing infrastructure is still relatively new: some pain points are the fact we don't track shape and datatype information of ndarrays, and things like np.bool_/np.floating values not always appearing to behave the same as bool/float, depending on the type checker.

One interesting decision is whether we just stick all our types into a single place - e.g. sherpa.utils.types - or whether we have a "per domain" location - so sherpa.stats.types, sherpa.io.types, .... We are currently in a "in between" case, as most types are getting put into sherpa.utils.types but the I/O types are in their own location (and astro-sepcific). I have tried to point out that all the type code is somewhat experimental, so we do not have to have a strategy set in stone.

The aim of this PR is to make it easier to support multi-matrix RMF files, but this change is done in a follow-up PR (allthough the new types are set up to make this change a lot easier). See

planned I/O changes for 4.17 #1939

One thing I should do is note down the problem cases I identified here, so we can track them better.

codecov · 2023-10-04T20:58:52Z

Codecov Report

Attention: Patch coverage is 72.91297% with 305 lines in your changes missing coverage. Please review.

Project coverage is 87.39%. Comparing base (52b00d0) to head (0b62e11).

Files	Patch %	Lines
sherpa/astro/io/crates_backend.py	0.00%	244 Missing ⚠️
sherpa/astro/io/types.py	87.03%	21 Missing ⚠️
sherpa/astro/io/__init__.py	95.94%	16 Missing ⚠️
sherpa/astro/io/pyfits_backend.py	93.89%	16 Missing ⚠️
sherpa/astro/xspec/__init__.py	71.42%	4 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1921      +/-   ##
==========================================
+ Coverage   86.88%   87.39%   +0.50%     
==========================================
  Files          84       84              
  Lines       29770    29586     -184     
  Branches     4422     4446      +24     
==========================================
- Hits        25866    25857       -9     
+ Misses       3795     3617     -178     
- Partials      109      112       +3

Files	Coverage Δ
sherpa/astro/data.py	`93.08% <100.00%> (-0.01%)`	⬇️
sherpa/astro/instrument.py	`90.09% <100.00%> (ø)`
sherpa/astro/io/dummy_backend.py	`62.96% <100.00%> (-0.68%)`	⬇️
sherpa/astro/io/xstable.py	`98.24% <100.00%> (-0.27%)`	⬇️
sherpa/astro/ui/utils.py	`98.16% <100.00%> (+<0.01%)`	⬆️
sherpa/astro/utils/__init__.py	`83.52% <ø> (ø)`
sherpa/astro/xspec/__init__.py	`98.39% <71.42%> (-0.09%)`	⬇️
sherpa/astro/io/__init__.py	`96.28% <95.94%> (-1.76%)`	⬇️
sherpa/astro/io/pyfits_backend.py	`94.36% <93.89%> (+0.61%)`	⬆️
sherpa/astro/io/types.py	`87.71% <87.03%> (-12.29%)`	⬇️
... and 1 more

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52b00d0...0b62e11. Read the comment docs.

Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).

DougBurke · 2023-10-26T22:16:34Z

Rebasing as I have just lost a day due to an overly-ambitious refactor which I've now pulled out into #1929 (although the ambitious part hasn't been included). I want to make sure I have a copy of "working" code before I break it all again.

Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).

DougBurke · 2024-01-12T15:57:49Z

Rebased to pick up #1938

Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).

Using the types module we can set up the data we want to both read in and write out in the IO layer and not the backend layer. The backend types for get_rmf_data (return value) and pack_rmf_data/set_rmf_data (input argument) have changed to use "structured" types. One aim is to provide a unified interface for RMF processing so that we no-longer have the case of "backend a can not read in this RMF but backend b can" (modulo those cases where a backend can not read in a particular file). One minor change is that we now add a CREATOR keyword to the two table blocks of a RMF (set to "sherpa <version>") and this is only done if the CREATOR keyword is not already set. There is some logic here that could be moved into the DataRMF class as it looks like we may slow down the rmf_fold code if the data types do not match (e.g. to make sure we convert to the right data types needed by the C++ rmf_fold code). RMF files are now required to follow the OGIP standard **much more** strongly than before. For example, we now **require** an EBOUNDS block when it was optional before. This might affect some "interestig" use cases, but I honestly don't understand the current support for this and it is completely untested (as shown by the fact that making EBOUNDS a required block does not change any existing test). As part of this, we now read in all the MATRIX blocks for a RMF and then warn the user we only use the first block (and this warning is done in the generic layer rather than repeated in both backends).

The resp_init routine was only used by the crates backend, not pyfits. With the rework of the I/O layer it doesn't appear to be needed (and was likely better handled in Python rather than C++ scope as it isn't a time-critical routine), so remove it.

The ARF case is simpler than the RMF case, but we need to convert the generic table code before we can really simplify things. The backend routines get_arf_data (return value) and the pair of pack_arf_data and_set_arf_data (number and types of arguments) have changed to use more-structured types.

The backend layer now just reads in the data with the logic for handling the data (e.g. is it type I vs type II, do we need to convert from RATE to "counts") now in the generic layer. The TableHDU process is also used for writing out the data. Thanks to how the code was written this also changes the set_table_data to use TableHDU for passing data from the generic layer to the backend code. The backend get_pha_data (return value) and the pack_pha_data, set_pha_data (number of arguments and the types) calls have been changed. There is some interesting behaviour - e.g. we don't fully handle the "columns can also be given as a scalar value if it is a constant" case, in particular with the GROUPING and QUALITY arrays. We have at least one test file with them set to "0" and if we read this in as a column then a number of tests fail. It is likely that these failures can be treated as regression tests (i.e. the tests can be updated) but that is not done here. The BACKSCAL and AREASAL values are now written out as 4-byte reals when given as a column, rather than 8-byte reals (so "E" rather than "D" format) to match the OGIP standard. It is also a bit clearer (having done this) where the code now exists that removes keywords like ANCRFILE from the header - see issue sherpa#1885. However, it's still not entirely clear and the existing code has some interesting behaviour (i.e. when writing out a PHA the *FILE keywords have any paths stripped from them). The pyfits backend had code that treated SYS_ERR as a fractional error (i.e. it gets multiplied by the COUNTS column) but the crates backend does not do this. So for now we drop this behaviour, even though OGIP Memo OGIP/92-007 does say Sys_err, a 4-byte REAL scalar giving the fractional systematic error to be applied to the data in this row (channel). The FITS column name is SYS_ERR. We do not have good testing of this column! The code makes choices about columns and keywords it sets when writing out PHA files. It's not entirely obvious all of these make sense but the idea here is not to change the existing behaviour. Any changes can come later.

With these changes we can remove a lot of the previous code to read in tabular data in the backend layers. The backend get_table_data, get_header_data, get_ascii_data, and read_table_blocks routines have their return typse changed and the pack_table_data and set_table_data routines have their arguments changed to use more structured types.

Although this removes some code it's not ideal as we basically repeat the WCS code in the two backends. The write routine should probably go via set_hdus as well, as least for the FITS output. We can at least finally remove some of the routines like _remove_structural_kewords for which we had added replacements.

This change adds specialized TableBlock classes for the PHA, ARF, and RMF file cases. These specialized classes are meant to help avoid repeating checks we know should pass (e.g. that a PHA dataset has a CHANNEL column) but it doesn't actually add much type safety. It does add some level of documentation to the backend code (i.e. point out what structures we expect, rather than just a generic "table"). One issue is that these clases raise a generic ValueError error rather than a DataErr (since, in part, we don't know the filename when throwing the error, because they could be generated from in-memory data). I am assuming that this is okay (since, for most cases) such errors should already have been caught in the backend code. An alternative would be to add special-case classes with direct access to the columns we require (e.g. as a dataclass), but this way we allow for extra information (e.g. other columns) to be sent around.

Try to improve coverage of corner cases. There are a number where it is hard to check these error cases as recent changes should make impossible to check, but leaving in the checks does make the code a bit easier to check what is important or not.

I have just been using this branch to test some simulations in both Sherpa and XSPEC. This showed two problems: - the EXPOSURE keyword was not set in a simulated dataset (this suggests some issue in how fake_pha creates a DataPHA object but it's also an issue here, so we set a default value - the TLMIN1/MAX1 keywords could be written out as floating-point values for an integer column, which XSPEC doesn't like; it's an easy fix

Several minor tweaks found when using the output of save_pha from this PR: - ensure the EXPOSURE value is a float (it could have been written out as an integer) - ensure consistency in the output data type for several related fields (e.g. a float and not an int when it defaults to 0) - ensure consistency in the description label for several related fields

DougBurke · 2024-08-12T17:40:57Z

For reference, I've pushed the rebased version to #1944 as e8c7778

Once this passes - famous last words - I'll be pushing the rest of that branch but I could also use it here.

wmclaugh · 2024-08-13T11:35:18Z

@DougBurke- personally, if it isn't too much effort I would prefer the rebase to preserve the history and simplify diff'ing. We could take the rebased changes here since the PR is already approved or wait until 1944 is changed from draft and reviewed...

DougBurke · 2024-08-13T12:23:23Z

@wmclaugh - I've rebased it.

We had inconsistent handling of the STAT_ERR and SYS_ERR values, both recently (changes in PR sherpa#1921 to move the deconstruction of the PHA data from the backend into the generic layer), and also in the original, per-backend code. We have - values can be set as a column or a header value (a scalar) - the use of 0 to indicate "we don't have a value calculated" Now we check - first for the column - then the keyword and then check if all values are 0, and if so we remove it (i.e. do not send it to DataPHA). This means that the code now behaves slightly differently, since you may not get a message about "errors being present but unused" when the error values are all 0, and the column will no-longer be created when it is effectively unset. This should **NOT** lead to differences when fitting the data.

DougBurke marked this pull request as draft October 4, 2023 20:42

DougBurke added type:enhancement area:code labels Oct 6, 2023

DougBurke mentioned this pull request Oct 6, 2023

Support multi matrix rmf #1869

Closed

DougBurke changed the title ~~Move rmf processing~~ Clean up of the IO code Oct 6, 2023

DougBurke mentioned this pull request Oct 19, 2023

Let users know they should use load_xstable_model not load_table_model #1926

Merged

DougBurke mentioned this pull request Oct 26, 2023

IO: separate pack and set commands #1929

Closed

DougBurke force-pushed the move-rmf-processing branch from 79eb155 to eaa0e72 Compare October 26, 2023 22:13

DougBurke force-pushed the move-rmf-processing branch 2 times, most recently from 7849007 to 9dfc2d2 Compare November 2, 2023 20:21

DougBurke mentioned this pull request Dec 15, 2023

planned I/O changes for 4.17 #1939

Closed

DougBurke force-pushed the move-rmf-processing branch from 141816d to f9fe689 Compare January 3, 2024 14:31

DougBurke mentioned this pull request Jan 3, 2024

Add multi rmf #1944

Draft

DougBurke force-pushed the move-rmf-processing branch from f9fe689 to 209b1b6 Compare January 12, 2024 15:57

DougBurke mentioned this pull request May 9, 2024

IO: separate pack and set commands #2017

Merged

DougBurke mentioned this pull request Jul 19, 2024

Clean up of the IO code redux #2085

Closed

DougBurke force-pushed the move-rmf-processing branch from 209b1b6 to 1451276 Compare August 5, 2024 16:25

DougBurke and others added 12 commits August 12, 2024 13:20

UTILS: remove resp_init

9894421

The resp_init routine was only used by the crates backend, not pyfits. With the rework of the I/O layer it doesn't appear to be needed (and was likely better handled in Python rather than C++ scope as it isn't a time-critical routine), so remove it.

TESTS: remove debug lines

ba773fe

IO: minor tweaks

a04775d

Try to improve coverage of corner cases. There are a number where it is hard to check these error cases as recent changes should make impossible to check, but leaving in the checks does make the code a bit easier to check what is important or not.

Address typos found in review by @hamogu

e8c7778

DougBurke mentioned this pull request Aug 12, 2024

numpy 2 promotion changes #2087

Open

DougBurke force-pushed the move-rmf-processing branch from 0b62e11 to e8c7778 Compare August 13, 2024 12:22

wmclaugh merged commit 899cdc7 into sherpa:main Aug 13, 2024
31 checks passed

DougBurke deleted the move-rmf-processing branch August 13, 2024 14:00

This was referenced Aug 24, 2024

PHA read difference crates vs astropy: chose key or column first #359

Closed

IO: ensure NUMELT/GRP are written out as integers for RMF header #2111

Merged

Read in of PHA files has regressed #2117

Closed

This was referenced Sep 3, 2024

Improve handling of STAT/SYS_ERR PHA columns #2120

Merged

IO backend difference: does staterror get read in? #1876

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up of the IO code #1921

Clean up of the IO code #1921

DougBurke commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading

DougBurke commented Oct 26, 2023

DougBurke commented Jan 12, 2024

DougBurke commented Aug 12, 2024

wmclaugh commented Aug 13, 2024

DougBurke commented Aug 13, 2024

Clean up of the IO code #1921

Clean up of the IO code #1921

Conversation

DougBurke commented Oct 4, 2023 • edited Loading

Summary

Details

codecov bot commented Oct 4, 2023 • edited Loading

Codecov Report

DougBurke commented Oct 26, 2023

DougBurke commented Jan 12, 2024

DougBurke commented Aug 12, 2024

wmclaugh commented Aug 13, 2024

DougBurke commented Aug 13, 2024

DougBurke commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading