-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up of the IO code #1921
Clean up of the IO code #1921
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1921 +/- ##
==========================================
+ Coverage 86.88% 87.39% +0.50%
==========================================
Files 84 84
Lines 29770 29586 -184
Branches 4422 4446 +24
==========================================
- Hits 25866 25857 -9
+ Misses 3795 3617 -178
- Partials 109 112 +3
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
79eb155
to
eaa0e72
Compare
Rebasing as I have just lost a day due to an overly-ambitious refactor which I've now pulled out into #1929 (although the ambitious part hasn't been included). I want to make sure I have a copy of "working" code before I break it all again. |
7849007
to
9dfc2d2
Compare
141816d
to
f9fe689
Compare
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
f9fe689
to
209b1b6
Compare
Rebased to pick up #1938 |
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
Ensure that a problem case found during development of sherpa#1921 is explicitly tested (it is used by other tests but it is hard to track down thanks to the cascading error handling that we use for the "load a generic data file" commands, so ensure we test it explicitly).
209b1b6
to
1451276
Compare
Using the types module we can set up the data we want to both read in and write out in the IO layer and not the backend layer. The backend types for get_rmf_data (return value) and pack_rmf_data/set_rmf_data (input argument) have changed to use "structured" types. One aim is to provide a unified interface for RMF processing so that we no-longer have the case of "backend a can not read in this RMF but backend b can" (modulo those cases where a backend can not read in a particular file). One minor change is that we now add a CREATOR keyword to the two table blocks of a RMF (set to "sherpa <version>") and this is only done if the CREATOR keyword is not already set. There is some logic here that could be moved into the DataRMF class as it looks like we may slow down the rmf_fold code if the data types do not match (e.g. to make sure we convert to the right data types needed by the C++ rmf_fold code). RMF files are now required to follow the OGIP standard **much more** strongly than before. For example, we now **require** an EBOUNDS block when it was optional before. This might affect some "interestig" use cases, but I honestly don't understand the current support for this and it is completely untested (as shown by the fact that making EBOUNDS a required block does not change any existing test). As part of this, we now read in all the MATRIX blocks for a RMF and then warn the user we only use the first block (and this warning is done in the generic layer rather than repeated in both backends).
The resp_init routine was only used by the crates backend, not pyfits. With the rework of the I/O layer it doesn't appear to be needed (and was likely better handled in Python rather than C++ scope as it isn't a time-critical routine), so remove it.
The ARF case is simpler than the RMF case, but we need to convert the generic table code before we can really simplify things. The backend routines get_arf_data (return value) and the pair of pack_arf_data and_set_arf_data (number and types of arguments) have changed to use more-structured types.
The backend layer now just reads in the data with the logic for handling the data (e.g. is it type I vs type II, do we need to convert from RATE to "counts") now in the generic layer. The TableHDU process is also used for writing out the data. Thanks to how the code was written this also changes the set_table_data to use TableHDU for passing data from the generic layer to the backend code. The backend get_pha_data (return value) and the pack_pha_data, set_pha_data (number of arguments and the types) calls have been changed. There is some interesting behaviour - e.g. we don't fully handle the "columns can also be given as a scalar value if it is a constant" case, in particular with the GROUPING and QUALITY arrays. We have at least one test file with them set to "0" and if we read this in as a column then a number of tests fail. It is likely that these failures can be treated as regression tests (i.e. the tests can be updated) but that is not done here. The BACKSCAL and AREASAL values are now written out as 4-byte reals when given as a column, rather than 8-byte reals (so "E" rather than "D" format) to match the OGIP standard. It is also a bit clearer (having done this) where the code now exists that removes keywords like ANCRFILE from the header - see issue sherpa#1885. However, it's still not entirely clear and the existing code has some interesting behaviour (i.e. when writing out a PHA the *FILE keywords have any paths stripped from them). The pyfits backend had code that treated SYS_ERR as a fractional error (i.e. it gets multiplied by the COUNTS column) but the crates backend does not do this. So for now we drop this behaviour, even though OGIP Memo OGIP/92-007 does say Sys_err, a 4-byte REAL scalar giving the fractional systematic error to be applied to the data in this row (channel). The FITS column name is SYS_ERR. We do not have good testing of this column! The code makes choices about columns and keywords it sets when writing out PHA files. It's not entirely obvious all of these make sense but the idea here is not to change the existing behaviour. Any changes can come later.
With these changes we can remove a lot of the previous code to read in tabular data in the backend layers. The backend get_table_data, get_header_data, get_ascii_data, and read_table_blocks routines have their return typse changed and the pack_table_data and set_table_data routines have their arguments changed to use more structured types.
Although this removes some code it's not ideal as we basically repeat the WCS code in the two backends. The write routine should probably go via set_hdus as well, as least for the FITS output. We can at least finally remove some of the routines like _remove_structural_kewords for which we had added replacements.
This change adds specialized TableBlock classes for the PHA, ARF, and RMF file cases. These specialized classes are meant to help avoid repeating checks we know should pass (e.g. that a PHA dataset has a CHANNEL column) but it doesn't actually add much type safety. It does add some level of documentation to the backend code (i.e. point out what structures we expect, rather than just a generic "table"). One issue is that these clases raise a generic ValueError error rather than a DataErr (since, in part, we don't know the filename when throwing the error, because they could be generated from in-memory data). I am assuming that this is okay (since, for most cases) such errors should already have been caught in the backend code. An alternative would be to add special-case classes with direct access to the columns we require (e.g. as a dataclass), but this way we allow for extra information (e.g. other columns) to be sent around.
Try to improve coverage of corner cases. There are a number where it is hard to check these error cases as recent changes should make impossible to check, but leaving in the checks does make the code a bit easier to check what is important or not.
I have just been using this branch to test some simulations in both Sherpa and XSPEC. This showed two problems: - the EXPOSURE keyword was not set in a simulated dataset (this suggests some issue in how fake_pha creates a DataPHA object but it's also an issue here, so we set a default value - the TLMIN1/MAX1 keywords could be written out as floating-point values for an integer column, which XSPEC doesn't like; it's an easy fix
Several minor tweaks found when using the output of save_pha from this PR: - ensure the EXPOSURE value is a float (it could have been written out as an integer) - ensure consistency in the output data type for several related fields (e.g. a float and not an int when it defaults to 0) - ensure consistency in the description label for several related fields
@DougBurke- personally, if it isn't too much effort I would prefer the rebase to preserve the history and simplify diff'ing. We could take the rebased changes here since the PR is already approved or wait until 1944 is changed from draft and reviewed... |
0b62e11
to
e8c7778
Compare
@wmclaugh - I've rebased it. |
We had inconsistent handling of the STAT_ERR and SYS_ERR values, both recently (changes in PR sherpa#1921 to move the deconstruction of the PHA data from the backend into the generic layer), and also in the original, per-backend code. We have - values can be set as a column or a header value (a scalar) - the use of 0 to indicate "we don't have a value calculated" Now we check - first for the column - then the keyword and then check if all values are 0, and if so we remove it (i.e. do not send it to DataPHA). This means that the code now behaves slightly differently, since you may not get a message about "errors being present but unused" when the error values are all 0, and the column will no-longer be created when it is effectively unset. This should **NOT** lead to differences when fitting the data.
We had inconsistent handling of the STAT_ERR and SYS_ERR values, both recently (changes in PR sherpa#1921 to move the deconstruction of the PHA data from the backend into the generic layer), and also in the original, per-backend code. We have - values can be set as a column or a header value (a scalar) - the use of 0 to indicate "we don't have a value calculated" Now we check - first for the column - then the keyword and then check if all values are 0, and if so we remove it (i.e. do not send it to DataPHA). This means that the code now behaves slightly differently, since you may not get a message about "errors being present but unused" when the error values are all 0, and the column will no-longer be created when it is effectively unset. This should **NOT** lead to differences when fitting the data.
Summary
Rework the astro data input and output code so that they take advantage of the new data representation in sherpa.astro.io.types (these classes provide a generic representation of FITS-like concepts), rather than passing around data as a dictionary. This moves the generic logic for deconstructing or constructing FITS like data from the backend code into the generic IO layer, reducing the chance for differences in the backend support for special cases.
Details
Follow the suggestion of #1913 and move the decoding logic from the backend code into the generic layer, taking advantage of the new types in the
sherpa.astro.io.types
module. There is also a lot of code clean up, in particular (but not limited to) the crates backend.The idea is that we can use these "new" types (which are actually based on the work done in the 4.16 release for the XSPEC table model case) to better model the data we want (compared to just using a dictionary that is then sent to a
DataXXX
class). In doing so we can move some of the decoding logic from the backend into the frontend, and so avoid having cases where "the file can be read by backend a but not by backend b".Rather than do this all in one go, I have tried to change one area at a time, so
And then, after going through all this, I then add specialized types for PHA/responses which reworks the code I've just changed. So, we could decide not to take the
add HEASARC-specific table blocks
commit, if we feel it's not worth it.So, reviewing this you may see comments indicating "we have introduced this new function to replace this old function but haven't completed the transition yet". Some of these commits are annoying to review - in particular the RMF code - because code is moved around, sometimes between files, and slightly changed at the same time. These changes rely in part on the large number of tests we have for I/O on these files (some of which have been added in previous PRs in this series because of this eventuality).
Unfortunately the use of types doesn't always lead to completely type-safe interfaces, in part because the typing infrastructure is still relatively new: some pain points are the fact we don't track shape and datatype information of ndarrays, and things like np.bool_/np.floating values not always appearing to behave the same as bool/float, depending on the type checker.
One interesting decision is whether we just stick all our types into a single place - e.g.
sherpa.utils.types
- or whether we have a "per domain" location - sosherpa.stats.types
,sherpa.io.types
, .... We are currently in a "in between" case, as most types are getting put intosherpa.utils.types
but the I/O types are in their own location (and astro-sepcific). I have tried to point out that all the type code is somewhat experimental, so we do not have to have a strategy set in stone.The aim of this PR is to make it easier to support multi-matrix RMF files, but this change is done in a follow-up PR (allthough the new types are set up to make this change a lot easier). See
One thing I should do is note down the problem cases I identified here, so we can track them better.