Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up of the IO code #1921

Merged
merged 13 commits into from
Aug 13, 2024
Merged

Clean up of the IO code #1921

merged 13 commits into from
Aug 13, 2024

Conversation

DougBurke
Copy link
Contributor

@DougBurke DougBurke commented Oct 4, 2023

Summary

Rework the astro data input and output code so that they take advantage of the new data representation in sherpa.astro.io.types (these classes provide a generic representation of FITS-like concepts), rather than passing around data as a dictionary. This moves the generic logic for deconstructing or constructing FITS like data from the backend code into the generic IO layer, reducing the chance for differences in the backend support for special cases.

Details

Follow the suggestion of #1913 and move the decoding logic from the backend code into the generic layer, taking advantage of the new types in the sherpa.astro.io.types module. There is also a lot of code clean up, in particular (but not limited to) the crates backend.

The idea is that we can use these "new" types (which are actually based on the work done in the 4.16 release for the XSPEC table model case) to better model the data we want (compared to just using a dictionary that is then sent to a DataXXX class). In doing so we can move some of the decoding logic from the backend into the frontend, and so avoid having cases where "the file can be read by backend a but not by backend b".

Rather than do this all in one go, I have tried to change one area at a time, so

  • RMF
  • ARF
  • PHA
  • generic table support
  • image access

And then, after going through all this, I then add specialized types for PHA/responses which reworks the code I've just changed. So, we could decide not to take the add HEASARC-specific table blocks commit, if we feel it's not worth it.

So, reviewing this you may see comments indicating "we have introduced this new function to replace this old function but haven't completed the transition yet". Some of these commits are annoying to review - in particular the RMF code - because code is moved around, sometimes between files, and slightly changed at the same time. These changes rely in part on the large number of tests we have for I/O on these files (some of which have been added in previous PRs in this series because of this eventuality).

Unfortunately the use of types doesn't always lead to completely type-safe interfaces, in part because the typing infrastructure is still relatively new: some pain points are the fact we don't track shape and datatype information of ndarrays, and things like np.bool_/np.floating values not always appearing to behave the same as bool/float, depending on the type checker.

One interesting decision is whether we just stick all our types into a single place - e.g. sherpa.utils.types - or whether we have a "per domain" location - so sherpa.stats.types, sherpa.io.types, .... We are currently in a "in between" case, as most types are getting put into sherpa.utils.types but the I/O types are in their own location (and astro-sepcific). I have tried to point out that all the type code is somewhat experimental, so we do not have to have a strategy set in stone.

The aim of this PR is to make it easier to support multi-matrix RMF files, but this change is done in a follow-up PR (allthough the new types are set up to make this change a lot easier). See

One thing I should do is note down the problem cases I identified here, so we can track them better.

@DougBurke DougBurke marked this pull request as draft October 4, 2023 20:42
@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

Attention: Patch coverage is 72.91297% with 305 lines in your changes missing coverage. Please review.

Project coverage is 87.39%. Comparing base (52b00d0) to head (0b62e11).

Files Patch % Lines
sherpa/astro/io/crates_backend.py 0.00% 244 Missing ⚠️
sherpa/astro/io/types.py 87.03% 21 Missing ⚠️
sherpa/astro/io/__init__.py 95.94% 16 Missing ⚠️
sherpa/astro/io/pyfits_backend.py 93.89% 16 Missing ⚠️
sherpa/astro/xspec/__init__.py 71.42% 4 Missing and 4 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1921      +/-   ##
==========================================
+ Coverage   86.88%   87.39%   +0.50%     
==========================================
  Files          84       84              
  Lines       29770    29586     -184     
  Branches     4422     4446      +24     
==========================================
- Hits        25866    25857       -9     
+ Misses       3795     3617     -178     
- Partials      109      112       +3     
Files Coverage Δ
sherpa/astro/data.py 93.08% <100.00%> (-0.01%) ⬇️
sherpa/astro/instrument.py 90.09% <100.00%> (ø)
sherpa/astro/io/dummy_backend.py 62.96% <100.00%> (-0.68%) ⬇️
sherpa/astro/io/xstable.py 98.24% <100.00%> (-0.27%) ⬇️
sherpa/astro/ui/utils.py 98.16% <100.00%> (+<0.01%) ⬆️
sherpa/astro/utils/__init__.py 83.52% <ø> (ø)
sherpa/astro/xspec/__init__.py 98.39% <71.42%> (-0.09%) ⬇️
sherpa/astro/io/__init__.py 96.28% <95.94%> (-1.76%) ⬇️
sherpa/astro/io/pyfits_backend.py 94.36% <93.89%> (+0.61%) ⬆️
sherpa/astro/io/types.py 87.71% <87.03%> (-12.29%) ⬇️
... and 1 more

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52b00d0...0b62e11. Read the comment docs.

@DougBurke DougBurke changed the title Move rmf processing Clean up of the IO code Oct 6, 2023
DougBurke added a commit to DougBurke/sherpa that referenced this pull request Oct 26, 2023
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
@DougBurke
Copy link
Contributor Author

Rebasing as I have just lost a day due to an overly-ambitious refactor which I've now pulled out into #1929 (although the ambitious part hasn't been included). I want to make sure I have a copy of "working" code before I break it all again.

@DougBurke DougBurke force-pushed the move-rmf-processing branch 2 times, most recently from 7849007 to 9dfc2d2 Compare November 2, 2023 20:21
@DougBurke DougBurke mentioned this pull request Jan 3, 2024
DougBurke added a commit to DougBurke/sherpa that referenced this pull request Jan 12, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
@DougBurke
Copy link
Contributor Author

Rebased to pick up #1938

DougBurke added a commit to DougBurke/sherpa that referenced this pull request Feb 16, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
DougBurke added a commit to DougBurke/sherpa that referenced this pull request Apr 30, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
DougBurke added a commit to DougBurke/sherpa that referenced this pull request May 3, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
DougBurke added a commit to DougBurke/sherpa that referenced this pull request May 7, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
DougBurke added a commit to DougBurke/sherpa that referenced this pull request May 14, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
DougBurke added a commit to DougBurke/sherpa that referenced this pull request Jun 12, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
DougBurke added a commit to DougBurke/sherpa that referenced this pull request Jun 14, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
DougBurke added a commit to DougBurke/sherpa that referenced this pull request Jun 21, 2024
Ensure that a problem case found during development of sherpa#1921 is
explicitly tested (it is used by other tests but it is hard to
track down thanks to the cascading error handling that we use for
the "load a generic data file" commands, so ensure we test it
explicitly).
DougBurke and others added 12 commits August 12, 2024 13:20
Using the types module we can set up the data we want to both read in
and write out in the IO layer and not the backend layer.  The backend
types for get_rmf_data (return value) and pack_rmf_data/set_rmf_data
(input argument) have changed to use "structured" types.

One aim is to provide a unified interface for RMF processing
so that we no-longer have the case of "backend a can not read
in this RMF but backend b can" (modulo those cases where a
backend can not read in a particular file).

One minor change is that we now add a CREATOR keyword to the two
table blocks of a RMF (set to "sherpa <version>") and this is
only done if the CREATOR keyword is not already set.

There is some logic here that could be moved into the DataRMF
class as it looks like we may slow down the rmf_fold code
if the data types do not match (e.g. to make sure we convert
to the right data types needed by the C++ rmf_fold code).

RMF files are now required to follow the OGIP standard **much
more** strongly than before. For example, we now **require** an
EBOUNDS block when it was optional before. This might affect
some "interestig" use cases, but I honestly don't understand the
current support for this and it is completely untested (as shown
by the fact that making EBOUNDS a required block does not change
any existing test).

As part of this, we now read in all the MATRIX blocks for a RMF
and then warn the user we only use the first block (and this
warning is done in the generic layer rather than repeated in both
backends).
The resp_init routine was only used by the crates backend, not
pyfits. With the rework of the I/O layer it doesn't appear to
be needed (and was likely better handled in Python rather than
C++ scope as it isn't a time-critical routine), so remove it.
The ARF case is simpler than the RMF case, but we need to convert the
generic table code before we can really simplify things.

The backend routines get_arf_data (return value) and the pair
of pack_arf_data and_set_arf_data (number and types of arguments)
have changed to use more-structured types.
The backend layer now just reads in the data with the logic for
handling the data (e.g. is it type I vs type II, do we need to
convert from RATE to "counts") now in the generic layer. The
TableHDU process is also used for writing out the data. Thanks to
how the code was written this also changes the set_table_data to
use TableHDU for passing data from the generic layer to the backend
code.

The backend get_pha_data (return value) and the pack_pha_data,
set_pha_data (number of arguments and the types) calls have
been changed.

There is some interesting behaviour - e.g. we don't fully handle
the "columns can also be given as a scalar value if it is a
constant" case, in particular with the GROUPING and QUALITY arrays.
We have at least one test file with them set to "0" and if we read
this in as a column then a number of tests fail. It is likely that
these failures can be treated as regression tests (i.e. the tests
can be updated) but that is not done here.

The BACKSCAL and AREASAL values are now written out as 4-byte
reals when given as a column, rather than 8-byte reals (so "E"
rather than "D" format) to match the OGIP standard.

It is also a bit clearer (having done this) where the code now
exists that removes keywords like ANCRFILE from the header - see
issue sherpa#1885. However, it's still not entirely clear and the existing
code has some interesting behaviour (i.e. when writing out a PHA
the *FILE keywords have any paths stripped from them).

The pyfits backend had code that treated SYS_ERR as a fractional
error (i.e. it gets multiplied by the COUNTS column) but the crates
backend does not do this. So for now we drop this behaviour, even
though OGIP Memo OGIP/92-007 does say

  Sys_err, a 4-byte REAL scalar giving the fractional systematic
    error to be applied to the data in this row (channel).
  The FITS column name is SYS_ERR.

We do not have good testing of this column!

The code makes choices about columns and keywords it sets when
writing out PHA files. It's not entirely obvious all of these make
sense but the idea here is not to change the existing behaviour.
Any changes can come later.
With these changes we can remove a lot of the previous code to
read in tabular data in the backend layers.

The backend get_table_data, get_header_data, get_ascii_data, and
read_table_blocks routines have their return typse changed and the
pack_table_data and set_table_data routines have their arguments
changed to use more structured types.
Although this removes some code it's not ideal as we basically
repeat the WCS code in the two backends. The write routine should
probably go via set_hdus as well, as least for the FITS output.

We can at least finally remove some of the routines like
_remove_structural_kewords for which we had added replacements.
This change adds specialized TableBlock classes for the
PHA, ARF, and RMF file cases. These specialized classes are
meant to help avoid repeating checks we know should pass (e.g.
that a PHA dataset has a CHANNEL column) but it doesn't actually
add much type safety. It does add some level of documentation
to the backend code (i.e. point out what structures we expect,
rather than just a generic "table").

One issue is that these clases raise a generic ValueError
error rather than a DataErr (since, in part, we don't know
the filename when throwing the error, because they could be
generated from in-memory data). I am assuming that this is okay
(since, for most cases) such errors should already have been
caught in the backend code.

An alternative would be to add special-case classes with
direct access to the columns we require (e.g. as a dataclass),
but this way we allow for extra information (e.g. other columns)
to be sent around.
Try to improve coverage of corner cases. There are a number where
it is hard to check these error cases as recent changes should
make impossible to check, but leaving in the checks does make the
code a bit easier to check what is important or not.
I have just been using this branch to test some simulations in
both Sherpa and XSPEC. This showed two problems:

- the EXPOSURE keyword was not set in a simulated dataset
  (this suggests some issue in how fake_pha creates a DataPHA
  object but it's also an issue here, so we set a default
  value

- the TLMIN1/MAX1 keywords could be written out as floating-point
  values for an integer column, which XSPEC doesn't like; it's
  an easy fix
Several minor tweaks found when using the output of save_pha
from this PR:

- ensure the EXPOSURE value is a float (it could have been
  written out as an integer)

- ensure consistency in the output data type for several related
  fields (e.g. a float and not an int when it defaults to 0)

- ensure consistency in the description label for several related
  fields
@DougBurke
Copy link
Contributor Author

For reference, I've pushed the rebased version to #1944 as e8c7778

Once this passes - famous last words - I'll be pushing the rest of that branch but I could also use it here.

@wmclaugh
Copy link
Contributor

@DougBurke- personally, if it isn't too much effort I would prefer the rebase to preserve the history and simplify diff'ing. We could take the rebased changes here since the PR is already approved or wait until 1944 is changed from draft and reviewed...

@DougBurke
Copy link
Contributor Author

@wmclaugh - I've rebased it.

@wmclaugh wmclaugh merged commit 899cdc7 into sherpa:main Aug 13, 2024
31 checks passed
@DougBurke DougBurke deleted the move-rmf-processing branch August 13, 2024 14:00
DougBurke added a commit to DougBurke/sherpa that referenced this pull request Sep 3, 2024
We had inconsistent handling of the STAT_ERR and SYS_ERR values,
both recently (changes in PR sherpa#1921 to move the deconstruction of
the PHA data from the backend into the generic layer), and also
in the original, per-backend code.

We have

- values can be set as a column or a header value (a scalar)
- the use of 0 to indicate "we don't have a value calculated"

Now we check

- first for the column
- then the keyword

and then check if all values are 0, and if so we remove it
(i.e. do not send it to DataPHA).

This means that the code now behaves slightly differently,
since you may not get a message about "errors being present
but unused" when the error values are all 0, and the column
will no-longer be created when it is effectively unset. This
should **NOT** lead to differences when fitting the data.
DougBurke added a commit to DougBurke/sherpa that referenced this pull request Sep 3, 2024
We had inconsistent handling of the STAT_ERR and SYS_ERR values,
both recently (changes in PR sherpa#1921 to move the deconstruction of
the PHA data from the backend into the generic layer), and also
in the original, per-backend code.

We have

- values can be set as a column or a header value (a scalar)
- the use of 0 to indicate "we don't have a value calculated"

Now we check

- first for the column
- then the keyword

and then check if all values are 0, and if so we remove it
(i.e. do not send it to DataPHA).

This means that the code now behaves slightly differently,
since you may not get a message about "errors being present
but unused" when the error values are all 0, and the column
will no-longer be created when it is effectively unset. This
should **NOT** lead to differences when fitting the data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants