Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_cdf() crashes if time-dependent variable doesn't provide 'UNITS' #5907

Closed
jgieseler opened this issue Feb 22, 2022 · 9 comments · Fixed by #5909
Closed

read_cdf() crashes if time-dependent variable doesn't provide 'UNITS' #5907

jgieseler opened this issue Feb 22, 2022 · 9 comments · Fixed by #5909
Labels
Bug Probably a bug Effort Low Requires a small time investment Good First Issue The best issues for new people to tackle! io/CDF Issue with CDF files Package Novice Requires little knowledge of the internal structure of SunPy Priority Low Slow action required timeseries Affects the timeseries submodule

Comments

@jgieseler
Copy link
Member

jgieseler commented Feb 22, 2022

Describe the bug

I try to obtain STEREO in-situ particle data of the LET instrument from CDAWeb via Sunpy's new CDAWeb client (#5558). The actual data files are CDF's, so internally read_cdf() from sunpy.io.cdf (#5435) is used (both by @dstansby).

The download of the CDF files is working, but when I try to read the cdf files, read_cdf() produces a KeyError because it tries to read the attribute 'UNITS' that is not provided in the CDF file for this specific variable.

To Reproduce

>>> from sunpy.net import Fido
>>> from sunpy.net import attrs as a
>>> from sunpy.io.cdf import read_cdf

>>> trange = a.Time('2021/07/01', '2021/07/02')
>>> dataset = a.cdaweb.Dataset('STA_L1_LET')
>>> result = Fido.search(trange, dataset)
>>> downloaded_files = Fido.fetch(result)

>>> data = read_cdf(downloaded_files[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gieseler/anaconda3/envs/test310/lib/python3.10/site-packages/sunpy/io/cdf.py", line 74, in read_cdf
    unit_str = attrs['UNITS']
KeyError: 'UNITS'

What happened?

The problem is that read_cdf() addresses the variable attribute 'UNITS' without checking if it exists:

unit_str = attrs['UNITS']

So a quick workaround would be this:

# Get units
try:
    unit_str = attrs['UNITS']
    try:
        unit = u.Unit(unit_str)
    except ValueError:
        if unit_str in _known_units:
            unit = _known_units[unit_str]
        else:
            warn_user(f'astropy did not recognize units of "{unit_str}". '
                      'Assigning dimensionless units. '
                      'If you think this unit should not be dimensionless, '
                      'please raise an issue at https://github.com/sunpy/sunpy/issues')
            unit = u.dimensionless_unscaled
except KeyError as keyerr:
    if keyerr.args[0] == 'UNITS':
        warn_user('No units provided. '
                  'Assigning dimensionless units. '
                  'If you think there should be units found in the file, '
                  'please raise an issue at https://github.com/sunpy/sunpy/issues')
        unit = u.dimensionless_unscaled
        pass
    else:
        raise

I can submit that as a pull request, but before I wanted to check that this is an acceptable approach. Or whether the general opinion is that the problem lies in the ill-defined CDF file and that it should be fixed there.

By the way, the (first) variable causing the KeyError in this case actually is dimensionless, while the other physical measurements do provide 'UNITS':

>>> import cdflib
>>> cdf = cdflib.CDF(downloaded_files[0])
>>> cdf.varattsget("CodeOK")
{'CATDESC': "LET's internal code check, 1=good, 0=bad",
 'DEPEND_0': 'Epoch',
 'FIELDNAM': 'LET Code OK',
 'FILLVAL': array([255], dtype=uint8),
 'FORMAT': 'I1',
 'LABLAXIS': 'CodeOK',
 'VALIDMIN': array([0], dtype=uint8),
 'VALIDMAX': array([2], dtype=uint8),
 'VAR_TYPE': 'support_data'}

>>> cdf.varattsget("Al_summed_flux")
{'CATDESC': 'Al unsectored flux in 3 energy bins',
 'DEPEND_0': 'Epoch',
 'DEPEND_1': 'Bins6to27',
 'DISPLAY_TYPE': 'time_series',
 'FIELDNAM': 'Al summed flux',
 'FILLVAL': array([-1.e+31]),
 'FORMAT': 'E12.2',
 'LABL_PTR_1': 'Bins6to27',
 'UNITS': '#/(cm^2*s*sr*Mev/nucleon)',
 'VALIDMIN': array([0., 0., 0.]),
 'VALIDMAX': array([1.e+10, 1.e+10, 1.e+10]),
 'VAR_TYPE': 'data',
 'SCALETYP': 'linear'}

Expected behavior

Return (a list of) GenericTimeSeries

Screenshots

No response

System Details

>>> import sunpy; sunpy.util.system_info()
==============================
sunpy Installation Information
==============================

General
#######
OS: Ubuntu (20.04, Linux 5.13.0-30-generic)
Arch: 64bit, (x86_64)
sunpy: 3.1.3
Installation path: /home/gieseler/anaconda3/envs/test310/lib/python3.10/site-packages

Required Dependencies
#####################
parfive: 1.5.1
packaging: 21.3
numpy: 1.22.1
astropy: 5.0

Optional Dependencies
#####################
asdf: Missing, need asdf>=2.6.0; extra == "all"
beautifulsoup4: 4.10.0
cdflib: 0.3.20
dask: 2022.1.0
drms: 0.6.2
glymur: 0.9.7.post1
h5netcdf: 0.13.1
h5py: 3.6.0
matplotlib: 3.5.1
mpl-animators: 1.0.0
pandas: 1.4.0
python-dateutil: 2.8.2
reproject: Missing, need reproject; extra == "all"
scikit-image: 0.18.3
scipy: 1.7.3
sqlalchemy: 1.4.31
tqdm: 4.62.3
zeep: 4.1.0

Installation method

conda

@dstansby
Copy link
Member

According to the ISTP guide on CDF files, which is the standard we support, UNITS or UNITS_PTR is a required field - https://spdf.gsfc.nasa.gov/istp_guide/variables.html. So if there are neither in the CDF file, I think we should be throwing an error. We should catch the KeyError though, and raise a nicer error message ourselves.

@dstansby
Copy link
Member

Thinking a bit more about this, maybe the pragmatic thing to do is to raise a warning, and either:

  1. Not read a variable in if it doesn't have units
  2. Read in the variable and assign dimensionless units

This way it's still possible to read the rest of the CDF file. I think I'd choose 1. to be on the safe side (instead of assuming untis), @jgieseler do you have any thoughts on which would be preferable?

@jgieseler
Copy link
Member Author

Yeah, I already guessed that the files might not follow the standard (they started 2006).

I think a pragmatic approach is worthwhile in this case, and I actually would favor option 2. Because right now also dimensionless units are assigned if the provided unit string is not understood. We wouldn't provide "wrong" information anyhow. And we don't know if some CDF file might contain some crucial variable that would get omitted. I think for the users, it's better to get a variable without units so that they really notice it and have the option to do something about it, rather than just drop the variables and provide a small warning that might easily get overlooked.

@dstansby
Copy link
Member

Good point about what we currently do for not understood unit strings 👍 Should be a pretty simple fix then - would you be up for opening a pull request with it?

@dstansby dstansby added the timeseries Affects the timeseries submodule label Feb 22, 2022
@jgieseler
Copy link
Member Author

Yes, I'll take care of it.

@nabobalis nabobalis added Bug Probably a bug Effort Low Requires a small time investment Good First Issue The best issues for new people to tackle! Priority Low Slow action required Package Novice Requires little knowledge of the internal structure of SunPy labels Feb 22, 2022
@wafels
Copy link
Member

wafels commented Feb 24, 2022

If I understand this correctly, there is an error in the CDF file. Do you know if the SPDF is aware of this error?

@jgieseler
Copy link
Member Author

jgieseler commented Feb 24, 2022

I'm not sure if this is a real error in the CDF files or just "not following the (current) standard". And I don't know if SPDF is aware.

Because the mission is already quite old (launched in 2006), I'm afraid there won't be too many people left working on the data. 😕 But as I just realize, the problem seems to affect the whole Level-1 dataset of the IMPACT suite of instruments of STEREO. Usually people would just use the Level-2 data anyhow. But one problem is that for some data only Level-1 is available (e.g. the 1-min STEREO/LET data from the beginning of this issue). And another problem is that CDAWeb for most IMPACT instruments only provides the Level-1 data. I guess because for most (all?) instruments only Level-1 data was released as CDF files...

So long story short, I probably should inform SPDF and some part of the IMPACT team (I think I remember who created some of these CDF files back in the day).

@wafels
Copy link
Member

wafels commented Feb 24, 2022

Yes, please contact the SPDF and the IMPACT team. I think they would be interested in this, if they don't already know.

As a separate matter I wonder if we should have a couple of new labels to point out issues with source CDF and FITS files. If SunPy users are having problems with the source data then the data providers should be made aware of them. Having separate GitHub labels would make it easy to identify these in the SunPy issue list.

@candeynasa
Copy link

I think the proposed fix of issuing a warning and setting UNITS to dimensionless is the way to go. While the ISTP Guidelines calls for the UNITS (or UNIT_PTR for multiple units attribute, and set to a blank space if dimensionless) as required, it's not uncommon to miss an attribute sometimes, especially for unit-less variables.

SPDF/CDAWeb uses a set of metadata-only CDF called master CDFs to over-ride and add metadata in the data files, in lieu of modifying the archived data files themselves. It might not be feasible, but the SunPy read_cdf routine could check https://spdf.gsfc.nasa.gov/pub/software/cdawlib/0MASTERS/ for a matching master and open it first to get the metadata. For attributes with a blank in the Master CD, the values for that attribute should be taken from the data CDF instead if available. Alternatively, one could call our Python cdawsws library https://cdaweb.gsfc.nasa.gov/WebServices/REST/py/cdasws/ and ask for the specific variables and time range and get back a generated CDF that will already have the better metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Probably a bug Effort Low Requires a small time investment Good First Issue The best issues for new people to tackle! io/CDF Issue with CDF files Package Novice Requires little knowledge of the internal structure of SunPy Priority Low Slow action required timeseries Affects the timeseries submodule
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants