Roman Data Models
===================

Pipeline Architecture Overview
--------------------------------------

**Based on lessons learned from JWST, in turn based on lessons learned from HST.**

- HST pipelines tended to be monolithic code with little modularity.
- Common algorithms often cut and pasted, and eventually diverged, leading to maintenance headaches.
- Inordinate dependencies on FITS keywords in the code itself; changes in input file metadata led to difficulties maintaining code depending on such details.
- WCS needed a better solution.
- Reference file management system led to many problems and an error-prone system.

**JWST architecture designed to address these shortcomings**

- Modularize pipelines into steps where common step functionality put into a separate library shared by all steps (`stpipe`).
 - Handles obtaining proper reference files.
 - Input/output options.
 - Logging.
 - Configuration and handling options for processing.
- Adopt Python Data Models to isolate dependencies on input file details to the I/O components of the pipeline; calibration code requires fewer changes when the file details change (required to support FITS).
- Develop ASDF as an alternate format to FITS, and use in all cases for WCS information for unresampled data, layered on a more capable WCS library (Generalized WCS).
- Validation checks that files match expectations (useful to detect mismatched file/pipeline problems).

**Changes based on JWST experience**

- Extracting calibration code common between JWST and Roman into stcal.
- Merge Calibration and Data Management Systems schemas to keep Calibration pipelines in synch with what the DMS system is producing for for Level 1 Files.
  - DMS uses the schemas to indicate sources of the information, and where the metadata ends up in the archive catalogs.
- FITS support seriously complicates the JWST data model machinery; dropped for Roman.
- Python Data Model mostly driven by ASDF schema files, mostly using tags (JWST uses few ASDF tags).
- Previous schema system resulted in kitchen sink schemas that had many irrelevant attributes for any specific data model. New data model machinery avoids this.

**Components of the Roman Pipeline Architecture**

- CRDS (for reference file management)
- stpipe for Step infrastructure
- stcal for generic calibration code
- Schemas based on tags
- Data Models defined by Schemas

**Respositories related to the architecture**

- Located in the [spacetelescope/rad](https://github.com/spacetelescope/rad) repository.
- Roman Data Models machinery located in [spacetelescope/roman_datamodels](https://github.com/spacetelescope/roman_datamodels) repository.
- Most Data Model elements derived by the schemas.
  - Those that need customizations have those applied in roman_datamodels.
  - Only top-level data models require entries in roman_datamodels.
- Top level calibration code located in [spacetelescope/romancal](https://github.com/spacetelescope/romancal) respository.
- Generic shared calibration code located in [spacetelescope/stcal](https://github.com/spacetelescope/stcal) respository.

**Data Models and Schemas**

- Roman datamodels consist of a nested set of nodes, most of which are associated with a tag, and thus have a Python object type.
- Rather than a node's type implicitly defined by which attribute it is found under, it has a type independent of where in the ASDF tree it is found. 
- The expected attributes a node has is defined by the schema associated with the node's tag.
- The association of tags and schemas is defined in a manifest file located in the RAD repository.

The following illustrates a few of the components of an ASDF file as defined by the schemas. We will look at that for the level 2 product.

**Note on location of schemas discussed below.** These are found in the [spacetelescope/rad](https://github.com/spacetelescope/rad) respository in the following directory: `src/rad/resources/schemas`. Those for reference files are located in the `reference_files` subdirectory.

**Level 2 top level schema** (`wfi_image-1.0.0.yaml`)

```
%YAML 1.1
---
$schema: asdf://stsci.edu/datamodels/roman/schemas/rad_schema-1.0.0
id: asdf://stsci.edu/datamodels/roman/schemas/wfi_image-1.0.0

title: |
  The schema for WFI Level 2 images.

type: object
properties:
  meta:
    allOf:
      - $ref: common-1.0.0
      - type: object
        properties:
          photometry:
            tag: asdf://stsci.edu/datamodels/roman/tags/photometry-1.0.0
        required: [photometry]
```

The general structure of the data model is to place all major arrays at the top level, and all metadata under the `meta` attribute. Since the content of `meta` varies with the type of data, it itself has no tag, but instead uses an include mechanism to merge different content. The schema pointed to by `$ref`, `common-1.0.0`, contains all metadata expected for all Roman data. In this case one more attribute is added to that with the name `photometry`, which has a tag specifying the content expected. The content associated with `common-1.0.0` will be examined after this schema.

What follows the specification of what is expected in meta are the description of the expected data arrays. As can be seen, they are all tagged as being array types, along with requirements on the expected element type and the number of dimensions. There is associated information such as title, which is a short description of the attribute.

**Top level schema for Level 2 products**

  
  ```
  data:
    title: Science data, excluding border reference pixels.
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 2
  dq:
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: uint32
    ndim: 2
  err:
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 2
  var_poisson:
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 2
  var_rnoise:
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 2
  var_flat:
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 2
  amp33:
    title: Amp 33 reference pixel data
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: uint16
    ndim: 3
  border_ref_pix_left:
    title: Original border reference pixels, on left (from viewers perspective).
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 3
  border_ref_pix_right:
    title: Original border reference pixels, on right (from viewers perspective).
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 3
  border_ref_pix_top:
    title: Original border reference pixels, on top.
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 3
  border_ref_pix_bottom:
    title: Original border reference pixels, on bottom.
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: float32
    ndim: 3
  dq_border_ref_pix_left:
    title: DQ for border reference pixels, on left (from viewers perspective).
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: uint32
    ndim: 2
  dq_border_ref_pix_right:
    title: DQ for border reference pixels, on right (from viewers perspective).
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: uint32
    ndim: 2
  dq_border_ref_pix_top:
    title: DQ for border reference pixels, on top.
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: uint32
    ndim: 2
  dq_border_ref_pix_bottom:
    title: DQ for border reference pixels, on bottom.
    tag: tag:stsci.edu:asdf/core/ndarray-1.0.0
    datatype: uint32
    ndim: 2
  cal_logs:
    tag: asdf://stsci.edu/datamodels/roman/tags/cal_logs-1.0.0
  ```

The final part of the schema indicates the order that the attributes should appear in the YAML content. This is mainly for visual consistency rather than seeing unpredictable locations.

`flowStyle` indicates which of the two formatting options that YAML can use. `block` indicates that indented formatting is to be used. 

Finally `required` indicates which attributes must be present (which in this case is all of them).

```
propertyOrder: [meta, data, dq, err, var_poisson, var_rnoise, var_flat,
                amp33, border_ref_pix_left, border_ref_pix_right,
                border_ref_pix_top, border_ref_pix_bottom,
                dq_border_ref_pix_left, dq_border_ref_pix_right,
                dq_border_ref_pix_top, dq_border_ref_pix_bottom, cal_logs]
flowStyle: block
required: [meta, data, dq, err, var_poisson, var_rnoise, amp33,
           border_ref_pix_left, border_ref_pix_right, border_ref_pix_top,
           border_ref_pix_bottom, dq_border_ref_pix_left,
           dq_border_ref_pix_right, dq_border_ref_pix_top,
           dq_border_ref_pix_bottom, cal_logs]
...
```

What follows is the schema for the common elements in the metadata expected in all Roman data products

**common-1.0.0.yaml**

```
%YAML 1.1
---
$schema: asdf://stsci.edu/datamodels/roman/schemas/rad_schema-1.0.0
id: asdf://stsci.edu/datamodels/roman/schemas/common-1.0.0

title: Common metadata properties

allOf:
# Meta Variables
- $ref: asdf://stsci.edu/datamodels/roman/schemas/basic-1.0.0
- type: object
  properties:
    # Meta Objects
    aperture:
      tag: asdf://stsci.edu/datamodels/roman/tags/aperture-1.0.0
    cal_step:
      tag: asdf://stsci.edu/datamodels/roman/tags/cal_step-1.0.0
    coordinates:
      tag: asdf://stsci.edu/datamodels/roman/tags/coordinates-1.0.0
    ephemeris:
      tag: asdf://stsci.edu/datamodels/roman/tags/ephemeris-1.0.0
    exposure:
      tag: asdf://stsci.edu/datamodels/roman/tags/exposure-1.0.0
    guidestar:
      tag: asdf://stsci.edu/datamodels/roman/tags/guidestar-1.0.0
    instrument:
      tag: asdf://stsci.edu/datamodels/roman/tags/wfi_mode-1.0.0
    observation:
      tag: asdf://stsci.edu/datamodels/roman/tags/observation-1.0.0
    pointing:
      tag: asdf://stsci.edu/datamodels/roman/tags/pointing-1.0.0
    program:
      tag: asdf://stsci.edu/datamodels/roman/tags/program-1.0.0
    ref_file:
      tag: asdf://stsci.edu/datamodels/roman/tags/ref_file-1.0.0
    target:
      tag: asdf://stsci.edu/datamodels/roman/tags/target-1.0.0
    velocity_aberration:
      tag: asdf://stsci.edu/datamodels/roman/tags/velocity_aberration-1.0.0
    visit:
      tag: asdf://stsci.edu/datamodels/roman/tags/visit-1.0.0
    wcsinfo:
      tag: asdf://stsci.edu/datamodels/roman/tags/wcsinfo-1.0.0
  required: [aperture, cal_step, coordinates, ephemeris, exposure, guidestar,
             instrument, observation, pointing, program, ref_file,
             target, velocity_aberration, visit, wcsinfo]
...
```

Notice that all the attributes in this schema reference tags, and that all are required. We will take a look at one one of these, instrument, below.

**wfi_mode-1.0.0.yaml**

```
YAML 1.1
---
$schema: asdf://stsci.edu/datamodels/roman/schemas/rad_schema-1.0.0
id: asdf://stsci.edu/datamodels/roman/schemas/wfi_mode-1.0.0


title: |
  WFI observing configuration
type: object
properties:
  name:
    title: Instrument used to acquire the data
    type: string
    enum: [WFI]
    sdf:
      special_processing: VALUE_REQUIRED
      source:
        origin: TBD
    archive_catalog:
      datatype: nvarchar(5)
      destination: [ScienceCommon.instrument_name]
  detector:
    $ref: wfi_detector-1.0.0
    sdf:
      special_processing: VALUE_REQUIRED
      source:
        origin: TBD
    archive_catalog:
      datatype: nvarchar(10)
      destination: [ScienceCommon.detector]
  optical_element:
    $ref: wfi_optical_element-1.0.0
    sdf:
      special_processing: VALUE_REQUIRED
      source:
        origin: TBD
    archive_catalog:
      datatype: nvarchar(20)
      destination: [ScienceCommon.optical_element]
propertyOrder: [detector, optical_element, name]
flowStyle: block
required: [detector, optical_element, name]
...
```

We can see that the instrument attribute has the attributes: `name`, `detector`, and `optical_element`. The `optical_element` attribute refers to yet another schema that specifies what values are acceptable (which we will not examine).

Notice that the special Data Management System information is present for these attribute, in particular `sdf` and `archive_catalog`, which are to be used to indicate to the DMS how these values are populated and where they will end up in the archive catalog.

Relation to Roman Data Models
------------------------------------------

Roman data files can be opened in two different ways. Directly in ASDF using `asdf.open` or through `roman_datamodels` through its `open` function. The roman_datamodels interface provides more functionality and conveniences, namely in returning a datamodels object and alowing use of dotted attribute notation for accessing items in the data set (as illustrated in tutorial (xxx link). For example in the above case the following would be relevant:

In [None]:
import roman_datamodels as rdm
dm = rdm.open('../data/r0000101001001001001_01101_0001_WFI01_cal.asdf')
print('type of data model: ', type(dm))
print('shape of error array: ', dm.err.shape)
print('type of dm.meta.insrument: ', type(dm.meta.instrument))
print('optical_element = ', dm.meta.instrument.optical_element)

In general, the Roman datamodel object classes are generated automatically from the schemas with a few exceptions, one of which is `WfiMode`. In that case an attribute `filter` is added to return the value of `optical_element` to provide consistency with JWST calibration code that is generic enough to share with Roman (and which is located in the stcal repository). In most cases each roman_datamodel node has an object type, but is basically a structure which has attributes as defined in the schema (which of course presumes that the attribute names are legal Python variable names). 

Managing changes to schemas and data models
-----------------------------------------------------------

Normally such changes are made as follows:

- first to [spacetelescope/rad](https://github.com/spacetelescope/rad) by adding the appropriate schema in the correct place. 
- Then a corresponding change must be made to the manifest to link the tag with the schema. 
- The manifest file is located in the `spacetelescope/rad` respository with this path: `src/rad/resources/manifests/datamodels-1.0.yaml`. Part of the existing manifest file is shown below to indicate the typical contents.

```
%YAML 1.1
---
id: asdf://stsci.edu/datamodels/roman/manifests/datamodels-1.0
extension_uri: asdf://stsci.edu/datamodels/roman/extensions/datamodels-1.0
title: Datamodels extension 1.0
description: |-
  A set of tags for serializing STScI Roman datamodels.
asdf_standard_requirement:
  gte: 1.1.0
tags:
# Object Modules
- tag_uri: asdf://stsci.edu/datamodels/roman/tags/guidewindow-1.0.0
  schema_uri: asdf://stsci.edu/datamodels/roman/schemas/guidewindow-1.0.0
  title: Guide window schema
  description: |-
    Guide window schema
- tag_uri: asdf://stsci.edu/datamodels/roman/tags/ramp-1.0.0
  schema_uri: asdf://stsci.edu/datamodels/roman/schemas/ramp-1.0.0
  title: Ramp schema
  description: |-
    Ramp schema
- tag_uri: asdf://stsci.edu/datamodels/roman/tags/ramp_fit_output-1.0.0
  schema_uri: asdf://stsci.edu/datamodels/roman/schemas/ramp_fit_output-1.0.0
  title: Ramp fit output schema
  description: |-
    Ramp fit output schema
- tag_uri: asdf://stsci.edu/datamodels/roman/tags/wfi_science_raw-1.0.0
  schema_uri: asdf://stsci.edu/datamodels/roman/schemas/wfi_science_raw-1.0.0
  title: Roman WFI Raw Science Data datamodel
  description: |-
    Basic Roman Raw Science
- tag_uri: asdf://stsci.edu/datamodels/roman/tags/wfi_image-1.0.0
  schema_uri: asdf://stsci.edu/datamodels/roman/schemas/wfi_image-1.0.0
  title: Wfi level 2 image information
  description: |-
    Wfi level 2 image information
< rest of file not shown>
```

The last shown item associates the `wfi_image` tag with the corresponding schema file. The machinery that uses these values determines where in the source tree the schema file is actually located.

Once the schema respository is updated, and released, any changes needed in the `roman_datamodels` code can be made and a corresponding release made. In most cases this may only involve updating the version of the manifest file referred to (though unlike most other version handling, so long as previous tag/schema associations are not changed, it is not necessary to change the version of the manifest file, in which case, no changes are needed to roman_datamodels if no customizations are needed to the new datamodel objects to support new schemas.

Such customizations are only needed if special processing is needed to add extra attributes or methods, or modify the values found in the file (e.g., changing the units of values or some other transformation).

stcal
--------

[stcal](https://github.com/spacetelescope/stcal)  probably is not relevant to SSC purposes unless there is a need to use generic code that is being used by STScI for both Roman and JWST calibration pipelines. It doesn't hurt to know what exists in it to avoid reinventing the wheel. If something is there is almost what is needed, and where a new option would make it useful for SSC purposes, it would save much work to coordinate changes there.


Building a Roman Data Model from scratch
-----------------------------------------------

Typically one would create the nodes from the bottom up, using the node types expected by the schemas. For example, to create the instrument value, class type WfiMode one would do the following: 

In [None]:
import roman_datamodels.stnode as rdnode
import numpy as np
import asdf
instr = rdnode.WfiMode()
instr['name'] = 'WFI'
instr['detector'] = 'WFI01'
instr['optical_element'] = 'GRISM'
meta = {}
meta['instrument'] = instr
wfi_image = rdnode.WfiImage()
wfi_image['meta'] = meta
wfi_image['err'] = np.zeros((4088, 4088), dtype=np.float32)
af = asdf.AsdfFile()
# af.tree = {'roman': wfi_image} ## This will not work since not 
                                 ## all required elements have been defined.

This only shows population of the items decribed previously, but the same approach would be used for most everything else. Note that the definition of the object attributes uses dictionary notation since the node machinery prevents creation of new attributes using the usual notation. The check against the schemas are only performed when creating the AsdfFile object.

Building a Roman Reference File Data Model from scratch
---------------------------------------------------------------------

We will build a gain reference data model and save it.

In [None]:
import astropy.time as atime
meta = {}
# Populate common part.
meta['reftype'] = 'GAIN'
meta['pedigree'] = 'GROUND'
meta['description'] = 'Gain reference file'
meta['author'] = 'Stephen King'
meta['useafter'] = atime.Time('2022-01-01T11:11:11.111')
meta['telescope'] = 'ROMAN'
# STSCI is the ONLY valid origin at this time
meta['origin'] = 'STSCI'
instr['name'] = 'WFI'
instr['detector'] = 'WFI01'
meta['instrument'] = instr
# Now create the top level node of the right type
gainref = rdnode.GainRef()
gainref['meta'] = meta
# Using small array since schema isn't specific about size
gainref['data'] = np.ones((100, 100), dtype=np.float32)
af = asdf.AsdfFile()
af.tree = {'roman': gainref}
af.write_to('gainref.asdf')
dm = rdm.open('gainref.asdf')
print(type(dm))
print(dm.meta.instrument.detector)

Exercises
-----------

1. Open the example data file and extract the value of the exposure start time.
2. Change the exposure start time.
3. Change the exposure start time to "hello there". What happens?
4. Go to the rad repository, and use the example above for gain as a guide and the schemas for the flat field reference file to create a reference file for a flat field (yes, all reference files have corresponding data models). Note that there is a fairly straightfoward algorithm that maps the schema name into a class name (underscores are removed and the elements thus separated are capitalized before joining with "Ref". E.g., gain-1.0.0.yaml becomes GainRef). You will need to follow all the chained references for the flat schema (e.g., it refers to ref_common-1.0.0 and ref_optical_elements-1.0.0, and the latter refers to wfi_optical_element-1.0.0.yaml to get all the necessary details) 