-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement]: Add weight threshold option for averaging operations #531
Comments
@pochedls thank you for posting this. I think this would be very important, especially when there is time-varying missing data in different locations, which is not uncommon in observation dataset. @gleckler1 handles observation datasets using xcdat for obs4mips, and this subject is going to be very relevant to his work. Datasets from obs4mips are used as reference datasets in PMP, thus this issue is also related to PMP. |
@pochedls @lee1043 Thanks for bringing this up! It would be great to have some sort of weight_threshold ... is the thinking that if the threshold was not met an NaN would be given or an error would be raised, ? It would be great to have this for both time and space but my first choice would be time. Thanks again for thinking of this. |
Here's an exchange I had with Chris Golaz related to this in 2018 and addresses the question of an annual mean. It requires one to form monthly means, then if at least one month of data was available during each season, then an annual mean would be calculated. If all 3 months were missing in one or more seasons, the annual mean would be considered missing. I'll also try to find my notes on using a centroid criteria for means of cyclical data (like the annual cycle). It gave the user more control over how much data could be missing in a situation like that. Chris, I think the seasonal climatology assumes the first month in the time-series is January, so that part needs to be generalized to handle different starting months. happy coding, On 5/23/18 3:05 PM, Chris Golaz wrote: Thanks for that. This algorithm makes perfect sense to me with reasonable choices for propagating missing values from monthly to seasonal and annual. Maybe I should code it up in Fortran to see how much faster than Python it can be... As far as I can tell, the extra complication in cdutil comes from the added flexibility on how missing values propagate up in the average (with the option of specifying a threshold and centroid). -Chris On 05/23/2018 02:43 PM, Karl Taylor wrote: -------- Forwarded Message -------- Hi Chris, I can't find notes from 2001, but below I've copied a suggestion I made for computing climatologies for use with the metric package. The pseudo-code can handle missing months. It computes climatological monthly means first, then from those it computes the seasonal means and the annual means. If this has been implemented, it was probably implemented on top of for fundamental CDMS functions. Hope this helps a little, -------- Forwarded Message -------- Hi Charles and all, Here's a proposal for how to compute climatological means, starting with Let's consider a seasonal mean missing if the climatological monthly Then I suggest the following algorithm: Let f(x, m, y) be the value at grid-cell x, year y, and month m. *** First compute monthly climatologies for each grid cell [C(x, m)] loop over grid cells (x)
end x loop *** Now compute seasonal mean climatologies [Cs(x, s)]: loop over grid cells (x)
m2) + C(x, m3)*A(x, m3) ) / (A(x, m1) + A(x, m2) + A(x, m3))
end x loop *** Now compute annual mean climatology [Ca(x)]: loop over grid cells (x)
end x loop Notes:
that's all, |
I can't find my notes on this, but there is some description of how the centroid method I came up with is applied in https://cdat.llnl.gov/documentation/utilities/utilities-1.html under "temporal averaging". In general two criteria are set: a minimum coverage of the time period (threshold), and a constraint requiring data to be near the centroid of times sampled. For a simple time-series (assumed not to be quasi-cyclic), the centroid is calculated as a simple mean of all times with data. This may be useful in deciding whether you can calculate a meaningful mean over a interval that includes a trend. If the mean of the sampling times is too close to one end the interval, then you'll get a non-representative time-mean. For quasi-cyclic data like the diurnal cycle or the annual cycle, the centroid is calculated as for a two-dimensional field. For an annual mean to be calculated from monthly values, for example, you can specify that the centroid lie near the point calculated when all months are available. You basically treat each month as a point on an analog clock face, and leaving out the missing months calculate the centroid of the remaining months. Assume you've centered the clock on a polar coordinate system. If the radial distance to the centroid is less than some threshold, then the mean of the monthly values will give a reasonable annual mean. You might also, of course, set a minimum number of months as a second criteria. |
Hi @taylor13, thank you for your input! We are planning to carefully review your notes about weight thresholds and the "centroid" function. Here are more related comments:
You also sent us a helpful email about this on 7/14/21:
|
thanks for linking in the earlier input. I think the xcdat strategy should probably be to implement something rather simple like the "seasons" approach suggested in #531 (comment) . Two cases might be commonly encountered. Monthly mean data covering multiple years. Here, if there are no big trends, the multi-year calendar months could be averaged to form a single annual cycle. The other is to compute a time-series of annual means. For this case an annual mean could be calculated when at least one month of data was available for each of the 4 seasons. Seasonal means would be calculated from the months available within each season and then the annual mean would be calculated from all 4 seasons. (for seasonal means, the months should be weighted by the length of each month, and for annual means the seasons should be weighted by the length of each season.) there are another of other simple options, one might adopt, so perhaps others can weigh in. |
Our last xcdat made me think about this issue a little bit. I think something along the lines of the following could work for the xcdat internal spatial
Other things to think about:
|
You say the weights must be a data array. Are there other requirements? For time, is it assumed they have been set proportional to the length of each month? If you compute a monthly mean from daily means, with some daily means missing, do the weights for the monthly means get set proportional to the number of days in the month with valid data? Conceptually, I find it easiest to keep track of masking by enabling the analysis package to compute the full weights (assuming no missing values), based on the coordinate bounds (but with an option for the user to provide these in difficult cases like weird grids). Then, in addition, I would like to make it easy for the user to associate a data array containing "fractional weights", which would indicate when a data value should be down-weighted from its full value. The actual weights are then the product of the full weights array and the fractional weights array. For simple binary masks, the fractions would either be 0 or 1; for unmasked data the fractions would all be 1. As an example, consider a latxlon array with data only available over land. Suppose we want to regrid it. On the original grid the analysis package would compute areas for each grid cell based on the lat and lon coordinate bounds. The user would indicate which grid cells had valid data by, for example, setting the fractional weights to 0 for ocean cells and 1 for the land cells. The regridder could use this information and conservatively map the data to some target grid. The user would expect to have returned (along with the regridded field) the fraction of each target cell with contributions from the source cells (i.e., the fractional weights), and as for the original grid, the full weights could be inferred from the lat and lon bounds of the target grid (and the regridder would presumably have calculated these). The above "averager" function would for calculate weight_sum_all directly from the array containing full weights and would calculate weight_sum_masked from the product of the full weights array and the fractional weights array. Maybe in effect that is what is already being done (or something essentially equivalent), in which case this comment can be ignored. |
@taylor13 - thanks for looking this over and the comments. This might be more efficient as an in-person discussion (please send me an email if you want to chat more). You have a lot of nice suggestions in this thread. I think some comments might be better placed in separate discussions and, given limited resources, it would be helpful to differentiate ideas across a spectrum of a) nice to have, but not frequently used to b) very important, and commonly used.
My comment was regarding spatial averaging operations (I updated the comment to clarify this) - not temporal averaging. The spatial weight matrix is calculated internally (using the coordinate bounds) or it can be supplied via the public averaging interface [note that the function above is an internal helper function]. The comment was meant to serve be a technical pointer of where we might add code to address this issue (adding a weight threshold for averaging operations). I think we should focus this thread on the issue of applying weight thresholds. Discussion / questions / suggestions on how to generate weights might be better placed in a new discussion or a separate issue. The readthedocs page provides information on how these features work in the existing xcdat package.
The user can supply their own weights via the public
This issue was motivated by dealing with missing values and a particular problem: you have valid data in a small part of your temporal or spatial domain and missing values elsewhere. This can lead to an issue where a very small fraction of data (e.g., one day in a yearly average or one grid cell in a spatial domain) can give you an unrepresentative average. This is currently how the average works (not ideal), though there is a separate issue that has the effect of masking data that has any missing values (also not ideal, but easy to implement). This issue/thread aims to give a little more control by allowing you to specify how much of the data (i.e., the fractional weight) must exist to return a value (otherwise I think this scenario is a little different, because the user wants to bring in their own constraints (in this case a separate matrix about land/ocean fraction). This would probably be best handled by the user using xcdat tools (e.g., |
@pochedls @taylor13 thanks for the discussion and sharing the insights and knowledge. I believe this is important for more precise weighting when I had a chance to chat about this in brief with @taylor13 today and leaving this note to share and remind myself. Regarding the temporal average bounds, I learned that the below code handles it in CDAT: I also acknowledge that to calculate spatial average more precisely, Just for your interest, below is a prototype testing code that I wrote for obs4MIPs, which I think is the similar approach to one in @pochedls's above comment. count_NaN_values.pdf (Note: First half for toy dummy data, second half for real-world data.) My prototype code is yet to have the sophisticated weighting capability that @taylor13 explained for now. I think I will need to give more thoughts on that. |
Example 6 in https://xcdat.readthedocs.io/en/latest/examples/spatial-average.html indicates that xcdat already gives the user control over providing a time-varying "weights" array along with the data to do spatial averaging. So, most of my input above is already provided for. |
Is your feature request related to a problem?
When xCDAT performs an averaging operation (spatial or temporal) with missing data, it assigns missing values a weight of zero (which is correct). But this can be misleading if some or the majority of data is missing. Imagine if data was missing in spring, fall, and summer (leaving only winter data). xCDAT would take the annual average and report a very cold annual average temperature. A user is usually aware of this kind of thing, but may miss it if a small number of grid cells are missing part of the dataset (or are missing anomaly values, which would be harder to recognize as "weird").
See the example below:
A similar situation can arise if a timestep of observations is missing part of the field during spatial averaging operations (e.g., missing the tropics, leading to too cold global temperatures).
Describe the solution you'd like
This would need to be mapped out more, but it might be useful to have a
weight_threshold
optional argument that allows the user to specify the minimum temporal/spatial weight needed to produce a value. For exampleweight_threshold=0.9
would require 90% of the spatial / temporal data within the spatial or temporal averaging window be present.CDAT had similar functionality for temporal averaging calculations (see Specifying Data Coverage Criteria. I'm not sure if there was any similar functionality for spatial averaging.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: