Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Example of a big-ish phenomics dataset, for feedback #20
A simulated dataset
The intent is to begin to get feedback on some rough sketches of what some data products might look like.
To this end, I have simulated the type of data that a sensor might observe, along with some of the underlying environmental drivers and physiological traits.
Note that there will be numerical artifacts, quasi-meaningful error terms, and liberal re-application of core concepts for the purposes of developing these datasets.
All of these simulated datasets are released CC-BY-0: do with as you please, but these are not production quality - just trying to meet demand and begin getting feedback.
A note on variable names
I have used the names currently used in BETYdb.org/variables, along with names inspired by the more standardized naming Climate Forecasting conventions. However, at this point this is a very early pre-release, and comments on how such data should be formatted and accessed can be discussed in issue #18.
Design of the Simulation Experiment
227 lines grown at each of three sites along a N-S transect in Illinois over five years (2021-2025). Two years were dry, two were wet, and one was average.
These are historic data, but the years have been changed to emphasize the point that these are not real data.
These are approximate locations used to query the meteorological and soil data used in the simulations.
Each site has four replicate fields: A, B, C, D. This simulated dataset assumes each field within a site has similar, but different meteorology (e.g., as if they were all in the same county).
Two-hundred and twenty-seven lines were grown at each site. They are identified uniquely by an integer in the range [9915:10141]
The phenotypes associated with each genotype is in the file
These 'phenotypes' are used as input parameters to the simulation model. We often refer to these as 'traits' (as opposed to biomass or growth rates, which are states and proceses). In this example, we assume that 'phenotypes' are time-invariant.
Simulated Sensor Data
This dataset includes what a sensor might observe, daily for five years during the growing season.
note A sensor won't observe roots or rhizomes. Furthermore, Sorghum doesn't have rhizomes. The simulated biology is a little different.
How to obtain data and give feedback:
If you do something cool, please send comments and figures!
Data are located on Box: https://uofi.box.com/sorghum-simulation
This looks fantastic and I think it's going to be more than enough to get us started on developing the GP prediction code. A couple of quick questions:
I think we can get some initial demo plots soon!
Yes there is lots of simulated noise. I hope it's not too much - it's not
There is an expected decline in biomass at the end of the season - leaves
No I don't think it will be too much noise, and thanks for the explanation of the biomass downturns! Makes perfect sense. It's actually somewhat relieving because I had been a bit worried about putting monotonically increasing constraints into the GP predictions for biomass (it's possible but not straightforward); we won't need to do so!