Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upData API design questions #119
Comments
|
A change in the direction described above would also allow time-series only analyses without the need to create a covariates table. So, if the dataset is just an aggregation to time-series and counts (a common format), the current API requires splitting this into two tables and then cross-tabbing one of them. The alternative design would just consume the table as is and use the E.g., r_LDATS <- LDA_TS(data, time_col = newmoon, word_col = abundance,
topics = 2:5, nseeds = 2, changepoints = 1,
formulas = ~time + sin(time) + cos(time)) |
|
After chatting with @juniperlsimonis last week, I thought I'd add my thoughts re: the From my perspective, the main thing is that we've adopted the The other thing is that capacity for taking a table of multiple communities and subsetting them within |
|
I'd be happy to set up some time to chat through the API with anyone who's interested and see if we can come up with the best solution. Since neither MATSS or LDATS is at 1.0 or being used outside the group now is definitely the time to figure out what the core API should look like. I'm fine with LDATS only handling one community at a time, but we want to make sure that it's easy to do that the most common data formats. That said, we may also want to include a convenience wrapper function at some point that handles the multi-community version to make it as easy as possible for end users. |
|
I'd definitely be game to chat through data options. |
|
If
|
|
If we consume wide data this would be an option for the multi-site situation:
And |
|
currently, the API is
inputs topics, nseeds, formulas, nchangepoints allow for expansion |
|
inputs in
where the two data sets are managed as [1] response variable in the formula for the term table and [2] |
|
Abundance input as long-form multi-site data, reshaped into wide-form for LDA_TS: ts_data = read.csv('ts_data.csv')
covar_data = read.csv('covar_data.csv')
ts_data %>%
filter(species != 'bob') %>%
spread(species, abundance) %>%
group_split(site) %>%
purrr::map(LDA_TS, covariates = covar_data) |
|
Abundance and covariates for a single-site, cross-tab format as a combined data structure in MATSS format: ts_data <- read.csv("ts_data.csv")
covar_data <- read.csv("covar_data.csv")
dat <- list(abundance = ts_data,
covariates = covar_data)
LDA_TS(dat) # or maybe LDA_TS(data = dat) |
|
take home from the conversation: the API for *the two data structures might be combined into a single object/list |
|
The currently drafted v0.2.0 now has the API for |
The package currently consumes two data objects:
Assembling these two objects for commonly formatted data would require some potentially fragile work and so I'm wondering if it's worth having a conversation about the data API at least for the top-level
LDA_TSfunction.I'm envisioning most users having long data in the general form of
year, species, countandyear, covariate_value(often with asitevariable for both as well which in concept can be grouped by). To use with the current API this would require cross-tabbing the first table and if the sorts on the two tables aren't the same for some reason this will produce the wrong answer (hence my concern about this being a bit fragile).My first thought was that the data should be long in both cases. I can see why this wasn't the initial implementation because it comes with its own set of issues, specifically that using long data would require passing the names of the "words" and "documents" columns so that the LDA step understands what it is supposed to work with. That said, I think this is more robust than assuming that the rows are the documents and the columns are the words since it's easy enough to mess up the cross-tabbing and get the components switched around. Given that this package is explicitly temporal, we could also use the opportunity to codify that in the API and outputs by passing the
timenamedirectly rather than viacontrol = TS_controls_list(timename = "time_column").I definitely don't fully understand the under-the-hood stuff that might make this change more complicated and think this is probably a pretty in-depth discussion, so I'd be happy to set up some time with whoever is interested to talk through the optimal API design (which could end up being what's already here).