Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify all summary functions to take into account design information (i.e., weights). #3

Closed
krivit opened this issue May 20, 2017 · 26 comments

Comments

@krivit
Copy link
Collaborator

krivit commented May 20, 2017

The simplest way to do that is to simply use weights(egor) to get the list of ego weights according to the design. Note that the weights are not normalized: they do not necessarily sum to 1.

A more advanced use is svymean(x, ego.design(egor)), which takes a vector or a matrix x and evaluates its sample mean in accordance with the survey design.

@krivit
Copy link
Collaborator Author

krivit commented May 20, 2017

@tilltnet , @raffaelevacca , this one might have to be you: I only understand the data structure and survey design parts of egor.

tilltnet added a commit that referenced this issue May 20, 2017
…lating average netzsize and average density.
@tilltnet
Copy link
Owner

tilltnet commented May 20, 2017

svymean(x, ego.design(egor))

I included this in summary.egor.

All other analysis functions calculate per ego values from alters/ alter_ties, so I think there is no need to include weights there, but I should include examples using the weights for summarising, i.e. compositional results with weighted.mean() EDIT: or the svymean() thing, of course!

@tilltnet tilltnet self-assigned this May 20, 2017
@krivit
Copy link
Collaborator Author

krivit commented Jan 15, 2020

I've changed the code so that egor$ego is now a tbl_svy object with all that that entails. Do the functions that call for weights still work?

Edit: The changes are in the ego_srvyr branch.

@krivit
Copy link
Collaborator Author

krivit commented Sep 10, 2020

As far as I can gather, we have two kinds of summary functions:

  1. Those that calculate some statistic for each ego and return a tibble with the egoID and the statistic.
  2. Those that calculate the summary over all egos together.

I am not actually sure which, except for the dplyr functions, are of the second type.

Based on this, I think that in order to properly incorporate sampling weights, we should probably have functions of the first type return a tbl_svy object instead of a tibble, so that the user could then use sampling-aware summaries. Any thoughts?

@tilltnet
Copy link
Owner

  1. I think the first type is the main kind we have to take care of. I think it'd be easiest to have an internal function that can be inserted at the end of each of these summary/analytics functions, that would check if there is an ego.design and if there is it
    converts the result into a tbl_svy that includes the ego.design.
  2. Currently there really is only one function that provides summary stats for the whole data set: summary.egor(). It computes the mean network size and the mean density. I can update this, so that it takes the ego.design into regard again.

@krivit
Copy link
Collaborator Author

krivit commented Sep 15, 2020

With #53 being resolved, we can now modify the first type of functions to return data of the appropriate type. I am increasingly of the opinion that we should not output different formats depending on whether the underlying egor has ego design information, because, unfortunately, tbl_svy does not provide a very transparent interface to tibble---in particular, $ indexes the underlying data structure, not the table. For this reason, I ultimately decided to have as_tibble and as_alters_df always return a tibble, as_survey and as_alters_survey always a tbl_svy, and so on.

In light of this, we need some way to control the output format for the first type, such as EI. Two additional arguments come to mind:

  • survey=logical: TRUE, FALSE, or NA. If TRUE, always return a tbl_svy (even if the underlying egor does not have design information), if FALSE, always drop it, and if NA, inherit.
  • output=c("tibble","df","survey","inherit"): "df" is an alias for "tibble". Others are self-explanatory.

Any thoughts?

@tilltnet
Copy link
Owner

tilltnet commented Sep 15, 2020

Both options are good I think. I have a slight preference for the first. I would use TRUE/FALSE/NULL not NA. Following your argument the default would be FALSE?

Stepping back a bit though I am not sure if this is offers the right workflow to people that have an ego.design. If I have an ego.design I probably want to use it for all of my results, so I'd have to always be explicit about that, choosing specific functions and setting arguments.

The way I teach it in the workshop is that ego level results should be joined to the $ego tibble. If the same could be easily done with the results that come from an egor object with an ego.design a summarization of the results could than easily incorporate the ego.design. Since srvyr doesn't allow joins maybe we could have a join_results() function that works like a left_join(), has the egor object as the first argument and the results as the second.

Long story short, I think if we go in this direction we should also consider implementing a left_join()-like function, as you sggested in #53, that works with tbl_svy objects.

@krivit
Copy link
Collaborator Author

krivit commented Sep 16, 2020

Both options are good I think. I have a slight preference for the first. I would use TRUE/FALSE/NULL not NA. Following your argument the default would be FALSE?

Either works.

Stepping back a bit though I am not sure if this is offers the right workflow to people that have an ego.design. If I have an ego.design I probably want to use it for all of my results, so I'd have to always be explicit about that, choosing specific functions and setting arguments.

By that argument, should the default be NA/NULL?

The way I teach it in the workshop is that ego level results should be joined to the $ego tibble. If the same could be easily done with the results that come from an egor object with an ego.design a summarization of the results could than easily incorporate the ego.design. Since srvyr doesn't allow joins maybe we could have a join_results() function that works like a left_join(), has the egor object as the first argument and the results as the second.

Do you mean that they should return an egor object with the egor$ego tibble augmented with additional columns? Then we already have methods for that. Suppose that x is our original egor object, and res is a tibble with an .egoID column and whatever ego-level index we've computed, and there are no duplicated .egoIDs. Then,

x %>% activate("ego") %>% left_join(res, ".egoID")

should get the desired result.

If we just want a survey with two variables, the ego ID and the result, it becomes

out <- egor$ego
out$variables <- left_join(out$variables[".egoID"], res, ".egoID")
out

Long story short, I think if we go in this direction we should also consider implementing a left_join()-like function, as you sggested in #53, that works with tbl_svy objects.

I opened a ticket on srvyr back in January (gergness/srvyr#65). In principle, it should be possible to implement the left_join() and the inner_join() methods for tbl_svy since the underlying survey package handles indexing intelligently.

@tilltnet
Copy link
Owner

Ok, I looked into it and joining results to the egor object works fine, as demonstrated by your example above.

Given this I think we should add the survey argument to all summary/analysis functions that return ego level results and the options would be TRUE/FALSE/NULL. Defaulting to FALSE. Or would you prefer another default?

@krivit
Copy link
Collaborator Author

krivit commented Oct 12, 2020

Apologies for the slow reply... I like the neutral (NULL/NA) as the default better in principle, but in practice, I suspect most people will be expecting FALSE.

@krivit
Copy link
Collaborator Author

krivit commented Oct 12, 2020

Actually, could we make it an options() option?

@tilltnet
Copy link
Owner

For the maintenance of the package using a global option with options() sounds like a good idea to me. So we could have a global option that is called maybe "egor.return.results.with.design". And we could handle this the same as described above. When not set or NULL we inherit, TRUE returns the results as srvyr object, and when FALSE the design is discarded when returning the results. As for the default value, I am a bit unsure currently, but I think we can finalize the decision later on. I'll start working on this today and for now, I think I'll go with 'NULL' as the default.

tilltnet added a commit that referenced this issue Oct 22, 2020
@mbojan
Copy link
Collaborator

mbojan commented Oct 22, 2020

Just a quick note: NULL is equivalent to an option being absent:

options( foobleh = NULL )
o <- options()
"foobleh" %in% names(o)
## FALSE
getOption("foobleh")
## NULL

@krivit
Copy link
Collaborator Author

krivit commented Oct 22, 2020

Good point. That's a part of why I prefer NA.

@mbojan
Copy link
Collaborator

mbojan commented Oct 22, 2020

If the option is needed in a (hopefully) single place in the code then simply getOption("egor.return_with_design", FALSE) can be used to set the default. Otherwise, to have the defaults stored in a single place we can have a non-exported list, say egor_option_defaults <- list(egor.return_with_design = FALSE) and in the functions use sth like getOption("egor.return_with_design", egor_option_defaults$egor.return_with_design)...

@tilltnet
Copy link
Owner

Just a quick note: NULL is equivalent to an option being absent:

I think that is an advantage in this case. If we want 'inherit` to be the standard we don't need to set the option anywhere and only the user will have to set it if they want something else than the default.

The implementation I pushed in a666c93 differs a bit from what was said above. The options now are to inherit the tbl_svy or not. Returning a tbl_svy when there is no ego_design is present seemed useless to me, but I might be wrong?!?

Currently I am not setting the global option and the default is 'inherit'. But as I said above, I am not completely sure, what the default should be.

To summarize pros and cons from the previous discussion:

Default = inherit

Pros:

  • summary stats on results can be fed into functions that take weights into regard

Cons:

  • egor workflow of binding results to the ego dataframe is a bit more complicated (currently requires copy = TRUE) but that could be smoothed out
  • results are harder to inspect, as the tbl_svy does not print the actual values but rather only the sampling design info

Default = plain tibble

Pros:

  • results can be easily joined to ego data

Cons:

  • sampling design not present in results at all

Is there anything else?

@tilltnet
Copy link
Owner

If the option is needed in a (hopefully) single place in the code then simply getOption("egor.return_with_design", FALSE) can be used to set the default. Otherwise, to have the defaults stored in a single place we can have a non-exported list, say egor_option_defaults <- list(egor.return_with_design = FALSE) and in the functions use sth like getOption("egor.return_with_design", egor_option_defaults$egor.return_with_design)...

The option is only accessed in one place currently. I was thinking to use .onLoad or .onAttach if we want to set it. But, we could just change the behavior for when the option is NULL, to whatever we want to be the default.

@mbojan
Copy link
Collaborator

mbojan commented Oct 23, 2020

Using the hooks for setting options is a bad idea. .onAttach will not set it if the package is just imported (eg by ergm.ego). Both will override potential user settings in eg Rprofile.

@krivit
Copy link
Collaborator Author

krivit commented Oct 23, 2020

The options set in .onLoad() will work even if the package is not attached.

It's possible to set the options without clobbering existing ones using some code along the following lines (currently used in ergm):

  OPTIONS <- list(ergm.eval.loglik=TRUE,
                  ergm.loglik.warn_dyads=TRUE,
                  ergm.cluster.retries=5)
  current <- names(options())
  for(opt in names(OPTIONS)){
    if(! opt%in%current){
      do.call(options, OPTIONS[opt])
    }
  }

@mbojan
Copy link
Collaborator

mbojan commented Oct 23, 2020

If the user sets the option in say Rprofile (as is common to set options per user or per project), .onLoad will overwrite it.

Edit: ok your code will not overwrite it. Still, i think having the defaults in the package namespace as I shown earlier is much cleaner.

@krivit
Copy link
Collaborator Author

krivit commented Oct 23, 2020

The defaults can go anywhere .onLoad() can see them.

By the way, here's an even simpler implementation, assuming PKGOPTIONS is set somewhere:

do.call(options, PKGOPTIONS[setdiff(names(PKGOPTIONS), names(options()))])

@krivit
Copy link
Collaborator Author

krivit commented Oct 23, 2020

I've just implemented a function statnet.common::default_options() which wraps options() to avoid overwriting existing settings. It might make sense to import that, or maybe copy to egor.

@tilltnet
Copy link
Owner

tilltnet commented Nov 3, 2020

  • Make all summary functions return the results with return_results
    • composition
    • EI
    • comp_ei
    • ego_density
    • comp_ply
    • diversity
    • count dyads
    • ego_constraint
  • make summary.egor use ego_design for means of size and density

@tilltnet
Copy link
Owner

tilltnet commented Nov 3, 2020

I've just implemented a function statnet.common::default_options() which wraps options() to avoid overwriting existing settings. It might make sense to import that, or maybe copy to egor.

Thanks, I used that as inspiration.

The defaults are now set with .onLoad() and I added additional options to influence the behavior of print.egor() (see #54).

The default for egor.return.results.with.design currently is TRUE. But I feel like setting it to FALSE makes more sense until srvyr, prints tbl_svys in a way where the values are visible.

@krivit
Copy link
Collaborator Author

krivit commented Nov 10, 2020

Thanks! The name seems a bit verbose. Also, I would suggest replacing the dots after the first with underscores.

@tilltnet
Copy link
Owner

egor.results_with_design

would that be better? since those are not typed out regularly but at a maximum typed at the beginning of a session it should be as verbose as necessary to get across what it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants