Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a generalisation for nodematch() that tests if the two nodes have any attributes or properties in common or counts them? #481

Open
krivit opened this issue Aug 12, 2022 · 2 comments

Comments

@krivit
Copy link
Member

krivit commented Aug 12, 2022

Term description

This stems from this question on Stack Overflow: generalising it, suppose that each node i has some set A[i] of properties (I am avoiding "attributes", since we use that term elsewhere.). We wish to specify a dyadic predictor that, in pseudocode, can be represented x[i,j] = length(intersect(A[i], A[j])) (the number of properties i and j have in common) or x[i,j] = length(intersect(A[i], A[j])) > 0 (whether i and j have any properties in common).

Some examples:

  • A[i] is the set of languages i speaks, and we wish to use an indicator of whether i and j speak at least one common language as a predictor of their interaction. (This is from Stack Overflow.)
  • A[i] is a list of i's hobbies, and we wish to use the number of hobbies i and j have in common to predict acquaintance.
  • A[i] is a list of places i visited over the course of a day (e.g., from a contact diary), and we wish to use the number of common areas visited by i and j to predict whether they had a contact.

This seems like something that can be useful in a variety of circumstances.

A further generalisation of this concept is to make A[i] a mapping that maps property k to some value (e.g., proficiency in a language) so that, e.g., x[i,j] = max[k](min(A[i][k], A[j][k])) (or some other "interacting" and "combining" functions in place of min() and max[k](), respectively). In the language example, this predictor represents the proficiency of the less-proficient actor in the two actors' best common language (where "best common language" is the language in which the less-proficient actor has the highest proficiency).

In all cases, this would be a dyad-independent term, so in principle representable with edgecov().

Questions

  1. How broadly useful would this be? I suspect @CarterButts and @mbojan might have some applications I hadn't thought about.
  2. Would the generalisation to a mapping be useful? What "interacting" and "combining" functions would be useful?
  3. What would be an efficient way to implement these?
  4. What kind of a user interface (required data format and syntax) would we want for this term?
@mbojan
Copy link
Member

mbojan commented Aug 12, 2022

Nice idea. It sounds useful to me especially if there would be some freedom of choice wrt "similarity function" f() in

f( A[i], A[j] )

where f() could include, next to your examples:

  • Cosine similarity
  • Jaccard coefficient (esp useful if A[i] is a set of binary variables)
  • some of those described in Choi, S. S., Cha, S. H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of systemics, cybernetics and informatics, 8(1), 43-48.
  • or any R function of two vectors returning a numeric scalar

Implementation-wise, does such term have to essentially construct a matrix for edgecov on the R level, or there are computational "shortcuts" to exploit on the lower level?

@krivit
Copy link
Member Author

krivit commented Aug 12, 2022

Implementation-wise, does such term have to essentially construct a matrix for edgecov on the R level, or there are computational "shortcuts" to exploit on the lower level?

That remains to be seen. If the operation is on a set rather than a mapping, there is a number of ways to represent it, with different advantages and disadvantages:

  1. Each int can encode set membership for up to 32 properties. Then, unions and intersections can be calculated by bitwise &. If there are more than 32 properties, one can have multiple ints per node, though the storage and computational costs grow linearly in the number of properties.
  2. If there are no more than 2^32 distinct properties, then each node can have a sorted array of its property IDs; then an algorithm can iterate through each node's properties, testing for common members a la the merge sort. This method is not sensitive to the total number of distinct properties but is sensitive to the average number of properties a node has.

There are probably others.

For a mapping, Method 2 can be used, with the array of property IDs serving as keys and a parallel array for values. (One can also just store the values in a vector with one element per property for each node analogously to Method 1, but then one loses the benefits of compactness of the one-bit-per-property representation and the speed of bitwise operations.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants