Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Algorithm for Top-K Closeness Centralities #19031

Closed
sagetrac-borassi mannequin opened this issue Aug 14, 2015 · 40 comments
Closed

New Algorithm for Top-K Closeness Centralities #19031

sagetrac-borassi mannequin opened this issue Aug 14, 2015 · 40 comments

Comments

@sagetrac-borassi
Copy link
Mannequin

sagetrac-borassi mannequin commented Aug 14, 2015

Implement the algorithm in http://arxiv.org/abs/1507.01490 for computing the k most central vertices according to closeness centrality.

Depends on #19007
Depends on #19014

CC: @nathanncohen @dcoudert

Component: graph theory

Keywords: Closeness Centrality, top-k

Author: Michele Borassi

Branch/Commit: 1367f00

Reviewer: David Coudert

Issue created by migration from https://trac.sagemath.org/ticket/19031

@sagetrac-borassi sagetrac-borassi mannequin added this to the sage-6.9 milestone Aug 14, 2015
@sagetrac-borassi sagetrac-borassi mannequin added the p: major / 3 label Aug 14, 2015
@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 14, 2015

Changed keywords from none to Closeness Centrality, top-k

@sagetrac-borassi

This comment has been minimized.

@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 14, 2015

Author: Michele Borassi

@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 14, 2015

Dependencies: #19007,#19014

@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 17, 2015

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 17, 2015

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

48514adMoved centrality_closeness to centrality.pyx
cf9cf87Trailing whitespaces
4174764Reviewer's suggestions
e9c1c60Merged with #19007 and #19014
148d26cTemp
09d67a9First draft

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 17, 2015

Commit: 09d67a9

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 17, 2015

Changed commit from 09d67a9 to a83cdf0

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 17, 2015

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

131c421Temp
2693f52First draft of Tarjan's SCC algorithm
3238039Reviewer's suggestions
61bca8aReviewer's suggestions and small improvements
af07ceeReviewer's suggestions
a83cdf0First draft

@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 17, 2015

comment:5

Hello!

This is the first draft of this new closeness centrality algorithm. It is able to conclude in reasonable time even on very big graphs, with millions of nodes and billions of edges. The new algorithm is in the last commit (while all previous commits simply merge with dependent tickets).

See you,

Michele

@dcoudert
Copy link
Contributor

comment:6

A first round of comments.

In method _estimate_reachable_vertices_dir

  • The lower bound is stored in reachableL[v] -> The lower bound is stored in reachL[v]
  • Although the method will not be tested, could you please add an INPUT section specifying the type and expected size of arrays. You could also briefly document the algorithm implemented in the method.
  • please improve the presentation of the block with all variable declarations. It is hard to read since it constains lots of important calls to various methods. For instance avoid cdef int i, nscc = tarjan_strongly_connected_components_Cand prefer cdef int nscc = tarjan_strongly_connected_components_C
  • What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.
  • You could use maxscc = max(scc_sizes[i] for i in range(nscc)

For method _compute_reachable_vertices_undir, you should add some documentation at the beginning of the method.

Method counting sort: no comment, but in the future we should certainly create a method somewhere.

In method def centrality_closeness_top_k(G, int k=1, int verb=0)

  • verb -> verbose
  • the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for one randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.
  • we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???
  • if directed: gamma = 0 else: gamma = 1 -> gamma = 0 if directed else 1
  • when stopped == True you don't quit the for j in range(layer_current_beginning,layer_current_end):. May be you should replace this loop with a while.
  • you set gamma = 0. So you need gamma==1 only once for undirected graphs. Correct?

David.

@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 18, 2015

comment:7

Replying to @dcoudert:

A first round of comments.

In method _estimate_reachable_vertices_dir

  • The lower bound is stored in reachableL[v] -> The lower bound is stored in reachL[v]

Done!

  • Although the method will not be tested, could you please add an INPUT section specifying the type and expected size of arrays. You could also briefly document the algorithm implemented in the method.

Done!

  • please improve the presentation of the block with all variable declarations. It is hard to read since it constains lots of important calls to various methods. For instance avoid cdef int i, nscc = tarjan_strongly_connected_components_Cand prefer cdef int nscc = tarjan_strongly_connected_components_C

Done!

  • What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

  • You could use maxscc = max(scc_sizes[i] for i in range(nscc)

I don't think so, maxscc is not the maximum size of a SCC, but it is the SCC with maximum size. In other words, your instruction outputs scc_sizes[maxscc], not maxscc. Am I missing something?

For method _compute_reachable_vertices_undir, you should add some documentation at the beginning of the method.

Done

Method counting sort: no comment, but in the future we should certainly create a method somewhere.

Agree!

In method def centrality_closeness_top_k(G, int k=1, int verb=0)

  • verb -> verbose

Done!

  • the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for one randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

  • we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???

Yep: !http://docs.cython.org/src/reference/language_basics.html

  • if directed: gamma = 0 else: gamma = 1 -> gamma = 0 if directed else 1

Done!

  • when stopped == True you don't quit the for j in range(layer_current_beginning,layer_current_end):. May be you should replace this loop with a while.

I added an 'if' at the end of the loop.

  • you set gamma = 0. So you need gamma==1 only once for undirected graphs. Correct?

Hmm, correct, but I realized that I could do a little better by slightly changing the implementation. So this does not apply anymore.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 18, 2015

Branch pushed to git repo; I updated commit sha1. New commits:

a0c1b14Reviewer's suggestions
2c73801Small improvements

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 18, 2015

Changed commit from a83cdf0 to 2c73801

@dcoudert
Copy link
Contributor

comment:9
  • What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

Is it mentioned somewhere?

  • You could use maxscc = max(scc_sizes[i] for i in range(nscc)

I don't think so, maxscc is not the maximum size of a SCC, but it is the SCC with maximum size. In other words, your instruction outputs scc_sizes[maxscc], not maxscc. Am I missing something?

Right, it should be maxscc = max((scc_sizes[i],i) for i in range(nscc))[1], but this is ugly.

  • the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for one randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

That's a lot!
However, I don't like the n = random.randint(2,20). You could safely let n=20, no?

  • we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???

Yep: !http://docs.cython.org/src/reference/language_basics.html

That's funny.

Can we always assume that farness[v]>0? This is for the return sorted([(1.0/farness[v],....

Instead of G.vertices()[v], you should add a line like cdef list V = G.vertices() before the return statement and then use V[v].

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

  • Is your method faster than centrality_closeness in this case?
  • This behavior should be documented for the parameter k

That's all for now ;)

David.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 19, 2015

Changed commit from 2c73801 to 06d2c00

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 19, 2015

Branch pushed to git repo; I updated commit sha1. New commits:

06d2c00Other reviewer's suggestions

@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 19, 2015

comment:11
  • What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

Is it mentioned somewhere?

I added two inline comments to explain it.

  • the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for one randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

That's a lot!
However, I don't like the n = random.randint(2,20). You could safely let n=20, no?

Done!

Can we always assume that farness[v]>0? This is for the return sorted([(1.0/farness[v],....

Yep.

The only instructions that modify the farness are farness[x] = n (if a visit is stopped) and farness[x] = ((<double> f) * (n-1)) / (<double>(nd-1) * (nd-1)), where f is a (positive) sum of distances, nd-1 is positive because nd is the number of reachable vertices from a vertex x. If nd==1, it means that the out-degree of x is 0, and we skip the analysis of x, so this case never occurs. Then, n>=nd, and hence also n-1 is positive.

Instead of G.vertices()[v], you should add a line like cdef list V = G.vertices() before the return statement and then use V[v].

Done!

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

  • Is your method faster than centrality_closeness in this case?

No, it should be the same (actually, a bit slower because we make some computations during the BFSes).

  • This behavior should be documented for the parameter k

Done!

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 19, 2015

Changed commit from 06d2c00 to 84031c3

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 19, 2015

Branch pushed to git repo; I updated commit sha1. New commits:

84031c3Added documentation for k

@dcoudert
Copy link
Contributor

comment:13

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

  • Is your method faster than centrality_closeness in this case?

No, it should be the same (actually, a bit slower because we make some computations during the BFSes).

So you should add a test if k>=n at the beginning of the method and if True call the other method and format the result. You could also add a test n==0 and/or n==1 where the result is obvious.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 19, 2015

Branch pushed to git repo; I updated commit sha1. This was a forced push. Last 10 new commits:

0f855c6Reviewer's suggestions
1ec3f44Merged with beta2
cbe780cReviewer's suggestions and small improvements
4174764Reviewer's suggestions
0c19024Use check_allocarray
9167071Used limits instead of stdint to find maximum int value
83b8057Small corrections
1551c6fUndo commit moving closeness_centrality to centrality.pyx. Corrected links.
fd0580aMerged with #19007
5c1905fNew algorithm for closeness centrality

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 19, 2015

Changed commit from 84031c3 to 5c1905f

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 19, 2015

Changed commit from 5c1905f to ced875f

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 19, 2015

Branch pushed to git repo; I updated commit sha1. New commits:

ced875fReviewer's suggestions / added a reference

@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 19, 2015

comment:16

Helloooooooo!

After we moved centrality_closeness back to generic_graph, I had to rebase this ticket. The new algorithm is still in file centrality.pyx, but it is not available in generic_graph (that is, g.centrality_closeness_top_k() does not work anymore). However, I don't think this is a serious issue, because this algorithm in used in very specialized cases, and the standard Sage user does not need it, in my opinion.

Cheers,

Michele

@dcoudert
Copy link
Contributor

Reviewer: David Coudert

@dcoudert
Copy link
Contributor

comment:17

Is not such a big deal to do

sage: from sage.graphs.centrality import centrality_closeness_top_k
sage: centrality_closeness_top_k(graphs.PetersenGraph(), 3)
[(0.6, 2), (0.6, 1), (0.6, 0)]

and it's documented. Other methods such as hyperbolicity, pathwidth, etc. must be imported for similar reasons. As soon as we will find a reasonable solution for lazy import, we may add it in another ticket.

I set this patch to positive review.

David.

@vbraun
Copy link
Member

vbraun commented Aug 20, 2015

comment:18

On 32-bit Linux:

sage -t --long src/sage/graphs/centrality.pyx
**********************************************************************
File "src/sage/graphs/centrality.pyx", line 588, in sage.graphs.centrality.centrality_closeness_top_k
Failed example:
    centrality_closeness_top_k(g, 5, 1)
Expected:
    Final performance ratio: 0.766666666667
    [(0.36, 5),
     (0.36, 4),
     (0.3333333333333333, 6),
     (0.3333333333333333, 3),
     (0.29032258064516125, 2)]
Got:
    Final performance ratio: 0.805555555556
    [(0.36, 5),
     (0.36, 4),
     (0.3333333333333333, 6),
     (0.3333333333333333, 3),
     (0.29032258064516125, 7)]
**********************************************************************
1 item had failures:
   1 of  40 in sage.graphs.centrality.centrality_closeness_top_k
    [51 tests, 1 failure, 0.16 s]

@dcoudert
Copy link
Contributor

comment:19

Michele,

in this tests, nodes 2 and 7 are both correct answers since we have:

  • on a 64bit machine (i.e. my laptop)
sage: from sage.graphs.centrality import centrality_closeness_top_k
sage: g = graphs.PathGraph(10)
sage: centrality_closeness_top_k(g, 5, 1)
Final performance ratio: 0.766666666667
[(0.36, 5),
 (0.36, 4),
 (0.3333333333333333, 6),
 (0.3333333333333333, 3),
 (0.29032258064516125, 2)]
sage: g.centrality_closeness()
{0: 0.2,
 1: 0.24324324324324326,
 2: 0.2903225806451613,
 3: 0.3333333333333333,
 4: 0.36,
 5: 0.36,
 6: 0.3333333333333333,
 7: 0.2903225806451613,
 8: 0.24324324324324326,
 9: 0.2}
  • on a 32 bits computer (old desktop)
sage: from sage.graphs.centrality import centrality_closeness_top_k
sage: g = graphs.PathGraph(10)
sage: centrality_closeness_top_k(g, 5, 1)
Final performance ratio: 0.805555555556
[(0.36, 5),
 (0.36, 4),
 (0.3333333333333333, 6),
 (0.3333333333333333, 3),
 (0.29032258064516125, 7)]
sage: g.centrality_closeness()
{0: 0.2,
 1: 0.24324324324324326,
 2: 0.2903225806451613,
 3: 0.3333333333333333,
 4: 0.36,
 5: 0.36,
 6: 0.3333333333333333,
 7: 0.2903225806451613,
 8: 0.24324324324324326,
 9: 0.2}

The good point is that the numbers are the same on both machines ;)

One part of the solution is to call centrality_closeness_top_k(g, 4, 1).

However, for the performance ratio it is more difficult.
In fact, we have visited=138 on my mac, and visited=145 on the 32 bits computer. So we have to understand what could change the number of visited nodes depending on 32-64 bits.

David.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 21, 2015

Branch pushed to git repo; I updated commit sha1. New commits:

5941f86Removed memset with -1 (32bit problem)

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 21, 2015

Changed commit from ced875f to 5941f86

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 21, 2015

Branch pushed to git repo; I updated commit sha1. New commits:

681d482Corrected doctest with 32bit

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 21, 2015

Changed commit from 5941f86 to 681d482

@sagetrac-borassi
Copy link
Mannequin Author

sagetrac-borassi mannequin commented Aug 21, 2015

comment:22

Hello!

I have installed Sage on a 32bit Virtual Machine: the problem was a comparison between (equal) doubles, and probably due to roundings the error popped out. Indeed, both results were correct.

By asking only top-4 in the doctest, we avoid such comparisons, and everything works.

Thank you very much!

Michele

@dcoudert
Copy link
Contributor

comment:24

The patch passes all tests on both my mac (64bits) and a 32bits linux PC.
For me the patch is ok, but is is certainly better to rebase on 9.3.
David.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 22, 2015

Changed commit from 681d482 to 1367f00

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 22, 2015

Branch pushed to git repo; I updated commit sha1. New commits:

1367f00Merged with 6.9.beta3

@dcoudert
Copy link
Contributor

comment:26

Thanks.

@vbraun
Copy link
Member

vbraun commented Aug 23, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants