New Algorithm for Top-K Closeness Centralities #19031

sagetrac-borassi · 2015-08-14T12:26:21Z

Implement the algorithm in http://arxiv.org/abs/1507.01490 for computing the k most central vertices according to closeness centrality.

Depends on #19007
Depends on #19014

CC: @nathanncohen @dcoudert

Component: graph theory

Keywords: Closeness Centrality, top-k

Author: Michele Borassi

Branch/Commit: 1367f00

Reviewer: David Coudert

Issue created by migration from https://trac.sagemath.org/ticket/19031

The text was updated successfully, but these errors were encountered:

sagetrac-borassi · 2015-08-14T12:29:35Z

Changed keywords from none to Closeness Centrality, top-k

sagetrac-borassi · 2015-08-14T12:29:35Z

Author: Michele Borassi

sagetrac-borassi · 2015-08-14T12:29:35Z

Dependencies: #19007,#19014

sagetrac-borassi · 2015-08-17T11:15:32Z

Branch: u/borassi/new_algorithm_for_top_k_closeness_centralities

sagetrac-git · 2015-08-17T16:37:47Z

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

`48514ad`	`Moved centrality_closeness to centrality.pyx`
`cf9cf87`	`Trailing whitespaces`
`4174764`	`Reviewer's suggestions`
`e9c1c60`	`Merged with #19007 and #19014`
`148d26c`	`Temp`
`09d67a9`	`First draft`

sagetrac-git · 2015-08-17T16:37:47Z

Commit: 09d67a9

sagetrac-git · 2015-08-17T16:40:14Z

Changed commit from 09d67a9 to a83cdf0

sagetrac-git · 2015-08-17T16:40:14Z

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

`131c421`	`Temp`
`2693f52`	`First draft of Tarjan's SCC algorithm`
`3238039`	`Reviewer's suggestions`
`61bca8a`	`Reviewer's suggestions and small improvements`
`af07cee`	`Reviewer's suggestions`
`a83cdf0`	`First draft`

sagetrac-borassi · 2015-08-17T16:42:34Z

comment:5

Hello!

This is the first draft of this new closeness centrality algorithm. It is able to conclude in reasonable time even on very big graphs, with millions of nodes and billions of edges. The new algorithm is in the last commit (while all previous commits simply merge with dependent tickets).

See you,

Michele

dcoudert · 2015-08-18T12:16:15Z

comment:6

A first round of comments.

In method _estimate_reachable_vertices_dir

The lower bound is stored in reachableL[v] -> The lower bound is stored in reachL[v]
Although the method will not be tested, could you please add an INPUT section specifying the type and expected size of arrays. You could also briefly document the algorithm implemented in the method.
please improve the presentation of the block with all variable declarations. It is hard to read since it constains lots of important calls to various methods. For instance avoid cdef int i, nscc = tarjan_strongly_connected_components_Cand prefer cdef int nscc = tarjan_strongly_connected_components_C
What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.
You could use maxscc = max(scc_sizes[i] for i in range(nscc)

For method _compute_reachable_vertices_undir, you should add some documentation at the beginning of the method.

Method counting sort: no comment, but in the future we should certainly create a method somewhere.

In method def centrality_closeness_top_k(G, int k=1, int verb=0)

verb -> verbose
the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for one randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.
we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???
if directed: gamma = 0 else: gamma = 1 -> gamma = 0 if directed else 1
when stopped == True you don't quit the for j in range(layer_current_beginning,layer_current_end):. May be you should replace this loop with a while.
you set gamma = 0. So you need gamma==1 only once for undirected graphs. Correct?

David.

sagetrac-borassi · 2015-08-18T15:46:06Z

comment:7

Replying to @dcoudert:

A first round of comments.

In method _estimate_reachable_vertices_dir

The lower bound is stored in reachableL[v] -> The lower bound is stored in reachL[v]

Done!

Although the method will not be tested, could you please add an INPUT section specifying the type and expected size of arrays. You could also briefly document the algorithm implemented in the method.

Done!

please improve the presentation of the block with all variable declarations. It is hard to read since it constains lots of important calls to various methods. For instance avoid cdef int i, nscc = tarjan_strongly_connected_components_Cand prefer cdef int nscc = tarjan_strongly_connected_components_C

Done!

What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

You could use maxscc = max(scc_sizes[i] for i in range(nscc)

I don't think so, maxscc is not the maximum size of a SCC, but it is the SCC with maximum size. In other words, your instruction outputs scc_sizes[maxscc], not maxscc. Am I missing something?

For method _compute_reachable_vertices_undir, you should add some documentation at the beginning of the method.

Done

Method counting sort: no comment, but in the future we should certainly create a method somewhere.

Agree!

In method def centrality_closeness_top_k(G, int k=1, int verb=0)

verb -> verbose

Done!

the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for one randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???

Yep: !http://docs.cython.org/src/reference/language_basics.html

if directed: gamma = 0 else: gamma = 1 -> gamma = 0 if directed else 1

Done!

when stopped == True you don't quit the for j in range(layer_current_beginning,layer_current_end):. May be you should replace this loop with a while.

I added an 'if' at the end of the loop.

you set gamma = 0. So you need gamma==1 only once for undirected graphs. Correct?

Hmm, correct, but I realized that I could do a little better by slightly changing the implementation. So this does not apply anymore.

sagetrac-git · 2015-08-18T15:48:41Z

Branch pushed to git repo; I updated commit sha1. New commits:

`a0c1b14`	`Reviewer's suggestions`
`2c73801`	`Small improvements`

sagetrac-git · 2015-08-18T15:48:41Z

Changed commit from a83cdf0 to 2c73801

dcoudert · 2015-08-18T16:57:16Z

comment:9

What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

Is it mentioned somewhere?

You could use maxscc = max(scc_sizes[i] for i in range(nscc)

I don't think so, maxscc is not the maximum size of a SCC, but it is the SCC with maximum size. In other words, your instruction outputs scc_sizes[maxscc], not maxscc. Am I missing something?

Right, it should be maxscc = max((scc_sizes[i],i) for i in range(nscc))[1], but this is ugly.

the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for one randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

That's a lot!
However, I don't like the n = random.randint(2,20). You could safely let n=20, no?

we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???

Yep: !http://docs.cython.org/src/reference/language_basics.html

That's funny.

Can we always assume that farness[v]>0? This is for the return sorted([(1.0/farness[v],....

Instead of G.vertices()[v], you should add a line like cdef list V = G.vertices() before the return statement and then use V[v].

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

Is your method faster than centrality_closeness in this case?
This behavior should be documented for the parameter k

That's all for now ;)

David.

sagetrac-git · 2015-08-19T09:26:40Z

Changed commit from 2c73801 to 06d2c00

sagetrac-git · 2015-08-19T09:26:40Z

Branch pushed to git repo; I updated commit sha1. New commits:

`06d2c00`	`Other reviewer's suggestions`

sagetrac-borassi · 2015-08-19T09:37:59Z

comment:11

What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

Is it mentioned somewhere?

I added two inline comments to explain it.

the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for one randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

That's a lot!
However, I don't like the n = random.randint(2,20). You could safely let n=20, no?

Done!

Can we always assume that farness[v]>0? This is for the return sorted([(1.0/farness[v],....

Yep.

The only instructions that modify the farness are farness[x] = n (if a visit is stopped) and farness[x] = ((<double> f) * (n-1)) / (<double>(nd-1) * (nd-1)), where f is a (positive) sum of distances, nd-1 is positive because nd is the number of reachable vertices from a vertex x. If nd==1, it means that the out-degree of x is 0, and we skip the analysis of x, so this case never occurs. Then, n>=nd, and hence also n-1 is positive.

Instead of G.vertices()[v], you should add a line like cdef list V = G.vertices() before the return statement and then use V[v].

Done!

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

Is your method faster than centrality_closeness in this case?

No, it should be the same (actually, a bit slower because we make some computations during the BFSes).

This behavior should be documented for the parameter k

Done!

sagetrac-git · 2015-08-19T09:43:11Z

Changed commit from 06d2c00 to 84031c3

sagetrac-git · 2015-08-19T09:43:11Z

Branch pushed to git repo; I updated commit sha1. New commits:

`84031c3`	`Added documentation for k`

dcoudert · 2015-08-19T11:29:52Z

comment:13

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

Is your method faster than centrality_closeness in this case?

No, it should be the same (actually, a bit slower because we make some computations during the BFSes).

So you should add a test if k>=n at the beginning of the method and if True call the other method and format the result. You could also add a test n==0 and/or n==1 where the result is obvious.

sagetrac-git · 2015-08-19T13:43:03Z

Branch pushed to git repo; I updated commit sha1. This was a forced push. Last 10 new commits:

`0f855c6`	`Reviewer's suggestions`
`1ec3f44`	`Merged with beta2`
`cbe780c`	`Reviewer's suggestions and small improvements`
`4174764`	`Reviewer's suggestions`
`0c19024`	`Use check_allocarray`
`9167071`	`Used limits instead of stdint to find maximum int value`
`83b8057`	`Small corrections`
`1551c6f`	`Undo commit moving closeness_centrality to centrality.pyx. Corrected links.`
`fd0580a`	`Merged with #19007`
`5c1905f`	`New algorithm for closeness centrality`

sagetrac-git · 2015-08-19T13:43:03Z

Changed commit from 84031c3 to 5c1905f

sagetrac-git · 2015-08-19T14:09:26Z

Changed commit from 5c1905f to ced875f

sagetrac-git · 2015-08-19T14:09:26Z

Branch pushed to git repo; I updated commit sha1. New commits:

`ced875f`	`Reviewer's suggestions / added a reference`

sagetrac-borassi · 2015-08-19T14:12:37Z

comment:16

Helloooooooo!

After we moved centrality_closeness back to generic_graph, I had to rebase this ticket. The new algorithm is still in file centrality.pyx, but it is not available in generic_graph (that is, g.centrality_closeness_top_k() does not work anymore). However, I don't think this is a serious issue, because this algorithm in used in very specialized cases, and the standard Sage user does not need it, in my opinion.

Cheers,

Michele

dcoudert · 2015-08-20T06:41:53Z

Reviewer: David Coudert

dcoudert · 2015-08-20T06:41:53Z

comment:17

Is not such a big deal to do

sage: from sage.graphs.centrality import centrality_closeness_top_k
sage: centrality_closeness_top_k(graphs.PetersenGraph(), 3)
[(0.6, 2), (0.6, 1), (0.6, 0)]

and it's documented. Other methods such as hyperbolicity, pathwidth, etc. must be imported for similar reasons. As soon as we will find a reasonable solution for lazy import, we may add it in another ticket.

I set this patch to positive review.

David.

vbraun · 2015-08-20T17:01:11Z

comment:18

On 32-bit Linux:

sage -t --long src/sage/graphs/centrality.pyx
**********************************************************************
File "src/sage/graphs/centrality.pyx", line 588, in sage.graphs.centrality.centrality_closeness_top_k
Failed example:
    centrality_closeness_top_k(g, 5, 1)
Expected:
    Final performance ratio: 0.766666666667
    [(0.36, 5),
     (0.36, 4),
     (0.3333333333333333, 6),
     (0.3333333333333333, 3),
     (0.29032258064516125, 2)]
Got:
    Final performance ratio: 0.805555555556
    [(0.36, 5),
     (0.36, 4),
     (0.3333333333333333, 6),
     (0.3333333333333333, 3),
     (0.29032258064516125, 7)]
**********************************************************************
1 item had failures:
   1 of  40 in sage.graphs.centrality.centrality_closeness_top_k
    [51 tests, 1 failure, 0.16 s]

dcoudert · 2015-08-20T22:40:43Z

comment:19

Michele,

in this tests, nodes 2 and 7 are both correct answers since we have:

on a 64bit machine (i.e. my laptop)

sage: from sage.graphs.centrality import centrality_closeness_top_k
sage: g = graphs.PathGraph(10)
sage: centrality_closeness_top_k(g, 5, 1)
Final performance ratio: 0.766666666667
[(0.36, 5),
 (0.36, 4),
 (0.3333333333333333, 6),
 (0.3333333333333333, 3),
 (0.29032258064516125, 2)]
sage: g.centrality_closeness()
{0: 0.2,
 1: 0.24324324324324326,
 2: 0.2903225806451613,
 3: 0.3333333333333333,
 4: 0.36,
 5: 0.36,
 6: 0.3333333333333333,
 7: 0.2903225806451613,
 8: 0.24324324324324326,
 9: 0.2}

on a 32 bits computer (old desktop)

sage: from sage.graphs.centrality import centrality_closeness_top_k
sage: g = graphs.PathGraph(10)
sage: centrality_closeness_top_k(g, 5, 1)
Final performance ratio: 0.805555555556
[(0.36, 5),
 (0.36, 4),
 (0.3333333333333333, 6),
 (0.3333333333333333, 3),
 (0.29032258064516125, 7)]
sage: g.centrality_closeness()
{0: 0.2,
 1: 0.24324324324324326,
 2: 0.2903225806451613,
 3: 0.3333333333333333,
 4: 0.36,
 5: 0.36,
 6: 0.3333333333333333,
 7: 0.2903225806451613,
 8: 0.24324324324324326,
 9: 0.2}

The good point is that the numbers are the same on both machines ;)

One part of the solution is to call centrality_closeness_top_k(g, 4, 1).

However, for the performance ratio it is more difficult.
In fact, we have visited=138 on my mac, and visited=145 on the 32 bits computer. So we have to understand what could change the number of visited nodes depending on 32-64 bits.

David.

sagetrac-git · 2015-08-21T19:01:19Z

Branch pushed to git repo; I updated commit sha1. New commits:

`5941f86`	`Removed memset with -1 (32bit problem)`

sagetrac-git · 2015-08-21T19:01:19Z

Changed commit from ced875f to 5941f86

sagetrac-git · 2015-08-21T20:01:21Z

Branch pushed to git repo; I updated commit sha1. New commits:

`681d482`	`Corrected doctest with 32bit`

sagetrac-git · 2015-08-21T20:01:21Z

Changed commit from 5941f86 to 681d482

sagetrac-borassi · 2015-08-21T20:04:54Z

comment:22

Hello!

I have installed Sage on a 32bit Virtual Machine: the problem was a comparison between (equal) doubles, and probably due to roundings the error popped out. Indeed, both results were correct.

By asking only top-4 in the doctest, we avoid such comparisons, and everything works.

Thank you very much!

Michele

dcoudert · 2015-08-21T20:27:09Z

comment:24

The patch passes all tests on both my mac (64bits) and a 32bits linux PC.
For me the patch is ok, but is is certainly better to rebase on 9.3.
David.

sagetrac-git · 2015-08-22T17:11:37Z

Changed commit from 681d482 to 1367f00

sagetrac-git · 2015-08-22T17:11:37Z

Branch pushed to git repo; I updated commit sha1. New commits:

`1367f00`	`Merged with 6.9.beta3`

dcoudert · 2015-08-22T17:22:51Z

comment:26

Thanks.

vbraun · 2015-08-23T23:31:12Z

Changed branch from u/borassi/new_algorithm_for_top_k_closeness_centralities to 1367f00

sagetrac-borassi mannequin added this to the sage-6.9 milestone Aug 14, 2015

sagetrac-borassi mannequin added the p: major / 3 label Aug 14, 2015

This comment has been minimized.

Sign in to view

sagetrac-borassi mannequin added t: enhancement labels Aug 14, 2015

sagetrac-borassi mannequin added the s: needs review label Aug 17, 2015

dcoudert added s: positive review and removed s: needs review labels Aug 20, 2015

vbraun added s: needs work and removed s: positive review labels Aug 20, 2015

sagetrac-borassi mannequin added s: needs review and removed s: needs work labels Aug 21, 2015

dcoudert added s: positive review and removed s: needs review labels Aug 22, 2015

vbraun removed the s: positive review label Aug 23, 2015

vbraun closed this as completed in feb86b7 Aug 23, 2015

New Algorithm for Top-K Closeness Centralities #19031

New Algorithm for Top-K Closeness Centralities #19031

Comments

sagetrac-borassi mannequin commented Aug 14, 2015

sagetrac-borassi mannequin commented Aug 14, 2015

This comment has been minimized.

sagetrac-borassi mannequin commented Aug 14, 2015

sagetrac-borassi mannequin commented Aug 14, 2015

sagetrac-borassi mannequin commented Aug 17, 2015

sagetrac-git mannequin commented Aug 17, 2015

sagetrac-git mannequin commented Aug 17, 2015

sagetrac-git mannequin commented Aug 17, 2015

sagetrac-git mannequin commented Aug 17, 2015

sagetrac-borassi mannequin commented Aug 17, 2015

dcoudert commented Aug 18, 2015

sagetrac-borassi mannequin commented Aug 18, 2015

sagetrac-git mannequin commented Aug 18, 2015

sagetrac-git mannequin commented Aug 18, 2015

dcoudert commented Aug 18, 2015

sagetrac-git mannequin commented Aug 19, 2015

sagetrac-git mannequin commented Aug 19, 2015

sagetrac-borassi mannequin commented Aug 19, 2015

sagetrac-git mannequin commented Aug 19, 2015

sagetrac-git mannequin commented Aug 19, 2015

dcoudert commented Aug 19, 2015

sagetrac-git mannequin commented Aug 19, 2015

sagetrac-git mannequin commented Aug 19, 2015

sagetrac-git mannequin commented Aug 19, 2015

sagetrac-git mannequin commented Aug 19, 2015

sagetrac-borassi mannequin commented Aug 19, 2015

dcoudert commented Aug 20, 2015

dcoudert commented Aug 20, 2015

vbraun commented Aug 20, 2015

dcoudert commented Aug 20, 2015

sagetrac-git mannequin commented Aug 21, 2015

sagetrac-git mannequin commented Aug 21, 2015

sagetrac-git mannequin commented Aug 21, 2015

sagetrac-git mannequin commented Aug 21, 2015

sagetrac-borassi mannequin commented Aug 21, 2015

dcoudert commented Aug 21, 2015

sagetrac-git mannequin commented Aug 22, 2015

sagetrac-git mannequin commented Aug 22, 2015

dcoudert commented Aug 22, 2015

vbraun commented Aug 23, 2015