Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of GNP generators for graphs and digraphs #12362

Closed
dcoudert opened this issue Jan 26, 2012 · 47 comments
Closed

Improvement of GNP generators for graphs and digraphs #12362

dcoudert opened this issue Jan 26, 2012 · 47 comments

Comments

@dcoudert
Copy link
Contributor

The RandomDirectedGNP generator is quit slow because it is written in python. Also I propose to rewrite it in Cython.

With this ticket, I propose to create a file graph_generators_pyx.pyx containing graphs and digraphs generators in cython and to provide a faster implementation of the GNP generator for both graphs and digraphs.

APPLY:

CC: @nathanncohen

Component: graph theory

Keywords: directed graphs, generators

Author: David Coudert

Reviewer: Nathann Cohen

Merged: sage-5.0.beta9

Issue created by migration from https://trac.sagemath.org/ticket/12362

@dcoudert
Copy link
Contributor Author

comment:1

The patchs are not working yet :(

Here the modifications are not taken into account by sage -b.
Something is missing but I don't know what.
I'm still a beginner with sage...

Any help, advise,... is more than welcome !

D.

@dcoudert

This comment has been minimized.

@dcoudert dcoudert changed the title Modification of the digraph generator to cython and improvement of the GNP genrator Improvement of GNP generators for graphs and digraphs Jan 31, 2012
@dcoudert

This comment has been minimized.

@dcoudert
Copy link
Contributor Author

comment:4

Another proposal for RandomDirectedGNP.

The function is generic and also works for RandomGNP, but I haven't modified the graph_generators.py file because further benchmark are required: networkx implementation versus this implementation.

How can I remove useless attachments of this ticket ??

Thanks,

D.

@dcoudert
Copy link
Contributor Author

dcoudert commented Feb 7, 2012

comment:5

I have changed the status from new to need_review.

Without this patch

sage: %timeit digraphs.RandomDirectedGNP(1000,0.001)
5 loops, best of 3: 2.58 s per loop
sage: %timeit networkx.random_graphs.directed_gnp_random_graph(1000,0.001)
5 loops, best of 3: 2.08 s per loop

With this patch

sage: %timeit digraphs.RandomDirectedGNP(1000,0.001)
5 loops, best of 3: 26.6 ms per loop

This patch could be used for undirected graphs, but the networkx implementation is competitive.

Waiting for your feedback.

D.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Feb 9, 2012

comment:6

Helloooooo !!!

Nice speedup :-D

During the Sage week we also found a similar speedup in a Poset code, by replacing a m[i][j] by a m[i,j] :-P

About the length of the code : it can be divided by two at no cost by removing the "else" on the "loops" argument. If you try to build an edge between i and i in a graph that that has been built with loops = False the edge is silently not created. And there should be no noticeable difference in the runtime if the code with loops is applied to the case where loops are forbidden.

It is also possible to write the two codes

for 0 <= i < n:                                                                                                                                                                                                                                                 
   for 0 <= j < n:                                                                                                                                                                                                                                             
      if random() < pp:                                                                                                                                                                                                                                       
         G.add_edge(i,j)   

and

# Standard GNP generator for graphs                                                                                                                                                                                                                                 
for 0 <= i < n-1:                                                                                                                                                                                                                                                   
   for i+1 <= j <n:                                                                                                                                                                                                                                                
       if random() < pp:                                                                                                                                                                                                                                           
          G.add_edge(i,j)   

as

for 0 <= i < n-1:                                                                                                                                                                                                                                                   
   for (i+1 if directed else 0) <= j <n:                                                                                                                                                                                                                                                
      if random() < pp:                                                                                                                                                                                                                                           
         G.add_edge(i,j)   

Same thing here, a "if" in a loop (with branch prediction !) should not weigh much :-)

Nathann

@dcoudert
Copy link
Contributor Author

dcoudert commented Feb 9, 2012

comment:7

Thank you Nathann for the good advices.

I have changed accordingly the standard GNP algorithm.
I have also contracted the fast algorithm but it is a bit more difficult to read now. The extra cost for N = 10000 and p=0.0001 is 5ms which is acceptable ;-)

Some running time. The fast algorithm is better for large sparse graphs. Otherwise, the standard GNP algorithm is now fast enough.

sage: %timeit G = digraphs.RandomDirectedGNP(1000,0.001,loops=True,fast=True)
5 loops, best of 3: 55.8 ms per loop
sage: %timeit G = digraphs.RandomDirectedGNP(1000,0.001,loops=True)
25 loops, best of 3: 15.8 ms per loop
sage: %timeit G = digraphs.RandomDirectedGNP(10000,0.0001,loops=True,fast=True)
5 loops, best of 3: 558 ms per loop
sage: %timeit G = digraphs.RandomDirectedGNP(10000,0.0001,loops=True)
5 loops, best of 3: 1.07 s per loop
sage: %timeit G = digraphs.RandomDirectedGNP(10000,0.001,loops=True,fast=True)
5 loops, best of 3: 5.31 s per loop
sage: %timeit G = digraphs.RandomDirectedGNP(10000,0.001,loops=True)
5 loops, best of 3: 1.5 s per loop

D.

PS: concerning m[i][j] versus m[i,j], is it for python or cython ? which kind of data structure ?

PS2: in general in C, m[i][j] is faster than m[iN+j], so if you have a NxN matrix encoded in a 1D array A, it is interesting to create a **B array of size N and to assign B[i]=A+iN. Then, we have B[i,j]==A[i*N+j]. OK, I had real C course.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Feb 10, 2012

comment:9

Hellooooooo !!!

Some running time. The fast algorithm is better for large sparse graphs.

Nice ! Could you advertise it in the doc ? Something like "the fast algorithm is only faster for LARGE instances (try it to know whether it is useful for you)" ?

I tried to improve it a bit because I still find weird that networkx could beat us at this game, but found nothing yet... I'll try again later :-p

This being said, I was thinking of two things. The first of the two you reported yourself some time ago : the means seem to be below what is expected...

sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)]) 
0.0990303030303030
sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)])
0.0985575757575757
sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)])
0.0989050505050506
sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)])
0.0993535353535353

And there is also this "detail" :-p

sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = False).density()+.0 for i in range(50)])
0.0498989898989899
sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)]) 
0.0987434343434344

Could you also add the file to the documentation (the files in devel/sage-yourbranch/doc/en/reference/) and make the method available (through an optional argument) in graphs.RandomGNP ? Even though it can not be the default method, as networkx is faster (why the hell should it be ? O_o), it would be nice to have around.

And about the matrix : it was not about C arrays but about Sage matrices, and this is why the speedup was so huge (#12476). It is because a field was accessed in Python as m[x][y] which translates as "build a copy of the x^th row of m, and return its y^th element". A bit more than what was necessary :-P

Nathann

@dcoudert
Copy link
Contributor Author

comment:10

I have addressed all your remarks. The "detail" was because I have stupidly copy/paste your (i+1 if directed else 0). It should be (0 if directed else i+1) ;)

For graphs, I have decided to let the networkx method as default.

sage: %timeit networkx.gnp_random_graph(1000,0.01)
5 loops, best of 3: 1.03 s per loop
sage: %timeit graphs.RandomGNP(1000,0.01,method='sage')
5 loops, best of 3: 233 ms per loop
sage: %timeit graphs.RandomGNP(1000,0.01,method='networkx')
5 loops, best of 3: 66.5 ms per loop
sage: %timeit networkx.gnp_random_graph(100,0.05,method='sage')
25 loops, best of 3: 10.9 ms per loop
sage: %timeit graphs.RandomGNP(100,0.05,method='sage')
25 loops, best of 3: 11.4 ms per loop
sage: %timeit graphs.RandomGNP(100,0.05,method='networkx')
125 loops, best of 3: 3.45 ms per loop

Concerning the average density, it is seems correct with larger sample.

sage: mean([digraphs.RandomDirectedGNP(100,.1, loops = True).density()+.0 for i in range(500)])
0.100094400000000
sage: mean([digraphs.RandomDirectedGNP(100,.1, loops = False).density()+.0 for i in range(500)]) 
0.100155555555556
sage: mean([digraphs.RandomDirectedGNP(100,.1, loops = False, fast = True).density()+.0 for i in range(500)]) 
0.0989579797979798
sage: mean([digraphs.RandomDirectedGNP(100,.1, loops = True, fast = True).density()+.0 for i in range(500)]) 
0.0988469999999999
sage: mean([graphs.RandomGNP(100,.1, fast = False, method='sage').density()+.0 for i in range(500)]) 
0.100131313131313
sage: mean([graphs.RandomGNP(100,.1, fast = False, method='networkx').density()+.0 for i in range(500)]) 
0.100190707070707
sage: mean([graphs.RandomGNP(100,.1, fast = True, method='networkx').density()+.0 for i in range(500)]) 
0.100002424242424
sage: mean([graphs.RandomGNP(100,.1, fast = True, method='sage').density()+.0 for i in range(500)]) 
0.104045165676905

A possible improvement for RandomDirectedGNP is to return a complete digraph (with or without loops) when p=1. However, no such function exists and it is not relevant in this patch. I let it for another patch.

D.

@dcoudert
Copy link
Contributor Author

comment:11

I have added some set_random_seed(0) before the tests to ensure that the results are always the same. I have also removed annoying TABs and spaces.
The patch should now pass all tests (at least it is the case on my computer).

However, I don't know how to set the seed inside a cython function. This could be a nice improvement to this patch. I have checked the sage.misc.randstate documentation, but I don't understand how to proceed. Any help is more than welcome.

D.

@dcoudert
Copy link
Contributor Author

Attachment: trac_12362_random_gnp.patch.gz

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Feb 23, 2012

comment:12

Hellooooooo David !

Well, I don't think not being able to define the seed as a parameter of the function is so bad. As it works fine when you set it globally for the doctests anyway....

Now, here is a small patch with some modifications. It adds a few tests, check that the given method is "networkx" or "Sage" but nothing else (that's how we do it usually). I also added a "(Cython)" to the title of the new file you create. It makes more sense this way when you look at the documentation. Otherwise, you would have:

  • Digraph generators
  • Graph Generators
  • Digraph and Graph generators
    Now there is a "Cython" at the end of the third one, and there's some logic to it :-)

This being said, I added one test which does not pass... Here it is :

sage: from sage.graphs.graph_generators_pyx import RandomGNP
sage: abs(mean([RandomGNP(150,.2, fast = True).density() for i in range(30)])-.2) < .001
False 

It looks like there is some deviation with the "fast" algorithm, which does not occur with the usual one :

sage: sage: abs(mean([RandomGNP(200,.2).density() for i in range(30)])-.2)
0.000443886097152429
sage: sage: abs(mean([RandomGNP(200,.2, fast = True).density() for i in range(30)])-.2)
0.0163283739008691

Well, that's already 1% of edges too much. I don't think the deviation with the first algorithm is very bad (though for some reason it is always positive on my tests, and I do not like that), but I'm no big fan of the deviation for the second implementation
The "fast" algorithm is also.... MUCH SLOWER on most cases... Is it better than the other one in some interesting situations ?

Oh, and I also noticed that the graphs returned did not necessarily have n vertices when p was too low. I fixed that in the patch too :-)

Nathann

@dcoudert
Copy link
Contributor Author

comment:13

Attachment: trac_12362-rev.patch.gz

Hello Nathann,

The fast method is expected to be faster for large and sparse graphs, e.g., N=10000 and p=0.0001. But even for such values, the standard method is already fast. Otherwise, the standard method is sufficient.
Since I don't know how to solve the deviation problem you have pointed out (I'm not the designer of the algorithm) and that it is not so useful, I propose to simply remove it from the Cython RandomGNP function and from the RandomDirectedGNP function. For the graphs.RandomGNP function, the fast method of networkx has no deviation so we can let it.

We should also remove the tests with cputime. They are quite old and will differ from one computer to another.

If you agree with these proposition, I will prepare a rev-rev patch ;-)

D.

@dcoudert
Copy link
Contributor Author

dcoudert commented Mar 4, 2012

comment:27

That's much better ! For me the patch is ready to do :))

D.

@jdemeyer
Copy link

comment:28

On hawk, this yields

sage -t  --long -force_lib devel/sage/sage/graphs/generic_graph.py
**********************************************************************
File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.0.beta8/devel/sage-main/sage/graphs/generic_graph.py", line 5830:
    sage: g.flow(0,1, method="FF") == g.flow(0,1,method="LP")
Expected:
    True
Got:
    False
**********************************************************************

This is probably a bug uncovered by this patch, rather than caused by this patch. But anyway, it should be looked at.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Mar 14, 2012

comment:29

Updated ! It must be because of some numerical noise on that machine. Actually, this doctest breaks when the CBC package is installed too, for the same reason. Let it be fixed with the updated version of trac_12362-edge_disjoint_spanning_tree.patch.

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Mar 14, 2012

@dcoudert
Copy link
Contributor Author

comment:30

The patch works perfectly with sage-5.0.beta7 on my computer. Tests are also OK.

compote:~/Soft/sage-5.0.beta7> ./sage -t --long -force_lib generic_graph.py
sage -t --long -force_lib "devel/sage-myclone/sage/graphs/generic_graph.py"
         [74.6 s]
 
----------------------------------------------------------------------
All tests passed!
Total time for all tests: 74.6 seconds

The long tests are also OK for all .py and .pyx files of sage/graphs/.

Thanks Nathann for last update.

D.

@dcoudert
Copy link
Contributor Author

comment:31

Long tests are also OK on a fresh new install of sage-5.0.beta8.

compote:~/sage-5.0.beta8> ./sage -t --long -force_lib devel/sage-myclone/sage/graphs/generic_graph.py
sage -t --long -force_lib "devel/sage-myclone/sage/graphs/generic_graph.py"
         [75.5 s]
 
----------------------------------------------------------------------
All tests passed!
Total time for all tests: 75.5 seconds

@dcoudert
Copy link
Contributor Author

comment:32

Nathann,

I don't know if it is a requirement, but there is no ticket number inside files trac_12362-rev.patch and trac_12362-edge_disjoint_spanning_tree.patch

# HG changeset patch
# User Nathann Cohen <nathann.cohen@gmail.com>
# Date 1329999818 -3600
# Node ID 283e6da5d8af82c7d8b8ae19c16572d55ca06e08
# Parent  34d93e382b071f0f17bb633c6168c1950d901fde
Improvement of GNP generators for graphs and digraphs -- Cython

diff --git a/module_list.py b/module_list.py

Best,
D.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Mar 17, 2012

comment:33

I don't know if it is a requirement, but there is no ticket number inside files trac_12362-rev.patch and trac_12362-edge_disjoint_spanning_tree.patch

That's a very controversial point in Sage development :-D

If I did not miss the latest updates, the moral is : it used to be a requirement, and it is now done automatically with scripts. If you put it anyway Jeroen will remove it himself, but you spare him that if you do not add it. But these things changes faster than lightning :-)

By the way, have you ever played the role of mentor in a "Google Summer of Code" project ? Looks like two people asked to participate in one for Sage : the first about Graphs, the other with LP... I should really work on my application right now, but I still must think f that too O_o;;;

Nathann

@jdemeyer
Copy link

comment:34

Replying to @nathanncohen:

it used to be a requirement, and it is now done automatically with scripts.

Exactly.

If you put it anyway Jeroen will remove it himself

That's false. If you put it anyway, you end up with a commit message like

Trac #12362: #12362: foo bar

@dcoudert
Copy link
Contributor Author

comment:35

OK, so let me know if I have something to do.

Best,

David.

@jdemeyer
Copy link

Merged: sage-5.0.beta9

@jdemeyer
Copy link

comment:37

This patch is bad.

Before:

sage: timeit('graphs.RandomGNP(2000, .01)')
5 loops, best of 3: 263 ms per loop

After:

sage: timeit('graphs.RandomGNP(2000, .01)')
5 loops, best of 3: 3.73 s per loop

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Apr 28, 2012

comment:38

OOch O_O;;;;

David ? You know where that comes from ? O_o;;;;

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Apr 28, 2012

comment:39

Well, actually our own code is still faster....

sage: %timeit graphs.RandomGNP(2000,.01, method="Sage")
5 loops, best of 3: 157 ms per loop

But did something change with networkx since ? O_o

@dcoudert
Copy link
Contributor Author

comment:40

Hello,

the default method is networkx (as before this patch) and so when used with default parameters this patch has no impact on the running time. When I wrote this patch, we decided to let networkx as default algorithm because the running time was faster with new method only for large graphs.

Something has changed with networkx. It is now surprisingly slow...

sage: %timeit networkx.gnp_random_graph(2000,0.01)
5 loops, best of 3: 4.31 s per loop
sage: %timeit graphs.RandomGNP(2000, .01, method='networkx')
5 loops, best of 3: 4.47 s per loop
sage: %timeit Graph(networkx.gnp_random_graph(2000,0.01))
5 loops, best of 3: 4.49 s per loop
sage: %timeit graphs.RandomGNP(2000, .01, method='Sage')
5 loops, best of 3: 173 ms per loop
sage:
sage: %timeit graphs.RandomGNP(100, .01, method='networkx')
25 loops, best of 3: 11.2 ms per loop
sage: %timeit graphs.RandomGNP(100, .01, method='Sage')
625 loops, best of 3: 659 µs per loop

If we are unable to find why networkx is now slow, we can open a ticket to put Sage as default algorithm.

For directed graphs, our implementation was always faster than networkx, and so we removed the networkx one.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Apr 28, 2012

comment:41

Yeah, something must have changed in networkx, but the last update seems to be 1.2 a veeeeeeery long time ago. Where the hell does this come from ? O_o

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Apr 28, 2012

comment:42

Here's the code of RandomGNP in beta8

        if seed is None:
            seed = current_randstate().long_seed()
        if p == 1:
            return graphs.CompleteGraph(n)
        import networkx
        if fast:
            G = networkx.fast_gnp_random_graph(n, p, seed=seed)
        else:
            G = networkx.gnp_random_graph(n, p, seed=seed)
        return graph.Graph(G)

and in beta14

        if n < 0:
            raise ValueError("The number of nodes must be positive or null.")
        if 0.0 > p or 1.0 < p:
            raise ValueError("The probability p must be in [0..1].")

        if seed is None:
            seed = current_randstate().long_seed()
        if p == 1:
            return graphs.CompleteGraph(n)

        if method == 'networkx':
            import networkx
            if fast:
                G = networkx.fast_gnp_random_graph(n, p, seed=seed)
            else:
                G = networkx.gnp_random_graph(n, p, seed=seed)
            return graph.Graph(G)
        elif method == "Sage":
            # We use the Sage generator
            from sage.graphs.graph_generators_pyx import RandomGNP as sageGNP
            return sageGNP(n, p)
        else:
            raise ValueError("'method' must be equal to 'networkx' or to 'Sage'.")

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Apr 28, 2012

comment:43

GOT IT !!!!

The flag "fast" used to be "True", it is now "False". We set it to False because our "fast" algorithm is slower than the usual one, so for the Sage algorithm it is better to set it to False.
For networkx, however, it is much better to use the "Fast" version, and so the default value has to be set to "True" again, as the default method is networkx !

Nathann

@dcoudert
Copy link
Contributor Author

comment:44

Bravo !

sage: %timeit graphs.RandomGNP(2000, .01, method='networkx', fast = True)
5 loops, best of 3: 234 ms per loop
sage: %timeit graphs.RandomGNP(2000, .01, method='Sage')
5 loops, best of 3: 163 ms per loop
sage: %timeit graphs.RandomGNP(2000, .001, method='networkx', fast = True)
25 loops, best of 3: 27.7 ms per loop
sage: %timeit graphs.RandomGNP(2000, .001, method='Sage')
5 loops, best of 3: 50.2 ms per loop
sage: %timeit graphs.RandomGNP(2000, .1, method='networkx', fast = True)
5 loops, best of 3: 2.38 s per loop
sage: %timeit graphs.RandomGNP(2000, .1, method='Sage')
5 loops, best of 3: 1.33 s per loop

I will open a new ticket to correct this. Sorry.

A remaining question is which should be the default algorithm. Apparently, the Sage implementation is often faster than the networkx one, except for large sparse graphs

sage: %timeit graphs.RandomGNP(10000, 0.001, method='Sage')
5 loops, best of 3: 1.21 s per loop
sage: %timeit graphs.RandomGNP(10000, 0.001, method='networkx', fast = True)
5 loops, best of 3: 612 ms per loop
sage: %timeit graphs.RandomGNP(10000, 0.01, method='Sage')
5 loops, best of 3: 4.17 s per loop
sage: %timeit graphs.RandomGNP(10000, 0.01, method='networkx', fast = True)
5 loops, best of 3: 6.09 s per loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants