Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De Bruijn Sequence construction for combinat #10530

Closed
eviatarbach opened this issue Dec 29, 2010 · 59 comments
Closed

De Bruijn Sequence construction for combinat #10530

eviatarbach opened this issue Dec 29, 2010 · 59 comments

Comments

@eviatarbach
Copy link

I have written an implementation of De Bruijn sequences for combinat. It can currently generate one sequence for an alphabet (or arity) and substring length. It can also calculate cardinality.

For the human:

Apply :

for the bot:

Apply 10530.patch

CC: @sagetrac-sage-combinat @sagetrac-tmonteil

Component: combinatorics

Author: Eviatar Bach

Reviewer: Nicolas M. Thiéry, Nathann Cohen

Merged: sage-4.7.2.alpha2

Issue created by migration from https://trac.sagemath.org/ticket/10530

@eviatarbach
Copy link
Author

Attachment: 15057.patch.gz

@nthiery
Copy link
Contributor

nthiery commented Dec 29, 2010

comment:2

Hi!

I just had a quick look at your patch. This sounds good!

A couple micro comments (before an actual review):

  • CombinatorialClass is deprecated. Please use the
    FiniteEnumeratedSets category. For an example, see:

    sage: C = FiniteEnumeratedSets().example()
    sage: C??
    File:		/opt/sage/local/lib/python2.6/site-packages/sage/categories/examples/finite_enumerated_sets.py
    

    (sorry, we still have a lot of old code giving bad hints to the
    newcomers)

  • What about just implementing an_element? (no __iter__, no first).

    Actually at this point DeBruijnSequences(alphabet, n) should just
    be in the category FiniteSets(). However the later does not exist
    yet, so Sets() will do.

  • Rather than defining an attribute self.k_is_alphabet, I would setup
    two attributes self.k and self.alphabet once for all in the
    init. Then no more branching needs to be done later on.

    Maybe rename the parameter k to alphabet (which is more explicit)

  • DeBruijnSequence_evaluation -> DeBruijnSequences

    No need for the DeBruijnSequence function at this point.

  • The length of the sequence is known in advance, right? Then it
    should be faster to allocate a bit list right away, and then fill
    it up, rather than using append.

  • In the long run, we might want to provide an iterator over the
    sequence (once cython will support iterators)

  • debruijn.py -> debruijn_sequence.py

  • The speed difference is impressive :-) I don't know if we want to
    mention it in the doc though (this may change in the future).

Cheers,
Nicolas

@nthiery
Copy link
Contributor

nthiery commented Dec 29, 2010

Reviewer: Nicolas M. Thiéry

@eviatarbach
Copy link
Author

comment:4

Thank you for the comments. I have a few questions:

Rather than defining an attribute self.k_is_alphabet, I would setup two attributes self.k and self.alphabet once for all in the init. Then no more branching needs to be done later on.

I know it looks ugly. The point of having k_is_alphabet is to check whether the input k is an alphabet or an arity, as the processing is different in either case. I was thinking of removing the alphabet functionality; maybe just the arity would suffice? One could always map values onto the digits after.

The length of the sequence is known in advance, right? Then it should be faster to allocate a bit list right away, and then fill it up, rather than using append.

I'm not clear on this. Are you suggesting I compute the length of the array in advance, create sequence full of zeros to that length, and then use indeces to change the values? Is append that much slower that creating a long list and then filling it would be faster?

Or are you suggesting that I declare sequence as an int array in C, then fill it? This was my initial idea, but I couldn't find a way to insert arguments into the declaration. This returns a traceback:

cdef sequence[k^n]

Thanks again!

@nthiery
Copy link
Contributor

nthiery commented Dec 30, 2010

comment:5

Replying to @eviatarbach:

Thank you for the comments. I have a few questions:

Rather than defining an attribute self.k_is_alphabet, I would setup two attributes self.k and self.alphabet once for all in the init. Then no more branching needs to be done later on.

I know it looks ugly. The point of having k_is_alphabet is to check whether the input k is an alphabet or an arity, as the processing is different in either case.

It feels only marginaly different. For example, if alphabet is set to
[0,...,k-1], then the last if becomes unnecessary; granted, there
is an overhead of an indexed lookup in an array, but I assume it
should be negligible (to be checked with timeit though!).

I was thinking of removing the alphabet functionality; maybe just the arity would suffice? One could always map values onto the digits after.

As you feel like; that functionality can indeed be added later.

The length of the sequence is known in advance, right? Then it should be faster to allocate a bit list right away, and then fill it up, rather than using append.

I'm not clear on this. Are you suggesting I compute the length of the array in advance, create sequence full of zeros to that length, and then use indeces to change the values?

Yes.

Is append that much slower that creating a long list and then filling it would be faster?

With append, Python needs to regularly reallocate the list in memory,
requiring a copy of it each time. In principle, Python uses
"exponential" reallocation so the theoretical complexity of append is
"amortized O(1)", but there still is an overhead.

I recommend trying both, and benchmarking with timeit (and post here
the result for our instruction).

Or are you suggesting that I declare sequence as an int array in C,
then fill it? This was my initial idea, but I couldn't find a way to
insert arguments into the declaration. This returns a traceback:

cdef sequence[k^n]

I don't know what's the canonical way to allocate C arrays in
Cython. Probably via malloc or something. It will be in the Cython
doc. Anyway, since the result will be returned as a list, I assume
that copying back the result into a list would ruin the benefit of
having used a C array in the first place.

Ah, by the way: please add tests for all the base cases: k=0,
alphabet=0, ...

Also please add tests for the direct call to debruijn_cython.
Speaking of which I suggest renaming it to debruijn_sequence, and
adding a link back to DeBruijnSequences.

Cheers,

@eviatarbach
Copy link
Author

comment:6

Append is marginally faster (by about one microsecond), probably because for the other I had to declare a global index variable in Python.

Let me get this straight:

Python

Filename: debruijn_sequence.py
Class name: DeBruijnSequence (DeBruijnSequences wouldn't make sense since only one is generated)

Cython

Filename: debruijn_sequence.pyx
Function name: debruijn_sequence

Is this correct?

Also, for some reason, when I use underscores in debruijn_sequence.py, I get this error:

ImportError: dynamic module does not define init function (initdebruijn_sequence)

Thank you!

@nthiery
Copy link
Contributor

nthiery commented Jan 2, 2011

comment:7

Replying to @eviatarbach:

Append is marginally faster (by about one microsecond), probably
because for the other I had to declare a global index variable
in Python.

Interesting to know! Have you tried with large output? Say, a sequence of length 10^5 or 10^6?

For the following, see also the discussion on:

http://groups.google.com/group/sage-combinat-devel/browse_thread/thread/bf20e88681bc557b

Let me get this straight:

Python

Filename: debruijn_sequence.py
Class name: DeBruijnSequence (DeBruijnSequences wouldn't make sense since only one is generated)

Class name: DeBruijnSequences (it models the set of all such sequences)
Filename: debruijn_sequences.py (for consistency with the name of the class; granted this convention is not yet followed consistently across combinat; that should eventually be fixed)

Cython

Filename: debruijn_sequence.pyx
Function name: debruijn_sequence

+1

Possibly debruijn_sequences_cython.pyx in case you expect other
functionalities on de Bruijn sequences to be implemented in cython.

Also, for some reason, when I use underscores in debruijn_sequence.py, I get this error:

ImportError: dynamic module does not define init function (initdebruijn_sequence)

Hmm, as far as I know this should work. Please double check all
references to the f,ile and that everything was recompiled
properly. Also, there might be some old files that have not been
cleaned up and that cython tries to recompile.

Cheers,
Nicolas

@eviatarbach
Copy link
Author

comment:9

Attachment: 15454.patch.gz

Sorry for taking so long, but here is the updated patch. The changes are as follows:

  • sanity-checks the input; corresponding tests added in the docstring
  • filenames changed
  • option for alphabet removed; not really useful and having an additional keyword argument may confuse

I tried to allocate the length of the sequence length beforehand and tested with larger inputs, as you suggested, but it is still slower. I don't know much about algorithmic efficiency, but that's what I observed in my tests. For one input, I got around 50 ms and 120 ms, respectively.

I decided to not have the sequences as words, to leave the option open for generation of all possible sequences open.

Thanks in advance!

@eviatarbach
Copy link
Author

comment:10

By the way, it was built on Sage 4.6.2.alpha4.

@sagetrac-ylchapuy
Copy link
Mannequin

sagetrac-ylchapuy mannequin commented Feb 12, 2011

comment:11

Hi,
sight remark: are you sure about the cardinality part? I think a factorial is missing...
( see e.g. your reference or http://en.wikipedia.org/wiki/De_Bruijn_sequence )

@eviatarbach
Copy link
Author

comment:12

Yes, you're correct. It seems that the formula I was looking at only applies when k=2. I uploaded a new patch.

@eviatarbach
Copy link
Author

To be applied on top of previous patch

@sagetrac-ylchapuy
Copy link
Mannequin

sagetrac-ylchapuy mannequin commented Feb 15, 2011

comment:13

Attachment: 15455.patch.gz

Another slight remark:
in debruijn_sequence, in the inner function gen, you should "cdef int j" for a nice speedup.

@eviatarbach
Copy link
Author

Attachment: 15456.patch.gz

To be applied on top of previous.

@eviatarbach
Copy link
Author

comment:14

That is a significant speedup! Thank you!

New patch uploaded.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented May 8, 2011

comment:15

Hello !

Several superficial remarks :

  • As in your other patch, the lines should be kept to less than 80 characters

  • Some functions are not documented

       ```
       /home/ncohen/sage/devel/sage-1/sage/combinat$ sage -coverage debruijn_sequence*
       ----------------------------------------------------------------------
       debruijn_sequence.pyx
       SCORE debruijn_sequence.pyx: 0% (0 of 2)
       
       Missing documentation:
       	 * gen(int t, int p):
       
       
       Missing doctests:
       	 * debruijn_sequence(int k, int n):
       
       ----------------------------------------------------------------------
       
       ----------------------------------------------------------------------
       debruijn_sequences.py
       ERROR: Please add a `TestSuite(s).run()` doctest.
       SCORE debruijn_sequences.py: 75% (3 of 4)
       
       Missing documentation:
       	 * __init__(self, k,n):
       
       ----------------------------------------------------------------------
       
       /home/ncohen/sage/devel/sage-1/sage/combinat$ sage -coverage debruijn_sequence.pyx 
       ----------------------------------------------------------------------
       debruijn_sequence.pyx
       SCORE debruijn_sequence.pyx: 0% (0 of 2)
       
       Missing documentation:
       	 * gen(int t, int p):
       
       
       Missing doctests:
       	 * debruijn_sequence(int k, int n):
       
       ----------------------------------------------------------------------
    
    Let's keep this to 100% lest we write other patches later to improve it `:-)`
    
  • The "NOTE::" environment is misused. Look at other places in the source code where it is used. You can see it doesn't display properly when building the documentation

    ```
    sage -docbuild reference html
    
    
    
  • Same for the "reference" environment and citations.The best is to look at other places in the souce code :-)

  • The "cardinality" method, for instance, is not doctested.

  • It would be good to add a reference to the book "Combinatorial Generation" in the file containing the actual implementaion of the algorithm

  • There are 4 patches on this ticket.

    • If the ticket's name is a number, let it be the ticket's number
    • In this kind of situation, it's also very nice to produce "flattened" patched. Sometimes it is better to update the current patch than to create another one to apply on top. You may need to use "Mercurial Queue" if you're not used to it.

Short of all this, thank you for this patch ! As usual, there's much more work in dealing with the programming standards than with the actual math content, but it quickly becomes natural. Thank you very much also, because of you I spend yesterday reading random parts of Combinatorial Generation, and learned very nice things :-)

(send me an email if I can be of any help)

Nathann

@fchapoton
Copy link
Contributor

comment:37

Attachment: trac_10530-reviewer.patch.gz

for the bot:

Apply 10530.2.patch trac_10530-reviewer.patch

@fchapoton

This comment has been minimized.

@eviatarbach
Copy link
Author

comment:38

Sorry for taking so long with the patch, but I'm having a hard time learning Mercurial queues.

@eviatarbach
Copy link
Author

comment:39

Merged the two files, other fixes. Only one patch to apply now!

@eviatarbach

This comment has been minimized.

@eviatarbach
Copy link
Author

Changed reviewer from Nicolas M. Thiéry to Nicolas M. Thiéry, Nathann Cohen

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Jul 1, 2011

comment:42

Hello !!

First, it is my mistake as it appeared in my previous patch, but clearly the second doctest should be changed :-)

if not True:

Something like

D.an_element() in D

Would be much better !

Then, do you get a warning when you build the documentation ? Here's what I get :

/home/ncohen/sage/devel/sage-4$ sage -docbuild reference html
...
reading sources... [100%] sage/combinat/debruijn_sequence                                                                                                                                                                                                                        
docstring of sage.combinat.debruijn_sequence:35: (WARNING/2) Inline interpreted text or phrase reference start-string without end-string.

But sometimes this kind of bug just comes from having built the documentation many times while adding/removing files from it... O_o

Nathann

@eviatarbach
Copy link
Author

comment:44

I can never get it to build the documentation properly. It ignores debruijn_sequences.pyx for some reason.

Do you have any idea what that warning could mean?

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Jul 2, 2011

comment:45

No idea :-/

I've never been friends with Sphinx. When this kind of things happen, I usually swear a lot and reinstall Sage from scratch ^^;

Nathann

@nthiery
Copy link
Contributor

nthiery commented Jul 2, 2011

comment:46

Replying to @nathanncohen:

No idea :-/

I've never been friends with Sphinx. When this kind of things happen, I usually swear a lot and reinstall Sage from scratch ^^;

Deleting doc/output (or sometimes just doc/output/doctrees is still quite brute force but usually works.

Just to make sure: debruijn_sequences.pyx is referenced in doc/en/reference/combinat/index.rst?

@eviatarbach
Copy link
Author

comment:47

Nathann, I can relate. That's what I'm going to do now. Unfortunately my computer takes ages to build the documentation.

debruijn_sequence.pyx is referenced in index.rst.

@eviatarbach
Copy link
Author

comment:48

Hello Nathann,

I tried building the documentation, and I'm getting the same warning. Do you think it would be okay to submit it anyways?

What's the point of this test?

sage: for n in range(1, 7):

... for k in range(1, 7):
... D = DeBruijnSequences(k, n)
... if not True:
... print "Something's dead wrong (n = ", n, "k =", k, ") !"
... break

I don't see how this can fail.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Jul 8, 2011

comment:49

Yooooo !

I tried building the documentation, and I'm getting the same warning. Do you think it would be okay to submit it anyways?

No, it has to be fixed ! The release manager will set it back to needs/work otherwise anyway ;-)

What's the point of this test?

As I trold you during my latest review, the

if not True:

Should be repaced by

D.an_element() in D

The aim is to check that all the small DeBruijnSequences that the code return are indeed valid :-)

Nathann

@nthiery
Copy link
Contributor

nthiery commented Jul 11, 2011

comment:50

Replying to @nathanncohen:

D.an_element() in D

For the record: this is among the generic sanity checks that TestSuite(D).run() does.

@eviatarbach
Copy link
Author

comment:51

I fixed the docbuild warning (had to quarantine successive parts of the documentation in order to isolate the problem!). It was due to the whitespace before the URL for the Wikipedia link.

I also fixed the doctest.

@eviatarbach
Copy link
Author

Attachment: 10530.patch.gz

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Jul 12, 2011

comment:52

yeahhhhhhhhhhhhhh !

Positive review to this patch :-)

Nice job debugging the Sphinx warning. I'll keep that in mind for my future patch : I very often spend 10 minutes on each of those ;-)

Nathann

@eviatarbach
Copy link
Author

comment:53

Yes! Finally :)

@eviatarbach
Copy link
Author

comment:54

Thank you Nathann and Nicolas for reviewing!

@jdemeyer jdemeyer modified the milestones: sage-4.7.1, sage-4.7.2 Jul 13, 2011
@fchapoton
Copy link
Contributor

comment:56

for the bot

Apply 10530.patch

@jdemeyer
Copy link

Merged: sage-4.7.2.alpha2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants