Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of sites overflows in parser #8

Closed
jasondk opened this issue May 11, 2015 · 6 comments
Closed

Large number of sites overflows in parser #8

jasondk opened this issue May 11, 2015 · 6 comments

Comments

@jasondk
Copy link

jasondk commented May 11, 2015

In axml.h, the rawdata->sites variable is defined as type int. Attempting to compress an alignment with about 3B positions is resulting in a "too few sites" error, presumably because we are overflowing the int. We will also have more than 32k site patterns after compression, and some of these will occur more than 32k times in the dataset - so we will still be causing overflows in the site/alias indexes even after changing sites to long long int. Is there a quick fix for this? Thanks!

@stamatak
Copy link
Owner

I'll try to fix this soon, I don't think that the fix is easy if you
don't know the ExaML code well, please use the RAxML google group for
reporting bugs in the future, thereby all users are aware of potential
problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|.
Attempting to compress an alignment with about 3B positions is resulting
in a "too few sites" error, presumably because we are overflowing the
|int|. We will also have more than 32k site patterns after compression,
and some of these will occur more than 32k times in the dataset - so we
will still be causing overflows in the site/alias indexes even after
changing |sites| to |long long int|. Is there a quick fix for this? Thanks!


Reply to this email directly or view it on GitHub
#8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

@stamatak
Copy link
Owner

Dear Jason,

I think that I have fixed it but I need access to the dataset for testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you
don't know the ExaML code well, please use the RAxML google group for
reporting bugs in the future, thereby all users are aware of potential
problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|.
Attempting to compress an alignment with about 3B positions is resulting
in a "too few sites" error, presumably because we are overflowing the
|int|. We will also have more than 32k site patterns after compression,
and some of these will occur more than 32k times in the dataset - so we
will still be causing overflows in the site/alias indexes even after
changing |sites| to |long long int|. Is there a quick fix for this?
Thanks!


Reply to this email directly or view it on GitHub
#8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

@jasondk
Copy link
Author

jasondk commented May 26, 2015

Hey Alexis,

Thanks so much and sorry for the delay in responding. You can download the compressed dataset (5GB, sorry!) here http://hyperion.ucalgary.ca/example.phy.bz2. I’ll leave the link up for a couple of days. If you have a problem downloading it, you could just simulate a similar dataset. The dimensions are 7 OTUs and 3,036,303,846 sites with very little divergence (most of this will compress out if indexing site patterns).

Best wishes,

  • Jason

A.P. Jason de Koning, Ph.D.

Assistant Professor
University of Calgary, Faculty of Medicine
and Alberta Children's Hospital Research Institute for Child and Maternal Health
Dept. of Biochemistry and Molecular Biology
Dept. of Medical Genetics

Health Sciences Centre 1150 Suite
3330 Hospital Drive N.W.
Calgary, Alberta T2N 4N1 Canada

Office: 403-210-7638 | Fax: 403-270-8928
Email: jason.dekoning@ucalgary.ca
Web: http://lab.jasondk.io

On May 26, 2015, at 1:29 PM, Alexis Stamatakis notifications@github.com wrote:

Dear Jason,

I think that I have fixed it but I need access to the dataset for testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you
don't know the ExaML code well, please use the RAxML google group for
reporting bugs in the future, thereby all users are aware of potential
problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|.
Attempting to compress an alignment with about 3B positions is resulting
in a "too few sites" error, presumably because we are overflowing the
|int|. We will also have more than 32k site patterns after compression,
and some of these will occur more than 32k times in the dataset - so we
will still be causing overflows in the site/alias indexes even after
changing |sites| to |long long int|. Is there a quick fix for this?
Thanks!


Reply to this email directly or view it on GitHub
#8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Reply to this email directly or view it on GitHub #8 (comment).

@stamatak
Copy link
Owner

Hi Jason,

The modified parser works now, how quickly do you need the fix?

I am in the middle of a larger re-design, thus the code with the fixed
parser is not ready for release yet.

Below is the output of the parser, does that look right? It looks rather
weird to me.

Alexis

Pattern compression: ON

Alignment has 200630281 completely undetermined sites that will be
automatically removed from the binary alignment file

Your alignment has 5956 unique patterns

Under CAT the memory required by ExaML for storing CLVs and tip vectors
will be
1375836 bytes
1343 kiloBytes
1 MegaBytes
0 GigaBytes

Under GAMMA the memory required by ExaML for storing CLVs and tip
vectors will be
5378268 bytes
5252 kiloBytes
5 MegaBytes
0 GigaBytes

Please note that, these are just the memory requirements for doing
likelihood calculations!
To be on the safe side, we recommend that you execute ExaML on a system
with twice that memory.

Binary and compressed alignment file written to file HUGE.binary

Parsing completed, exiting now ...

On 26.05.2015 23:06, A.P. Jason de Koning wrote:

Hey Alexis,

Thanks so much and sorry for the delay in responding. You can download
the compressed dataset (5GB, sorry!) here
http://hyperion.ucalgary.ca/example.phy.bz2. I’ll leave the link up
for a couple of days. If you have a problem downloading it, you could
just simulate a similar dataset. The dimensions are 7 OTUs and
3,036,303,846 sites with very little divergence (most of this will
compress out if indexing site patterns).

Best wishes,

  • Jason

A.P. Jason de Koning, Ph.D.

Assistant Professor
University of Calgary, Faculty of Medicine
and Alberta Children's Hospital Research Institute for Child and
Maternal Health
Dept. of Biochemistry and Molecular Biology
Dept. of Medical Genetics

Health Sciences Centre 1150 Suite
3330 Hospital Drive N.W.
Calgary, Alberta T2N 4N1 Canada

Office: 403-210-7638 | Fax: 403-270-8928
Email: jason.dekoning@ucalgary.ca
Web: http://lab.jasondk.io

On May 26, 2015, at 1:29 PM, Alexis Stamatakis
notifications@github.com wrote:

Dear Jason,

I think that I have fixed it but I need access to the dataset for
testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you
don't know the ExaML code well, please use the RAxML google group for
reporting bugs in the future, thereby all users are aware of potential
problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|.
Attempting to compress an alignment with about 3B positions is
resulting
in a "too few sites" error, presumably because we are overflowing the
|int|. We will also have more than 32k site patterns after
compression,
and some of these will occur more than 32k times in the dataset -
so we
will still be causing overflows in the site/alias indexes even after
changing |sites| to |long long int|. Is there a quick fix for this?
Thanks!


Reply to this email directly or view it on GitHub
#8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Reply to this email directly or view it on GitHub
#8 (comment).


Reply to this email directly or view it on GitHub
#8 (comment).

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

@jasondk
Copy link
Author

jasondk commented May 29, 2015

Hey Alexis, this looks approximately correct to me. We’d previously run just the variable sites from this dataset and had similar results. Can you possibly make the binary output of the parser for this dataset available to us for download? Or allow us access to the revised parser? This is for the last piece of a student project that is otherwise complete. Thanks! Jason

On May 28, 2015, at 12:30 AM, Alexis Stamatakis notifications@github.com wrote:

Hi Jason,

The modified parser works now, how quickly do you need the fix?

I am in the middle of a larger re-design, thus the code with the fixed
parser is not ready for release yet.

Below is the output of the parser, does that look right? It looks rather
weird to me.

Alexis

Pattern compression: ON

Alignment has 200630281 completely undetermined sites that will be
automatically removed from the binary alignment file

Your alignment has 5956 unique patterns

Under CAT the memory required by ExaML for storing CLVs and tip vectors
will be
1375836 bytes
1343 kiloBytes
1 MegaBytes
0 GigaBytes

Under GAMMA the memory required by ExaML for storing CLVs and tip
vectors will be
5378268 bytes
5252 kiloBytes
5 MegaBytes
0 GigaBytes

Please note that, these are just the memory requirements for doing
likelihood calculations!
To be on the safe side, we recommend that you execute ExaML on a system
with twice that memory.

Binary and compressed alignment file written to file HUGE.binary

Parsing completed, exiting now ...

On 26.05.2015 23:06, A.P. Jason de Koning wrote:

Hey Alexis,

Thanks so much and sorry for the delay in responding. You can download
the compressed dataset (5GB, sorry!) here
http://hyperion.ucalgary.ca/example.phy.bz2. I’ll leave the link up
for a couple of days. If you have a problem downloading it, you could
just simulate a similar dataset. The dimensions are 7 OTUs and
3,036,303,846 sites with very little divergence (most of this will
compress out if indexing site patterns).

Best wishes,

  • Jason

A.P. Jason de Koning, Ph.D.

Assistant Professor
University of Calgary, Faculty of Medicine
and Alberta Children's Hospital Research Institute for Child and
Maternal Health
Dept. of Biochemistry and Molecular Biology
Dept. of Medical Genetics

Health Sciences Centre 1150 Suite
3330 Hospital Drive N.W.
Calgary, Alberta T2N 4N1 Canada

Office: 403-210-7638 | Fax: 403-270-8928
Email: jason.dekoning@ucalgary.ca
Web: http://lab.jasondk.io

On May 26, 2015, at 1:29 PM, Alexis Stamatakis
notifications@github.com wrote:

Dear Jason,

I think that I have fixed it but I need access to the dataset for
testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you
don't know the ExaML code well, please use the RAxML google group for
reporting bugs in the future, thereby all users are aware of potential
problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|.
Attempting to compress an alignment with about 3B positions is
resulting
in a "too few sites" error, presumably because we are overflowing the
|int|. We will also have more than 32k site patterns after
compression,
and some of these will occur more than 32k times in the dataset -
so we
will still be causing overflows in the site/alias indexes even after
changing |sites| to |long long int|. Is there a quick fix for this?
Thanks!


Reply to this email directly or view it on GitHub
#8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Reply to this email directly or view it on GitHub
#8 (comment).


Reply to this email directly or view it on GitHub
#8 (comment).

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Reply to this email directly or view it on GitHub #8 (comment).

@stamatak
Copy link
Owner

just sent the code to your university email,

alexis

On 29.05.2015 16:23, A.P. Jason de Koning wrote:

Hey Alexis, this looks approximately correct to me. We’d previously run
just the variable sites from this dataset and had similar results. Can
you possibly make the binary output of the parser for this dataset
available to us for download? Or allow us access to the revised parser?
This is for the last piece of a student project that is otherwise
complete. Thanks! Jason

On May 28, 2015, at 12:30 AM, Alexis Stamatakis
notifications@github.com wrote:

Hi Jason,

The modified parser works now, how quickly do you need the fix?

I am in the middle of a larger re-design, thus the code with the fixed
parser is not ready for release yet.

Below is the output of the parser, does that look right? It looks rather
weird to me.

Alexis

Pattern compression: ON

Alignment has 200630281 completely undetermined sites that will be
automatically removed from the binary alignment file

Your alignment has 5956 unique patterns

Under CAT the memory required by ExaML for storing CLVs and tip vectors
will be
1375836 bytes
1343 kiloBytes
1 MegaBytes
0 GigaBytes

Under GAMMA the memory required by ExaML for storing CLVs and tip
vectors will be
5378268 bytes
5252 kiloBytes
5 MegaBytes
0 GigaBytes

Please note that, these are just the memory requirements for doing
likelihood calculations!
To be on the safe side, we recommend that you execute ExaML on a system
with twice that memory.

Binary and compressed alignment file written to file HUGE.binary

Parsing completed, exiting now ...

On 26.05.2015 23:06, A.P. Jason de Koning wrote:

Hey Alexis,

Thanks so much and sorry for the delay in responding. You can download
the compressed dataset (5GB, sorry!) here
http://hyperion.ucalgary.ca/example.phy.bz2. I’ll leave the link up
for a couple of days. If you have a problem downloading it, you could
just simulate a similar dataset. The dimensions are 7 OTUs and
3,036,303,846 sites with very little divergence (most of this will
compress out if indexing site patterns).

Best wishes,

  • Jason

A.P. Jason de Koning, Ph.D.

Assistant Professor
University of Calgary, Faculty of Medicine
and Alberta Children's Hospital Research Institute for Child and
Maternal Health
Dept. of Biochemistry and Molecular Biology
Dept. of Medical Genetics

Health Sciences Centre 1150 Suite
3330 Hospital Drive N.W.
Calgary, Alberta T2N 4N1 Canada

Office: 403-210-7638 | Fax: 403-270-8928
Email: jason.dekoning@ucalgary.ca
Web: http://lab.jasondk.io

On May 26, 2015, at 1:29 PM, Alexis Stamatakis
notifications@github.com wrote:

Dear Jason,

I think that I have fixed it but I need access to the dataset for
testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy
if you
don't know the ExaML code well, please use the RAxML google
group for
reporting bugs in the future, thereby all users are aware of
potential
problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type
|int|.
Attempting to compress an alignment with about 3B positions is
resulting
in a "too few sites" error, presumably because we are
overflowing the
|int|. We will also have more than 32k site patterns after
compression,
and some of these will occur more than 32k times in the dataset -
so we
will still be causing overflows in the site/alias indexes even
after
changing |sites| to |long long int|. Is there a quick fix for
this?
Thanks!


Reply to this email directly or view it on GitHub
#8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of
Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology,
University
of Arizona at Tucson

www.exelixis-lab.org

Reply to this email directly or view it on GitHub
#8 (comment).


Reply to this email directly or view it on GitHub
#8 (comment).

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Reply to this email directly or view it on GitHub
#8 (comment).


Reply to this email directly or view it on GitHub
#8 (comment).

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants