Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGoR sequence generation - random number generation issue #1

Closed
sinkovit opened this issue Feb 17, 2018 · 2 comments
Closed

IGoR sequence generation - random number generation issue #1

sinkovit opened this issue Feb 17, 2018 · 2 comments

Comments

@sinkovit
Copy link

I’ve been using the IGoR software to generate simulated T cell beta chain repertoires and ran into an issue that I believe is related to the random number generation. My colleagues at Vanderbilt University Medical Center, Cinque Soto and James Crowe, suggested I get in touch with you.

We need to generate very large repertoires to compare with the results obtained from deep immunome sequencing being carried out at Vanderbilt. When I was processing IGoR output, I noticed that the same results are generated every 52,895,649 sequences. We discovered this when calculating the number of unique clonotypes as a function of the number of sequences – after 52M+ reads, the number of clonotypes was unchanged.

Here’s the igor command line that we’ve been using. Note that I tried this with two different seeds and get the same behavior and same cycle length (52,895,649 sequences).

igor -threads -1 -set_wd $PWD -batch tcr_beta -species human -chain beta
-generate 1100000000 --CDR3 --seed 12345678 --noerr

For reasons that I don’t understand, the first few sequences are not repeated, but after a short time the random number generator settles into a predictable cycle.

The example grep output below shows how the 750th sequence is repeated. As you probably know, the “-n” option to grep prefixes the output with the line number for the matching line. You’ll notice that the difference in line numbers is always a multiple of 52,895,649 (e.g. 1,057,913,732 - 752 = 52,895,649 x 20)

$ grep -n 'TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT' \ generated_seqs_noerr_CDR3_info.csv

752:750,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
52896401:52896399,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
105792050:105792048,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
158687699:158687697,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
211583348:211583346,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
264478997:264478995,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
317374646:317374644,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
370270295:370270293,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
423165944:423165942,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
476061593:476061591,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
528957242:528957240,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
581852891:581852889,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
634748540:634748538,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
687644189:687644187,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
740539838:740539836,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
793435487:793435485,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
846331136:846331134,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
899226785:899226783,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
952122434:952122432,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
1005018083:1005018081,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0
1057913732:1057913730,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0

$ grep -n '(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)' \ generated_realizations_noerr.csv

752:750;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
52896401:52896399;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
105792050:105792048;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
158687699:158687697;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
211583348:211583346;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
264478997:264478995;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
317374646:317374644;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
370270295:370270293;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
423165944:423165942;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
476061593:476061591;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
528957242:528957240;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
581852891:581852889;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
634748540:634748538;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
687644189:687644187;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
740539838:740539836;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
793435487:793435485;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
846331136:846331134;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
899226785:899226783;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
952122434:952122432;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
1005018083:1005018081;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)
1057913732:1057913730;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)

Any help would be greatly appreciated. Let me add that this is fantastic software. We’ve been looking for a rigorous way to generate simulated repertoires and we’re convinced that IGoR is the right way to do this.

@qmarcou
Copy link
Owner

qmarcou commented Feb 19, 2018

Hi @sinkovit ,
Thank your very much for these positive comments!
Thank you for this detailed bug report: This indeed seems to be related to the random engine that was used by IGoR. The C++ standard library default_random_engine is a 32bits linear congruential engine whose state space was indeed too small (thus creating a periodic sequence generation) for large sequence simulations (as generating a sequence already requires drawing ~30 random numbers).
I have now updated it to a 64 bits mersenne twister generator which period is ~10^6000, this should leave us some time before having hard disks able to accept such large datasets! After a few tests, this changes even seem to improve (very slightly though as hard disk writing is most likely the bottleneck) IGoR's sequence generation time.
I have pushed the modifications in the dev branch and leave this issue open until I incorporate further polishing changes (such as creation of a good quality seed) from: http://www0.cs.ucl.ac.uk/staff/D.Jones/GoodPracticeRNG.pdf
This changes will be incorporated in the next release. Please tell me if this does not solve your problem.

PS: although I guess you did it for debuging purposes, one should probably refrain to chose actually chose the seed (in your case 12345678) as it will most likely not exploit all available 64bits. Note that is is however mostly a problem if you're running many different simulations.

@qmarcou qmarcou self-assigned this Feb 19, 2018
@qmarcou
Copy link
Owner

qmarcou commented Apr 18, 2018

This issue is now fully fixed:

  • the used random engine is now a 64 bits mersenne twister, which period is gigantic as stated above
  • proper seeding via 64 bits seed is done using systems random device (e.g /dev/urandom) via the C++ random_device object
  • in case such random device is not available, I have added an alternative quite chaotic random seed constructor (probably of much lesser quality) by combining date, some random runtime, process PID and some somewhat random bit levels operation

All this is already available on the dev branch and will constitute part of IGoR's v1.2.0 next release.

@qmarcou qmarcou closed this as completed Apr 18, 2018
alfaceor added a commit to alfaceor/IGoR that referenced this issue Dec 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants