Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fermi assembler issue #16

Open
Ahhgust opened this issue Mar 1, 2017 · 1 comment
Open

Fermi assembler issue #16

Ahhgust opened this issue Mar 1, 2017 · 1 comment

Comments

@Ahhgust
Copy link

Ahhgust commented Mar 1, 2017

I'm having some with the Fermi assembler. I've tried various versions of the attached code; none manage to produce any contigs. Further, calling GetSequences() after calling PerformAssembly produces mangled strings (e.g., ascii value of 4 for an 'A'), which is probably a bad sign!

Here's the (very rough) source:
#include
#include
#include "SeqLib/BamReader.h"
#include "SeqLib/FastqReader.h"
#include "SeqLib/BWAWrapper.h"
#include "SeqLib/FermiAssembler.h"
#include "SeqLib/UnalignedSequence.h"

using namespace SeqLib;
using namespace std;

const string REFGENOME = "chrAll.standardChroms.fa";

int
main(int argc, char **argv) {

/* Mapping w/ BWA
BWAWrapper bwa;

bwa.LoadIndex(REFGENOME);

bwa.SetGapOpen(4);
bwa.SetGapExtension(1);
bwa.Set3primeClippingPenalty(1000);
bwa.Set5primeClippingPenalty(1000);

BamRecordVector results;

for (string line; getline(cin, line) ; ) {
bwa.AlignSequence(line, "m", results, false, 0.9, 0);
}
*/

if (argc < 2) {
cerr << "I need a fastq file to work!\n" << endl;
return 1;
}

FermiAssembler f;
FastqReader q(argv[1]);
UnalignedSequence s;

while (q.GetNextSequence(s) )
f.AddRead(s);

UnalignedSequenceVector seqs = f.GetSequences();
cout << seqs.size() << " also is the size!\n";
for (unsigned i = 0; i < seqs.size(); ++i)
cout << seqs[i].Seq << endl;

f.SetMinOverlap(10);
f.SetKmerMinThreshold(1);
f.SetKmerMaxThreshold(1000);

f.PerformAssembly();

// retrieve the contigs
std::vectorstd::string contigs = f.GetContigs();
cerr << contigs.size() << " contigs made! " << endl;

seqs = f.GetSequences();
const char *seq = seqs[0].Seq.c_str();
unsigned len = seqs[0].Seq.length();

cout << len << " is the length...? " << ((int)seq[0]) << " is the first char" << endl;

return 0;
}

Here's the fastq file (please forgive the copy pasting!):
@M01451:39:000000000-AGNHG:1:1101:18315:2050 1:N:0:25
ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT

  • F vWA
    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFDFFGGGGEEFGFGGGGGGGGGGGGGGGGGGGGGFF
    @M01451:39:000000000-AGNHG:1:1101:15489:2067 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    CEGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
    @M01451:39:000000000-AGNHG:1:1101:11753:2476 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    GGGGFFGGGGFGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGFECFGGGFGGFFGGGGGGGGGFGGGGGGGGGGGG
    @M01451:39:000000000-AGNHG:1:1101:8601:2804 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    GCGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGFGGGFFGFGGGGGGGGGGGGGGFGGGGGGFG
    @M01451:39:000000000-AGNHG:1:1101:19616:3182 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    GGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFGGGGGGGGGGGEFFGGGGEGG
    @M01451:39:000000000-AGNHG:1:1101:12281:3408 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGFGGGGGGGGG
    @M01451:39:000000000-AGNHG:1:1101:9468:3637 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    FFFGFGGGGGGEFE<EGGGGGGFEGGGGGGGFGGGFFGGDFCFFAFGGGGDEEFGGFFGGGGGGFFFCFFCFDFGCG@EFFDFGGFGGGGGGGGGG
    @M01451:39:000000000-AGNHG:1:1101:15793:4032 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    C6C@,CFFEFGGFFEFEGGGCFEF<EGGCGFFGGDFFFFGFGGGGGGGGFFGFGFGCFGGGGGGBFFFGFGGGGFFGGGGGGGGGGGGFGGGGGGGGGFCFFFG
@walaj
Copy link
Owner

walaj commented Mar 3, 2017

Thank you for the detailed report. I was able to recreate the issue, and see two things to address.

The first is easy enough. fermi-lite actually deallocates the input sequences after assembly. This is why calling GetSequences afterward is giving garbled answers. I need to 1) make this clear in the documentation and 2) have GetSequences return a zero-length vector if the sequences have been deallocated from the FermiAssembler object.

The second, the failure to assemble contigs in this example, is more baffling to me. I think this is more in the scope of figuring out the right parameter settings in fermi-lite for this assembly. I'll look into this more, but this will likely take longer to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants