Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hp encoding for proteins #758

Merged
merged 50 commits into from
Nov 22, 2019
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
3fedf34
add check sq false
pranathivemuri Oct 10, 2019
f9211e8
Update tenx.py
pranathivemuri Oct 10, 2019
f761b54
Add whole sequence by delimiter
pranathivemuri Oct 11, 2019
9e19dc6
add delimiter if not present
pranathivemuri Oct 11, 2019
cc5d793
edit a comment
pranathivemuri Oct 11, 2019
d67968e
update the comment merge conflict
pranathivemuri Oct 11, 2019
89013a0
save_fastas flag
pranathivemuri Oct 11, 2019
fc41944
save_fasta
pranathivemuri Oct 11, 2019
5bed129
remove check seq false
pranathivemuri Oct 11, 2019
8f58071
add test
pranathivemuri Oct 11, 2019
edeee41
fix the comma in the test
pranathivemuri Oct 11, 2019
ee93292
delimiter already exists
pranathivemuri Oct 11, 2019
c99acb4
missing barcode name
pranathivemuri Oct 11, 2019
f652f41
add a test, correct the screed fasta path
pranathivemuri Oct 11, 2019
4c4309e
remove umi > than symbols in the sequence output due to missing next …
pranathivemuri Oct 11, 2019
722732d
fix test
pranathivemuri Oct 11, 2019
050cfc6
fix test
pranathivemuri Oct 11, 2019
52df53c
take mins generator out
pranathivemuri Oct 11, 2019
228b6c7
Write a note, and comment out the test line that randomly fails
pranathivemuri Oct 11, 2019
bfb7461
save_fastas dir, not true, false flag
pranathivemuri Oct 14, 2019
6afc0a3
Merge branch 'master' into master
olgabot Oct 21, 2019
5247f6c
convert fasta file saving to per barcode_umi
pranathivemuri Oct 21, 2019
aa61b08
add more tests
pranathivemuri Oct 21, 2019
a0fe290
Merge branch 'master' of https://github.com/czbiohub/sourmash
pranathivemuri Oct 21, 2019
55d6d1f
Add more tests
pranathivemuri Oct 21, 2019
2f3bbd4
fix test
pranathivemuri Oct 21, 2019
ebf2b30
Merge branch 'master' into master
pranathivemuri Oct 21, 2019
e8a5860
make tests python 2.7 compatible
pranathivemuri Oct 21, 2019
4c294ea
Merge branch 'master' of https://github.com/czbiohub/sourmash
pranathivemuri Oct 21, 2019
a8a4452
Merge branch 'master' into master
luizirber Oct 22, 2019
e9f8032
add hp encoding
pranathivemuri Oct 22, 2019
01134ee
Merge remote-tracking branch 'upstream/master'
pranathivemuri Oct 22, 2019
86689e6
remove double tranlate to codon
pranathivemuri Oct 22, 2019
920e547
missing comma
pranathivemuri Oct 22, 2019
185860e
Merge remote-tracking branch 'upstream/master'
pranathivemuri Oct 22, 2019
31abf84
fix tests
pranathivemuri Oct 22, 2019
6d00bdd
add a test
pranathivemuri Oct 22, 2019
759682b
Merge branch 'master' into master
pranathivemuri Oct 23, 2019
69396fc
Merge branch 'master' into master
luizirber Oct 23, 2019
5f1386e
Assert correctly
pranathivemuri Oct 23, 2019
37d67a8
Merge branch 'master' of https://github.com/czbiohub/sourmash
pranathivemuri Oct 23, 2019
7a75c9f
add paper ref for hp encoding
pranathivemuri Oct 24, 2019
f9bd45f
Merge branch 'master' into master
pranathivemuri Nov 2, 2019
a3550d3
Define 'n' before using it
olgabot Nov 2, 2019
859bf2e
add bam2fasta for requirements
olgabot Nov 2, 2019
7d5bec2
Merge pull request #4 from czbiohub/olgabot-patch-1
pranathivemuri Nov 4, 2019
ba1530f
Merge pull request #5 from czbiohub/olgabot-patch-2
pranathivemuri Nov 4, 2019
72d1f75
Edit citation, add explanation for hp
pranathivemuri Nov 5, 2019
bc9a050
Merge branch 'master' of https://github.com/czbiohub/sourmash
pranathivemuri Nov 5, 2019
c6b993a
Merge branch 'master' into master
luizirber Nov 22, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ sphinxcontrib-napoleon
setuptools_scm
setuptools_scm_git_archive
nbsphinx
bam2fasta
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I should remove this, should we still keep bam2fasta optional? @olgabot @luizirber

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this up to @luizirber. I'd personally prefer to keep it there since the functionality will break if it is not installed, but not everyone needs the bam to fasta conversion.

7 changes: 5 additions & 2 deletions sourmash/_minhash.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,19 @@ cdef extern from "kmer_min_hash.hh":
const unsigned int ksize;
const bool is_protein;
const bool dayhoff;
const bool hp;
const HashIntoType max_hash;
CMinHashType mins;

KmerMinHash(unsigned int, unsigned int, bool, bool, uint32_t, HashIntoType)
KmerMinHash(unsigned int, unsigned int, bool, bool, bool, uint32_t, HashIntoType)
void add_hash(HashIntoType) except +ValueError
void remove_hash(HashIntoType) except +ValueError
void add_word(const string& word) except +ValueError
void add_word(const char * word) except +ValueError
void add_sequence(const string&, bool) except +ValueError
void merge(const KmerMinHash&) except +ValueError
string aa_to_dayhoff(string aa) except +ValueError
string aa_to_hp(string aa) except +ValueError
string translate_codon(string codon) except +ValueError
unsigned int count_common(const KmerMinHash&) except +ValueError
unsigned long size()
Expand All @@ -45,7 +47,7 @@ cdef extern from "kmer_min_hash.hh":
cdef cppclass KmerMinAbundance(KmerMinHash):
CMinHashType abunds;

KmerMinAbundance(unsigned int, unsigned int, bool, bool, uint32_t, HashIntoType)
KmerMinAbundance(unsigned int, unsigned int, bool, bool, bool, uint32_t, HashIntoType)
void add_hash(HashIntoType) except +ValueError
void remove_hash(HashIntoType) except +ValueError
void add_word(string word) except +ValueError
Expand All @@ -54,6 +56,7 @@ cdef extern from "kmer_min_hash.hh":
void merge(const KmerMinAbundance&) except +ValueError
void merge(const KmerMinHash&) except +ValueError
string aa_to_dayhoff(string aa) except +ValueError
string aa_to_hp(string aa) except +ValueError
string translate_codon(string codon) except +ValueError
unsigned int count_common(const KmerMinAbundance&) except +ValueError
unsigned long size()
Expand Down
44 changes: 35 additions & 9 deletions sourmash/_minhash.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ cdef class MinHash(object):
def __init__(self, unsigned int n, unsigned int ksize,
bool is_protein=False,
bool dayhoff=False,
bool hp=False,
bool track_abundance=False,
uint32_t seed=MINHASH_DEFAULT_SEED,
HashIntoType max_hash=0,
Expand All @@ -112,9 +113,9 @@ cdef class MinHash(object):

cdef KmerMinHash *mh = NULL
if track_abundance:
mh = new KmerMinAbundance(n, ksize, is_protein, dayhoff, seed, max_hash)
mh = new KmerMinAbundance(n, ksize, is_protein, dayhoff, hp, seed, max_hash)
else:
mh = new KmerMinHash(n, ksize, is_protein, dayhoff, seed, max_hash)
mh = new KmerMinHash(n, ksize, is_protein, dayhoff, hp, seed, max_hash)

self._this.reset(mh)

Expand All @@ -128,6 +129,7 @@ cdef class MinHash(object):
def __copy__(self):
a = MinHash(deref(self._this).num, deref(self._this).ksize,
deref(self._this).is_protein, deref(self._this).dayhoff,
deref(self._this).hp,
self.track_abundance,
deref(self._this).seed, deref(self._this).max_hash)
a.merge(self)
Expand All @@ -142,23 +144,24 @@ cdef class MinHash(object):
deref(self._this).ksize,
deref(self._this).is_protein,
deref(self._this).dayhoff,
deref(self._this).hp,
self.get_mins(with_abundance=with_abundance),
None, self.track_abundance, deref(self._this).max_hash,
deref(self._this).seed)

def __setstate__(self, tup):
(n, ksize, is_protein, dayhoff, mins, _, track_abundance, max_hash, seed) =\
(n, ksize, is_protein, dayhoff, hp, mins, _, track_abundance, max_hash, seed) =\
tup

self._track_abundance = track_abundance

cdef KmerMinHash *mh = NULL
if track_abundance:
mh = new KmerMinAbundance(n, ksize, is_protein, dayhoff, seed, max_hash)
mh = new KmerMinAbundance(n, ksize, is_protein, dayhoff, hp, seed, max_hash)
self._this.reset(mh)
self.set_abundances(mins)
else:
mh = new KmerMinHash(n, ksize, is_protein, dayhoff, seed, max_hash)
mh = new KmerMinHash(n, ksize, is_protein, dayhoff, hp, seed, max_hash)
self._this.reset(mh)
self.add_many(mins)

Expand All @@ -168,6 +171,7 @@ cdef class MinHash(object):
deref(self._this).ksize,
deref(self._this).is_protein,
deref(self._this).dayhoff,
deref(self._this).hp,
self.track_abundance,
deref(self._this).seed,
deref(self._this).max_hash,
Expand All @@ -182,7 +186,7 @@ cdef class MinHash(object):
def copy_and_clear(self):
a = MinHash(deref(self._this).num, deref(self._this).ksize,
deref(self._this).is_protein, deref(self._this).dayhoff,
self.track_abundance,
deref(self._this).hp, self.track_abundance,
deref(self._this).seed, deref(self._this).max_hash)
return a

Expand Down Expand Up @@ -247,6 +251,10 @@ cdef class MinHash(object):
def dayhoff(self):
return deref(self._this).dayhoff

@property
def hp(self):
return deref(self._this).hp

@property
def ksize(self):
return deref(self._this).ksize
Expand All @@ -271,7 +279,7 @@ cdef class MinHash(object):

if v:
mh = new KmerMinAbundance(self.num, self.ksize, self.is_protein,
self.dayhoff, self.seed, self.max_hash)
self.dayhoff, self.hp, self.seed, self.max_hash)
self._this.reset(mh)

# At this point, if we are changing from track_abundance=True to False,
Expand All @@ -294,6 +302,7 @@ cdef class MinHash(object):

a = MinHash(new_num, deref(self._this).ksize,
deref(self._this).is_protein, deref(self._this).dayhoff,
deref(self._this).hp,
self.track_abundance,
deref(self._this).seed, 0)
if self.track_abundance:
Expand Down Expand Up @@ -326,6 +335,7 @@ cdef class MinHash(object):

a = MinHash(0, deref(self._this).ksize,
deref(self._this).is_protein, deref(self._this).dayhoff,
deref(self._this).hp,
self.track_abundance,
deref(self._this).seed, new_max_hash)
if self.track_abundance:
Expand All @@ -348,6 +358,7 @@ cdef class MinHash(object):
deref(self._this).ksize,
deref(self._this).is_protein,
deref(self._this).dayhoff,
deref(self._this).hp,
deref(self._this).seed,
deref(self._this).max_hash)

Expand All @@ -356,6 +367,7 @@ cdef class MinHash(object):
deref(self._this).ksize,
deref(self._this).is_protein,
deref(self._this).dayhoff,
deref(self._this).hp,
deref(self._this).seed,
deref(self._this).max_hash)

Expand Down Expand Up @@ -467,17 +479,25 @@ cdef class MinHash(object):
raise ValueError("cannot add amino acid sequence to DNA MinHash!")

aa_kmers = (sequence[i:i + ksize] for i in range(0, len(sequence) - ksize + 1))
if not self.dayhoff:
if not self.dayhoff and not self.hp:
for aa_kmer in aa_kmers:
deref(self._this).add_word(to_bytes(aa_kmer))
else:
elif self.dayhoff:
for aa_kmer in aa_kmers:
dayhoff_kmer = ''
for aa in aa_kmer:
dayhoff_letter = deref(self._this).aa_to_dayhoff(to_bytes(aa))
dayhoff_kmer += dayhoff_letter
# dayhoff_kmer = ''.join( for aa in aa_kmer)
deref(self._this).add_word(to_bytes(dayhoff_kmer))
else:
for aa_kmer in aa_kmers:
hp_kmer = ''
for aa in aa_kmer:
hp_letter = deref(self._this).aa_to_hp(to_bytes(aa))
hp_kmer += hp_letter
# hp_kmer = ''.join( for aa in aa_kmer)
deref(self._this).add_word(to_bytes(hp_kmer))

def is_molecule_type(self, molecule):
if molecule.upper() == 'DNA' and not self.is_protein:
Expand All @@ -489,4 +509,10 @@ cdef class MinHash(object):
else:
if molecule == 'protein':
return True
if self.hp:
if molecule == 'hp':
return True
else:
if molecule == 'protein':
return True
return False
23 changes: 22 additions & 1 deletion sourmash/command_compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,10 @@ def compute(args):
notify('Computing both nucleotide and Dayhoff-encoded protein '
'signatures.')
num_sigs = 2*len(ksizes)
elif args.dna and args.hp:
notify('Computing both nucleotide and Hp-encoded protein '
'signatures.')
num_sigs = 2*len(ksizes)
elif args.dna:
notify('Computing only nucleotide (and not protein) signatures.')
num_sigs = len(ksizes)
Expand All @@ -162,8 +166,12 @@ def compute(args):
notify('Computing only Dayhoff-encoded protein (and not nucleotide) '
'signatures.')
num_sigs = len(ksizes)
elif args.hp:
notify('Computing only hp-encoded protein (and not nucleotide) '
'signatures.')
num_sigs = len(ksizes)

if (args.protein or args.dayhoff) and not args.input_is_protein:
if (args.protein or args.dayhoff or args.hp) and not args.input_is_protein:
bad_ksizes = [ str(k) for k in ksizes if k % 3 != 0 ]
if bad_ksizes:
error('protein ksizes must be divisible by 3, sorry!')
Expand All @@ -190,6 +198,7 @@ def make_minhashes():
E = MinHash(ksize=k, n=args.num_hashes,
is_protein=True,
dayhoff=False,
hp=False,
track_abundance=args.track_abundance,
scaled=args.scaled,
seed=seed)
Expand All @@ -198,6 +207,16 @@ def make_minhashes():
E = MinHash(ksize=k, n=args.num_hashes,
is_protein=True,
dayhoff=True,
hp=False,
track_abundance=args.track_abundance,
scaled=args.scaled,
seed=seed)
Elist.append(E)
if args.hp:
E = MinHash(ksize=k, n=args.num_hashes,
is_protein=True,
dayhoff=False,
hp=True,
track_abundance=args.track_abundance,
scaled=args.scaled,
seed=seed)
Expand All @@ -206,6 +225,7 @@ def make_minhashes():
E = MinHash(ksize=k, n=args.num_hashes,
is_protein=False,
dayhoff=False,
hp=False,
track_abundance=args.track_abundance,
scaled=args.scaled,
seed=seed)
Expand Down Expand Up @@ -342,6 +362,7 @@ def save_siglist(siglist, output_fp, filename=None):
# make minhashes for the whole file
Elist = make_minhashes()

n = 0
total_seq = 0
for filename in args.filenames:
# consume & calculate signatures
Expand Down
12 changes: 10 additions & 2 deletions sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -1002,14 +1002,22 @@ def watch(args):
moltype = 'DNA'
is_protein = False
dayhoff = False
hp = False
elif args.protein:
moltype = 'protein'
is_protein = True
dayhoff = False
else:
hp = False
elif args.dayhoff:
moltype = 'dayhoff'
is_protein = True
dayhoff = True
hp = False
else:
moltype = 'hp'
is_protein = True
dayhoff = False
hp = True

tree = load_sbt_index(args.sbt_name)

Expand All @@ -1020,7 +1028,7 @@ def watch(args):
tree_mh = leaf.data.minhash
ksize = tree_mh.ksize

E = MinHash(ksize=ksize, n=args.num_hashes, is_protein=is_protein, dayhoff=dayhoff)
E = MinHash(ksize=ksize, n=args.num_hashes, is_protein=is_protein, dayhoff=dayhoff, hp=hp)
streamsig = sig.SourmashSignature(E, filename='stdin', name=args.name)

notify('Computing signature for k={}, {} from stdin', ksize, moltype)
Expand Down
Loading