# 3.4. Set cardinality with HyperLogLog

## Set cardinality


Again, let's revisit our betalactamase genes:

In [None]:
import nbimporter
import Background as utils

# These AMR genes are downloaded from the ResFinder database (accessed 05/2017):
## blaIMP-1_1_DQ522237
geneA = "ATGAGCAAGTTATCTGTATTCTTTATATTTTTGTTTTGCAGCATTGCTACCGCAGCAGAGTCTTTGCCAGATTTAAAAATTGAAAAGCTTGATGAAGGCGTTTATGTTCATACTTCGTTTAAAGAAGTTAACGGGTGGGGCGTTGTTCCTAAACATGGTTTGGTGGTTCTTGTAAATGCTGAGGCTTACCTAATTGACACTCCATTTACGGCTAAAGATACTGAAAAGTTAGTCACTTGGTTTGTGGAGCGTGGCTATAAAATAAAAGGCAGCATTTCCTCTCATTTTCATAGCGACAGCACGGGCGGAATAGAGTGGCTTAATTCTCGATCTATCCCCACGTATGCATCTGAATTAACAAATGAACTGCTTAAAAAAGACGGTAAGGTTCAAGCCACAAATTCATTTAGCGGAGTTAACTATTGGCTAGTTAAAAATAAAATTGAAGTTTTTTATCCAGGCCCGGGACACACTCCAGATAACGTAGTGGTTTGGTTGCCTGAAAGGAAAATATTATTCGGTGGTTGTTTTATTAAACCGTACGGTTTAGGCAATTTGGGTGACGCAAATATAGAAGCTTGGCCAAAGTCCGCCAAATTATTAAAGTCCAAATATGGTAAGGCAAAACTGGTTGTTCCAAGTCACAGTGAAGTTGGAGACGCATCACTCTTGAAACTTACATTAGAGCAGGCGGTTAAAGGGTTAAACGAAAGTAAAAAACCATCAAAACCAAGCAACTAA"

## blaIMP-2_1_AJ243491
geneB = "ATGAAGAAATTATTTGTTTTATGTGTATGCTTCCTTTGTAGCATTACTGCCGCGGGAGCGCGTTTGCCTGATTTAAAAATCGAGAAGCTTGAAGAAGGTGTTTATGTTCATACATCGTTCGAAGAAGTTAACGGTTGGGGTGTTGTTTCTAAACACGGTTTGGTGGTTCTTGTAAACACTGACGCCTATCTGATTGACACTCCATTTACTGCTACAGATACTGAAAAGTTAGTCAATTGGTTTGTGGAGCGCGGCTATAAAATCAAAGGCACTATTTCCTCACATTTCCATAGCGACAGCACAGGGGGAATAGAGTGGCTTAATTCTCAATCTATTCCCACGTATGCATCTGAATTAACAAATGAACTTCTTAAAAAAGACGGTAAGGTGCAAGCTAAAAACTCATTTAGCGGAGTTAGTTATTGGCTAGTTAAAAATAAAATTGAAGTTTTTTATCCCGGCCCGGGGCACACTCAAGATAACGTAGTGGTTTGGTTACCTGAAAAGAAAATTTTATTCGGTGGTTGTTTTGTTAAACCGGACGGTCTTGGTAATTTGGGTGACGCAAATTTAGAAGCTTGGCCAAAGTCCGCCAAAATATTAATGTCTAAATATGTTAAAGCAAAACTGGTTGTTTCAAGTCATAGTGAAATTGGGGACGCATCACTCTTGAAACGTACATGGGAACAGGCTGTTAAAGGGCTAAATGAAAGTAAAAAACCATCACAGCCAAGTAACTAA"

# get the canonical k-mers for each gene
kmersA = utils.getKmers(geneA, 7)
kmersB = utils.getKmers(geneB, 7)

# how many k-mers do we have
print("blaIMP-1 is {} bases long and has {} k-mers (k=7)" .format(len(geneA), len(kmersA)))
print("blaIMP-2 is {} bases long and has {} k-mers (k=7)" .format(len(geneA), len(kmersB)))


* we counted the number of k-mers by using the length function in python. However, this will count duplicate k-mers. * we will now use HyperLogLog sketch to approximate the cardinality for each set of k-mers:

In [None]:
from datasketch import HyperLogLog

sketch = HyperLogLog(p=4)
for kmer in kmersA:
  sketch.update(kmer.encode('utf8'))
print("Estimated cardinality of 7-mers in blaIMP-1 is {0:.0f}" .format(sketch.count()))

* let's work out the actual cardinality and compare this to our estimate:

In [None]:
setA = set(kmersA)
print("Actual cardinality of 7-mers in blaIMP-1 is {}" .format(len(setA)))

* we are quite far off with this estimate
* let's increase the accuracy of the sketch by adding more registers
* change the number of registers from 4 to 14:

In [None]:
sketch = HyperLogLog(p=14)
for kmer in kmersA:
  sketch.update(kmer.encode('utf8'))
print("Estimated cardinality of 7-mers in blaIMP-1 is {0:.0f}" .format(sketch.count()))