The following script was written as a part of an open-ended independent research project conducted for CS 431: Data Intensive Distributed Analytics (big data).

I’m Sunny, and welcome to my solo project! I will be exploring simple methods of comparing genome sequences. In the end, I will be discussing some of the ways we can extrapolate my simple experiments in the field, limitations of my case study, and how to interpret my findings in the context of today’s knowledge. I apply some basic big data methods and assess their effectiveness in being able to compare genome sequences. I learned a lot from jumping into this project, and I hope you enjoy it as much as I did. :)

During my search for interesting datasets, I noticed a lot of “COVID-19 genome sequencing” datasets. After inquiring, I got this insight: there is more info about genes in databases than researchers have been able to run experiments on! Furthermore, finding patterns in DNA today depends on the validity of conclusions drawn from a limited number of experiments that were previously done, and so the propagation of error is a big issue here.

The main purpose of my project is to explore what data we can extract from having two samples and no database with previous knowledge to pull from. How can my current knowledge allow me to extrapolate data?

To begin, I wanted a way to just “eyeball” the data. It was a bit difficult to look at blocks of letters, so I thought of a really simple hash, where for each block, I count the number of occurrences of each letter, and store them as a set, and that’s the hash.

Obviously, this hash is not useful for comparing the actual sequences themselves, but taking a brief look at the resultant hashes for two genomes of coronavirus type 2, I quickly noticed that they looked pretty similar. Each line seemed to differ by approximately 1 character, and one character only. It was then that I actually looked at the data, and noticed that all my samples for this kind of viral species type were almost identical, and only differed in the first few characters (in spaces that might be referred to as junk DNA), and in the last characters, in the place that is often referred to as the poly-A tail.

This makes a lot of sense because, we do expect to see some variation in all of our samples,
Genome sequencing is often imperfect, and may have errors or gaps in the data, making it advantageous to have multiple samples, which improves the “resolution” of the data.
Based on contextual knowledge, we know that the various samples are genetically very similar, as there has been limited time for the virus to spread and mutate.

So, rather than comparing two sequences from the same strain, I decided to compare “coronavirus” and a “coronavirus 2” dataset. These are the names of different kinds of coronaviruses.

The data that I had was divided into lines of 70 characters, and each sample had about 400 lines. Obviously a hash is pretty sensitive, and would be thrown off even in nearly identical samples. For me, this meant that I wanted to try to figure out a better way to divide the data to analyse, rather than arbitrary lines of length 70.

Comparing DNA sequences is not quite the same as comparing files, for example, to read DNA, we are looking for “open reading frame,” the beginning of which are marked with these things called start codons, which are pretty much always “AGT.”
But i thought i’d try some techniques that we learned about file comparison anyway.
One method that did occur to me, was to create a dictionary of multiple permutations of letters. However, even with lines of 70 characters, we would develop a huge workload really fast. While this is completely doable in spark, I was more interested in exploring shortcuts. Given more time, perhaps I would have implemented this model in order to compare to the other models.

The model that I DID trY was to flatMap and split the data by start codons. So, at every occurrence of “AGT,” split the data.

For my two samples, from Kaggle, these are some numbers I pulled: As you can see, viruses which are more closely related had more similar results.
one sample had 777 start codons 2143 stop codons. The second had 777 start codons and 2159 stop codons. The first had 428 lines of bases, the second had 427.

For comparison, I also calculated these metrics on a different strain. The previous were coronavirus 2. For a “coronavirus” strain, there was 802 start, 2038 stop, and 426 lines. This makes sense, as we’d expect to see more deviation as the species diverges.

So right away, we see the complexities of comparing gene sequences emerging. If there are mismatches between just two datasets, then we would certainly expect to see many more in larger datasets.

Then I did my hashing technique again, then I reduced to see how many there were, then Ij joined with the other dataset, and only kept the pairs which had the same count. Obviously, the last filter was very unnecessary. However, I was still left with 92 pairs which were hashed the same.
If I filtered so that the only requirement was that both sequences have the same hash, I ended with 175 unique hashes in both datasets.

Given time, I would have used this information as a starting point to look for parallels. There would definitely have been challenges in figuring out ways to index in spark. But in the meantime, I think this information could be useful as a metric of how possible it is to have similarities.
To assess validity, the number of items in the rdd which was split by start codons was 1129. Not having a lot of biology experience, this ratio seems pretty okay to me. I imagine if I were to further explore the validity of this method, I would be able to develop an ideal ratio to estimate the similarity of the two sets.

The problem of having too much data and not enough researchers to analyse it, especially digital data, is a huge untapped market in the digital age. Using simple case studies, one could eventually over time build libraries for researchers to use which could eventually handle very complicated genome analysis.

So there you have it, a summary of my brief explorations in playing with genome sequencing. There are a lot of ways to approach genome analysis, and this was more about getting a feel for the challenges than researching the methods, but I still had a lot of fun, and I hope you enjoyed. Thanks for watching, and thanks for a great term.


In [None]:
# INSTALL SPARK

!apt-get update -qq > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
# SET UP SPARK CONTEXT; PYSPARK

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext(appName="YourTest", master="local[*]")

Journal notes:
This is a personal project, playing with genome datasets using big data methods.

I will start by preparing two datasets, of two separately sequenced complete COVID genomes, and compare them.

Prep the files, compare the length. Everything i learn about these sets will be new to me, and I'll explore the meaning of what I'm studying.

My hash can be a LIST of four numbers, each number counting the number of A,C,G,T?

Currently, my objective for this project is to explore various combinations of samples we can create out of two genome samples in order to compare the two. This will involve creating functions to identify start and stop codons, which have many combinations, as well as matching up start codons with potential stop codons, and then comparing the lists of genomes developed for both sets.

PREPARING GENOME SEQUENCE A (genome2)

In [None]:
!wget -q /content/drive/MyDrive/1 - Term 4A/CS 431/genome1.rtf

# hash data, one set (count_A, c_C, c_G, c_T) per line

# line is an RDD, form such as ['A', 'T', 'G', 'G']
def parsing(line):
  new = []
  for letter in line:
    new = new + [(letter, 1)]
  return new

# fake reduce by key, takes in list of kv pairs
def rbk(kv):
  a = 0
  c = 0
  g = 0
  t = 0
  for pair in kv:
    if pair[0] == 'A':
      a = a + 1
    elif pair[0] == 'C':
      c = c + 1
    elif pair[0] == 'G':
      g = g + 1
    elif pair[0] == 'T':
      t = t + 1
    else:
      print('problem!!')
  return (a, c, g, t)

lines2 = sc.textFile('/content/drive/MyDrive/1 - Term 4A/CS 431/genome2.txt')

# LISTLINE
# makes an RDD of lists, ['A', 'T', 'G', 'G'], etc
# turns it into list of kv pairs, [('A',1), ('T',1), ('G',1), ('G',1)]
# end result is a list of hashes, [(22, 17, 8, 23),
#                                  (19, 16, 14, 21),
#                                  (13, 17, 17, 23),
#                                  (16, 16, 21, 17),
#                                  (16, 20, 15, 19),
#                                  (16, 13, 23, 18),
#                                  (20, 14, 15, 21),
#                                  (16, 16, 20, 18),
#                                  (13, 16, 21, 20),
#                                  (20, 13, 18, 19)]
listline2 = lines2.map(lambda x: list(x)).map(parsing).map(rbk)

listline2.take(10)

#print(listline2.count())
#lines2.take(2)



[(22, 17, 8, 23),
 (19, 16, 14, 21),
 (13, 17, 17, 23),
 (16, 16, 21, 17),
 (16, 20, 15, 19),
 (16, 13, 23, 18),
 (20, 14, 15, 21),
 (16, 16, 20, 18),
 (13, 16, 21, 20),
 (20, 13, 18, 19)]

PREPARING GENOME SEQUENCE B (genome3)

In [None]:
lines3 = sc.textFile('/content/drive/MyDrive/1 - Term 4A/CS 431/ggenome3.txt')

listline3 = lines3.map(lambda x: list(x)).map(parsing).map(rbk)

print(listline3.count())
listline3.take(10)

# Severe acute respiratory syndrome-related coronavirus isolate Tor2, complete genome


# every 3, NOT every one, starting at a start codon.
# read in sets of 3, but different sets can still produce the same amino acid. so genome diff != practical diff



426


[(21, 18, 10, 21),
 (22, 16, 13, 19),
 (13, 18, 15, 24),
 (16, 15, 21, 18),
 (17, 17, 18, 18),
 (16, 16, 23, 15),
 (19, 16, 17, 18),
 (18, 18, 20, 14),
 (18, 18, 17, 17),
 (20, 12, 19, 19)]

In [None]:
lines4 = sc.textFile('/content/drive/MyDrive/1 - Term 4A/CS 431/genome4.txt')

listline4 = lines4.map(lambda x: list(x)).map(parsing).map(rbk)

print(listline4.count())
listline4.take(10)


427


[(21, 18, 8, 23),
 (19, 15, 14, 22),
 (13, 17, 17, 23),
 (16, 17, 21, 16),
 (16, 19, 16, 19),
 (16, 14, 22, 18),
 (20, 13, 16, 21),
 (16, 16, 20, 18),
 (13, 16, 20, 21),
 (20, 13, 19, 18)]

Okay, so I'm realizing that my hash method is not the right idea, partially bcause even though comparing each individual number in this matrix, the matrix which is the DIFFERENCE between the two is not that much, there are some problems:
- ACGT is not a huge variety of options, so the spread varies really easily
- if you replace ONE letter difference in each row, then TWO letter counts are affected, making the results skewed really fast.

But in any case, I'd love to calculate a matrix, one subtracting the other, to see what happens.

In [None]:
# matrix A: listline2
# matrix B: listline3

#matrixA = listline2.collect()
#matrixA.take(10)

listline2.take(10)
listline3.take(10)


[(21, 18, 10, 21),
 (22, 16, 13, 19),
 (13, 18, 15, 24),
 (16, 15, 21, 18),
 (17, 17, 18, 18),
 (16, 16, 23, 15),
 (19, 16, 17, 18),
 (18, 18, 20, 14),
 (18, 18, 17, 17),
 (20, 12, 19, 19)]

current thoughts:

brute force the calculation by comparing EACH INDIVIDUAL position in the two genomes, and then sum the similarities,  divide by total number of letters.

maybe use this as my base to compare against

then i have my sketchy hash method

then look up some algorithms to try

then at the end, extend to explain how this could be applicable if we have an enormous dataset.

QUESTION: are there currently tools out there which handle MULTIPLE huge sets? Bc if not, this is great.

QUESTION: why are the genome lengths different?

QUESTION: how important is position? is it reasonable to compare index (i1,i2)? Order matters, but does it matter from the start? DO THEY HAVE TO BE THE SAME SIZE

QUESTION: are there actual advantages to being able to compare huge masses of genomes? or is there more value in quantifying the individual differences? I'm sure by processing huge amounts, we'd be able to generate metrics about the mass of data, against which we can compare differences in extreme cases.

Some research: there are ways of comparing multiple genomes at once. the purpose is either really to show the genetic relationship between two strains, or possibly to isolate mutations.

https://academic.oup.com/bfg/article/10/6/322/233867 :
Compared with ‘traditional’ genome studies that focus on a single genome per study, comparative genomics provides an additional layer of detail. A single genome can reveal the functional potential encoded within an organism with ‘tried-and-true’ annotation strategies using BLASTX, or HMM searches of protein families (e.g. COG, Pfam, TIGRfam, etc.) in order to glean functional potential of a given organism. Comparisons between different pathogen genomes, however, often lead to faster identification of distinct mechanisms underlying pathogenicity. Many types of differences have been observed between pathogens and non- or less-pathogenic relatives, from very large genomic differences as in the genomic islands found among pathogenic or benign strains within a single species (Escherichia coli) [12, 13], to smaller genomic differences between closely related species of the same genus (e.g. between Yersinia pestis and Yersinia pseudotuberculosis) [14]. While the set of genome sequence differences alone may not provide any conclusive answer as to which sequence(s) may be responsible for a specific phenotype, nevertheless, genome comparisons do generate manageable lists of genomic regions and gene candidates for further study.

Obviously, there are a lot of advancements in comparative genetics that I cannot even approach for a single project like this. This changes my research question a bit, to exploring the variation in results some very simple algorithms have in comparative genetics.

exploring why some algorithms work, and some don't will need more research.

Here, I think for now I will just compare multiple methods on two similar genome datasets, and talk about how this would expand.

For example, to actually properly test these methods, and go even deeper, we'd need far more datasets.

Additionally, I think it would be cool to apply statistics. Maybe if we had a huge sample size of genomes, we could calculate an average matrix, and use that matrix to develop a standard deviation matrix, and we could easily identify at least areas which have outliers!

To be clear, there are absolutely methods out there to compare large datasets of genomes quickly. From what I've gathered though, much of these methods are quite new, (fun fact: my highschool biology teacher, who was quite young, worked on the human genome sequencing project himself), they are still developing. I think it's nice because genomes and big data emerged in a similar era, and they kind of grow together. In terms of Kuhnian paradigms, they belong to the same paradigm, and grow together. As someone who studies interdisciplinarity, I am very excited at this parallel.

Also from the link: "Due to technological advances in the past few decades, sequencing whole genomes is becoming cheaper at an exponential rate. As a result, an increasing collection of whole sequenced genomes is becoming publicly available, ranging in size from less than two hundred thousand base pairs2 to more than 22 * 109 base pairs3... 
However, to improve our understanding of these topics, new data are being included in databases, but doing so comes at a price: the more genomes and larger genomes trend is aggravating a scenario where it is unthinkable to perform genomic experiments without the aid of extensive computational resources, but furthermore, even current computational methods are being left behind with the increasing complexity."

"Currently, most of the handling, exploration and curation of genomic information (such as identifying coding regions or generating pairwise synteny maps) is performed manually and curated with platforms dedicated to this purpose, which often include a repository of precomputed genome comparisons. For instance, NARCISSE4, Genomicus5 and SynFind6 provide such data along with other tools for exploration purposes such as genome browsers and karyotype analysers7. Although these resources contain high-quality curated information, they usually rely on previously computed data and typically do not allow users to run new experiments on demand. As a result, when new experiments are supported by these platforms, it is common to employ restricted comparison methodologies such as gene-based approaches (see CoGe8) to lower computational demands. However, these approaches are often based on the BLAST9 algorithm, which was not initially designed for large pairwise genome comparisons."

A lot like files, I could hope in the very least that these methods could be a fast way to isolate which pairs are good CANDIDATES for further examination or comparison.

The software out there that I looked into right now, there aren't any websites which do that specifically.

Yes! The fact that there's more info in the databases than humans have been able to run experiments on! So finding patterns in the DNA depends on the validity of the conclusions we've drawn from a limited number of experiments that we HAVE done, so the propagation of error is a huge issue.



# step 1, find start codon ATG? always. find first instance, then process stuff after.
# find stop codons also. stop codon: TAG, TAA, TGA. (could quickly google).
# would expect to see these multiple times.
# for every sequence, lots of genes
# open reading frame: from start to stop. these are the actual genes. there will be lots orfs. millions/w/e

# the poly A tail, AAAAAAAAAAA, like a bumper to protect end of gene. depends on sequence type.

# the stuff before the start codon? read gene, make mrna to make a protein. junk dna wasn't relevant before, but it actually controls the amount
# of the protein, regulation, etc. it's not useless. not translated to protein, but used in expression.

# human genome project: 22,000-25,000 genes!
# COVID is a viral thing, so it's tinier
# DNA tries to be efficient. relationship bw how much dna it takes to code the exterior and some other stuff.

# website with annotated gene records (lots!! link)
# links me to different ways of comparing existing datasets
# mimick!
# "explore the data"



In [None]:
lines2 = sc.textFile('/content/drive/MyDrive/1 - Term 4A/CS 431/genome2.txt')

# find start codon
def start(line):
  count = 0
  A = 0
  T = 0
  for letter in line:
    #print(letter)

    if letter == 'A':
      if A == 1:
        continue
      elif A == 0:
        A = 1
        #print('change?')
        #print(A)

    elif letter == 'T':
      if A == 0:
        continue
      else:
        if T == 0:
          T = 1
          #print('change?')
          #print(T)
        else:
          A = 0
          T = 0
          #print('refresh')
          #print(A)
          #print(T)
    
    elif letter == 'G':
      if A == 0:
        continue
      else:
        if T == 0:
          A = 0
        else:
          count = count + 1
          A = 0
          T = 0
          #print('UPDATE')
          #print(count)
    else:
      A = 0
      T = 0

  return count

count_start = lines2.map(start)

count_start2 = count_start.reduce(lambda x,y: x+y)


count_start4 = lines4.map(start)

count_start44 = count_start4.reduce(lambda x,y: x+y)


print(count_start2)
print(count_start44)
print(start('ATCTATCGATGAA'))
print(start('ATCTATCGATGAATG'))
#start('ATG')

count_start.take(10)
#[0, 1, 0, 1, 0, 1, 1, 3, 1, 2]

lines2.take(10)

777
777
1
2


['ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA',
 'CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC',
 'TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG',
 'TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC',
 'CCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTAC',
 'GTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGG',
 'CTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGAT',
 'GCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC',
 'GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCT',
 'TCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTA']

In [None]:
lines2 = sc.textFile('/content/drive/MyDrive/1 - Term 4A/CS 431/genome2.txt')


# FIND END CODONS: TAG, TAA, TGA

# find start codon
def end(line):
  count = ''
  total = 0
  for letter in line:
    #print(letter)

    if letter == 'T':
      if count == '':
        count = 'T'
      elif count == 'T':
        continue
      else:
        count = ''
        #print('change?')
        #print(count)

    elif letter == 'A':
      if count == '':
        continue
      elif count == 'T':
        count = 'TA'
        #print(count)
      else:
        total = total + 1
        count = ''
        #print('UPDATE')
        #print(total)
    
    elif letter == 'G':
      if count == '':
        continue
      elif count == 'T':
        count = 'TG'
        #print('change?')
        #print(count)
      elif count == 'TA':
        count = ''
        total = total + 1
        #print('UPDATE')
        #print(total)
      else:
        count = ''
    else:
      A = 0
      T = 0

  return total

count_end = lines2.map(end)

count_end2 = count_end.reduce(lambda x,y: x+y)


count_end4 = lines4.map(end)
count_end44 = count_end4.reduce(lambda x,y: x+y)

print(count_end2)
print(count_end44)
#print(start('ATCTATCGATGAA'))
#print(start('ATCTATCGATGAATG'))

#count_start.take(10)


2143
2159


let's compare a coronavirus and a coronavirus 2. let's do it by building a keys dictionary.

In [None]:
listline2.take(30)

variations = listline2.map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y).sortByKey()

variations.take(10)


[((9, 15, 14, 32), 1),
 ((11, 9, 16, 34), 1),
 ((11, 14, 18, 27), 1),
 ((11, 15, 11, 33), 1),
 ((12, 9, 12, 37), 1),
 ((12, 12, 15, 31), 1),
 ((12, 14, 8, 36), 1),
 ((12, 14, 11, 33), 1),
 ((13, 0, 0, 0), 1),
 ((13, 15, 8, 34), 1)]

In [None]:

listline3.take(30)

variations3 = listline3.map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y).sortByKey()

variations3.take(10)


[((1, 0, 0, 0), 1),
 ((9, 14, 19, 28), 1),
 ((9, 17, 13, 31), 1),
 ((10, 11, 21, 28), 1),
 ((11, 13, 20, 26), 1),
 ((11, 15, 15, 29), 1),
 ((11, 16, 14, 29), 1),
 ((11, 18, 13, 28), 1),
 ((11, 19, 15, 25), 1),
 ((12, 10, 18, 30), 1)]

In [None]:
k = 30
# only looking at sequences of at least 30

# one method is looking at identical hashes, and comparing those lines

In [None]:
# trying from scratch, separating data by start codons

lines2 = sc.textFile('/content/drive/MyDrive/1 - Term 4A/CS 431/genome2.txt')

listline2 = lines2.flatMap(lambda x: x.split('ATG')).map(parsing).map(rbk).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortByKey(False)

#print(listline2.count())
listline2.take(20)

[((34, 15, 12, 9), 1),
 ((33, 7, 11, 19), 1),
 ((32, 17, 8, 13), 1),
 ((31, 9, 15, 15), 1),
 ((28, 15, 8, 19), 1),
 ((28, 14, 8, 20), 1),
 ((28, 13, 16, 13), 1),
 ((27, 11, 10, 22), 1),
 ((26, 12, 11, 21), 1),
 ((26, 11, 15, 18), 2),
 ((26, 9, 4, 19), 1),
 ((25, 18, 15, 12), 1),
 ((25, 14, 15, 16), 2),
 ((25, 14, 13, 15), 1),
 ((25, 12, 18, 15), 1),
 ((25, 11, 12, 22), 1),
 ((25, 11, 8, 19), 1),
 ((24, 14, 11, 21), 1),
 ((24, 13, 13, 20), 1),
 ((24, 12, 15, 19), 1)]

In [None]:
listline2 = lines2.flatMap(lambda x: x.split('ATG'))
listline2.count()

1129

In [None]:
lines3 = sc.textFile('/content/drive/MyDrive/1 - Term 4A/CS 431/ggenome3.txt')

listline3 = lines3.flatMap(lambda x: x.split('ATG')).map(parsing).map(rbk).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortByKey(False)

#print(listline2.count())
listline3.take(20)


[((32, 15, 16, 7), 1),
 ((31, 14, 9, 7), 1),
 ((28, 19, 12, 11), 1),
 ((27, 8, 13, 22), 1),
 ((26, 18, 14, 12), 1),
 ((26, 17, 14, 13), 1),
 ((26, 16, 13, 15), 1),
 ((26, 14, 14, 16), 1),
 ((26, 9, 14, 21), 1),
 ((25, 17, 11, 17), 1),
 ((25, 16, 14, 15), 1),
 ((25, 16, 12, 14), 1),
 ((25, 15, 14, 16), 1),
 ((25, 14, 15, 16), 1),
 ((25, 12, 15, 18), 1),
 ((24, 14, 12, 20), 1),
 ((24, 13, 14, 19), 1),
 ((24, 1, 0, 0), 1),
 ((23, 18, 16, 13), 1),
 ((23, 17, 16, 14), 1)]

In [None]:
merge = listline2.join(listline3)

merge.take(10)

[((25, 14, 15, 16), (2, 1)),
 ((15, 8, 6, 11), (1, 1)),
 ((9, 4, 3, 2), (1, 1)),
 ((9, 3, 3, 5), (1, 1)),
 ((8, 9, 5, 6), (1, 1)),
 ((8, 5, 5, 14), (1, 2)),
 ((7, 6, 2, 5), (1, 1)),
 ((7, 5, 3, 5), (1, 1)),
 ((5, 5, 2, 8), (1, 1)),
 ((5, 4, 7, 14), (1, 1))]

In [None]:
how_many = merge.filter(lambda x: x[1][0] != 0 and x[1][1] != 0)

print(how_many.count())
how_many.take(10)

175


[((25, 14, 15, 16), (2, 1)),
 ((15, 8, 6, 11), (1, 1)),
 ((9, 4, 3, 2), (1, 1)),
 ((9, 3, 3, 5), (1, 1)),
 ((8, 9, 5, 6), (1, 1)),
 ((8, 5, 5, 14), (1, 2)),
 ((7, 6, 2, 5), (1, 1)),
 ((7, 5, 3, 5), (1, 1)),
 ((5, 5, 2, 8), (1, 1)),
 ((5, 4, 7, 14), (1, 1))]