In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Introduction

Like most TPS competitions in the past, TPS February 2022 took an unexpected turn as kagglers (e.g., [discussion thread](https://www.kaggle.com/c/tabular-playground-series-feb-2022/discussion/305364)) discovered that there are a lot of duplicate rows in the training and test data. Subsequently there have been debates whether these duplicates should be dropped or kept or accounted for such as assigning a sample weight. 

Initially my reaction was that this phenomenon was mainly caused by combinatorics. I was led to this thinking because most duplicates occur in the low-resolution rows. Briefly, the original [paper](https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full) speaks of the number \\(r\\) of "pyramid tips" (`num_reads` in their implementation), which determines the number of BOC reads for each sample. In that context, \\(r=100\\) gives a low-resolution histogram whereas \\(r=1000000\\) gives a high-resolution histogram. Combinatorially, a low-resolution histogram has less possible combinations, more precisely there are \\(\frac{(r+285)!}{r!285!}\\) possible histograms. This is still a huge number, but perhaps if the spectrum to be sampled has a lot of zeroes, then that might limit the number such as repetitions might be more likely -- so I thought. When I took the time to perform some simulations later, it turns out that getting duplicates this way is next to impossible, even for \\(r=100\\), either with a genome distribution or the bias distribution.

I became curious. There must be a way to replicate this phenomenon. It is unlikely I can replicate the exact data in this competition, but I should be able to replicate the phenomenon. That's when I decided to look at the [accompanying implementation](https://github.com/rlwphd/DNAFingerprints) by the authors of the paper. The following are my findings.

**Disclaimer**: This is an investigation of the data generation process in the original paper. I do not have any knowledge of how the data was generated in this competition. For what I know, kaggle might take some ideas from the paper, but use a completely different methodology in data generation.

**Disclaimer\\({}^2\\):** The "CSI" part of the title of this notebook is just a joke. It is not intended to imply that any crime has occurred. 

# Extracting just enough code to replicate the phenomenon

First let us download the whole package. 

In [None]:
!wget https://github.com/rlwphd/DNAFingerprints/archive/refs/heads/master.zip
import zipfile
with zipfile.ZipFile('master.zip', 'r') as zip_ref:
    zip_ref.extractall('.')   

In [None]:
import os
import math
import time
import datetime
import dask.dataframe as ddf
from collections import Counter

We need to specify where to read the raw data and where to output the BOC reads.

In [None]:
bacteria = './DNAFingerprints-master/Bacteria/'
local_BOC = '.'

The code in the following cell is directly copied from the original source.

In [None]:
def str_count(str_part):
    A = str_part.count('A')
    T = str_part.count('T')
    G = str_part.count('G')
    C = str_part.count('C')

    return (A, T, G, C), (T, A, C, G)

def kmer_fingerprints(whole_str,dna_length,kmer_range):
    str_part = (whole_str[ii:ii+dna_length] for ii in range(len(whole_str)-dna_length+1))
    kmer_list = [item for string in str_part for item in str_count(string)]
    kmer_dict = Counter(kmer_list)
    results = [kmer_dict[val] for val in kmer_range]

    return results

def title_extraction(header, resistance):
    heading = header.split(' ')
    seq_record_id = heading[0]
    genus = heading[1]
    if len(heading) >= 3:
        species = heading[2]
        name = genus + ' ' + species
    else:
        name = genus
        
    for word in heading:
        if 'plasmid' in word:
            dna_type = 'Plasmid'
            break
        elif 'genome' in word:
            dna_type = 'Genome'
            break
        elif 'sequence'in word:
            dna_type = 'Sequence'
            break
        else:
            dna_type = ""
    
    bacteria_type = ""
    strain = ""
    data_add = [seq_record_id[1:], resistance, name, genus, dna_type, strain, bacteria_type,  header]
    
    return data_add

def multiple_str_check(file):
    header = []
    start = []
    with open(file, 'r') as f:
        for ii, line in enumerate(f):
            if '>' == line[0]:
                header.append(line)
                start.extend([ii+1])
            end = ii+1
        start.extend([end])
    return header, start

def str_extraction(file,start,end,dna_length):
    string = []
    with open(file,'r') as f:
        for ii, line in enumerate(f):
            if ii >= start and ii < end:
                string.append(line)
            elif ii == end:
                break
                
    
    section = ''.join(line.strip() for line in string)
    whole_string = ''.join([section,section[0:dna_length-1]])
    return whole_string

def kmer_main(file,dna_length,kmer_range,data_index,df_Genome,df_Plasmid, resistance):
    header,start = multiple_str_check(file)
    if len(header) > 1:
        for ii in range(len(header)):
            whole_str = str_extraction(file,start[ii],start[ii+1]-1,dna_length)
            species_data = title_extraction(header[ii], resistance)
            kmer_results = kmer_fingerprints(whole_str,dna_length,kmer_range)
            species_data.extend(kmer_results)
            df_kmer = pd.DataFrame(species_data, index=data_index)
            if species_data[4] != 'Plasmid':
                df_Genome = df_Genome.append(df_kmer.T, ignore_index=True)
            elif species_data[4] == 'Plasmid':
                df_Plasmid = df_Plasmid.append(df_kmer.T, ignore_index=True)
    else:
        whole_str = str_extraction(file,start[0],start[1]-1,dna_length)
        species_data = title_extraction(header[0], resistance)
        kmer_results = kmer_fingerprints(whole_str,dna_length,kmer_range)
        species_data.extend(kmer_results)
        df_kmer = pd.DataFrame(species_data, index=data_index)
        if species_data[4] != 'Plasmid':
            df_Genome = df_Genome.append(df_kmer.T, ignore_index=True)
        elif species_data[4] == 'Plasmid':
            df_Plasmid = df_Plasmid.append(df_kmer.T, ignore_index=True)
    
    type_change = {}
    for ii,name in enumerate(data_index):
        if ii < 8:
            type_change[name] = 'object'
        else:
            type_change[name]='int32'
    df_Genome = df_Genome.astype(type_change)
    df_Plasmid = df_Plasmid.astype(type_change)
    
    return df_Genome,df_Plasmid

def kmer_length():
    k = None
    while k is None:
        input_value = input("Please enter DNA segment length (5-100): ")
        try:
        # try and convert the string input to a number
            k = int(input_value)
            if k < 5:
                print("{input} is not a valid integer, please enter a valid integer between 5-100".format(input=input_value))
                k = None
            elif k > 100:
                print("{input} is not a valid integer, please enter a valid integer between 5-100".format(input=input_value))
                k = None
        except ValueError:
        # tell the user off
            print("{input} is not a valid integer, please enter a valid integer between 5-100".format(input=input_value))
    return k

The function `SERS_values` is an important function that generates the BOC reads. I put it in a separate cell for emphasis but otherwise it is a verbatim copy of the original source.

In [None]:
def SERS_values(sers,categories,num_reads,arr,mutate,mutation):

    jj = 0
    while 1:
    # Randomizing the pyramid array
        mr = np.random.permutation(arr)
    # Adding in the mutations
        mr = np.where(mutate > 0, mutation, mr)
    # Getting the respective kmer counts and dividing the values 
    # by the total value to get the frequencies
        for nn in range(len(num_reads)):
        # Setting the limits for how much of the sequence it's using
            if num_reads[nn] > 10000:
                ff = 10000
                max_depth = int(num_reads[nn]/10000)
            else:
                ff = num_reads[nn]
                max_depth = 1 
            for mm in range(mr.shape[1]):
                for kk in range(max_depth):
                    sers[nn,mm,jj,:] += np.bincount(mr[kk,mm,:ff],minlength=categories)
                    
        jj+=1
        if jj >= sers.shape[2]:
            break
    sers /= np.array(num_reads).reshape((-1,1,1,1))
    return sers

The function `SERS_reads` is also an important function because it performs the random sampling of FBC spectra. It runs a loop through each genome, pre-samples the FBC spectrum and bias spectrum \\(200\times10000\\) times, passes the pre-sampled arrays to `SERS_values` to get the BOC reads, then write the reads to a file for each combination of `error_rate` and `num_reads`. 

I modified the original code in 2 ways:
* breaks out of the main loop after processing one genome. I just want to confirm the statistics of duplicates, and processing one genome is enough.
* saves the output to CSV instead of HDF5

In [None]:
def SERS_reads(dna_length,df,group,DNAtype,data_categories,bias,num_training_samples,num_reads,error_rate):

#Dividing the number of occurences of each bin by the total number of occurences
    df_prob = df.loc[:,data_categories[0]:data_categories[-1]].div(df.loc[:,data_categories[0]:data_categories[-1]].sum(axis=1),axis=0)

    for ii in range(len(df_prob.index)):
    # Getting just the probility values from the kmer counts
        print(F'Processing {df.iloc[ii]["Resistance"]} starts')
        prob = df_prob.iloc[ii,:].values
    # Getting the probility for the largest pyramid of interest and setting the data type
        read = np.random.RandomState(seed=231).choice(len(data_categories),(int(2*max(num_reads)/10000),10000),p=prob)
        read = np.stack([read for _ in range(len(error_rate))],axis=1)
        
    # Creating the mutation array
        mutate = np.zeros((int(2*max(num_reads)/10000),len(error_rate),10000),dtype=np.int16)
        for int_mut,mut in enumerate(error_rate):
            mutations = np.concatenate([np.random.RandomState(seed=123).choice([0,1],min(num_reads),p=[1-mut,mut]) for _ in range(int(2*max(num_reads)/min(num_reads)))]).reshape((-1,10000))
            mutate[:,int_mut,:] = mutations
        mutation = np.random.RandomState(seed=321).choice(len(data_categories),(int(2*max(num_reads)/10000),10000),p=bias)
        mutation = np.stack([mutation for _ in range(len(error_rate))],axis=1)
        
    # Getting the training samples for the species
        sers = np.empty((len(num_reads),len(error_rate),num_training_samples,len(data_categories)))
        sers_results = SERS_values(sers,len(data_categories),num_reads,read,mutate,mutation)
    # Subtracting off the random bias 
        sers_results -= bias
        
    # Cycling through each mutation array
        for mut_int in range(len(error_rate)):
        # Cycling through each pyramid size
            for read_int in range(len(num_reads)):
            # Storing the files as an hdf5 file
                save_path = os.path.join(local_BOC,'SERS_%s_%s_%s_%s_%smer_data.csv' % (str(int(error_rate[mut_int]*100)), str(num_reads[read_int]), group, DNAtype, str(dna_length)))
                # Putting it into a pandas dataframe and storing it
                df_SERS = pd.DataFrame(sers_results[read_int,mut_int,:,:], columns=data_categories)
                df_SERS['Name'] = [df.iloc[ii,1]]*len(df_SERS.index)
                df_SERS.to_csv(save_path,index=False)
        
        print(F'Processing {df.iloc[ii]["Resistance"]} ends')
        break # processing just one genome
        
        if ii % 10 == 0:
            print(ii)
            print('Saved %s' % (datetime.datetime.now().isoformat()))
            

    return

The following cell is also direct copy of the original code. It runs through each genome sequence to produce the FBC spectrum and save to .h5 file.

In [None]:
#dna_length=kmer_length()
dna_length = 10
# Recording the time it takes to run everything
print(datetime.datetime.now().isoformat())
start = time.perf_counter()
# number of samples per species 
num_training_samples = 1000
# error rate
error_rate = [0,0.01,0.05,0.1,0.25,0.33,0.5,0.75,0.9,1]
# number of optical sequencing reads
num_reads = [100,1000,10000,100000,1000000]

# creating the correct tuples for how many A, T, G and C's are in each bin
kmer_range = [(aa, tt, gg, cc) for aa in range(dna_length + 1) for tt in range(dna_length + 1) for gg in range(dna_length + 1) for cc in range(dna_length + 1) if aa + tt + cc + gg == dna_length]
# setting dna length based bias
bias = np.array([(1/4**dna_length) * math.factorial(dna_length)/(math.factorial(kmer[0]) * math.factorial(kmer[1]) * math.factorial(kmer[2]) * math.factorial(kmer[3])) for kmer in kmer_range])
# creating the categorical labels for storing data
data_categories = ["A%sT%sG%sC%s" % (str(aa), str(tt), str(gg), str(cc)) for aa in range(dna_length + 1) for tt in range(dna_length + 1) for gg in range(dna_length + 1) for cc in range(dna_length + 1) if aa + tt + cc + gg == dna_length]
# creating the labels for the non-numerical information
data_index = ['Seq Record ID', 'Resistance', 'Name', 'Genus', 'DNA Type', 'Strain', 'Bacteria Type', 'Notes']
# combining the the non-numerical and categorical labels for the pandas dataframe
data_index.extend(data_categories)

# getting the list of all of the folders and files that have the DNA sequences in them
file_list = [(os.path.join(root,name),root[3:].split('/')[-1],name) for root, dirs, files in os.walk(bacteria) for name in files if name.endswith(".txt") or name.endswith(".fna")]

# Running through all of the DNA sequence files
for int_file,file in enumerate(file_list):
# creating the empty dataframes to store the data in
    df_Genome = pd.DataFrame(columns=data_index)
    df_Plasmid = pd.DataFrame(columns=data_index)

# creating the 10mer data files for the DNA sequences
    resistance = file[2][:-4]
    df_Genome, df_Plasmid = kmer_main(file[0],dna_length,kmer_range, data_index,df_Genome,df_Plasmid, resistance)
    print('Completed %s' % file[2][:-4])

# saving the dataframe
    if len(df_Genome.index) > 0:
        df_Genome.to_hdf('PandasDataFrame_%s_Genome_%smer_data.h5' % (file[1],str(dna_length)),'df%s' % (int_file),mode='a',format='table')
    if len(df_Plasmid.index) > 0:
        df_Plasmid.to_hdf('PandasDataFrame_%s_Plasmid_%smer_data.h5' % (file[1],str(dna_length)),'df%s' % (int_file),mode='a',format='table')
    print('file saved')

    del(df_Genome)
    del(df_Plasmid)

end = time.perf_counter()
print('# of hours to run code: %s' % ((end-start)/3600))
print(datetime.datetime.now().isoformat())

OK so I am only interested in how the training data is generated, that would be from the file `PandasDataFrame_Training_Genome_10mer_data.h5`. For the first genome (Bacteroides_fragilis; remember I break out the loop after one genome) `num_training_samples = 1000` samples are generated for each combination of `error_rate` and `num_reads`.

In [None]:
file = './PandasDataFrame_Training_Genome_10mer_data.h5'
file_split = file[:-3].split('_')
df = ddf.read_hdf(file, 'df*').compute()
SERS_reads(dna_length,df,file_split[1],file_split[2],data_categories,bias,num_training_samples,num_reads,error_rate)

Now let's see what we got. First, `error_rate = 0.0`, `num_reads = 100`

In [None]:
train_data = pd.read_csv('./SERS_0_100_Training_Genome_10mer_data.csv')
np.unique(train_data.drop('Name',axis=1).to_numpy().astype(np.float64),axis=0).shape

How about a higher `num_reads` but still less than 10000? `error_rate = 0.0`, `num_reads = 1000`

In [None]:
train_data = pd.read_csv('./SERS_0_1000_Training_Genome_10mer_data.csv')
np.unique(train_data.drop('Name',axis=1).to_numpy().astype(np.float64),axis=0).shape

What if `num_reads` is larger than 10000? `error_rate = 0.0`, `num_reads = 100000`

In [None]:
train_data = pd.read_csv('./SERS_0_100000_Training_Genome_10mer_data.csv')
np.unique(train_data.drop('Name',axis=1).to_numpy().astype(np.float64),axis=0).shape

Let's try a couple of intermediate cases, with 0.5 `error_rate` and a low and high `num_reads`.

In [None]:
train_data = pd.read_csv('./SERS_50_100_Training_Genome_10mer_data.csv')
np.unique(train_data.drop('Name',axis=1).to_numpy().astype(np.float64),axis=0).shape

In [None]:
train_data = pd.read_csv('./SERS_50_100000_Training_Genome_10mer_data.csv')
np.unique(train_data.drop('Name',axis=1).to_numpy().astype(np.float64),axis=0).shape

Now what happens when the `error_rate` is 100%? Even with highest resolution `num_reads = 1000000`?

In [None]:
train_data = pd.read_csv('./SERS_100_1000000_Training_Genome_10mer_data.csv')
np.unique(train_data.drop('Name',axis=1).to_numpy().astype(np.float64),axis=0).shape

# Interpretations

* `num_reads` \\(\le10000\\), `error_rate` \\(< 1.0\\): It is not hard to see why the number of unique rows is at most 200. The sampling scheme pre-samples a \\(200\times10000\\) array and for each sample, pick a random row (first row of the permuted array). Obviously there can be at most 200 unique rows.

* `num_reads` \\(>10000\\), `error_rate` \\(< 1.0\\): In this case, `max_depth` rows from the presampled array are randomly chosen, where `max_depth = int(num_read/10000)`. There are a lot more variability and you would have to be extremely lucky to get duplicate rows.

* `error_rate` \\(=1.0\\): The output is always a constant row?? Well, it is clear why this is happening if you look the code more carefully.

At the beginning of the function `SERS_values` you can find these lines. `jj` indexes the samples to be generated. Inside the "infinite loop", the BOC reads `arr` are first randomized row-wise and saved as `mr`. If the `error_rate` is 1.0, `mutate` is identically one's. So, after "adding in the mutations", `mr` becomes the same as `mutation`, a \\(200\times10000\\) presampled array of the bias distribution. But wait, this array has not be randomized. Subsequently, only the first `max_depth` rows of this array are used as output, so the output is always a constant. Should the `mutation` array have been permuted for each sample generation? Common sense of mine says yes, but I am not a biologist nor a DNA device engineer to tell if a real device would behave this way. 

```
def SERS_values(sers,categories,num_reads,arr,mutate,mutation):

    jj = 0
    while 1:
    # Randomizing the pyramid array
        mr = np.random.permutation(arr)
    # Adding in the mutations
        mr = np.where(mutate > 0, mutation, mr)
    .
    .
    .
```

That concludes the reveal of the mystery of constant mutation.

# Conclusions

It is important to emphasis that this is a scientific investigation of how the data might have been generated in the original paper, according to their published accompanying code. I have no knowledge of how the data was generated in this competition. Whether one should remove duplicates or keep them or do something special about them, is one of the modeling decisions a modeler needs to make. I do believe that understanding the data generation model is an important step in the modeling, even if that data generation model might not reflect reality or might be buggy.