## Problem statement:
* Inputs: 
    * 10^9 (R1,R2) pairs: Short Nucleic Acid Sequence
        * R1: Possibly edited site
        * R2: Domain which was used to do the editing (probably slightly altered)
    * 10^5 Domains
    * Reference unedited R1 sequence.
* Output: Sparse Data Frame in csv format
    * Rows: domains
    * Columns: Edited R1s (reduced to "cigar string" alignment)
    * Data: count of how many times domain_i produced cigar_j
    

## Brute Force Plan:

* Paramters:
    * n reads from each file
    * m domains to check
    * (eventually) p jobs
    * location of input data. 
    
* Psuedocode:
    * for R1, R2 in get_reads()
        * domain_i = get_best_domain(R2)
        * cigar_j = get_cigar(R1) 
        * df.loc[domain_i, cigar_j] += 1
    * df_to_csv(output.csv)
    * (eventually) split-apply-combine wrapper
        

In [1]:
import htcondor
import os
import pandas as pd
import numpy as np
from pathlib import Path
import hess_pipeline_util as hpu

In [12]:
# pd.SparseDtype(np.dtype('int'))
# pd.DataFrame.sparse.from_spmatrix(pd.SparseDtype(np.dtype('int')))
# help(pd.arrays.SparseArray)
df_result = pd.DataFrame()
for r1, r2 in hpu.get_pairs(n=10):
    domain_i = hpu.get_best_domain(r2)
    cigar_j = hpu.get_cigar(r1)
    try: 
        df_result.loc[domain_i,cigar_j] += 1
    except KeyError:
        df_result.loc[domain_i,cigar_j] = 1
print(df_result.fillna(0))

      M0   M1   M2   M3   M4   M5   M6   M7   M8   M9
0    1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
-10  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
-20  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
-30  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
-40  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
-50  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
-60  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0
-70  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
-80  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0
-90  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0


In [11]:
a = list(hpu.FASTQ_HOME.glob(hpu.DEFAULT_GLOB_PATTERN))
pair_headers = sorted(list({el.name.split('.')[0][:-1] for el in a}))
print(pair_headers)
for ph in pair_headers 


['Nuc_ABA_BE_Rep1_R', 'Nuc_ABA_BE_Rep2_R', 'Nuc_ABA_KO_Rep1_R', 'Nuc_ABA_KO_Rep2_R', 'Nuc_BE_Rep1_R', 'Nuc_BE_Rep2_R', 'Nuc_KO_Rep1_R', 'Nuc_KO_Rep2_R']


In [1]:
import scipy.sparse
mat = scipy.sparse.eye(3)
pd.DataFrame.sparse.from_spmatrix(mat)

ModuleNotFoundError: No module named 'scipy'