# Problem set 2
## "The adventure of ten Arcs"
### Zachary Miller

## Part 1: Repoducing Moriarty's Results
The first thing we want to do is to reproduce Moriarty's result using kallisto on the original data. First, let's run the kallisto command with no arguments to get the usage information.

In [1]:
! kallisto

kallisto 0.46.0

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    bus           Generate BUS files for single-cell data 
    pseudo        Runs the pseudoalignment step 
    merge         Merges several batch runs 
    h5dump        Converts HDF5-formatted results to plaintext
    inspect       Inspects and gives information about an index
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage information for <CMD>



As is mentioned in the lecture notes, we are most interested in the `index` and `quant` commands, which we will use to reproduce Moriarty's expirement below. Notice that we tweaked the arguments to match Moriarty's expirement. 

In [2]:
! kallisto index -i moriarty_transcripts.idx arc.fasta.gz
! kallisto quant -i moriarty_transcripts.idx -o moriarty_output --single -l 150 -s 20 arc.fastq.gz


[build] loading fasta file arc.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: arc.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,983 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 58 rounds



Now let's take a look at the output of kallisto to make sure it matches the results Moriarty got.

In [3]:
import pandas as pd

m_data = pd.read_csv('moriarty_output/abundance.tsv', delim_whitespace=True)
m_data

Unnamed: 0,target_id,length,eff_length,est_counts,tpm
0,Arc1,4000,3851,2810.71,20570.4
1,Arc2,2000,1851,3640.85,55436.5
2,Arc3,3000,2851,28233.5,279105.0
3,Arc4,4000,3851,10376.2,75939.3
4,Arc5,4000,3851,12912.9,94504.2
5,Arc6,3000,2851,1983.17,19604.8
6,Arc7,2000,1851,5442.49,82868.8
7,Arc8,2000,1851,5635.31,85804.8
8,Arc9,3000,2851,3082.56,30472.9
9,Arc10,3000,2851,25865.2,255693.0


Looking at the tpm column, we see that kallisto is giving us tpm counts that are about $\pm500$ what Moriarty reported, and are certainly much closer to Moriarty's tpm counts than the ones we reported. 

## Part 2: Creating a RNA-Seq Simulator

Now, we decide that we want to test how well kallisto works when we feed it simulated data, since when we simulate data we have a sort of "God's eye" view and get to know the correct answer. To do so, we will write three functions. The first function will generate a synthetic Arc locus according to the paramters given in the problem statement. The second function will then create the simulated transcriptome from that Arc locus, using either the transcript lengths we found in our expirement, or randomly generated transcript lengths that are between 2 and 4 (inclusive) segments long. This function should write the transcriptome to a FASTA file and return the transcriptome as a 2d list. Lastly, we will need a third function that uses the transcriptome to generate the reads that will be fed into Kallisto, keeping in mind how the mRNA sequences are broken into reads in the actual expirement.

In [4]:
import numpy as np
import numpy.random as rand
import random
rand.seed(42)

# A few handy functions for this PSET
def trunc_gaussian(mean, sd, max_sample, min_sample):
    """Returns an integer sampled from a guassian distribution with mean=mean and standard deviation = sd,
    truncated to be between max_sample and min_sample (inclusive)"""
    while True:
        sample = int(rand.normal(mean, sd))
        if sample >= min_sample: break
    
    if sample > max_sample: sample = max_sample
    
    return sample

def reverse_comp(seq):
    """Returns the reverse compliment of a DNA sequence given as a string of the characters A C G T"""
    seq = seq[::-1]
    new_seq = ""
    for c in seq:
        if c == "A":
            new_seq += "T"
        elif c == "C":
            new_seq += "G"
        elif c == "G":
            new_seq += "C"
        elif c == "T":
            new_seq += "A"
            
    return new_seq

def base_error(seq, alpha):
    """Given a sequence seq and a base calling error rate alpha, will generate base calling errors in the 
    sequence at a probability equal to alpha"""
    new_seq = ""
    for base in seq:
        if rand.uniform(0,1) <= alpha:
            if base == "A":
                new_seq += rand.choice(["C","G","T"])
            elif base == "C":
                new_seq += rand.choice(["A","G","T"])
            elif base == "G":
                new_seq += rand.choice(["A","C","T"])
            elif base == "T":
                new_seq += rand.choice(["A","C","G"])
        
        else: new_seq += base
        
    return new_seq 

def calc_correct_tpm(nt_abunds, lengths):
    nt_abunds = np.array(nt_abunds)
    lengths = np.array(lengths)
    correct_tpm = []
    for idx, abund in enumerate(nt_abunds):
        fac_1 = abund/lengths[idx]
        fac_2 = sum(nt_abunds/lengths)
        correct_tpm.append((fac_1/fac_2)*(10**6))
    correct_tpm = [round(i,1) for i in correct_tpm]    
        
    return correct_tpm

# Arc locus parameters
S         = 10           # Number of segments in the Arc locus (A..J)
T         = S            # Number of different transcripts (the same, one starting on each segment, 1..10)
N         = 100000       # total number of observed reads we generate
alpha     = 0.999        # base calling accuracy (Q30 bases, typical of current Illumina)
S_len     = 1000         # length of each segment (nucleotides)
Arc_len   = S_len * S    # total length of the Arc locus (nucleotides)
R_len     = 75           # read length
frag_mean = 150          # fragment size: mean (of a truncated Gaussian)
frag_sd   = 20           # fragment size: stdev
bp_list   = ["A", "C",
            "G", "T"]
bp_probs  = [0.25, 0.25, # base pair probabilities in order: A C G T
            0.25, 0.25]

# Transcription parameters
v         = [0.008, 0.039, # default nucleotide abundances (equal to what we got from our expirement)
            0.291, 0.112,
            0.127, 0.008,
            0.059, 0.060,
            0.022, 0.273]

L         = [4000, 2000,   # default transcript lengths
            3000, 4000, 
            4000, 3000,
            2000, 2000,
            3000, 3000]

v_norm    = [i/sum(v) for  # value of v normalized to add up to one
            i in v] 

def create_arc_loc(new_arc_length):
    """Creates a DNA seqeunce, if user enters new_arc_length as False, it creates an arc locus according to 
    the length parameter above. If instead the user provides a specific length, then this returns an arc_locus
    of that length"""
    if new_arc_length == False:
        arc_locus = [""]*Arc_len
    
    else: arc_locus = [""]*new_arc_length
    
    for idx, element in enumerate(arc_locus):
        arc_locus[idx] = rand.choice(bp_list, p=bp_probs)
    
    return arc_locus

def create_arc_transcriptome(arc_locus, use_default_lengths, file_name):
    """Creates a tanscriptome based off an arc locus given as an input and outputs it to a FASTA file with 
    name "output". use_defualt_lengths allows the user to use the hardcoded lengths v by setting the 
    parameter to True, generate random lengths between 2 and 4 segments long (inclusive) by setting it to 
    False, or use their own given transcripts by setting the parameter equal to a list of lengths they want
    to use. file_name gives the name of the file that should be written to as a FASTA file"""
    
    arc_transcripts = [1]*S
    
    # Check if the user wants the random, defualt, or specified lengths
    if use_default_lengths == False:
        L_use = [1]*S
    
        for idx, element in enumerate(L_use):
            L_use[idx] = rand.choice([2*S_len, 3*S_len, 4*S_len])    
            
    elif use_default_lengths == True:
        L_use = L
        
    elif isinstance(use_default_lengths, list):
        L_use = use_default_lengths
        
    
    # Iterate over each transcript in the Arc locus
    for transcript_num, transcript_len in enumerate(L_use):
        seg_start = transcript_num*S_len
        
        # Correct for off-by-one error for non-zero segment start indices
        if seg_start > 0:
            seg_start -= 1  
        
        # Test if this iteration's transcript wraps fully around the circle, and then add the arc_locus
        # splits accordingly
        if (seg_start+1) + transcript_len > len(arc_locus):
            first_seg = arc_locus[seg_start:]
            arc_transcripts[transcript_num] = first_seg + arc_locus[:(transcript_len-len(first_seg))]   
        else:
            arc_transcripts[transcript_num] = arc_locus[seg_start:seg_start+transcript_len]
      
    # Write the arc_transcripts to a FASTA file
    file = open(file_name, "w")
    
    for transcript_num, transcript_seq in enumerate(arc_transcripts):
        fasta_list = [transcript_seq[i * 80:(i + 1) * 80] for i in range((len(transcript_seq) + 80 - 1)
                                                                         // 80 )]  
        file.write(">"+"Arc"+str(transcript_num+1)+"\n")
        
        for line in fasta_list:
            new_line = "".join(line)
            file.write(new_line+"\n")

    file.close()
    
    return arc_transcripts
    
def create_arc_reads(arc_transcripts, use_random_abund, file_name):
    
    v_use = []
    arc_reads = []
    file = open(file_name, "w")
    
    # Check if the user wants random abundances
    if use_random_abund == True:
        for i in range(10):
            v_use.append(rand.uniform(0,1))
        v_use = [j/sum(v_use) for j in v_use]
    elif use_random_abund == False:
        v_use = v_norm     
    else: v_use = use_random_abund
    
    # Iterate 100000 times, creating a new read each time
    for i in range(N):
        # Get a fragment of length read <= fragment <= transcript
        transcript_idx = rand.choice(len(arc_transcripts), p=v_use) # np did not work for square list
        transcript = arc_transcripts[transcript_idx]
        transcript_len = len(transcript)
        frag_len = trunc_gaussian(frag_mean, frag_sd, transcript_len, R_len)
        start_idx = rand.randint(0, (transcript_len-frag_len)+1)
        frag = transcript[start_idx:start_idx+frag_len]
        
        # With probability 1/2, take the reverse compliment of the frragment when creating the read to 
        # reflect the fact that our reads can come from either strand of the cDNA sequence. Note that we
        # write to the file each time instead of storing the sequences in a data structure to save time
        # and memory
        if rand.uniform(0,1) > 0.5:
            #arc_reads.append("".join(frag[:75])) want to test speed wrtiting directly to file, no list
            frag_str = "".join(frag[:75])
            frag_str = base_error(frag_str, 0.001)
            file.write("@read"+str(i)+"\n")
            file.write(frag_str+"\n")
            file.write("+\n")
            file.write("I"*len(frag_str)+"\n")
        else:
            rc_frag_str = "".join(reverse_comp(frag))
            rc_frag_str = base_error(rc_frag_str, 0.001)
            file.write("@read"+str(i)+"\n")
            file.write(rc_frag_str+"\n")
            file.write("+\n")
            file.write("I"*len(rc_frag_str)+"\n")

    file.close()
    
    return None

## Part 3: Testing Kallisto with a Postive Control

Now, with all of our functions written and parameter values set, let's create a simulated Arc Locus and transcriptome with the transcription abundances that we found in our original findings and see if Kallisto gets it correct. 

In [5]:
import time

start = time.time() # Record the time when we start our function pipeline

# Simulate arc transcriptome with the parameters from our PhD thesis conclusions
sim_arc_locus_1 = create_arc_loc(False)
sim_transcriptome_1 = create_arc_transcriptome(sim_arc_locus_1, True, "sim_transcriptome_1.fasta")
sim_reads_1 = create_arc_reads(sim_transcriptome_1, False, "sim_reads_1.fastq")

end = time.time() # Record the time when we end our function pipeline

print(end-start) # Print the difference to see the total run time to create the FASTQ read file

11.550817251205444


As a quick note, generating 100,000 reads takes under a minute, which is reasonable. Notice however that if we wanted to make this program scalable to create many more reads, we could use multi-threading to create many reads at once, and then save those reads to a list and write them to a file at the end. However, in that case we would need to be careful to consider memory usage (storing a massive list of millions of reads will use substansial RAM). However, since this is just a simulation, under a minute is an acceptable run time for our purposes here so multi-threading is not necessary.

Now that we have a FASTA file of our simulated transcriptome, and a FASTQ file for our simulated reads, we can feed the simulated data to Kallisto and see how it performs.

In [6]:
! kallisto index -i sim_kallisto_transcripts_1.idx sim_transcriptome_1.fasta
! kallisto quant -i sim_kallisto_transcripts_1.idx -o sim_kallisto_output_1 --single -l 150 -s 20 sim_reads_1.fastq


[build] loading fasta file sim_transcriptome_1.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: sim_reads_1.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,985 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 55 rounds



In [7]:
# Print the tpm counts given by kallisto for our simulated data next to Moriarty's tpm counts and our
# calculation for what the correct tpm counts should be according to the parameters with which our 
# simulated data was generated
sim_data_1 = pd.read_csv('sim_kallisto_output_1/abundance.tsv', delim_whitespace=True)
m_vs_sim1_data = pd.DataFrame({"Moriarty tpm":list(m_data["tpm"]), "Simulated tpm":list(sim_data_1["tpm"]),
                              "Correct tpm":calc_correct_tpm(v_norm, L)})
m_vs_sim1_data

Unnamed: 0,Moriarty tpm,Simulated tpm,Correct tpm
0,20570.4,21816.0,5904.1
1,55436.5,50026.7,57564.6
2,279105.0,284044.0,286346.9
3,75939.3,73338.1,82656.8
4,94504.2,92521.3,93726.9
5,19604.8,21441.5,7872.1
6,82868.8,82128.0,87084.9
7,85804.8,88821.2,88560.9
8,30472.9,27867.7,21648.2
9,255693.0,257996.0,268634.7


Now we can clearly see that Kallisto is giving incorrect TPM values given the parameters this data was generated from. Furthermore, we can see that the TPM counts for our simulated data that was generated according to the parameters we found in our thesis expirement are basically the same as Moriarty's TPM counts, which means that our thesis conclusions are likely correct.

## Part 4: Debugging Kallisto

Now we want to find out why Kallisto is getting the TPM counts wrong. The first thing we notice that is unusual is that the Arc Locus is circular, rather than having a start and an end. This might affect the way Kallisto creates the de Bruijn graph, which could in turn affect its accuracy. We also see that there are many overlapping transcripts (up to 4 in some places), which means that we are relying on maximum likelyhood estimation to assign multi-mapped reads. This could be yet another reason for the error if our Arc Locus somehow violates the assumpotions underlying ML estimation. Therefore, we first want to test Kallisto's accuracy on the simplest case: a linear Arc Locus with no overlapping transcripts. 

In [8]:
# Test how kallisto performs on the ideal case by generating a transcriptome with no overlap

test_transcript_lengths_1 = [1000]*10
test_transcriptome_1 = create_arc_transcriptome(sim_arc_locus_1, test_transcript_lengths_1,
                                                "test_transcriptome_1.fasta")
test_reads_1 = create_arc_reads(test_transcriptome_1, False, "test_reads_1.fastq")

! kallisto index -i test_kallisto_transcripts_1.idx test_transcriptome_1.fasta
! kallisto quant -i test_kallisto_transcripts_1.idx -o test_kallisto_output_1 --single -l 150 -s 20 test_reads_1.fastq

test_data_1 = pd.read_csv('test_kallisto_output_1/abundance.tsv', delim_whitespace=True)
test_data_1 = pd.DataFrame({"Arc":list(test_data_1["target_id"]),
                            "Test 1 tpm":list(test_data_1["tpm"]),
                               "Correct tpm":calc_correct_tpm(v_norm, test_transcript_lengths_1)})
test_data_1


[build] loading fasta file test_transcriptome_1.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 10 contigs and contains 9700 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 9,700
[index] number of equivalence classes: 10
[quant] running in single-end mode
[quant] will process file 1: test_reads_1.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,068 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds



Unnamed: 0,Arc,Test 1 tpm,Correct tpm
0,Arc1,8277.14,8008.0
1,Arc2,38932.9,39039.0
2,Arc3,289125.0,291291.3
3,Arc4,111711.0,112112.1
4,Arc5,126065.0,127127.1
5,Arc6,8277.14,8008.0
6,Arc7,57092.1,59059.1
7,Arc8,60160.7,60060.1
8,Arc9,22489.6,22022.0
9,Arc10,277870.0,273273.3


We can see that, in the ideal case above, Kallisto does an excellent job at estimating the TPM counts for each transcript. Now we want to detirmine if the error on the true Arc Locus is due to the circularity, the overlapping transcripts, or both. First, let's create a linear Arc Locus with overlaps to see if it is only the circularity that is causing the problem.

In [9]:
# Test how kallisto performs on linear arc locus with overlapping transcripts

test_arc_locus_2 = create_arc_loc(14000)
test_transcriptome_2 = create_arc_transcriptome(test_arc_locus_2, True,
                                                "test_transcriptome_2.fasta")
test_reads_2 = create_arc_reads(test_transcriptome_2, False, "test_reads_2.fastq")

! kallisto index -i test_kallisto_transcripts_2.idx test_transcriptome_2.fasta
! kallisto quant -i test_kallisto_transcripts_2.idx -o test_kallisto_output_2 --single -l 150 -s 20 test_reads_2.fastq

test_data_2 = pd.read_csv('test_kallisto_output_2/abundance.tsv', delim_whitespace=True)
test_data_2 = pd.DataFrame({"Arc":list(test_data_2["target_id"]),
                            "Test 2 tpm":test_data_2["tpm"], 
                            "Correct tpm":calc_correct_tpm(v_norm, L)})
test_data_2


[build] loading fasta file test_transcriptome_2.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 17 contigs and contains 11969 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 11,969
[index] number of equivalence classes: 23
[quant] running in single-end mode
[quant] will process file 1: test_reads_2.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,949 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds



Unnamed: 0,Arc,Test 2 tpm,Correct tpm
0,Arc1,6365.35,5904.1
1,Arc2,59832.1,57564.6
2,Arc3,286153.0,286346.9
3,Arc4,83541.5,82656.8
4,Arc5,80836.0,93726.9
5,Arc6,26107.9,7872.1
6,Arc7,82852.5,87084.9
7,Arc8,84094.2,88560.9
8,Arc9,29311.0,21648.2
9,Arc10,260906.0,268634.7


Looking at the results, we can see that making the Arc Locus linear increases Kallisto's accuracy for Arc1, but it is still very far off for Arc6. This means that at least part of the problem arises from overlapping transcripts. Also notice that when we make the Arc Locus linear, we reduce the amount of overlap for Arc1 to 0, which means that the increased accuracy for Arc1 may have nothing to do with the linearization of the Arc Locus and instead be attributed to the lack of overlap for transcript 1.

In order to test how Kallisto's accuracy changes with the number of overlaps a given transcript has, let's make the abundances uniform and see how Kallisto handle's just finding which reads map to where. We can do this by setting Arc1's length to 4 segements, and all other lengths to 1 segement so that we know the number of overlapping transcripts for each transcript. If the accuracy truly relies on the number of overlapping transcripts, then we should see decreased accuracy for transciripts with the most overlap

In [10]:
# Test kallisto's performance for when the only difference between the transcripts is the number of
# overlapping transcripts

test_transcript_lengths_3 = [1000, 1000, 1000, 1000, 1000, 4000, 1000, 1000, 1000, 1000]
test_transcript_abunds_3 = [0.1]*10

test_arc_locus_3 = create_arc_loc(False)
test_transcriptome_3 = create_arc_transcriptome(test_arc_locus_3, test_transcript_lengths_3,
                                                "test_transcriptome_3.fasta")
test_reads_3 = create_arc_reads(test_transcriptome_3, test_transcript_abunds_3, "test_reads_3.fastq")

! kallisto index -i test_kallisto_transcripts_3.idx test_transcriptome_3.fasta
! kallisto quant -i test_kallisto_transcripts_3.idx -o test_kallisto_output_3 --single -l 150 -s 20 test_reads_3.fastq

test_data_3 = pd.read_csv('test_kallisto_output_3/abundance.tsv', delim_whitespace=True)
test_data_3 = pd.DataFrame({"Arc":list(test_data_3["target_id"]),
                            "Test 3 tpm":test_data_3["tpm"], 
                            "Correct tpm":calc_correct_tpm(test_transcript_abunds_3, test_transcript_lengths_3)})
test_data_3


[build] loading fasta file test_transcriptome_3.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 12 contigs and contains 9790 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 9,790
[index] number of equivalence classes: 13
[quant] running in single-end mode
[quant] will process file 1: test_reads_3.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,336 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds



Unnamed: 0,Arc,Test 3 tpm,Correct tpm
0,Arc1,109048.0,108108.1
1,Arc2,107861.0,108108.1
2,Arc3,108113.0,108108.1
3,Arc4,109224.0,108108.1
4,Arc5,108014.0,108108.1
5,Arc6,26058.2,27027.0
6,Arc7,107138.0,108108.1
7,Arc8,107215.0,108108.1
8,Arc9,107358.0,108108.1
9,Arc10,109971.0,108108.1


Interestingly, there is no strong correlation between number of overlapping transcripts and Kallisto's accuracy when the abundances are all the same. Thinking back to Kallisto's origional error for our simulated data with the parameters from our thesis conclusion, we notice that the two transcripts that Kallisto performed poorly on (Arc1 and Arc6) were also the two with the lowest abundances. Therefore, it could be that Kallisto's accuracy is affected by a combination of both relative abundances and overlapping transcripts. Therefore, let's repeat the test run above, but set the transcript with the most overlap (Arc6) to also have either a very high abundance or a very low abundance. Our hypothesis is that Kallisto will over-estimate the TPM counts when the abundance is relativly low, and underestimate the TPM counts when the abundances are very high.

In [11]:
# Test kallisto's performance for when one transcript has many overlaps and low abundance

test_transcript_lengths_4 = [1000, 1000, 1000, 1000, 1000, 4000, 1000, 1000, 1000, 1000]
test_transcript_abunds_4 = [0.11, 0.11, 0.11, 0.11, 0.11, 0.01, 0.11, 0.11, 0.11, 0.11]

test_arc_locus_4 = create_arc_loc(False)
test_transcriptome_4 = create_arc_transcriptome(test_arc_locus_4, test_transcript_lengths_4,
                                                "test_transcriptome_4.fasta")
test_reads_4 = create_arc_reads(test_transcriptome_4, test_transcript_abunds_4, "test_reads_4.fastq")

! kallisto index -i test_kallisto_transcripts_4.idx test_transcriptome_4.fasta
! kallisto quant -i test_kallisto_transcripts_4.idx -o test_kallisto_output_4 --single -l 150 -s 20 test_reads_4.fastq

test_data_4 = pd.read_csv('test_kallisto_output_4/abundance.tsv', delim_whitespace=True)
test_data_4 = pd.DataFrame({"Arc":list(test_data_4["target_id"]),
                            "Test 4 tpm":test_data_4["tpm"], 
                            "Correct tpm":calc_correct_tpm(test_transcript_abunds_4, test_transcript_lengths_4)})
test_data_4


[build] loading fasta file test_transcriptome_4.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 12 contigs and contains 9790 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 9,790
[index] number of equivalence classes: 13
[quant] running in single-end mode
[quant] will process file 1: test_reads_4.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,294 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds



Unnamed: 0,Arc,Test 4 tpm,Correct tpm
0,Arc1,110461.0,110831.2
1,Arc2,109469.0,110831.2
2,Arc3,113608.0,110831.2
3,Arc4,112178.0,110831.2
4,Arc5,112351.0,110831.2
5,Arc6,4201.31,2518.9
6,Arc7,106494.0,110831.2
7,Arc8,110204.0,110831.2
8,Arc9,109806.0,110831.2
9,Arc10,111227.0,110831.2


In [12]:
# Test kallisto's performance for when one transcript has many overlaps and high abundance

test_transcript_lengths_5 = [1000, 1000, 1000, 1000, 1000, 4000, 1000, 1000, 1000, 1000]
test_transcript_abunds_5 = [0.05, 0.05, 0.05, 0.05, 0.05, 0.55, 0.05, 0.05, 0.05, 0.05]

test_arc_locus_5 = create_arc_loc(False)
test_transcriptome_5 = create_arc_transcriptome(test_arc_locus_5, test_transcript_lengths_5,
                                                "test_transcriptome_5.fasta")
test_reads_5 = create_arc_reads(test_transcriptome_5, test_transcript_abunds_5, "test_reads_5.fastq")

! kallisto index -i test_kallisto_transcripts_5.idx test_transcriptome_5.fasta
! kallisto quant -i test_kallisto_transcripts_5.idx -o test_kallisto_output_5 --single -l 150 -s 20 test_reads_5.fastq

test_data_5 = pd.read_csv('test_kallisto_output_5/abundance.tsv', delim_whitespace=True)
test_data_5 = pd.DataFrame({"Arc":list(test_data_5["target_id"]),
                            "Test 5 tpm":test_data_5["tpm"], 
                            "Correct tpm":calc_correct_tpm(test_transcript_abunds_5, test_transcript_lengths_5)})
test_data_5


[build] loading fasta file test_transcriptome_5.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 12 contigs and contains 9790 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 9,790
[index] number of equivalence classes: 13
[quant] running in single-end mode
[quant] will process file 1: test_reads_5.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,547 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds



Unnamed: 0,Arc,Test 5 tpm,Correct tpm
0,Arc1,87097.2,85106.4
1,Arc2,88615.9,85106.4
2,Arc3,87503.4,85106.4
3,Arc4,85861.0,85106.4
4,Arc5,88298.1,85106.4
5,Arc6,215008.0,234042.6
6,Arc7,85198.6,85106.4
7,Arc8,90208.1,85106.4
8,Arc9,83805.3,85106.4
9,Arc10,88404.0,85106.4


Looking at the results above, we see that the results are just as our hypothesis perdicted, with the high overlap/low abundance transcript having its TPM count overestimated and the high overlap/high abundance transcipt having its TPM count underestimated. Therefore, we beleive that soemething goes wrong with Kallisto when a transcript with a relativly high or low abundance also has many overlaps. One thing to note is that in the set up of the Arc Locus, the number of transcript overlaps is closely related to the length of a transcript, which in turn effects our calculation of the TPM vs abundance levels. Therefore, we should keep in mind that there may be some underlying relationship between these three things (abundance, transcript length, and number of transcript overlaps) that might change as you tweak any of the three values. 

One possible explanation for this behavior would be at the level of maximum likelyhood estimation. For example, say you have 15000 reads that map to three transcripts all of equal length, two of which have regular abundances and one of which has a relatively low abundance (but Kallisto does not know about the abundances). Kallisto's maximum likelyhood estimation might find that the reads were equally likely to have come from any of the three transcripts, in which case it would give 5,000 of the reads to each of the transcripts. Let's say that in reality, 7,250 reads came from each of the two regular abundance transcripts, and 500 reads came from the low abundace transcripts. Notice that the regular transcripts estimates are now off by 2,250 reads, while the low abundance transcript is off by 4,750 reads (over twice as many)! Now, further consider the fact that the error in the read counts relative to the true read count is even worse (7,250/5,000 = 1.45 vs 4,750/250 = 19)! However, to truly test if this is what is happening one would have to dig into the details of the maximum likelyhood calculation, equivilence class counts, and the realtionship between abundance, transcript length, and number of transcript overlaps.