# Cancer Expression Heatmap Testing Suite

### We're creating a testing suite to test all of the functions and outputs of the cancer_expression_heatmap file.

We can do this by first running the file to get all of the variables and functions loaded, and we'll go through and test these functions.

%%capture hides the output of the file.

In [5]:
%%capture
%run cancer_expression_heatmap.ipynb

## Test Data Imports and Preparation

1. Test remove_patients_from_list
    In this function we test if there are any patients from the to_remove list are still in the dataset. If true, all the patients were sucessfully removed.

In [19]:
def test_remove_patients_from_list(to_remove_file):
    function_patients = remove_patients_from_list(to_remove_file)
    to_remove_list = pd.read_csv(to_remove_file, delimiter = '\t')["Patient ID"].tolist()
    test = function_patients['submitter_id'].isin(to_remove_list)
    for row in test.iteritems():
        if row == True:
            return False
    return True

test_remove_patients_from_list('datasets/paad_tcga_clinical_data.tsv')

True

2. Test filter_genes_of_interest 1 tests to make sure that all of the genes in the table are in the provided neurotransmitter gene family file

3. Test filter_genes_of_interest 2 tests that all of the neurotransmitter genes successfully made it in the table (meaning that the original table contained all 107 of the neurotransmitter genes

In [26]:
def test_filter_genes_of_interest_1(table):
    function_filter = filter_genes_of_interest(table)
    test = function_filter['hgnc_symbol'].isin(neurotransmitter_genes["receptor gene"].tolist())
    for row in test.iteritems():
        if row == False:
            return False
    return True

test_filter_genes_of_interest_1(all_rnaseq)

True

In [32]:
def test_filter_genes_of_interest_2(table):
    function_filter = filter_genes_of_interest(table)
    filter_count = function_filter['hgnc_symbol'].count() 
    neuro_count = neurotransmitter_genes["receptor gene"].count()
    return filter_count == neuro_count

test_filter_genes_of_interest_2(all_rnaseq)

True

4. Test create_counts_list by checking if elements from the column values have any elements from the create counts list

In [52]:
def test_create_counts_list(table):
    column_list = table.columns.values
    counts_list = create_counts_list()
    return not any(item in column_list for item in counts_list)

test_create_counts_list(rnaseq)

True

5. Test sort genes of interest 1 makes sure that all of the hgnc_symbols stay with their row counts. 

6. Test sort genes of interest checks if the genes are sorted in the same order as the neurotransmitter gene families. 

In [59]:
def test_sort_genes_of_interest_1(table):
    t_unsorted = table.copy()
    t_sorted = sort_genes_of_interest(table)
    
    t_merged = pd.merge(t_unsorted, t_sorted, on=list(t_unsorted.columns.values), how='inner')
    
    return t_merged.count() == t_unsorted.count()

test_sort_genes_of_interest_1(rnaseq_goi)

Unnamed: 0                                               True
hgnc_symbol                                              True
X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz    True
X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz    True
X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz    True
                                                         ... 
f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz     True
fb65f821.92cb.402a.ad2f.d4044ca7de4d.htseq.counts.gz     True
fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz     True
feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz     True
fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz     True
Length: 156, dtype: bool

In [62]:
def test_sort_genes_of_interest_2(table):
    i = 0
    t_sorted = sort_genes_of_interest(table)
    for index, row in t_sorted.iterrows():
        if row['hgnc_symbol'] != receptor_gene_list[i]:
            return False
        i = i + 1
    return True

test_sort_genes_of_interest_2(rnaseq_goi)

True

No data manipulations were made on the draw_expression_heatmap function so we'll just check the map output to ensure that the function does what we want it to.

## Test draw_expression_log_heatmap data manipulations

7. Test z-score, here i am testing z-score using scipy's zscore function. this z-score function is accurate and faster than me building one. we're running the zscore function directly on a dataframe, vs converting to numpy and running zscore on it. this is kind of a weak test, but not sure what else to perform on it considering the scipy library does all of the math anyways.

the output is all very small values, most likely because of rounding error, differences in pandas vs numpy data types

In [99]:
def test_z_score(table):
    function_table = table.drop('hgnc_symbol', axis=1)
    comparator_table = function_table.copy()
    comparator_values = comparator_table.apply(stats.zscore)
    function_numpy = function_table.to_numpy(copy=True)
    comparator_numpy = comparator_values.to_numpy(copy=True)
    function_values = z_score(function_numpy)
    return function_values == comparator_values, function_values

test_z_score_result, rnaseq_zscore = test_z_score(rnaseq)

In [100]:
test_z_score_result

Unnamed: 0,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,X0726996d.62f2.4880.808c.cfe3361b4b42.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
6402,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6403,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6404,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6405,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6406,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4266,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4267,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4268,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4271,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


In [93]:
rnaseq_zscore

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         1.11022302e-16,  1.11022302e-16,  1.11022302e-16],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  8.32667268e-17, -2.22044605e-16],
       [ 2.77555756e-17, -4.85722573e-17, -1.38777878e-17, ...,
         1.04083409e-16,  8.67361738e-17, -2.77555756e-17],
       ...,
       [ 2.77555756e-17, -5.55111512e-17,  0.00000000e+00, ...,
         1.11022302e-16,  1.11022302e-16,  5.55111512e-17],
       [ 0.00000000e+00,  0.00000000e+00, -1.21430643e-17, ...,
         1.11022302e-16,  8.32667268e-17, -2.77555756e-17],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         1.11022302e-16,  1.11022302e-16,  1.11022302e-16]])

8. Test pandas to numpy

In [78]:
def test_convert_pandas_to_numpy(p_table):
    n_table = convert_pandas_to_numpy(p_table)
    return p_table, n_table

pandas_rnaseq, numpy_rnaseq = test_convert_pandas_to_numpy(rnaseq)

In [79]:
pandas_rnaseq

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
6402,DRD1,147,280,160,116,80,78,514,141,264,...,48,173,99,75,239,78,72,127,115,113
6403,DRD2,4112,2811,17294,3172,13653,5470,3053,15737,5267,...,10159,2557,8638,8886,9238,14085,6181,9665,1771,3394
6404,DRD3,1187,1428,1939,833,1555,1072,1037,2285,1689,...,1475,1105,1357,1286,1366,1016,795,2165,1472,1364
6405,DRD4,2,79,83,39,189,79,6,74,82,...,129,28,126,96,102,84,38,85,73,153
6406,DRD5,357,237,300,206,258,196,81,335,336,...,376,246,158,220,485,249,333,498,390,246
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4266,CHRM1,5685,7649,8086,3273,5948,3631,4643,7445,6605,...,6696,3255,5759,5961,6380,6388,4604,6820,6240,4534
4267,CHRM2,2511,7302,24450,2328,6852,2250,4499,6888,6787,...,4054,2257,4812,4628,6258,6866,3075,6401,2549,4360
4268,CHRM3,1073,689,1335,578,689,506,672,859,900,...,619,502,494,759,733,770,552,1186,538,600
4271,CHRM4,4702,2886,2049,1334,1432,2684,3089,2190,2864,...,2270,1474,729,1486,2581,1783,2244,5239,1790,1448


In [80]:
numpy_rnaseq

array([[  147.,   280.,   160., ...,   127.,   115.,   113.],
       [ 4112.,  2811., 17294., ...,  9665.,  1771.,  3394.],
       [ 1187.,  1428.,  1939., ...,  2165.,  1472.,  1364.],
       ...,
       [ 1073.,   689.,  1335., ...,  1186.,   538.,   600.],
       [ 4702.,  2886.,  2049., ...,  5239.,  1790.,  1448.],
       [    0.,     0.,     0., ...,     0.,     0.,     0.]])

9. Test numpy to pandas

In [86]:
def test_convert_numpy_to_pandas(n_table, p_table_columns):
    p_table = convert_numpy_to_pandas(n_table, p_table_columns)
    return p_table

test_convert_numpy_to_pandas(numpy_rnaseq, rnaseq.columns.values[1:])

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
0,DRD1,147.0,280.0,160.0,116.0,80.0,78.0,514.0,141.0,264.0,...,48.0,173.0,99.0,75.0,239.0,78.0,72.0,127.0,115.0,113.0
1,DRD2,4112.0,2811.0,17294.0,3172.0,13653.0,5470.0,3053.0,15737.0,5267.0,...,10159.0,2557.0,8638.0,8886.0,9238.0,14085.0,6181.0,9665.0,1771.0,3394.0
2,DRD3,1187.0,1428.0,1939.0,833.0,1555.0,1072.0,1037.0,2285.0,1689.0,...,1475.0,1105.0,1357.0,1286.0,1366.0,1016.0,795.0,2165.0,1472.0,1364.0
3,DRD4,2.0,79.0,83.0,39.0,189.0,79.0,6.0,74.0,82.0,...,129.0,28.0,126.0,96.0,102.0,84.0,38.0,85.0,73.0,153.0
4,DRD5,357.0,237.0,300.0,206.0,258.0,196.0,81.0,335.0,336.0,...,376.0,246.0,158.0,220.0,485.0,249.0,333.0,498.0,390.0,246.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,CHRM1,5685.0,7649.0,8086.0,3273.0,5948.0,3631.0,4643.0,7445.0,6605.0,...,6696.0,3255.0,5759.0,5961.0,6380.0,6388.0,4604.0,6820.0,6240.0,4534.0
103,CHRM2,2511.0,7302.0,24450.0,2328.0,6852.0,2250.0,4499.0,6888.0,6787.0,...,4054.0,2257.0,4812.0,4628.0,6258.0,6866.0,3075.0,6401.0,2549.0,4360.0
104,CHRM3,1073.0,689.0,1335.0,578.0,689.0,506.0,672.0,859.0,900.0,...,619.0,502.0,494.0,759.0,733.0,770.0,552.0,1186.0,538.0,600.0
105,CHRM4,4702.0,2886.0,2049.0,1334.0,1432.0,2684.0,3089.0,2190.0,2864.0,...,2270.0,1474.0,729.0,1486.0,2581.0,1783.0,2244.0,5239.0,1790.0,1448.0


The following functions will test the calculations performed in draw_expression_log_heatmap.

10. Test the natural log conversions we do in draw expression log heatmap

11. Test the log 10 conversions we do in draw expression log heatmap 

12. Test not performing a conversion at all in draw expression log heatmap

13. Test calling compute_zscore in draw expression log heatmap

14. Test calling sorting from draw expression log heatmap

In [88]:
def test_natural_log_conversion(table, log_type):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    if log_type == 'natural':
        expression_grid = expression_grid + 1
        expression_logged = np.log(expression_grid)
        
    return expression_logged

test_natural_log_conversion(rnaseq, 'natural')

array([[4.99721227, 5.63835467, 5.08140436, ..., 4.85203026, 4.75359019,
        4.73619845],
       [8.32190797, 7.94165125, 9.75817272, ..., 9.17636985, 7.47986413,
        8.13005904],
       [7.0800265 , 7.26473018, 7.57044325, ..., 7.68063743, 7.29505642,
        7.21890971],
       ...,
       [6.97914528, 6.5366916 , 7.19743535, ..., 7.07918439, 6.28971557,
        6.39859493],
       [8.45595588, 7.96797318, 7.62559507, ..., 8.56407678, 7.4905294 ,
        7.27862894],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [89]:
def test_log_10_conversion(table, log_type):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    if log_type == 'base-10':
        expression_grid = expression_grid + 1
        expression_logged = np.log10(expression_grid)
        
    return expression_logged

test_log_10_conversion(rnaseq, 'base-10')

array([[2.17026172, 2.44870632, 2.20682588, ..., 2.10720997, 2.06445799,
        2.05690485],
       [3.61415871, 3.44901532, 4.23792057, ..., 3.98524679, 3.24846372,
        3.53083978],
       [3.07481644, 3.15503223, 3.28780173, ..., 3.33565845, 3.16820275,
        3.13513265],
       ...,
       [3.03100428, 2.83884909, 3.12580646, ..., 3.07445072, 2.73158877,
        2.77887447],
       [3.67237498, 3.46044678, 3.31175386, ..., 3.71933129, 3.25309559,
        3.16106839],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [91]:
def test_no_conversion(table, log_type):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    if log_type == 'natural':
        expression_grid = expression_grid + 1
        expression_logged = np.log(expression_grid)
    elif log_type == 'base-10':
        expression_grid = expression_grid + 1
        expression_logged = np.log10(expression_grid)
    else:
        expression_logged = expression_grid
    
    return expression_grid == expression_logged

test_no_conversion(rnaseq, '')

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

In [103]:
def test_compute_zscore_no_conversion(table, log_type, compute_zscore, precalculated_zscore):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    if log_type == 'natural':
        expression_grid = expression_grid + 1
        expression_logged = np.log(expression_grid)
    elif log_type == 'base-10':
        expression_grid = expression_grid + 1
        expression_logged = np.log10(expression_grid)
    else:
        expression_logged = expression_grid
    
    if compute_zscore:
        expression_logged = z_score(expression_logged)
        
    return expression_logged == rnaseq_zscore
        
test_compute_zscore_no_conversion(rnaseq, '', True, rnaseq_zscore)

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

In [113]:
def test_log_heatmap_sorting(table, sort):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    expression_logged = expression_grid
    if sort:
        expression_logged_pandas = convert_numpy_to_pandas(expression_logged, rnaseq_columns[1:])
        expression_logged_pandas_sorted = sort_table(expression_logged_pandas)
        y_axis_list = expression_logged_pandas_sorted['hgnc_symbol'].tolist()
        expression_logged = convert_pandas_to_numpy(expression_logged_pandas_sorted)
        
    return expression_logged_pandas_sorted

test_log_heatmap_sorting(rnaseq, True)

Unnamed: 0,hgnc_symbol,X6423474d.60d7.4401.8e5b.46a3fbde5299.htseq.counts.gz,X0be94b2f.fccb.4482.b0ea.695c101aa65a.htseq.counts.gz,b6aa34d6.2b02.4317.8361.79536c7cb4e6.htseq.counts.gz,X09a677f2.d81d.4c3f.adf9.f8594e064e44.htseq.counts.gz,c19f102d.47a0.48c6.9443.63730d9ea6d1.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X98b1beb5.8d4c.45d1.a618.2d43aafa056c.htseq.counts.gz,X0aac5e42.7554.4949.8b90.c16528c71ef8.htseq.counts.gz,X855d4a17.5c83.429d.919b.8c2a8e9bab0b.htseq.counts.gz,...,e38e0ced.093c.44e9.9f3b.7cdd0e6b912e.htseq.counts.gz,X44c3d518.14fa.4d63.b265.d7fc81c398e2.htseq.counts.gz,e7cc80ef.4b87.47d9.bebe.1fb05b5b04a2.htseq.counts.gz,X7bf647f0.c20e.42e6.b7d5.6510a8d066fc.htseq.counts.gz,X4929062b.3127.4038.8313.c20cbd274be4.htseq.counts.gz,b9ab7393.4abb.41ec.9d55.a3dc846c4a93.htseq.counts.gz,X16c63027.f745.41c4.a5e8.f6d9f1fbf1c8.htseq.counts.gz,X0f426284.c121.4860.bb80.8df032b0dea8.htseq.counts.gz,X1f2aa905.5022.4efe.afac.022d1acfdbe5.htseq.counts.gz,X8a799dfa.c1b5.4b13.9c91.6cbfe2abbc9f.htseq.counts.gz
0,DRD4,8,11,12,27,36,39,86,21,19,...,55,99,42,28,34,32,30,25,15,109
1,DRD1,15,3,54,152,59,116,88,55,300,...,276,384,545,462,582,145,432,284,784,211
2,DRD5,122,110,146,127,135,206,207,160,166,...,522,317,891,529,781,387,600,536,1102,643
3,DRD3,709,508,546,691,679,833,926,668,864,...,1851,1822,3031,2584,2477,1518,2597,3341,2503,2720
4,DRD2,814,677,753,2988,1607,3172,6131,1621,7557,...,9562,11122,3782,14711,25699,16597,19489,11817,28453,16498
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,CHRNA7,2002,1963,1660,841,1642,1123,2245,2209,969,...,2752,3098,2646,3508,3872,4069,2272,4471,4763,4267
103,CHRNA2,599,996,2188,3047,2037,2225,3352,1435,6160,...,10085,12213,6459,8546,7875,7857,5260,4584,5089,7703
104,CHRNG,3526,4172,2261,2662,2449,2378,1970,3364,2785,...,6226,8213,5839,6367,7287,5934,9594,9809,8659,8961
105,CHRM1,749,1874,2547,2839,2545,3273,3235,2621,3696,...,9021,7885,10003,7063,6867,8481,8187,8961,8755,12527


## Test sort table

Let's test our various calculations in sort table:

16. Test create_sum_column to ensure that the sum is being calculated across rows

In [114]:
def test_create_sum_column(table):
    rnaseq_orig = table.copy()
    
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    
    return rnaseq_with_total

test_create_sum_column(rnaseq)

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz,Total by row
6402,DRD1,147,280,160,116,80,78,514,141,264,...,173,99,75,239,78,72,127,115,113,29062
6403,DRD2,4112,2811,17294,3172,13653,5470,3053,15737,5267,...,2557,8638,8886,9238,14085,6181,9665,1771,3394,1218887
6404,DRD3,1187,1428,1939,833,1555,1072,1037,2285,1689,...,1105,1357,1286,1366,1016,795,2165,1472,1364,224709
6405,DRD4,2,79,83,39,189,79,6,74,82,...,28,126,96,102,84,38,85,73,153,11302
6406,DRD5,357,237,300,206,258,196,81,335,336,...,246,158,220,485,249,333,498,390,246,51510
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4266,CHRM1,5685,7649,8086,3273,5948,3631,4643,7445,6605,...,3255,5759,5961,6380,6388,4604,6820,6240,4534,864200
4267,CHRM2,2511,7302,24450,2328,6852,2250,4499,6888,6787,...,2257,4812,4628,6258,6866,3075,6401,2549,4360,971896
4268,CHRM3,1073,689,1335,578,689,506,672,859,900,...,502,494,759,733,770,552,1186,538,600,123923
4271,CHRM4,4702,2886,2049,1334,1432,2684,3089,2190,2864,...,1474,729,1486,2581,1783,2244,5239,1790,1448,371541


17. Test sorting rows 1: this tests that all of the rows are the same just different order (hgnc symbol still goes to their respective values)

In [128]:
def test_sorting_rows_1(table):
    
    #### --- Test Setup ---
    rnaseq_orig = table.copy()
    
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    rnaseq_with_total = rnaseq_with_total.reset_index(drop=True)
    
    table_columns = list(table.columns.values)
    
    rnaseq_sorted = pd.DataFrame(columns=table_columns)
    
    # sorts the rows section by section, based on the size of each family of neurotransmitters
    index_begin = 0
    index_end = 0
    appended_data = []
    for family, gene_list in neuro_genes_dict.items():
        index_end = len(gene_list) + index_begin
        to_sort = rnaseq_with_total[index_begin : index_end].sort_values('Total by row', ascending=True)
        appended_data.append(to_sort)
        index_begin = index_end
    # the families were sorted as separate dataframes and then concat together
    rnaseq_sorted = pd.concat(appended_data)
    
    t_merged = pd.merge(rnaseq_with_total, rnaseq_sorted, on=list(rnaseq_with_total.columns.values), how='inner')
    
    return t_merged == rnaseq_with_total

test_sorting_rows_1(rnaseq)

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz,Total by row
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
103,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
104,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
105,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


18. Test sorting rows 2 tests that rows are sorted by increasing sum and only by family

In [138]:
def test_sorting_rows_2_helper(to_sort):
        with pd.option_context('display.max_rows', None, 'display.max_columns', None):
            display (to_sort)

def test_sorting_rows_2(table):
    rnaseq_orig = table.copy()
    
    # sort table wasn't working right with decimals, so hgnc symbol column was removed
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    
    table_columns = list(table.columns.values)
    
    # SORTING THE ROWS -------
    rnaseq_sorted = pd.DataFrame(columns=table_columns)
    
    # sorts the rows section by section, based on the size of each family of neurotransmitters
    index_begin = 0
    index_end = 0
    appended_data = []
    for family, gene_list in neuro_genes_dict.items():
        index_end = len(gene_list) + index_begin
        to_sort = rnaseq_with_total[index_begin : index_end].sort_values('Total by row', ascending=True)
        test_sorting_rows_2_helper(to_sort[['hgnc_symbol', 'Total by row']])
        index_begin = index_end
    # the families were sorted as separate dataframes and then concat together
    
    
test_sorting_rows_2(rnaseq)

Unnamed: 0,hgnc_symbol,Total by row
6405,DRD4,11302
6402,DRD1,29062
6406,DRD5,51510
6404,DRD3,224709
6403,DRD2,1218887


Unnamed: 0,hgnc_symbol,Total by row
9533,GRID1,11
9545,GRIN2A,20
9576,GRM8,22
9546,GRIN2B,126
9538,GRIK1,220
9540,GRIK2,788
9568,GRM4,10300
9571,GRM6,11911
9565,GRM2,21260
9550,GRIN3B,29480


Unnamed: 0,hgnc_symbol,Total by row
8669,GABRQ,3
8646,GABBR1,98
8668,GABRP,229
8653,GABRA1,825
8658,GABRA6,2375
8655,GABRA3,7569
8663,GABRE,15476
8672,GABRR3,20692
8670,GABRR1,22522
8657,GABRA5,34609


Unnamed: 0,hgnc_symbol,Total by row
521,ADRA2B,412
519,ADRA1D,1224
522,ADRA2C,2174
523,ADRB1,6577
525,ADRB3,8695
520,ADRA2A,23252
524,ADRB2,185517
518,ADRA1B,413763
517,ADRA1A,728596


Unnamed: 0,hgnc_symbol,Total by row
32876,TACR2,12646
32877,TACR3,16983
32875,TACR1,164411


Unnamed: 0,hgnc_symbol,Total by row
10946,HTR2C,8
10952,HTR3E,1096
10951,HTR3D,1696
10937,HTR1A,23168
10955,HTR5A,25389
10949,HTR3C,30549
10943,HTR2A,34116
10947,HTR3A,55848
10954,HTR4,60819
10945,HTR2B,76956


Unnamed: 0,hgnc_symbol,Total by row
10749,HRH1,10
10750,HRH2,293527
10752,HRH4,413852
10751,HRH3,562289


Unnamed: 0,hgnc_symbol,Total by row
4272,CHRM5,4
4277,CHRNA4,919
4276,CHRNA3,4705
4282,CHRNB1,14564
4274,CHRNA10,24480
4283,CHRNB2,29703
4285,CHRNB4,36072
4278,CHRNA5,42784
4273,CHRNA1,105704
4281,CHRNA9,106242


19. Test create sum row checks that a row has been created called Total by col, and takes the sum of each column

20. Test sorting columns 1 checks that columns have been sorted by their increasing sum. total by row is still in there because it gets deleted at the very end of the function.

21. Test sorting columns 2 checks that columns has been sorted but the values under each column still remain with that column. the check_like argument in the pandas testing library ignores the order of the index and columns but index must still correspond with the same data

In [162]:
def test_create_sum_row(table):
    rnaseq_orig = table.copy()
    
    # sort table wasn't working right with decimals, so hgnc symbol column was removed
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    
    table_columns = list(table.columns.values)
    
    # SORTING THE ROWS -------
    rnaseq_sorted = pd.DataFrame(columns=table_columns)
    
    # sorts the rows section by section, based on the size of each family of neurotransmitters
    index_begin = 0
    index_end = 0
    appended_data = []
    for family, gene_list in neuro_genes_dict.items():
        index_end = len(gene_list) + index_begin
        to_sort = rnaseq_with_total[index_begin : index_end].sort_values('Total by row', ascending=True)
        appended_data.append(to_sort)
        index_begin = index_end
    # the families were sorted as separate dataframes and then concat together
    rnaseq_sorted = pd.concat(appended_data)
    
    # adding the column sum back in so now we can sort by column
    rnaseq_sorted_2 = rnaseq_sorted.to_numpy(copy=True)
    rnaseq_sorted = pd.DataFrame(rnaseq_sorted_2)
    
    table_columns.append('Total by row')
    rnaseq_sorted.columns = table_columns
    rnaseq_sorted.loc['Total by col', :] = rnaseq_with_total.sum(axis=0)
    table_columns.remove('Total by row')
    
    return rnaseq_sorted.iloc[[-1]].transpose()

test_create_sum_row(rnaseq)

Unnamed: 0,Total by col
hgnc_symbol,DRD1DRD2DRD3DRD4DRD5GRM1GRM2GRM3GRM4GRM5GRM6GR...
X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,192982
X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,160991
X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,223678
X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,86527
...,...
f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,121813
fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,247489
feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,151847
fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz,123769


In [161]:
def test_sort_column_1(table):
    rnaseq_orig = table.copy()
    
    # sort table wasn't working right with decimals, so hgnc symbol column was removed
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    
    table_columns = list(table.columns.values)
    
    # SORTING THE ROWS -------
    rnaseq_sorted = pd.DataFrame(columns=table_columns)
    
    # sorts the rows section by section, based on the size of each family of neurotransmitters
    index_begin = 0
    index_end = 0
    appended_data = []
    for family, gene_list in neuro_genes_dict.items():
        index_end = len(gene_list) + index_begin
        to_sort = rnaseq_with_total[index_begin : index_end].sort_values('Total by row', ascending=True)
        appended_data.append(to_sort)
        index_begin = index_end
    # the families were sorted as separate dataframes and then concat together
    rnaseq_sorted = pd.concat(appended_data)
    
    # adding the column sum back in so now we can sort by column
    rnaseq_sorted_2 = rnaseq_sorted.to_numpy(copy=True)
    rnaseq_sorted = pd.DataFrame(rnaseq_sorted_2)
    
    table_columns.append('Total by row')
    rnaseq_sorted.columns = table_columns
    rnaseq_sorted.loc['Total by col', :] = rnaseq_with_total.sum(axis=0)
    table_columns.remove('Total by row')
    
    # SORTING THE COLUMNS ----------
    
    # remove hgnc_symbol column, sort the values, and then remove the total col and total row   
    excluded_after_row_sorting = rnaseq_sorted.loc[:, 'hgnc_symbol']
    del rnaseq_sorted['hgnc_symbol']
    sorted_cases = rnaseq_sorted.sort_values('Total by col', axis=1, ascending=True)

    return sorted_cases.iloc[[-1]].transpose()

test_sort_column_1(rnaseq)

Unnamed: 0,Total by col
X6423474d.60d7.4401.8e5b.46a3fbde5299.htseq.counts.gz,51064
X0be94b2f.fccb.4482.b0ea.695c101aa65a.htseq.counts.gz,69172
b6aa34d6.2b02.4317.8361.79536c7cb4e6.htseq.counts.gz,73238
X09a677f2.d81d.4c3f.adf9.f8594e064e44.htseq.counts.gz,73458
c19f102d.47a0.48c6.9443.63730d9ea6d1.htseq.counts.gz,78557
...,...
X16c63027.f745.41c4.a5e8.f6d9f1fbf1c8.htseq.counts.gz,324419
X0f426284.c121.4860.bb80.8df032b0dea8.htseq.counts.gz,335669
X1f2aa905.5022.4efe.afac.022d1acfdbe5.htseq.counts.gz,354405
X8a799dfa.c1b5.4b13.9c91.6cbfe2abbc9f.htseq.counts.gz,362500


In [243]:
from pandas.util.testing import assert_frame_equal
from pandas.util.testing import assert_series_equal

In [182]:
def test_sorting_column_2(table):
    rnaseq_orig = table.copy()
    
    # sort table wasn't working right with decimals, so hgnc symbol column was removed
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    
    table_columns = list(table.columns.values)
    
    # SORTING THE ROWS -------
    rnaseq_sorted = pd.DataFrame(columns=table_columns)
    
    # sorts the rows section by section, based on the size of each family of neurotransmitters
    index_begin = 0
    index_end = 0
    appended_data = []
    for family, gene_list in neuro_genes_dict.items():
        index_end = len(gene_list) + index_begin
        to_sort = rnaseq_with_total[index_begin : index_end].sort_values('Total by row', ascending=True)
        appended_data.append(to_sort)
        index_begin = index_end
    # the families were sorted as separate dataframes and then concat together
    rnaseq_sorted = pd.concat(appended_data)
    
    # adding the column sum back in so now we can sort by column
    rnaseq_sorted_2 = rnaseq_sorted.to_numpy(copy=True)
    rnaseq_sorted = pd.DataFrame(rnaseq_sorted_2)
    
    table_columns.append('Total by row')
    rnaseq_sorted.columns = table_columns
    rnaseq_sorted.loc['Total by col', :] = rnaseq_with_total.sum(axis=0)
    table_columns.remove('Total by row')
    
    # SORTING THE COLUMNS ----------
    
    # remove hgnc_symbol column, sort the values, and then remove the total col and total row   
    excluded_after_row_sorting = rnaseq_sorted.loc[:, 'hgnc_symbol']
    del rnaseq_sorted['hgnc_symbol']
    
    sorted_cases = rnaseq_sorted.sort_values('Total by col', axis=1, ascending=True)

    display (rnaseq_sorted)
    display (sorted_cases)
    try:
        assert_frame_equal(rnaseq_sorted, sorted_cases, check_like=True)
        return True
    except:
        return False

test_sorting_column_2(rnaseq)

Unnamed: 0,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,X0726996d.62f2.4880.808c.cfe3361b4b42.htseq.counts.gz,...,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz,Total by row
0,2,79,83,39,189,79,6,74,82,116,...,28,126,96,102,84,38,85,73,153,11302
1,147,280,160,116,80,78,514,141,264,163,...,173,99,75,239,78,72,127,115,113,29062
2,357,237,300,206,258,196,81,335,336,390,...,246,158,220,485,249,333,498,390,246,51510
3,1187,1428,1939,833,1555,1072,1037,2285,1689,1074,...,1105,1357,1286,1366,1016,795,2165,1472,1364,224709
4,4112,2811,17294,3172,13653,5470,3053,15737,5267,14914,...,2557,8638,8886,9238,14085,6181,9665,1771,3394,1218887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,6097,7193,7599,2225,3402,5555,3444,6980,10279,5461,...,1801,6278,5181,4148,2366,2546,8456,3154,2365,754354
104,2656,3900,6633,2378,7341,3742,2118,5634,3639,6533,...,4109,3620,4963,5688,4729,3568,4680,5461,4635,820884
105,5685,7649,8086,3273,5948,3631,4643,7445,6605,6330,...,3255,5759,5961,6380,6388,4604,6820,6240,4534,864200
106,2511,7302,24450,2328,6852,2250,4499,6888,6787,6415,...,2257,4812,4628,6258,6866,3075,6401,2549,4360,971896


Unnamed: 0,X6423474d.60d7.4401.8e5b.46a3fbde5299.htseq.counts.gz,X0be94b2f.fccb.4482.b0ea.695c101aa65a.htseq.counts.gz,b6aa34d6.2b02.4317.8361.79536c7cb4e6.htseq.counts.gz,X09a677f2.d81d.4c3f.adf9.f8594e064e44.htseq.counts.gz,c19f102d.47a0.48c6.9443.63730d9ea6d1.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X98b1beb5.8d4c.45d1.a618.2d43aafa056c.htseq.counts.gz,X0aac5e42.7554.4949.8b90.c16528c71ef8.htseq.counts.gz,X855d4a17.5c83.429d.919b.8c2a8e9bab0b.htseq.counts.gz,c642e018.f0cb.4be8.9b19.c944f1daf9cf.htseq.counts.gz,...,X44c3d518.14fa.4d63.b265.d7fc81c398e2.htseq.counts.gz,e7cc80ef.4b87.47d9.bebe.1fb05b5b04a2.htseq.counts.gz,X7bf647f0.c20e.42e6.b7d5.6510a8d066fc.htseq.counts.gz,X4929062b.3127.4038.8313.c20cbd274be4.htseq.counts.gz,b9ab7393.4abb.41ec.9d55.a3dc846c4a93.htseq.counts.gz,X16c63027.f745.41c4.a5e8.f6d9f1fbf1c8.htseq.counts.gz,X0f426284.c121.4860.bb80.8df032b0dea8.htseq.counts.gz,X1f2aa905.5022.4efe.afac.022d1acfdbe5.htseq.counts.gz,X8a799dfa.c1b5.4b13.9c91.6cbfe2abbc9f.htseq.counts.gz,Total by row
0,8,11,12,27,36,39,86,21,19,51,...,99,42,28,34,32,30,25,15,109,11302
1,15,3,54,152,59,116,88,55,300,95,...,384,545,462,582,145,432,284,784,211,29062
2,122,110,146,127,135,206,207,160,166,218,...,317,891,529,781,387,600,536,1102,643,51510
3,709,508,546,691,679,833,926,668,864,1056,...,1822,3031,2584,2477,1518,2597,3341,2503,2720,224709
4,814,677,753,2988,1607,3172,6131,1621,7557,3603,...,11122,3782,14711,25699,16597,19489,11817,28453,16498,1218887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,599,996,2188,3047,2037,2225,3352,1435,6160,2788,...,12213,6459,8546,7875,7857,5260,4584,5089,7703,754354
104,3526,4172,2261,2662,2449,2378,1970,3364,2785,3358,...,8213,5839,6367,7287,5934,9594,9809,8659,8961,820884
105,749,1874,2547,2839,2545,3273,3235,2621,3696,2523,...,7885,10003,7063,6867,8481,8187,8961,8755,12527,864200
106,1824,1415,971,3621,3599,2328,5714,3073,3609,2917,...,9441,4271,10923,6186,8086,13310,7458,9078,12473,971896


True

22. Test scale data TODO not the most important rn

23. Test add expression by family TODO not the most important rn

### Test TPM:

24. Test calculate gene length. if this test passes the dataframe will show a number that remains consistent for each row.

25. Test the creation of the rpk table

26. Test the calculation of rpk without the rpk function

27. Test the calculation of rpk with the rpk function, the output is a subtraction because it accounts for float

28. Test calculate total reads, this tests that sum of the column is being taken

29. Test per million table tests that the per million scaling factor is being taken and that the final step of tpm is being calculated properly.

In [187]:
def test_calculate_gene_length(table):
    missing_genes_dict = {'C12orf74':'PLEKHG7',
                     'LINC00856':'LINC00595'}

    precalculated_gene_length_dict = {'CCL3L1': 3.090}

    table_numpy = table.to_numpy(copy=True)
    
    
    ## -------- FOR CALCULATING RPK ----------
    rpk_table = table_numpy
    gene_length = 0
    
    for index, value in np.ndenumerate(table_numpy):

        if index[1] == 0:
            gene = value
            if gene in precalculated_gene_length_dict:
                gene_length = precalculated_gene_length_dict.get(value)
            else:
                if gene in missing_genes_dict:
                    gene = missing_genes_dict.get(value)
                gene_length = find_gene_length_ensembl(gene)
            

        else:
            rpk_table[index[0], index[1]] = gene_length
            
    return pd.DataFrame(rpk_table)

test_calculate_gene_length(all_rnaseq_selected_samples)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,138,139,140,141,142,143,144,145,146,147
0,A1BG,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,...,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314
1,A1BG-AS1,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,...,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737
2,A1CF,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,...,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266
3,A2M,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,...,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565
4,A2M-AS1,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,...,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,...,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993
37334,ZYG11B,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,...,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883
37335,ZYX,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,...,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816
37336,ZYXP1,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,...,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117


In [278]:
def test_calculate_rpk_table_string(table):
    missing_genes_dict = {'C12orf74':'PLEKHG7',
                     'LINC00856':'LINC00595'}

    precalculated_gene_length_dict = {'CCL3L1': 3.090}

    table_numpy = table.to_numpy(copy=True)
    
    
    ## -------- FOR CALCULATING RPK ----------
    rpk_table = table_numpy
    gene_length = 0
    
    for index, value in np.ndenumerate(table_numpy):

        if index[1] == 0:
            gene = value
            if gene in precalculated_gene_length_dict:
                gene_length = precalculated_gene_length_dict.get(value)
            else:
                if gene in missing_genes_dict:
                    gene = missing_genes_dict.get(value)
                gene_length = find_gene_length_ensembl(gene)
            

        else:
            rpk_table[index[0], index[1]] = str(value) + '/' + str(gene_length)
            
    # compare original table with the resultant rpk values
    display (table)
    return pd.DataFrame(rpk_table)

test_calculate_rpk_table_string(all_rnaseq_selected_samples)

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
0,A1BG,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,A1BG-AS1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,A1CF,4988,3492,3767,2697,4431,3635,5188,4714,3143,...,3047,2424,2757,4434,4202,2321,2500,6074,3980,2784
3,A2M,26,83,113,45,65,34,69,131,74,...,128,42,70,50,49,75,95,108,17,100
4,A2M-AS1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,27,9,44,8,29,1,1,21,50,...,13,6,6,10,32,28,82,74,8,22
37334,ZYG11B,1799,2857,5383,1757,5474,2362,1575,4241,2492,...,2845,2067,2942,2992,3023,4155,1693,2870,3623,2814
37335,ZYX,1543,1130,1472,914,559,4893,13436,1859,2283,...,1532,2489,741,2192,1739,3457,1376,2559,1388,1098
37336,ZYXP1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,138,139,140,141,142,143,144,145,146,147
0,A1BG,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,...,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314,0/8.314
1,A1BG-AS1,0/7.737,1/7.737,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737,...,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737,0/7.737
2,A1CF,4988/86.266,3492/86.266,3767/86.266,2697/86.266,4431/86.266,3635/86.266,5188/86.266,4714/86.266,3143/86.266,...,3047/86.266,2424/86.266,2757/86.266,4434/86.266,4202/86.266,2321/86.266,2500/86.266,6074/86.266,3980/86.266,2784/86.266
3,A2M,26/48.565,83/48.565,113/48.565,45/48.565,65/48.565,34/48.565,69/48.565,131/48.565,74/48.565,...,128/48.565,42/48.565,70/48.565,50/48.565,49/48.565,75/48.565,95/48.565,108/48.565,17/48.565,100/48.565
4,A2M-AS1,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,...,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526,0/3.526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,27/0.993,9/0.993,44/0.993,8/0.993,29/0.993,1/0.993,1/0.993,21/0.993,50/0.993,...,13/0.993,6/0.993,6/0.993,10/0.993,32/0.993,28/0.993,82/0.993,74/0.993,8/0.993,22/0.993
37334,ZYG11B,1799/100.883,2857/100.883,5383/100.883,1757/100.883,5474/100.883,2362/100.883,1575/100.883,4241/100.883,2492/100.883,...,2845/100.883,2067/100.883,2942/100.883,2992/100.883,3023/100.883,4155/100.883,1693/100.883,2870/100.883,3623/100.883,2814/100.883
37335,ZYX,1543/9.816,1130/9.816,1472/9.816,914/9.816,559/9.816,4893/9.816,13436/9.816,1859/9.816,2283/9.816,...,1532/9.816,2489/9.816,741/9.816,2192/9.816,1739/9.816,3457/9.816,1376/9.816,2559/9.816,1388/9.816,1098/9.816
37336,ZYXP1,0/0.117,0/0.117,0/0.117,0/0.117,0/0.117,0/0.117,0/0.117,0/0.117,0/0.117,...,0/0.117,0/0.117,0/0.117,0/0.117,0/0.117,1/0.117,0/0.117,0/0.117,0/0.117,0/0.117


In [279]:
def test_calculate_rpk_table(table):
    missing_genes_dict = {'C12orf74':'PLEKHG7',
                     'LINC00856':'LINC00595'}

    precalculated_gene_length_dict = {'CCL3L1': 3.090}

    table_numpy = table.to_numpy(copy=True)
    
    
    ## -------- FOR CALCULATING RPK ----------
    rpk_table = table_numpy
    gene_length = 0
    
    for index, value in np.ndenumerate(table_numpy):

        if index[1] == 0:
            gene = value
            if gene in precalculated_gene_length_dict:
                gene_length = precalculated_gene_length_dict.get(value)
            else:
                if gene in missing_genes_dict:
                    gene = missing_genes_dict.get(value)
                gene_length = find_gene_length_ensembl(gene)
            

        else:
            rpk_table[index[0], index[1]] = calculate_rpk(value, gene_length)
     
    # compare original table with the resultant rpk values
    return pd.DataFrame(rpk_table)

test_calculate_rpk_table(all_rnaseq_selected_samples)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,138,139,140,141,142,143,144,145,146,147
0,A1BG,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,A1BG-AS1,0,0.129249,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,A1CF,57.8212,40.4794,43.6673,31.2638,51.3644,42.1371,60.1396,54.6449,36.4338,...,35.321,28.0991,31.9593,51.3992,48.7098,26.9052,28.9801,70.4101,46.1364,32.2723
3,A2M,0.535365,1.70905,2.32678,0.926593,1.33841,0.700093,1.42078,2.69742,1.52373,...,2.63564,0.86482,1.44137,1.02955,1.00896,1.54432,1.95614,2.22382,0.350046,2.0591
4,A2M-AS1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,27.1903,9.06344,44.3102,8.05639,29.2044,1.00705,1.00705,21.148,50.3525,...,13.0916,6.0423,6.0423,10.0705,32.2256,28.1974,82.578,74.5217,8.05639,22.1551
37334,ZYG11B,17.8325,28.3199,53.3588,17.4162,54.2609,23.4133,15.6121,42.0388,24.7019,...,28.201,20.4891,29.1625,29.6581,29.9654,41.1863,16.7818,28.4488,35.9129,27.8937
37335,ZYX,157.192,115.118,149.959,93.1133,56.9478,498.472,1368.79,189.385,232.579,...,156.072,253.566,75.489,223.309,177.16,352.18,140.179,260.697,141.402,111.858
37336,ZYXP1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,8.54701,0,0,0,0


In [282]:
def test_calculate_rpk(sample_count, gene_length, expected):
    test = calculate_rpk(sample_count, gene_length)
    return test - expected

# test A1BG-AS1, patient 2
test_calculate_rpk(1, 7.737, 0.129249)

6.29442936384006e-08

In [194]:
test_calculate_rpk(531, 138.585)

True

In [201]:
def test_calculate_total_reads(table):
    missing_genes_dict = {'C12orf74':'PLEKHG7',
                     'LINC00856':'LINC00595'}

    precalculated_gene_length_dict = {'CCL3L1': 3.090}

    table_numpy = table.to_numpy(copy=True)
    
    
    ## -------- FOR CALCULATING RPK ----------
    rpk_table = table_numpy
    gene_length = 0

    for index, value in np.ndenumerate(table_numpy):
        if index[1] == 0:
            gene = value
            if gene in precalculated_gene_length_dict:
                gene_length = precalculated_gene_length_dict.get(value)
            else:
                if gene in missing_genes_dict:
                    gene = missing_genes_dict.get(value)
                gene_length = find_gene_length_ensembl(gene)
            
                
        else:
            rpk_table[index[0], index[1]] = calculate_rpk(value, gene_length)
    
    ## ---- FOR CALCULATING PER MILLION SCALING FACTOR -----
    per_mil_table = rpk_table
    
    total_reads = np.sum(rpk_table[:, 1:], axis=0)
    
    rpk_pandas = pd.DataFrame(rpk_table)
    rpk_pandas.loc['total by col', :] = rpk_table.sum(axis=0)
    
    # comparing first table against second table, where second table is our function output
    display (rpk_pandas.iloc[[-1]])
    
    return pd.DataFrame(total_reads).transpose()

test_calculate_total_reads(all_rnaseq_selected_samples)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,138,139,140,141,142,143,144,145,146,147
total by col,A1BGA1BG-AS1A1CFA2MA2M-AS1A2ML1A2ML1-AS1A2ML1-...,25033300.0,40472100.0,46545100.0,10681700.0,33268900.0,20547800.0,17318300.0,26926200.0,20207000.0,...,19185900.0,25880400.0,26199900.0,19724400.0,34999500.0,29434000.0,13688000.0,41344000.0,44959200.0,23614700.0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,137,138,139,140,141,142,143,144,145,146
0,25033300.0,40472100.0,46545100.0,10681700.0,33268900.0,20547800.0,17318300.0,26926200.0,20207000.0,22273300.0,...,19185900.0,25880400.0,26199900.0,19724400.0,34999500.0,29434000.0,13688000.0,41344000.0,44959200.0,23614700.0


In [205]:
def test_per_million_table(table):
    missing_genes_dict = {'C12orf74':'PLEKHG7',
                     'LINC00856':'LINC00595'}

    precalculated_gene_length_dict = {'CCL3L1': 3.090}
    table_numpy = table.to_numpy(copy=True)
    
    table_columns = table.columns.values
    
    ## -------- FOR CALCULATING RPK ----------
    rpk_table = table_numpy
    gene_length = 0
    
    for index, value in np.ndenumerate(table_numpy):
        if index[1] == 0:
            gene = value
            if gene in precalculated_gene_length_dict:
                gene_length = precalculated_gene_length_dict.get(value)
            else:
                if gene in missing_genes_dict:
                    gene = missing_genes_dict.get(value)
                gene_length = find_gene_length_ensembl(gene)
            
                
        else:
            rpk_table[index[0], index[1]] = calculate_rpk(value, gene_length)
    
    ## ---- FOR CALCULATING PER MILLION SCALING FACTOR -----
    per_mil_table = rpk_table
    
    total_reads = np.sum(rpk_table[:, 1:], axis=0)
    
    for index, value in np.ndenumerate(rpk_table):
        if (index[1] == 0):
            continue
        total_for_column = total_reads[index[1] - 1]
        
        scaling_factor = calculate_per_million(total_for_column)
        
        per_mil_table[index[0], index[1]] = str(value) + '/' + str(scaling_factor)
        
    return pd.DataFrame(per_mil_table, columns=table_columns)

test_per_million_table(all_rnaseq_selected_samples)

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
0,A1BG,0.0/25.033308132169527,0.0/40.472115920488626,0.0/46.54505274181538,0.0/10.681741528796152,0.0/33.268893485418644,0.0/20.54780318487987,0.0/17.31833710629884,0.0/26.926222765452653,0.0/20.20700697909019,...,0.0/19.185930264558305,0.0/25.880426119096576,0.0/26.199890790768382,0.0/19.72435163550857,0.0/34.99952139550847,0.0/29.433968756463166,0.0/13.688023374632223,0.0/41.34395664569171,0.0/44.959218805980186,0.0/23.614682669910234
1,A1BG-AS1,0.0/25.033308132169527,0.12924906294429364/40.472115920488626,0.0/46.54505274181538,0.0/10.681741528796152,0.0/33.268893485418644,0.0/20.54780318487987,0.0/17.31833710629884,0.0/26.926222765452653,0.0/20.20700697909019,...,0.0/19.185930264558305,0.0/25.880426119096576,0.0/26.199890790768382,0.0/19.72435163550857,0.0/34.99952139550847,0.0/29.433968756463166,0.0/13.688023374632223,0.0/41.34395664569171,0.0/44.959218805980186,0.0/23.614682669910234
2,A1CF,57.82115781420258/25.033308132169527,40.47944729093733/40.472115920488626,43.66726172536109/46.54505274181538,31.263765562330466/10.681741528796152,51.364384577933365/33.268893485418644,42.13711079683769/20.54780318487987,60.13956831196531/17.31833710629884,54.64493543226764/26.926222765452653,36.43382097234136/20.20700697909019,...,35.32098393341525/19.185930264558305,28.099135232884333/25.880426119096576,31.959288711659283/26.199890790768382,51.3991607353998/19.72435163550857,48.709804557995035/34.99952139550847,26.905153826536527/29.433968756463166,28.98013122203417/13.688023374632223,70.41012681705422/41.34395664569171,46.1363689054784/44.959218805980186,32.272274128857255/23.614682669910234
3,A2M,0.5353649747760734/25.033308132169527,1.7090497271697727/40.472115920488626,2.326778544219088/46.54505274181538,0.9265932255739731/10.681741528796152,1.3384124369401833/33.268893485418644,0.7000926593225574/20.54780318487987,1.4207762792134253/17.31833710629884,2.697415834448677/26.926222765452653,1.5237310820549779/20.20700697909019,...,2.6356429527437455/19.185930264558305,0.8648203438690415/25.880426119096576,1.4413672397817359/26.199890790768382,1.0295480284155256/19.72435163550857,1.008957067847215/34.99952139550847,1.5443220426232884/29.433968756463166,1.9561412539894987/13.688023374632223,2.2238237413775352/41.34395664569171,0.3500463296612787/44.959218805980186,2.0590960568310512/23.614682669910234
4,A2M-AS1,0.0/25.033308132169527,0.0/40.472115920488626,0.0/46.54505274181538,0.0/10.681741528796152,0.0/33.268893485418644,0.0/20.54780318487987,0.0/17.31833710629884,0.0/26.926222765452653,0.0/20.20700697909019,...,0.0/19.185930264558305,0.0/25.880426119096576,0.0/26.199890790768382,0.0/19.72435163550857,0.0/34.99952139550847,0.0/29.433968756463166,0.0/13.688023374632223,0.0/41.34395664569171,0.0/44.959218805980186,0.0/23.614682669910234
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,27.19033232628399/25.033308132169527,9.06344410876133/40.472115920488626,44.31017119838872/46.54505274181538,8.056394763343404/10.681741528796152,29.204431017119838/33.268893485418644,1.0070493454179255/20.54780318487987,1.0070493454179255/17.31833710629884,21.148036253776436/26.926222765452653,50.35246727089628/20.20700697909019,...,13.091641490433032/19.185930264558305,6.042296072507553/25.880426119096576,6.042296072507553/26.199890790768382,10.070493454179255/19.72435163550857,32.225579053373615/34.99952139550847,28.197381671701912/29.433968756463166,82.57804632426989/13.688023374632223,74.52165156092649/41.34395664569171,8.056394763343404/44.959218805980186,22.15508559919436/23.614682669910234
37334,ZYG11B,17.832538683425355/25.033308132169527,28.31993497417801/40.472115920488626,53.35884143017159/46.54505274181538,17.416214823111922/10.681741528796152,54.26087646085069/33.268893485418644,23.41326090619827/20.54780318487987,15.612144761753715/17.31833710629884,42.038797418792065/26.926222765452653,24.70188237859699/20.20700697909019,...,28.200985299802742/19.185930264558305,20.489081411139637/25.880426119096576,29.16249516766948/26.199890790768382,29.658118810899758/19.72435163550857,29.965405469702528/34.99952139550847,41.18632475243599/29.433968756463166,16.78181655977717/13.688023374632223,28.448797121417883/41.34395664569171,35.91288918846585/44.959218805980186,27.893698640999972/23.614682669910234
37335,ZYX,157.1923390383048/25.033308132169527,115.11817440912795/40.472115920488626,149.95925020374898/46.54505274181538,93.11328443357783/10.681741528796152,56.94784026079869/33.268893485418644,498.4718826405868/20.54780318487987,1368.7856560717196/17.31833710629884,189.3846780766096/26.926222765452653,232.57946210268946/20.20700697909019,...,156.07171964140178/19.185930264558305,253.56560717196413/25.880426119096576,75.48899755501222/26.199890790768382,223.3088834555827/19.72435163550857,177.15973920130398/34.99952139550847,352.1801140994295/29.433968756463166,140.17929910350446/13.688023374632223,260.6968215158924/41.34395664569171,141.40179299103502/44.959218805980186,111.85819070904644/23.614682669910234
37336,ZYXP1,0.0/25.033308132169527,0.0/40.472115920488626,0.0/46.54505274181538,0.0/10.681741528796152,0.0/33.268893485418644,0.0/20.54780318487987,0.0/17.31833710629884,0.0/26.926222765452653,0.0/20.20700697909019,...,0.0/19.185930264558305,0.0/25.880426119096576,0.0/26.199890790768382,0.0/19.72435163550857,0.0/34.99952139550847,8.547008547008547/29.433968756463166,0.0/13.688023374632223,0.0/41.34395664569171,0.0/44.959218805980186,0.0/23.614682669910234


30. test filter and sort checks that when filter and sort are applied, the each row still retains its original values

In [210]:
def test_filter_and_sort(table_tpm):
    table_filter = filter_genes_of_interest(table_tpm)
    table_sort = sort_genes_of_interest(table_filter)
    try:
        assert_frame_equal(table_filter, table_sort, check_like=True)
        return True
    except:
        return False
    
test_filter_and_sort(tpm_all_rnaseq_selected_samples)

True

## Test to see if our other tables also pass the above tests

### rnaseq_tpm table

In [211]:
test_sort_genes_of_interest_1(rnaseq_tpm)

hgnc_symbol                                              True
X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz    True
X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz    True
X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz    True
X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz    True
                                                         ... 
f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz     True
f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz     True
fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz     True
feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz     True
fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz     True
Length: 148, dtype: bool

In [212]:
test_sort_genes_of_interest_2(rnaseq_tpm)

True

### normal_rnaseq table

In [213]:
test_filter_genes_of_interest_1(all_normal_rnaseq)

True

In [214]:
test_filter_genes_of_interest_2(all_normal_rnaseq)

True

In [215]:
test_sort_genes_of_interest_1(normal_rnaseq)

hgnc_symbol                                              True
X0be94b2f.fccb.4482.b0ea.695c101aa65a.htseq.counts.gz    True
X1f2aa905.5022.4efe.afac.022d1acfdbe5.htseq.counts.gz    True
X26a18ff4.ac77.47e8.9ef8.da442ac1325d.htseq.counts.gz    True
X3de80dcb.4ff2.4125.b8e6.9e06ec1cd833.htseq.counts.gz    True
X42bec5f7.7623.42e6.bbdf.514fe3805940.htseq.counts.gz    True
X5047576e.f3de.4244.8f47.f78bc1c10c22.htseq.counts.gz    True
X50e50114.97c0.46a6.ac5a.8c5c32abd6b2.htseq.counts.gz    True
X6020245b.2956.46cf.9048.fbc09709ab22.htseq.counts.gz    True
X82275c4f.5976.40e0.ac70.c74250de34ac.htseq.counts.gz    True
X9f124994.5787.488d.b679.a33419ab63e5.htseq.counts.gz    True
a6ad90fe.ccfe.47ce.9e5a.95f5e7acf761.htseq.counts.gz     True
b6d23de9.99bf.4412.b022.bab1332165bf.htseq.counts.gz     True
b9038119.a0a9.4987.b9e3.02ea055a644a.htseq.counts.gz     True
be13c589.e2f2.4505.9f12.2de3a8c97fdf.htseq.counts.gz     True
e7e83d39.85b9.45c2.a4f4.f92080ef770a.htseq.counts.gz     True
fb65f821

In [216]:
test_sort_genes_of_interest_1(normal_rnaseq)

hgnc_symbol                                              True
X0be94b2f.fccb.4482.b0ea.695c101aa65a.htseq.counts.gz    True
X1f2aa905.5022.4efe.afac.022d1acfdbe5.htseq.counts.gz    True
X26a18ff4.ac77.47e8.9ef8.da442ac1325d.htseq.counts.gz    True
X3de80dcb.4ff2.4125.b8e6.9e06ec1cd833.htseq.counts.gz    True
X42bec5f7.7623.42e6.bbdf.514fe3805940.htseq.counts.gz    True
X5047576e.f3de.4244.8f47.f78bc1c10c22.htseq.counts.gz    True
X50e50114.97c0.46a6.ac5a.8c5c32abd6b2.htseq.counts.gz    True
X6020245b.2956.46cf.9048.fbc09709ab22.htseq.counts.gz    True
X82275c4f.5976.40e0.ac70.c74250de34ac.htseq.counts.gz    True
X9f124994.5787.488d.b679.a33419ab63e5.htseq.counts.gz    True
a6ad90fe.ccfe.47ce.9e5a.95f5e7acf761.htseq.counts.gz     True
b6d23de9.99bf.4412.b022.bab1332165bf.htseq.counts.gz     True
b9038119.a0a9.4987.b9e3.02ea055a644a.htseq.counts.gz     True
be13c589.e2f2.4505.9f12.2de3a8c97fdf.htseq.counts.gz     True
e7e83d39.85b9.45c2.a4f4.f92080ef770a.htseq.counts.gz     True
fb65f821

In [219]:
test_calculate_gene_length(all_normal_rnaseq)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,A1BG,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314,8.314
1,A1BG-AS1,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737,7.737
2,A1CF,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266,86.266
3,A2M,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565,48.565
4,A2M-AS1,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526,3.526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993,0.993
37334,ZYG11B,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883,100.883
37335,ZYX,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816,9.816
37336,ZYXP1,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117,0.117


In [221]:
test_calculate_rpk_table(all_normal_rnaseq)

Unnamed: 0,hgnc_symbol,X0be94b2f.fccb.4482.b0ea.695c101aa65a.htseq.counts.gz,X1f2aa905.5022.4efe.afac.022d1acfdbe5.htseq.counts.gz,X26a18ff4.ac77.47e8.9ef8.da442ac1325d.htseq.counts.gz,X3de80dcb.4ff2.4125.b8e6.9e06ec1cd833.htseq.counts.gz,X42bec5f7.7623.42e6.bbdf.514fe3805940.htseq.counts.gz,X5047576e.f3de.4244.8f47.f78bc1c10c22.htseq.counts.gz,X50e50114.97c0.46a6.ac5a.8c5c32abd6b2.htseq.counts.gz,X6020245b.2956.46cf.9048.fbc09709ab22.htseq.counts.gz,X82275c4f.5976.40e0.ac70.c74250de34ac.htseq.counts.gz,X9f124994.5787.488d.b679.a33419ab63e5.htseq.counts.gz,a6ad90fe.ccfe.47ce.9e5a.95f5e7acf761.htseq.counts.gz,b6d23de9.99bf.4412.b022.bab1332165bf.htseq.counts.gz,b9038119.a0a9.4987.b9e3.02ea055a644a.htseq.counts.gz,be13c589.e2f2.4505.9f12.2de3a8c97fdf.htseq.counts.gz,e7e83d39.85b9.45c2.a4f4.f92080ef770a.htseq.counts.gz,fb65f821.92cb.402a.ad2f.d4044ca7de4d.htseq.counts.gz
0,A1BG,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0
1,A1BG-AS1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,A1CF,2339,6121,3825,7911,5348,6989,2839,2272,6153,5157,4286,6003,6021,4014,7944,4916
3,A2M,20,109,98,124,65,77,14,1,52,31,76,12,154,128,88,74
4,A2M-AS1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,21,45,8,53,73,21,4,39,18,14,18,17,10,59,27,0
37334,ZYG11B,624,4225,3513,2737,2542,2485,2116,2342,3262,3150,3185,1834,2103,2841,3439,2145
37335,ZYX,438,8329,1317,3072,2134,3777,4870,326,1592,2549,1032,3638,2133,3658,5104,2757
37336,ZYXP1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,A1BG,0,0,0,0,0,0,0,0.240558,0,0,0,0,0,0,0,0
1,A1BG-AS1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.129249,0,0
2,A1CF,27.1138,70.955,44.3396,91.7047,61.9943,81.0169,32.9098,26.3371,71.3259,59.7802,49.6835,69.5871,69.7957,46.5305,92.0873,56.9865
3,A2M,0.411819,2.24441,2.01791,2.55328,1.33841,1.5855,0.288273,0.020591,1.07073,0.63832,1.56491,0.247092,3.17101,2.63564,1.812,1.52373
4,A2M-AS1,0,0,0,0,0,0,0,0,0.283607,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,21.148,45.3172,8.05639,53.3736,73.5146,21.148,4.0282,39.2749,18.1269,14.0987,18.1269,17.1198,10.0705,59.4159,27.1903,0
37334,ZYG11B,6.18538,41.8802,34.8225,27.1304,25.1975,24.6325,20.9748,23.215,32.3345,31.2243,31.5712,18.1795,20.8459,28.1613,34.089,21.2623
37335,ZYX,44.621,848.513,134.169,312.958,217.4,384.78,496.129,33.2111,162.184,259.678,105.134,370.619,217.298,372.657,519.967,280.868
37336,ZYXP1,0,0,0,0,0,0,0,0,0,0,0,0,8.54701,0,0,0


In [222]:
# test A1CF for patient 1
test_calculate_rpk(2339, 86.266)

True

In [223]:
# test zyx for patient 
test_calculate_rpk(2757, 9.816)

True

In [224]:
test_calculate_total_reads(all_normal_rnaseq)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
total by col,A1BGA1BG-AS1A1CFA2MA2M-AS1A2ML1A2ML1-AS1A2ML1-...,28710000.0,46368400.0,40960000.0,34504200.0,46620600.0,28605700.0,28303600.0,23649100.0,25074800.0,25456000.0,28859300.0,27966800.0,23241800.0,29439800.0,28624500.0,23453000.0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,28710000.0,46368400.0,40960000.0,34504200.0,46620600.0,28605700.0,28303600.0,23649100.0,25074800.0,25456000.0,28859300.0,27966800.0,23241800.0,29439800.0,28624500.0,23453000.0


In [225]:
test_per_million_table(all_normal_rnaseq)

Unnamed: 0,hgnc_symbol,X0be94b2f.fccb.4482.b0ea.695c101aa65a.htseq.counts.gz,X1f2aa905.5022.4efe.afac.022d1acfdbe5.htseq.counts.gz,X26a18ff4.ac77.47e8.9ef8.da442ac1325d.htseq.counts.gz,X3de80dcb.4ff2.4125.b8e6.9e06ec1cd833.htseq.counts.gz,X42bec5f7.7623.42e6.bbdf.514fe3805940.htseq.counts.gz,X5047576e.f3de.4244.8f47.f78bc1c10c22.htseq.counts.gz,X50e50114.97c0.46a6.ac5a.8c5c32abd6b2.htseq.counts.gz,X6020245b.2956.46cf.9048.fbc09709ab22.htseq.counts.gz,X82275c4f.5976.40e0.ac70.c74250de34ac.htseq.counts.gz,X9f124994.5787.488d.b679.a33419ab63e5.htseq.counts.gz,a6ad90fe.ccfe.47ce.9e5a.95f5e7acf761.htseq.counts.gz,b6d23de9.99bf.4412.b022.bab1332165bf.htseq.counts.gz,b9038119.a0a9.4987.b9e3.02ea055a644a.htseq.counts.gz,be13c589.e2f2.4505.9f12.2de3a8c97fdf.htseq.counts.gz,e7e83d39.85b9.45c2.a4f4.f92080ef770a.htseq.counts.gz,fb65f821.92cb.402a.ad2f.d4044ca7de4d.htseq.counts.gz
0,A1BG,0.0/28.709982399000495,0.0/46.36843674696419,0.0/40.96004263867878,0.0/34.50418500877338,0.0/46.62058070248897,0.0/28.60566861886804,0.0/28.303554144536495,0.24055809477988935/23.649056610410305,0.0/25.074824035565268,0.0/25.455992472910776,0.0/28.85933376286785,0.0/27.966813735913636,0.0/23.241767424606614,0.0/29.439759065312664,0.0/28.624494138598248,0.0/23.453032241204102
1,A1BG-AS1,0.0/28.709982399000495,0.0/46.36843674696419,0.0/40.96004263867878,0.0/34.50418500877338,0.0/46.62058070248897,0.0/28.60566861886804,0.0/28.303554144536495,0.0/23.649056610410305,0.0/25.074824035565268,0.0/25.455992472910776,0.0/28.85933376286785,0.0/27.966813735913636,0.0/23.241767424606614,0.12924906294429364/29.439759065312664,0.0/28.624494138598248,0.0/23.453032241204102
2,A1CF,27.11381077133517/28.709982399000495,70.95495328402846/46.36843674696419,44.339600769712284/40.96004263867878,91.70472723900494/34.50418500877338,61.9942967101755/46.62058070248897,81.01685484431873/28.60566861886804,32.909837015742006/28.303554144536495,26.337143254584657/23.649056610410305,71.3258989636705/25.074824035565268,59.78021468481209/25.455992472910776,49.68353696705538/28.85933376286785,69.58709109034845/27.966813735913636,69.7957480351471/23.241767424606614,46.530498690098064/29.439759065312664,92.08726497113578/28.624494138598248,56.986530035007995/23.453032241204102
3,A2M,0.41181921136621025/28.709982399000495,2.244414701945846/46.36843674696419,2.01791413569443/40.96004263867878,2.5532791104705037/34.50418500877338,1.3384124369401833/46.62058070248897,1.5855039637599095/28.60566861886804,0.2882734479563472/28.303554144536495,0.02059096056831051/23.649056610410305,1.0707299495521467/25.074824035565268,0.6383197776176259/25.455992472910776,1.564913003191599/28.85933376286785,0.24709152681972615/27.966813735913636,3.171007927519819/23.241767424606614,2.6356429527437455/29.439759065312664,1.8120045300113252/28.624494138598248,1.5237310820549779/23.453032241204102
4,A2M-AS1,0.0/28.709982399000495,0.0/46.36843674696419,0.0/40.96004263867878,0.0/34.50418500877338,0.0/46.62058070248897,0.0/28.60566861886804,0.0/28.303554144536495,0.0/23.649056610410305,0.2836074872376631/25.074824035565268,0.0/25.455992472910776,0.0/28.85933376286785,0.0/27.966813735913636,0.0/23.241767424606614,0.0/29.439759065312664,0.0/28.624494138598248,0.0/23.453032241204102
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,ZYG11AP1,21.148036253776436/28.709982399000495,45.31722054380665/46.36843674696419,8.056394763343404/40.96004263867878,53.373615307150054/34.50418500877338,73.51460221550856/46.62058070248897,21.148036253776436/28.60566861886804,4.028197381671702/28.303554144536495,39.274924471299094/23.649056610410305,18.12688821752266/25.074824035565268,14.098690835850956/25.455992472910776,18.12688821752266/28.85933376286785,17.119838872104733/27.966813735913636,10.070493454179255/23.241767424606614,59.4159113796576/29.439759065312664,27.19033232628399/28.624494138598248,0.0/23.453032241204102
37334,ZYG11B,6.185383067513853/28.709982399000495,41.88019785295838/46.36843674696419,34.82251717335924/40.96004263867878,27.130438230425344/34.50418500877338,25.197506021827266/46.62058070248897,24.632495068544753/28.60566861886804,20.97479258150531/28.303554144536495,23.21501144890616/23.649056610410305,32.33448648434325/25.074824035565268,31.22428952350743/25.455992472910776,31.571226073768624/28.85933376286785,18.17947523368655/27.966813735913636,20.845930434265437/23.241767424606614,28.16133540834432/29.439759065312664,34.08899418137843/28.624494138598248,21.26225429457887/23.453032241204102
37335,ZYX,44.62102689486552/28.709982399000495,848.5126324368377/46.36843674696419,134.1687041564792/40.96004263867878,312.9584352078239/34.50418500877338,217.400162999185/46.62058070248897,384.7799511002445/28.60566861886804,496.12876935615316/28.303554144536495,33.211083944580274/23.649056610410305,162.1841890790546/25.074824035565268,259.67807660961694/25.455992472910776,105.13447432762835/28.85933376286785,370.61939690301546/27.966813735913636,217.29828850855745/23.241767424606614,372.6568867155664/29.439759065312664,519.9674001629992/28.624494138598248,280.8679706601467/23.453032241204102
37336,ZYXP1,0.0/28.709982399000495,0.0/46.36843674696419,0.0/40.96004263867878,0.0/34.50418500877338,0.0/46.62058070248897,0.0/28.60566861886804,0.0/28.303554144536495,0.0/23.649056610410305,0.0/25.074824035565268,0.0/25.455992472910776,0.0/28.85933376286785,0.0/27.966813735913636,8.547008547008547/23.241767424606614,0.0/29.439759065312664,0.0/28.624494138598248,0.0/23.453032241204102


In [226]:
test_filter_and_sort(tpm_normal_rnaseq)

True

## Test log2fold

31. Test our log 2 fold calculation function

In [228]:
def test_log2fold_1(a, b):
    calculation = math.log2(b/a)
    test = log2fold(a, b)
    return calculation == test

test_log2fold(10000, 10)

True

In [229]:
test_log2fold(2.4, 2.4)

True

In [230]:
test_log2fold(4, 1)

True

In [247]:
def test_average_tpm(table):
    test = average_tpm(table)['Average by gene']
    tpm = table.copy()
    tpm.drop(labels=['hgnc_symbol'], axis=1, inplace=True)
    tpm.loc[:, 'Average by gene'] = tpm.mean(axis=1)
    
    try:
        assert_series_equal(tpm['Average by gene'], test)
        return True
    except:
        return False
        

test_average_tpm(rnaseq_tpm)

True

In [268]:
def test_apply_log2fold(table1, table2, table1_columns):
    table1_values = table1.to_numpy(copy=True)
    table2_values = table2['Average by gene'].tolist()
    
    log2fold_values = table1_values.copy()
    
    for index, value in np.ndenumerate(table1_values):
        if (index[1] == 0):
            continue
        a = value + 1
        b = table2_values[index[0]] + 1
        result = log2fold(a, b)
        log2fold_values[index[0], index[1]] = result
        
    log2fold_table = pd.DataFrame(log2fold_values, columns=table1_columns)

    display (table1)
    display (table2)
    return log2fold_table

test_apply_log2fold(rnaseq_tpm_avg, normal_rnaseq_tpm_avg, rnaseq_tpm_avg.columns.tolist())

Unnamed: 0,hgnc_symbol,Average by gene
6402,DRD1,1.825896
6403,DRD2,4.617089
6404,DRD3,0.803724
6405,DRD4,0.854273
6406,DRD5,5.415738
...,...,...
4266,CHRM1,16.931017
4267,CHRM2,1.677587
4268,CHRM3,0.059905
4271,CHRM4,63.173471


Unnamed: 0,hgnc_symbol,Average by gene
6402,DRD1,1.265019
6403,DRD2,2.393486
6404,DRD3,0.799979
6405,DRD4,0.550545
6406,DRD5,6.205692
...,...,...
4266,CHRM1,15.807902
4267,CHRM2,0.920309
4268,CHRM3,0.056760
4271,CHRM4,76.905979


Unnamed: 0,hgnc_symbol,Average by gene
0,DRD1,-0.319185
1,DRD2,-0.727054
2,DRD3,-0.00299827
3,DRD4,-0.258078
4,DRD5,0.167522
...,...,...
102,CHRM1,-0.0933177
103,CHRM2,-0.479595
104,CHRM3,-0.0042874
105,CHRM4,0.279757


In [270]:
# testing drd1
test_log2fold(1.825896, 1.265019)

True

In [272]:
def test_log2fold_2(a, b, value):
    test = log2fold(a, b)
    return test == value

hmmm oh shit, TODO look at this function

In [273]:
test_log2fold_2(1.825896, 1.265019, -0.319185)

False

In [274]:
test_log2fold_2(4.617089, 2.393486, -0.727054)

False

In [285]:
rnaseq_tpm

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
6402,DRD1,1.41635,1.66868,0.82912,2.61931,0.579992,0.915588,7.15859,1.26303,3.15118,...,0.603433,1.6123,0.911395,0.917126,1.64705,0.63917,1.26871,0.740905,0.61695,1.15416
6403,DRD2,2.48557,1.05098,5.62228,4.49347,6.20984,4.02821,2.66754,8.84376,3.94414,...,8.01233,1.49503,4.98889,6.81701,3.99398,7.241,6.83295,3.53737,0.59606,2.1748
6404,DRD3,0.660153,0.49123,0.579985,1.08571,0.650735,0.726343,0.833652,1.18147,1.1637,...,1.07034,0.594433,0.721095,0.907717,0.543377,0.480571,0.808609,0.729051,0.455828,0.804164
6405,DRD4,0.0232451,0.567926,0.51883,1.06229,1.65289,1.11862,0.100801,0.799607,1.18068,...,1.95626,0.31478,1.39924,1.41608,0.847927,0.83033,0.807725,0.598174,0.472416,1.88508
6406,DRD5,6.00463,2.46564,2.71384,8.1201,3.26526,4.01631,1.96932,5.23849,7.00122,...,8.25166,4.00221,2.53918,4.69631,5.83467,3.56194,10.2433,5.0717,3.65243,4.38621
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4266,CHRM1,17.2987,14.3963,13.2331,23.3402,13.6186,13.4605,20.4218,21.0616,24.8984,...,26.5848,9.58034,16.7436,23.0207,13.8855,16.5317,25.621,12.5653,10.5722,14.6252
4267,CHRM2,0.661713,1.19022,3.46534,1.43774,1.35869,0.722367,1.71376,1.68756,2.21573,...,1.39393,0.575309,1.21162,1.54786,1.17954,1.53885,1.48199,1.02135,0.374018,1.21799
4268,CHRM3,0.0810436,0.0321885,0.0542307,0.102311,0.0391578,0.046561,0.0733669,0.0603191,0.0842127,...,0.0610021,0.0366749,0.0356504,0.0727572,0.0395985,0.0494628,0.0762492,0.0542388,0.0226256,0.0480404
4271,CHRM4,124.391,47.2241,29.1536,82.706,28.5054,86.5048,118.123,53.8631,93.8629,...,78.3549,37.718,18.4268,49.8929,48.837,40.1167,108.569,83.9188,26.3668,40.6078
