# Cancer Expression Heatmap Testing Suite

### We're creating a testing suite to test all of the functions and outputs of the cancer_expression_heatmap file.

We can do this by first running the file to get all of the variables and functions loaded, and we'll go through and test these functions.

%%capture hides the output of the file.

In [5]:
%%capture
%run cancer_expression_heatmap.ipynb

## Test Data Imports and Preparation

1. Test remove_patients_from_list
    In this function we test if there are any patients from the to_remove list are still in the dataset. If true, all the patients were sucessfully removed.

In [19]:
def test_remove_patients_from_list(to_remove_file):
    function_patients = remove_patients_from_list(to_remove_file)
    to_remove_list = pd.read_csv(to_remove_file, delimiter = '\t')["Patient ID"].tolist()
    test = function_patients['submitter_id'].isin(to_remove_list)
    for row in test.iteritems():
        if row == True:
            return False
    return True

test_remove_patients_from_list('datasets/paad_tcga_clinical_data.tsv')

True

2. Test filter_genes_of_interest 1 tests to make sure that all of the genes in the table are in the provided neurotransmitter gene family file

3. Test filter_genes_of_interest 2 tests that all of the neurotransmitter genes successfully made it in the table (meaning that the original table contained all 107 of the neurotransmitter genes

In [26]:
def test_filter_genes_of_interest_1(table):
    function_filter = filter_genes_of_interest(table)
    test = function_filter['hgnc_symbol'].isin(neurotransmitter_genes["receptor gene"].tolist())
    for row in test.iteritems():
        if row == False:
            return False
    return True

test_filter_genes_of_interest_1(all_rnaseq)

True

In [32]:
def test_filter_genes_of_interest_2(table):
    function_filter = filter_genes_of_interest(table)
    filter_count = function_filter['hgnc_symbol'].count() 
    neuro_count = neurotransmitter_genes["receptor gene"].count()
    return filter_count == neuro_count

test_filter_genes_of_interest_2(all_rnaseq)

True

4. Test create_counts_list by checking if elements from the column values have any elements from the create counts list

In [52]:
def test_create_counts_list(table):
    column_list = table.columns.values
    counts_list = create_counts_list()
    return not any(item in column_list for item in counts_list)

test_create_counts_list(rnaseq)

True

5. Test sort genes of interest 1 makes sure that all of the hgnc_symbols stay with their row counts. 

6. Test sort genes of interest checks if the genes are sorted in the same order as the neurotransmitter gene families. 

In [59]:
def test_sort_genes_of_interest_1(table):
    t_unsorted = table.copy()
    t_sorted = sort_genes_of_interest(table)
    
    t_merged = pd.merge(t_unsorted, t_sorted, on=list(t_unsorted.columns.values), how='inner')
    
    return t_merged.count() == t_unsorted.count()

test_sort_genes_of_interest_1(rnaseq_goi)

Unnamed: 0                                               True
hgnc_symbol                                              True
X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz    True
X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz    True
X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz    True
                                                         ... 
f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz     True
fb65f821.92cb.402a.ad2f.d4044ca7de4d.htseq.counts.gz     True
fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz     True
feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz     True
fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz     True
Length: 156, dtype: bool

In [62]:
def test_sort_genes_of_interest_2(table):
    i = 0
    t_sorted = sort_genes_of_interest(table)
    for index, row in t_sorted.iterrows():
        if row['hgnc_symbol'] != receptor_gene_list[i]:
            return False
        i = i + 1
    return True

test_sort_genes_of_interest_2(rnaseq_goi)

True

No data manipulations were made on the draw_expression_heatmap function so we'll just check the map output to ensure that the function does what we want it to.

7. Test z-score, here i am testing z-score using scipy's zscore function. this z-score function is accurate and faster than me building one. we're running the zscore function directly on a dataframe, vs converting to numpy and running zscore on it. this is kind of a weak test, but not sure what else to perform on it considering the scipy library does all of the math anyways.

the output is all very small values, most likely because of rounding error, differences in pandas vs numpy data types

In [99]:
def test_z_score(table):
    function_table = table.drop('hgnc_symbol', axis=1)
    comparator_table = function_table.copy()
    comparator_values = comparator_table.apply(stats.zscore)
    function_numpy = function_table.to_numpy(copy=True)
    comparator_numpy = comparator_values.to_numpy(copy=True)
    function_values = z_score(function_numpy)
    return function_values == comparator_values, function_values

test_z_score_result, rnaseq_zscore = test_z_score(rnaseq)

In [100]:
test_z_score_result

Unnamed: 0,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,X0726996d.62f2.4880.808c.cfe3361b4b42.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
6402,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6403,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6404,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6405,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6406,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4266,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4267,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4268,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4271,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


In [93]:
rnaseq_zscore

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         1.11022302e-16,  1.11022302e-16,  1.11022302e-16],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  8.32667268e-17, -2.22044605e-16],
       [ 2.77555756e-17, -4.85722573e-17, -1.38777878e-17, ...,
         1.04083409e-16,  8.67361738e-17, -2.77555756e-17],
       ...,
       [ 2.77555756e-17, -5.55111512e-17,  0.00000000e+00, ...,
         1.11022302e-16,  1.11022302e-16,  5.55111512e-17],
       [ 0.00000000e+00,  0.00000000e+00, -1.21430643e-17, ...,
         1.11022302e-16,  8.32667268e-17, -2.77555756e-17],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         1.11022302e-16,  1.11022302e-16,  1.11022302e-16]])

8. Test pandas to numpy

In [78]:
def test_convert_pandas_to_numpy(p_table):
    n_table = convert_pandas_to_numpy(p_table)
    return p_table, n_table

pandas_rnaseq, numpy_rnaseq = test_convert_pandas_to_numpy(rnaseq)

In [79]:
pandas_rnaseq

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
6402,DRD1,147,280,160,116,80,78,514,141,264,...,48,173,99,75,239,78,72,127,115,113
6403,DRD2,4112,2811,17294,3172,13653,5470,3053,15737,5267,...,10159,2557,8638,8886,9238,14085,6181,9665,1771,3394
6404,DRD3,1187,1428,1939,833,1555,1072,1037,2285,1689,...,1475,1105,1357,1286,1366,1016,795,2165,1472,1364
6405,DRD4,2,79,83,39,189,79,6,74,82,...,129,28,126,96,102,84,38,85,73,153
6406,DRD5,357,237,300,206,258,196,81,335,336,...,376,246,158,220,485,249,333,498,390,246
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4266,CHRM1,5685,7649,8086,3273,5948,3631,4643,7445,6605,...,6696,3255,5759,5961,6380,6388,4604,6820,6240,4534
4267,CHRM2,2511,7302,24450,2328,6852,2250,4499,6888,6787,...,4054,2257,4812,4628,6258,6866,3075,6401,2549,4360
4268,CHRM3,1073,689,1335,578,689,506,672,859,900,...,619,502,494,759,733,770,552,1186,538,600
4271,CHRM4,4702,2886,2049,1334,1432,2684,3089,2190,2864,...,2270,1474,729,1486,2581,1783,2244,5239,1790,1448


In [80]:
numpy_rnaseq

array([[  147.,   280.,   160., ...,   127.,   115.,   113.],
       [ 4112.,  2811., 17294., ...,  9665.,  1771.,  3394.],
       [ 1187.,  1428.,  1939., ...,  2165.,  1472.,  1364.],
       ...,
       [ 1073.,   689.,  1335., ...,  1186.,   538.,   600.],
       [ 4702.,  2886.,  2049., ...,  5239.,  1790.,  1448.],
       [    0.,     0.,     0., ...,     0.,     0.,     0.]])

9. Test numpy to pandas

In [86]:
def test_convert_numpy_to_pandas(n_table, p_table_columns):
    p_table = convert_numpy_to_pandas(n_table, p_table_columns)
    return p_table

test_convert_numpy_to_pandas(numpy_rnaseq, rnaseq.columns.values[1:])

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,eb3894d4.fcae.43ef.ad68.b756c6aa56ea.htseq.counts.gz,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz
0,DRD1,147.0,280.0,160.0,116.0,80.0,78.0,514.0,141.0,264.0,...,48.0,173.0,99.0,75.0,239.0,78.0,72.0,127.0,115.0,113.0
1,DRD2,4112.0,2811.0,17294.0,3172.0,13653.0,5470.0,3053.0,15737.0,5267.0,...,10159.0,2557.0,8638.0,8886.0,9238.0,14085.0,6181.0,9665.0,1771.0,3394.0
2,DRD3,1187.0,1428.0,1939.0,833.0,1555.0,1072.0,1037.0,2285.0,1689.0,...,1475.0,1105.0,1357.0,1286.0,1366.0,1016.0,795.0,2165.0,1472.0,1364.0
3,DRD4,2.0,79.0,83.0,39.0,189.0,79.0,6.0,74.0,82.0,...,129.0,28.0,126.0,96.0,102.0,84.0,38.0,85.0,73.0,153.0
4,DRD5,357.0,237.0,300.0,206.0,258.0,196.0,81.0,335.0,336.0,...,376.0,246.0,158.0,220.0,485.0,249.0,333.0,498.0,390.0,246.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,CHRM1,5685.0,7649.0,8086.0,3273.0,5948.0,3631.0,4643.0,7445.0,6605.0,...,6696.0,3255.0,5759.0,5961.0,6380.0,6388.0,4604.0,6820.0,6240.0,4534.0
103,CHRM2,2511.0,7302.0,24450.0,2328.0,6852.0,2250.0,4499.0,6888.0,6787.0,...,4054.0,2257.0,4812.0,4628.0,6258.0,6866.0,3075.0,6401.0,2549.0,4360.0
104,CHRM3,1073.0,689.0,1335.0,578.0,689.0,506.0,672.0,859.0,900.0,...,619.0,502.0,494.0,759.0,733.0,770.0,552.0,1186.0,538.0,600.0
105,CHRM4,4702.0,2886.0,2049.0,1334.0,1432.0,2684.0,3089.0,2190.0,2864.0,...,2270.0,1474.0,729.0,1486.0,2581.0,1783.0,2244.0,5239.0,1790.0,1448.0


The following functions will test the calculations performed in draw_expression_log_heatmap.

10. Test the natural log conversions we do in draw expression log heatmap

11. Test the log 10 conversions we do in draw expression log heatmap 

12. Test not performing a conversion at all in draw expression log heatmap

13. Test calling compute_zscore in draw expression log heatmap

14. Test calling sorting from draw expression log heatmap

In [88]:
def test_natural_log_conversion(table, log_type):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    if log_type == 'natural':
        expression_grid = expression_grid + 1
        expression_logged = np.log(expression_grid)
        
    return expression_logged

test_natural_log_conversion(rnaseq, 'natural')

array([[4.99721227, 5.63835467, 5.08140436, ..., 4.85203026, 4.75359019,
        4.73619845],
       [8.32190797, 7.94165125, 9.75817272, ..., 9.17636985, 7.47986413,
        8.13005904],
       [7.0800265 , 7.26473018, 7.57044325, ..., 7.68063743, 7.29505642,
        7.21890971],
       ...,
       [6.97914528, 6.5366916 , 7.19743535, ..., 7.07918439, 6.28971557,
        6.39859493],
       [8.45595588, 7.96797318, 7.62559507, ..., 8.56407678, 7.4905294 ,
        7.27862894],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [89]:
def test_log_10_conversion(table, log_type):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    if log_type == 'base-10':
        expression_grid = expression_grid + 1
        expression_logged = np.log10(expression_grid)
        
    return expression_logged

test_log_10_conversion(rnaseq, 'base-10')

array([[2.17026172, 2.44870632, 2.20682588, ..., 2.10720997, 2.06445799,
        2.05690485],
       [3.61415871, 3.44901532, 4.23792057, ..., 3.98524679, 3.24846372,
        3.53083978],
       [3.07481644, 3.15503223, 3.28780173, ..., 3.33565845, 3.16820275,
        3.13513265],
       ...,
       [3.03100428, 2.83884909, 3.12580646, ..., 3.07445072, 2.73158877,
        2.77887447],
       [3.67237498, 3.46044678, 3.31175386, ..., 3.71933129, 3.25309559,
        3.16106839],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [91]:
def test_no_conversion(table, log_type):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    if log_type == 'natural':
        expression_grid = expression_grid + 1
        expression_logged = np.log(expression_grid)
    elif log_type == 'base-10':
        expression_grid = expression_grid + 1
        expression_logged = np.log10(expression_grid)
    else:
        expression_logged = expression_grid
    
    return expression_grid == expression_logged

test_no_conversion(rnaseq, '')

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

In [103]:
def test_compute_zscore_no_conversion(table, log_type, compute_zscore, precalculated_zscore):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    if log_type == 'natural':
        expression_grid = expression_grid + 1
        expression_logged = np.log(expression_grid)
    elif log_type == 'base-10':
        expression_grid = expression_grid + 1
        expression_logged = np.log10(expression_grid)
    else:
        expression_logged = expression_grid
    
    if compute_zscore:
        expression_logged = z_score(expression_logged)
        
    return expression_logged == rnaseq_zscore
        
test_compute_zscore_no_conversion(rnaseq, '', True, rnaseq_zscore)

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

In [113]:
def test_log_heatmap_sorting(table, sort):
    htseq_count_values = table.drop('hgnc_symbol', axis=1)

    expression_grid = htseq_count_values.to_numpy(copy=True, dtype=float)
    rnaseq_columns = list(table.columns.values)
    
    expression_logged = expression_grid
    if sort:
        expression_logged_pandas = convert_numpy_to_pandas(expression_logged, rnaseq_columns[1:])
        expression_logged_pandas_sorted = sort_table(expression_logged_pandas)
        y_axis_list = expression_logged_pandas_sorted['hgnc_symbol'].tolist()
        expression_logged = convert_pandas_to_numpy(expression_logged_pandas_sorted)
        
    return expression_logged_pandas_sorted

test_log_heatmap_sorting(rnaseq, True)

Unnamed: 0,hgnc_symbol,X6423474d.60d7.4401.8e5b.46a3fbde5299.htseq.counts.gz,X0be94b2f.fccb.4482.b0ea.695c101aa65a.htseq.counts.gz,b6aa34d6.2b02.4317.8361.79536c7cb4e6.htseq.counts.gz,X09a677f2.d81d.4c3f.adf9.f8594e064e44.htseq.counts.gz,c19f102d.47a0.48c6.9443.63730d9ea6d1.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X98b1beb5.8d4c.45d1.a618.2d43aafa056c.htseq.counts.gz,X0aac5e42.7554.4949.8b90.c16528c71ef8.htseq.counts.gz,X855d4a17.5c83.429d.919b.8c2a8e9bab0b.htseq.counts.gz,...,e38e0ced.093c.44e9.9f3b.7cdd0e6b912e.htseq.counts.gz,X44c3d518.14fa.4d63.b265.d7fc81c398e2.htseq.counts.gz,e7cc80ef.4b87.47d9.bebe.1fb05b5b04a2.htseq.counts.gz,X7bf647f0.c20e.42e6.b7d5.6510a8d066fc.htseq.counts.gz,X4929062b.3127.4038.8313.c20cbd274be4.htseq.counts.gz,b9ab7393.4abb.41ec.9d55.a3dc846c4a93.htseq.counts.gz,X16c63027.f745.41c4.a5e8.f6d9f1fbf1c8.htseq.counts.gz,X0f426284.c121.4860.bb80.8df032b0dea8.htseq.counts.gz,X1f2aa905.5022.4efe.afac.022d1acfdbe5.htseq.counts.gz,X8a799dfa.c1b5.4b13.9c91.6cbfe2abbc9f.htseq.counts.gz
0,DRD4,8,11,12,27,36,39,86,21,19,...,55,99,42,28,34,32,30,25,15,109
1,DRD1,15,3,54,152,59,116,88,55,300,...,276,384,545,462,582,145,432,284,784,211
2,DRD5,122,110,146,127,135,206,207,160,166,...,522,317,891,529,781,387,600,536,1102,643
3,DRD3,709,508,546,691,679,833,926,668,864,...,1851,1822,3031,2584,2477,1518,2597,3341,2503,2720
4,DRD2,814,677,753,2988,1607,3172,6131,1621,7557,...,9562,11122,3782,14711,25699,16597,19489,11817,28453,16498
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,CHRNA7,2002,1963,1660,841,1642,1123,2245,2209,969,...,2752,3098,2646,3508,3872,4069,2272,4471,4763,4267
103,CHRNA2,599,996,2188,3047,2037,2225,3352,1435,6160,...,10085,12213,6459,8546,7875,7857,5260,4584,5089,7703
104,CHRNG,3526,4172,2261,2662,2449,2378,1970,3364,2785,...,6226,8213,5839,6367,7287,5934,9594,9809,8659,8961
105,CHRM1,749,1874,2547,2839,2545,3273,3235,2621,3696,...,9021,7885,10003,7063,6867,8481,8187,8961,8755,12527


Let's test our various calculations in sort table:

16. Test create_sum_column to ensure that the sum is being calculated across rows

In [114]:
def test_create_sum_column(table):
    rnaseq_orig = table.copy()
    
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    
    return rnaseq_with_total

test_create_sum_column(rnaseq)

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz,Total by row
6402,DRD1,147,280,160,116,80,78,514,141,264,...,173,99,75,239,78,72,127,115,113,29062
6403,DRD2,4112,2811,17294,3172,13653,5470,3053,15737,5267,...,2557,8638,8886,9238,14085,6181,9665,1771,3394,1218887
6404,DRD3,1187,1428,1939,833,1555,1072,1037,2285,1689,...,1105,1357,1286,1366,1016,795,2165,1472,1364,224709
6405,DRD4,2,79,83,39,189,79,6,74,82,...,28,126,96,102,84,38,85,73,153,11302
6406,DRD5,357,237,300,206,258,196,81,335,336,...,246,158,220,485,249,333,498,390,246,51510
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4266,CHRM1,5685,7649,8086,3273,5948,3631,4643,7445,6605,...,3255,5759,5961,6380,6388,4604,6820,6240,4534,864200
4267,CHRM2,2511,7302,24450,2328,6852,2250,4499,6888,6787,...,2257,4812,4628,6258,6866,3075,6401,2549,4360,971896
4268,CHRM3,1073,689,1335,578,689,506,672,859,900,...,502,494,759,733,770,552,1186,538,600,123923
4271,CHRM4,4702,2886,2049,1334,1432,2684,3089,2190,2864,...,1474,729,1486,2581,1783,2244,5239,1790,1448,371541


17. Test sorting rows 1: this tests that all of the rows are the same just different order (hgnc symbol still goes to their respective values)

In [128]:
def test_sorting_rows_1(table):
    
    #### --- Test Setup ---
    rnaseq_orig = table.copy()
    
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    rnaseq_with_total = rnaseq_with_total.reset_index(drop=True)
    
    table_columns = list(table.columns.values)
    
    rnaseq_sorted = pd.DataFrame(columns=table_columns)
    
    # sorts the rows section by section, based on the size of each family of neurotransmitters
    index_begin = 0
    index_end = 0
    appended_data = []
    for family, gene_list in neuro_genes_dict.items():
        index_end = len(gene_list) + index_begin
        to_sort = rnaseq_with_total[index_begin : index_end].sort_values('Total by row', ascending=True)
        appended_data.append(to_sort)
        index_begin = index_end
    # the families were sorted as separate dataframes and then concat together
    rnaseq_sorted = pd.concat(appended_data)
    
    t_merged = pd.merge(rnaseq_with_total, rnaseq_sorted, on=list(rnaseq_with_total.columns.values), how='inner')
    
    return t_merged == rnaseq_with_total

test_sorting_rows_1(rnaseq)

Unnamed: 0,hgnc_symbol,X00faf8ba.ff90.4214.9d03.6c5e14645d8f.htseq.counts.gz,X0143419f.2abe.4906.bb55.af6010fab05f.htseq.counts.gz,X01f84c45.2058.4e22.b234.52f0a82a97fc.htseq.counts.gz,X03094067.02d4.40c5.b6fa.bb5180dc7eab.htseq.counts.gz,X0349f526.7816.4a7d.9967.1f75dd9ff00a.htseq.counts.gz,X03630a0c.aa97.4e28.bac9.0206fff669cd.htseq.counts.gz,X03761959.a620.440f.bbaa.33bd75afae1c.htseq.counts.gz,X057aa9ac.f22c.4c11.a44d.ad52ae59b4cf.htseq.counts.gz,X05f0ced5.6976.4f43.9be5.fddb3f550adf.htseq.counts.gz,...,f144de50.6126.4912.9c94.824d1eb0fac5.htseq.counts.gz,f2389819.b8fc.460e.821c.01dba313cce1.htseq.counts.gz,f6bd7191.a820.4d86.927a.b4b5f88ebd67.htseq.counts.gz,f748bf78.4dc1.47ad.8611.8186479d3e4b.htseq.counts.gz,f8551a29.d4bd.4954.bf9c.8e10265063de.htseq.counts.gz,f9f63982.b0ee.4cb8.8de5.f885d82137f0.htseq.counts.gz,fcd43085.7338.43fe.bc25.9d87b04e227f.htseq.counts.gz,feb22766.4282.47c8.bfe2.7d020b4a15d4.htseq.counts.gz,fef65b57.c58d.4050.8de4.f09f5cd616ce.htseq.counts.gz,Total by row
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
103,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
104,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
105,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


18. Test sorting rows 2 tests that rows are sorted by increasing sum and only by family

In [138]:
def test_sorting_rows_2_helper(to_sort):
        with pd.option_context('display.max_rows', None, 'display.max_columns', None):
            display (to_sort)

def test_sorting_rows_2(table):
    rnaseq_orig = table.copy()
    
    # sort table wasn't working right with decimals, so hgnc symbol column was removed
    excluded = rnaseq_orig.loc[:, 'hgnc_symbol']
    rnaseq_orig.drop('hgnc_symbol', axis=1, inplace=True)
    rnaseq_orig.loc[:, 'Total by row'] = rnaseq_orig.sum(axis=1)
    rnaseq_with_total = pd.concat([excluded.rename('hgnc_symbol'), rnaseq_orig], axis=1)
    
    table_columns = list(table.columns.values)
    
    # SORTING THE ROWS -------
    rnaseq_sorted = pd.DataFrame(columns=table_columns)
    
    # sorts the rows section by section, based on the size of each family of neurotransmitters
    index_begin = 0
    index_end = 0
    appended_data = []
    for family, gene_list in neuro_genes_dict.items():
        index_end = len(gene_list) + index_begin
        to_sort = rnaseq_with_total[index_begin : index_end].sort_values('Total by row', ascending=True)
        test_sorting_rows_2_helper(to_sort[['hgnc_symbol', 'Total by row']])
        index_begin = index_end
    # the families were sorted as separate dataframes and then concat together
    
    
test_sorting_rows_2(rnaseq)

Unnamed: 0,hgnc_symbol,Total by row
6405,DRD4,11302
6402,DRD1,29062
6406,DRD5,51510
6404,DRD3,224709
6403,DRD2,1218887


Unnamed: 0,hgnc_symbol,Total by row
9533,GRID1,11
9545,GRIN2A,20
9576,GRM8,22
9546,GRIN2B,126
9538,GRIK1,220
9540,GRIK2,788
9568,GRM4,10300
9571,GRM6,11911
9565,GRM2,21260
9550,GRIN3B,29480


Unnamed: 0,hgnc_symbol,Total by row
8669,GABRQ,3
8646,GABBR1,98
8668,GABRP,229
8653,GABRA1,825
8658,GABRA6,2375
8655,GABRA3,7569
8663,GABRE,15476
8672,GABRR3,20692
8670,GABRR1,22522
8657,GABRA5,34609


Unnamed: 0,hgnc_symbol,Total by row
521,ADRA2B,412
519,ADRA1D,1224
522,ADRA2C,2174
523,ADRB1,6577
525,ADRB3,8695
520,ADRA2A,23252
524,ADRB2,185517
518,ADRA1B,413763
517,ADRA1A,728596


Unnamed: 0,hgnc_symbol,Total by row
32876,TACR2,12646
32877,TACR3,16983
32875,TACR1,164411


Unnamed: 0,hgnc_symbol,Total by row
10946,HTR2C,8
10952,HTR3E,1096
10951,HTR3D,1696
10937,HTR1A,23168
10955,HTR5A,25389
10949,HTR3C,30549
10943,HTR2A,34116
10947,HTR3A,55848
10954,HTR4,60819
10945,HTR2B,76956


Unnamed: 0,hgnc_symbol,Total by row
10749,HRH1,10
10750,HRH2,293527
10752,HRH4,413852
10751,HRH3,562289


Unnamed: 0,hgnc_symbol,Total by row
4272,CHRM5,4
4277,CHRNA4,919
4276,CHRNA3,4705
4282,CHRNB1,14564
4274,CHRNA10,24480
4283,CHRNB2,29703
4285,CHRNB4,36072
4278,CHRNA5,42784
4273,CHRNA1,105704
4281,CHRNA9,106242
