# Question 1: Only a small proportion of genuine miRNA-target interactions have been experimentally identified; at the same time, miRTarBase could also contain some proportion of false positives. Does our data reflect these observations? How and to what extent do miRTarBase interactions overlap with our data?

### Or: What’s the relationship between miRTarBase-reported interactions (of both regular and strong support type), and magnitudes of within- and pan-cancer correlations in the dataset?

In [1]:
import csv
import datetime
import datalab.bigquery as bq
import google.datalab.storage as storage
import io
import logging
import math as m
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import scipy.stats as stats
import seaborn as sns
import statsmodels.sandbox.stats.multicomp as multicomp
import time

## Function definitions

#### Utils

In [2]:
def read_file(bucket, filepath, **kwargs):
  uri = bucket.object(filepath).uri
  get_ipython().run_line_magic('gcs', 'read --object ' + uri + ' --variable csv_data')
  return pd.read_csv(io.BytesIO(csv_data), **kwargs)

In [3]:
def write_df_to_csv(df, index_label, csv_filepath):
  df.to_csv('temp.csv', index_label = index_label)
  !gsutil cp 'temp.csv' $csv_filepath

In [4]:
def write_series_to_csv(series, index_label, csv_filepath):
  series.to_csv('temp.csv', index_label = index_label)
  !gsutil cp 'temp.csv' $csv_filepath

In [5]:
def get_corrs_df(bucket, filepath, index_col):
  df = read_file(bucket, filepath)
  df.set_index(index_col, inplace=True)
  return df

#### Analysis

In [6]:
def get_rank_indices(df, ranks):
  df_flattened = df.values.flatten()
  df_flattened_argsorted = np.argsort(df_flattened)
  df_flattened_sorted_nan_idxs = np.where(np.isnan(np.sort(df_flattened)))[0]
  if df_flattened_sorted_nan_idxs.size > 0:
    df_flattened_argsorted = df_flattened_argsorted[:df_flattened_sorted_nan_idxs[0]]
  return zip(map(lambda i: df.index[df_flattened_argsorted[i] / df.shape[1]], ranks), map(lambda i: df.columns[df_flattened_argsorted[i] % df.shape[1]], ranks))

## Data: Preliminaries

In [7]:
bucket = storage.Bucket('yfl-mirna')

### Correlation data

#### Across all samples

In [8]:
miRNAmRNA_corrs = read_file(bucket, 'explore/miRTar/pearson-corrs/data/mirtar-corrs.csv')
miRNAmRNA_corrs.set_index('miRNA', inplace=True)
miRNAmRNA_corrs_np = miRNAmRNA_corrs.values

In [9]:
miRNAmRNA_log_corrs = read_file(bucket, 'explore/miRTar/pearson-corrs/data/mirtar-log-corrs.csv')
miRNAmRNA_log_corrs.set_index('miRNA', inplace=True)
miRNAmRNA_log_corrs_np = miRNAmRNA_log_corrs.values

In [10]:
miRNAmRNA_spearman_corrs = read_file(bucket, 'explore/miRTar/spearman-corrs/data/mirtar-spearman-corrs.csv')
miRNAmRNA_spearman_corrs.set_index('miRNA', inplace=True)
miRNAmRNA_spearman_corrs_np = miRNAmRNA_spearman_corrs.values

### Sample metadata

In [11]:
sample_metadata = read_file(bucket, 'data/sample/PanCanAtlas_miRNA_sample_information_list.txt', delimiter='\t')

In [12]:
sample_metadata.rename(index=str, columns={'id': 'sample'}, inplace=True)
sample_metadata.set_index('sample', inplace=True)
sample_metadata.index = sample_metadata.index.map(lambda x: '-'.join(x.split('-')[0:4]))
sample_metadata.reset_index(inplace=True)
sample_metadata.drop_duplicates(subset='sample', keep='first', inplace=True)
sample_metadata.set_index('sample', inplace=True)

### Sample miRNA and mRNA expressions with cancer type

In [13]:
type1_sample_disease_mirtars = read_file(bucket, 'data/miRTar/type1-sample_disease_miRNAmRNA-exprs.csv')

In [14]:
type1_sample_disease_mirtars.set_index('sample', inplace=True)

In [15]:
cancer_types_and_counts = type1_sample_disease_mirtars['Disease'].value_counts().sort_values()
cancer_types_and_counts_with_pancan = cancer_types_and_counts.append(pd.Series([type1_sample_disease_mirtars.shape[0]], ['PAN']))
cancer_types = cancer_types_and_counts.index
cancer_types_list = cancer_types_and_counts_with_pancan.index

In [16]:
cancer_types_and_counts

CHOL      36
DLBC      47
UCS       56
KICH      65
ACC       78
UVM       80
MESO      87
SKCM      97
THYM     120
TGCT     149
READ     152
PAAD     177
PCPG     178
ESCA     181
SARC     254
KIRP     285
OV       301
CESC     303
LIHC     364
BLCA     405
STAD     409
COAD     421
LUSC     464
PRAD     490
KIRC     498
THCA     500
LUAD     505
LGG      510
HNSC     511
UCEC     521
BRCA    1066
Name: Disease, dtype: int64

### miRNA-mRNA miRTarBase masks

In [17]:
miRNAmRNA_pancan_corrs_in_mirtarbase_mask = read_file(bucket, 'data/miRTar/miRNAmRNA-corrs-pancan_miRTarBase_mask.csv')
miRNAmRNA_pancan_corrs_in_mirtarbase_mask.set_index('miRNA', inplace=True)

In [18]:
miRNAmRNA_pancan_corrs_mirtarbase_strong_mask = read_file(bucket, 'data/miRTar/miRNAmRNA-corrs-pancan_miRTarBase-strong_mask.csv')
miRNAmRNA_pancan_corrs_mirtarbase_strong_mask.set_index('miRNA', inplace=True)

In [19]:
miRNAmRNAs_in_mirtarbase_mask = read_file(bucket, 'data/miRTar/miRNAmRNA-corrs-full_miRTarBase_mask.csv')
miRNAmRNAs_in_mirtarbase_mask.set_index('miRNA', inplace=True)

In [20]:
miRNAmRNAs_mirtarbase_strong_mask = read_file(bucket, 'data/miRTar/miRNAmRNA-corrs-full_miRTarBase-strong_mask.csv')
miRNAmRNAs_mirtarbase_strong_mask.set_index('miRNA', inplace=True)

In [21]:
all_miRNAs = miRNAmRNAs_in_mirtarbase_mask.index
all_mRNAs = miRNAmRNAs_in_mirtarbase_mask.columns

In [22]:
# The number of miRTarBase interactions as a % of all possible miRNA-mRNA pairs
miRNAmRNAs_in_mirtarbase_mask.sum().sum() * 1.0 / miRNAmRNA_spearman_corrs.size

0.011448552987767474

## Use hypergeometric test to test for enrichment of miRTarBase relationships in miRNA-mRNA pairs with top n anticorrelations: n = 10, 50, 100, 250, 500, 750, 1000, 2000

### Definitions

In [24]:
hypergeom_test_ns = [10, 50, 100, 250, 500, 750, 1000, 2000]

In [25]:
hypergeom_test_n_strs = map(str, hypergeom_test_ns)

In [26]:
miRNAmRNA_corrs_nonnull = ~miRNAmRNA_corrs.isnull()

In [27]:
# even though all mirtar_corrs (computed on type-1 samples only) happen to be non-null
N = miRNAmRNA_corrs_nonnull.sum().sum()
N == miRNAmRNA_corrs.size

True

In [28]:
def get_ranks_intersection_counts(data, mask, rank_ns):
  return map(lambda n: sum(i == True for i in map(lambda idx: mask.loc[idx[0], idx[1]], get_rank_indices(data, list(range(n))))), rank_ns)

In [29]:
def get_top_n_miRNAs(data, miRNAs, n):
  miRNA_mask = pd.Series(None, miRNAs)
  miRNA_mask[pd.Series(map(lambda idx: idx[0], get_rank_indices(data, list(range(n))))).unique()] = True
  return miRNA_mask

In [30]:
def get_top_n_intersection_miRNAs_mask(data, mask, n):
  miRNA_mask = pd.Series(None, mask.index)
  miRNA_mask[[mir for mir in np.unique(np.array(map(lambda idx: idx[0] if mask.loc[idx[0], idx[1]] else None, get_rank_indices(data, list(range(n)))))) if mir is not None]] = True
  return miRNA_mask

In [31]:
def bonferroni_adj(pvals):
  return multicomp.multipletests(pvals, method='bonferroni')[1]

In [32]:
def benjaminihochberg(pvals):
  return multicomp.multipletests(pvals, method='fdr_bh')[1]

In [33]:
def benjaminihochberg_2stage(pvals):
  return multicomp.multipletests(pvals, method='fdr_tsbh')[1]

### All miRTarBase relationships

#### Pan-cancer (implicitly relying on all N miRNA-mRNA pairs having non-null correlations)

In [34]:
mirtars_nonnull_count = (miRNAmRNA_corrs_nonnull & miRNAmRNA_pancan_corrs_in_mirtarbase_mask).sum().sum()

In [35]:
hypergeom_test_n_rvs = map(lambda n: stats.hypergeom(N, mirtars_nonnull_count, n), hypergeom_test_ns)

In [36]:
cancer_type_top_n_log_corrs_mirtarbase_counts = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)
cancer_type_top_n_log_corrs_mirtarbase_pvals = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)

In [37]:
cancer_type_top_n_log_corrs_mirtarbase_counts['PAN'] = get_ranks_intersection_counts(miRNAmRNA_log_corrs, miRNAmRNA_pancan_corrs_in_mirtarbase_mask, hypergeom_test_ns)
cancer_type_top_n_log_corrs_mirtarbase_pvals['PAN'] = map(lambda i: 1 - hypergeom_test_n_rvs[i].cdf([cancer_type_top_n_log_corrs_mirtarbase_counts['PAN'][i]])[0],
                                                          range(len(hypergeom_test_ns)))

In [38]:
cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)
cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE['PAN'] = cancer_type_top_n_log_corrs_mirtarbase_counts['PAN'] / map(lambda n: n * 1.0 * mirtars_nonnull_count / N, hypergeom_test_ns)
cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE['PAN'] = cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE['PAN'].apply(lambda x: m.log(x + 1, 2))

In [39]:
cancer_type_top1000_log_corrs_mtb_miRNAs = pd.DataFrame(None, cancer_types_list, miRNAmRNA_log_corrs.index)
cancer_type_top1000_log_corrs_mtb_miRNAs.loc['PAN'] = get_top_n_intersection_miRNAs_mask(miRNAmRNA_log_corrs, miRNAmRNAs_in_mirtarbase_mask, 1000)

In [41]:
cancer_type_top1000_log_corrs_miRNAs = pd.DataFrame(None, cancer_types_list, miRNAmRNA_log_corrs.index)
cancer_type_top1000_log_corrs_miRNAs.loc['PAN'] = get_top_n_miRNAs(miRNAmRNA_log_corrs, miRNAmRNA_log_corrs.index, 1000)

In [None]:
cancer_type_top_n_spearman_corrs_mirtarbase_counts = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)
cancer_type_top_n_spearman_corrs_mirtarbase_pvals = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)

In [None]:
cancer_type_top_n_spearman_corrs_mirtarbase_counts['PAN'] = get_ranks_intersection_counts(miRNAmRNA_spearman_corrs, miRNAmRNA_pancan_corrs_in_mirtarbase_mask, hypergeom_test_ns)
cancer_type_top_n_spearman_corrs_mirtarbase_pvals['PAN'] = map(lambda i: 1 - hypergeom_test_n_rvs[i].cdf([cancer_type_top_n_spearman_corrs_mirtarbase_counts['PAN'][i]])[0],
                                                               range(len(hypergeom_test_ns)))

In [None]:
cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)
cancer_type_top_n_spearman_corrs_mirtarbase_expecteds = map(lambda n: n * 1.0 * mirtars_nonnull_count / N, hypergeom_test_ns)
cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE['PAN'] = cancer_type_top_n_spearman_corrs_mirtarbase_counts['PAN'] / cancer_type_top_n_spearman_corrs_mirtarbase_expecteds
cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE['PAN'] = cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE['PAN'].apply(lambda x: m.log(x + 1, 2))

In [None]:
cancer_type_top1000_spearman_corrs_mtb_miRNAs = pd.DataFrame(None, cancer_types_list, miRNAmRNA_spearman_corrs.index)
cancer_type_top1000_spearman_corrs_mtb_miRNAs.loc['PAN'] = get_top_n_intersection_miRNAs_mask(miRNAmRNA_spearman_corrs, miRNAmRNAs_in_mirtarbase_mask, 1000)

In [None]:
cancer_type_top1000_spearman_corrs_miRNAs = pd.DataFrame(None, cancer_types_list, miRNAmRNA_spearman_corrs.index)
cancer_type_top1000_spearman_corrs_miRNAs.loc['PAN'] = get_top_n_miRNAs(miRNAmRNA_spearman_corrs, miRNAmRNA_spearman_corrs.index, 1000)

In [None]:
cancer_type_hypergeom_test_params = pd.DataFrame(None, cancer_types_list, ['N', 'K'])
cancer_type_hypergeom_test_params.loc['PAN'] = [N, mirtars_nonnull_count]

#### Within-cancer

In [95]:
for cancer_type in cancer_types:
    type_miRNAmRNA_log_corrs = get_corrs_df(bucket, 'explore/miRTar/pearson-corrs/data/mirtar-log-corrs_' + cancer_type + '.csv', 'miRNA')
    miRNAmRNA_corrs_nonnull = ~type_miRNAmRNA_log_corrs.isnull()
    N = miRNAmRNA_corrs_nonnull.sum().sum()
    mirtars_nonnull_count = (miRNAmRNA_corrs_nonnull & miRNAmRNAs_in_mirtarbase_mask).sum().sum()
    cancer_type_hypergeom_test_params.loc[cancer_type] = [N, mirtars_nonnull_count]
    hypergeom_test_n_rvs = map(lambda n: stats.hypergeom(N, mirtars_nonnull_count, n), hypergeom_test_ns)
    cancer_type_top_n_log_corrs_mirtarbase_counts[cancer_type] = get_ranks_intersection_counts(type_miRNAmRNA_log_corrs, miRNAmRNAs_in_mirtarbase_mask, hypergeom_test_ns)
    cancer_type_top_n_log_corrs_mirtarbase_pvals[cancer_type] = map(lambda i: 1 - hypergeom_test_n_rvs[i].cdf([cancer_type_top_n_log_corrs_mirtarbase_counts[cancer_type][i]])[0],
                                                                    range(len(hypergeom_test_ns)))
    cancer_type_top_n_corrs_mirtarbase_expecteds = map(lambda n: n * 1.0 * mirtars_nonnull_count / N, hypergeom_test_ns)
    cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE[cancer_type] = cancer_type_top_n_log_corrs_mirtarbase_counts[cancer_type] / cancer_type_top_n_corrs_mirtarbase_expecteds
    cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE[cancer_type] = cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE[cancer_type].apply(lambda x: m.log(x + 1, 2))
    cancer_type_top1000_log_corrs_mtb_miRNAs.loc[cancer_type] = get_top_n_intersection_miRNAs_mask(type_miRNAmRNA_log_corrs, miRNAmRNAs_in_mirtarbase_mask, 1000)
    cancer_type_top1000_log_corrs_miRNAs.loc[cancer_type] = get_top_n_miRNAs(type_miRNAmRNA_log_corrs, miRNAmRNA_log_corrs.index, 1000)
    type_miRNAmRNA_spearman_corrs = get_corrs_df(bucket, 'explore/miRTar/spearman-corrs/data/mirtar-spearman-corrs_' + cancer_type + '.csv', 'miRNA')
    cancer_type_top_n_spearman_corrs_mirtarbase_counts[cancer_type] = get_ranks_intersection_counts(type_miRNAmRNA_spearman_corrs, miRNAmRNAs_in_mirtarbase_mask, hypergeom_test_ns)
    def get_top_n_spearman_corrs_hypergeom_pval(i):
      return 1 - hypergeom_test_n_rvs[i].cdf([cancer_type_top_n_spearman_corrs_mirtarbase_counts[cancer_type][i]])[0]
    cancer_type_top_n_spearman_corrs_mirtarbase_pvals[cancer_type] = map(get_top_n_spearman_corrs_hypergeom_pval, range(len(hypergeom_test_ns)))
    cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE[cancer_type] = cancer_type_top_n_spearman_corrs_mirtarbase_counts[cancer_type] / cancer_type_top_n_corrs_mirtarbase_expecteds
    cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE[cancer_type] = cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE[cancer_type].apply(lambda x: m.log(x + 1, 2))
    cancer_type_top1000_spearman_corrs_mtb_miRNAs.loc[cancer_type] = get_top_n_intersection_miRNAs_mask(type_miRNAmRNA_spearman_corrs, miRNAmRNAs_in_mirtarbase_mask, 1000)
    cancer_type_top1000_spearman_corrs_miRNAs.loc[cancer_type] = get_top_n_miRNAs(type_miRNAmRNA_spearman_corrs, miRNAmRNA_spearman_corrs.index, 1000)

In [77]:
cancer_type_top_n_log_corrs_mirtarbase_pvals_bf_adj = cancer_type_top_n_log_corrs_mirtarbase_pvals.loc[:, cancer_types_list].apply(bonferroni_adj, axis=1)
cancer_type_top_n_log_corrs_mirtarbase_pvals_bh_adj = cancer_type_top_n_log_corrs_mirtarbase_pvals.loc[:, cancer_types_list].apply(benjaminihochberg, axis=1)
cancer_type_top_n_log_corrs_mirtarbase_pvals_2sbh_adj = cancer_type_top_n_log_corrs_mirtarbase_pvals.loc[:, cancer_types_list].apply(benjaminihochberg_2stage, axis=1)

In [None]:
write_df_to_csv(cancer_type_top_n_log_corrs_mirtarbase_counts, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_counts.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mirtarbase_pvals, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_pvals.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mirtarbase_pvals_bf_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_pvals-bf-adj.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mirtarbase_pvals_bh_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_pvals-bh-adj.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mirtarbase_pvals_2sbh_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_pvals-2sbh-adj.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_counts_log2OE.csv')

In [None]:
write_df_to_csv(cancer_type_top1000_log_corrs_mtb_miRNAs, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-1000-log-corrs-mirtarbase_miRNAs.csv')
write_df_to_csv(cancer_type_top1000_log_corrs_miRNAs, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-1000-log-corrs_miRNAs.csv')

In [82]:
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bf_adj = cancer_type_top_n_spearman_corrs_mirtarbase_pvals.loc[:, cancer_types_list].apply(bonferroni_adj, axis=1)
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bh_adj = cancer_type_top_n_spearman_corrs_mirtarbase_pvals.loc[:, cancer_types_list].apply(benjaminihochberg, axis=1)
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_2sbh_adj = cancer_type_top_n_spearman_corrs_mirtarbase_pvals.loc[:, cancer_types_list].apply(benjaminihochberg_2stage, axis=1)

In [None]:
write_df_to_csv(cancer_type_top_n_spearman_corrs_mirtarbase_counts, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_counts.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mirtarbase_pvals, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_pvals.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bf_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_pvals-bf-adj.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bh_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_pvals-bh-adj.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mirtarbase_pvals_2sbh_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_pvals-2sbh-adj.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_counts_log2OE.csv')

In [None]:
write_df_to_csv(cancer_type_top1000_spearman_corrs_mtb_miRNAs, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-1000-spearman-corrs-mirtarbase_miRNAs.csv')
write_df_to_csv(cancer_type_top1000_spearman_corrs_miRNAs, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-1000-spearman-corrs_miRNAs.csv')

In [128]:
cancer_type_top1000_mtb_stats = pd.DataFrame({'log_corrs_mtb_miRNAs': cancer_type_top1000_log_corrs_mtb_miRNAs.sum(axis=1),
                                              'log_corrs_all_miRNAs': cancer_type_top1000_log_corrs_miRNAs.sum(axis=1),
                                              'log_corrs_mtb': cancer_type_top_n_log_corrs_mirtarbase_counts.loc[1000].reindex(cancer_types_list),
                                              'log_corrs_mtb_log2OE': cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE.loc[1000].reindex(cancer_types_list),
                                              'spearman_corrs_mtb_miRNAs': cancer_type_top1000_spearman_corrs_mtb_miRNAs.sum(axis=1),
                                              'spearman_corrs_all_miRNAs': cancer_type_top1000_spearman_corrs_miRNAs.sum(axis=1),                            
                                              'spearman_corrs_mtb': cancer_type_top_n_spearman_corrs_mirtarbase_counts.loc[1000].reindex(cancer_types_list),
                                              'spearman_corrs_mtb_log2OE': cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE.loc[1000].reindex(cancer_types_list)})
cancer_type_top1000_mtb_stats = cancer_type_top1000_mtb_stats.reindex(columns=['log_corrs_mtb_miRNAs', 'log_corrs_all_miRNAs', 'log_corrs_mtb', 'log_corrs_mtb_log2OE',
                                                                               'spearman_corrs_mtb_miRNAs', 'spearman_corrs_all_miRNAs', 'spearman_corrs_mtb', 'spearman_corrs_mtb_log2OE'])
write_df_to_csv(cancer_type_top1000_mtb_stats.applymap(lambda x: int(x)), 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancertype_top1000-mtb_stats.csv')

Copying file://temp.csv [Content-Type=text/csv]...
/ [1 files][  1.0 KiB/  1.0 KiB]                                                
Operation completed over 1 objects/1.0 KiB.                                      


In [None]:
write_df_to_csv(cancer_type_hypergeom_test_params, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_mtbase_hypergeom-test-params.csv')

### Only strong miRTarBase relationships (at least 1 non-weak-support-type entry)

#### Pan-cancer

In [42]:
N = miRNAmRNA_corrs_nonnull.sum().sum()

In [43]:
mirtars_strong_nonnull_count = (miRNAmRNA_corrs_nonnull & miRNAmRNA_pancan_corrs_mirtarbase_strong_mask).sum().sum()

In [44]:
hypergeom_test_n_rvs = map(lambda n: stats.hypergeom(N, mirtars_strong_nonnull_count, n), hypergeom_test_ns)

In [37]:
cancer_type_top_n_log_corrs_mtbase_strong_counts = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)
cancer_type_top_n_log_corrs_mtbase_strong_pvals = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)

In [38]:
cancer_type_top_n_log_corrs_mtbase_strong_counts['PAN'] = get_ranks_intersection_counts(miRNAmRNA_log_corrs, miRNAmRNA_pancan_corrs_mirtarbase_strong_mask, hypergeom_test_ns)
cancer_type_top_n_log_corrs_mtbase_strong_pvals['PAN'] = map(lambda i: 1 - hypergeom_test_n_rvs[i].cdf([cancer_type_top_n_log_corrs_mtbase_strong_counts['PAN'][i]])[0],
                                                             range(len(hypergeom_test_ns)))

In [39]:
cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)

In [65]:
cancer_type_top_n_log_corrs_mtbase_strong_exps = map(lambda n: n * mirtars_strong_nonnull_count / N, hypergeom_test_ns)
cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE['PAN'] = cancer_type_top_n_log_corrs_mtbase_strong_counts['PAN'] / cancer_type_top_n_log_corrs_mtbase_strong_exps
cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE['PAN'] = cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE['PAN'].apply(lambda x: m.log(x + 1, 2))

In [104]:
cancer_type_top1000_log_corrs_mtb_strong_miRNAs = pd.DataFrame(False, cancer_types_list, miRNAmRNA_log_corrs.index)
cancer_type_top1000_log_corrs_mtb_strong_miRNAs.loc['PAN'] = get_top_n_intersection_miRNAs_mask(miRNAmRNA_log_corrs, miRNAmRNAs_mirtarbase_strong_mask, 1000)

In [66]:
cancer_type_top_n_spearman_corrs_mtbase_strong_counts = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)

In [67]:
cancer_type_top_n_spearman_corrs_mtbase_strong_counts['PAN'] = get_ranks_intersection_counts(miRNAmRNA_spearman_corrs, miRNAmRNA_pancan_corrs_mirtarbase_strong_mask, hypergeom_test_ns)
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals['PAN'] = map(lambda i: 1 - hypergeom_test_n_rvs[i].cdf([cancer_type_top_n_spearman_corrs_mtbase_strong_counts['PAN'][i]])[0],
                                                                  range(len(hypergeom_test_ns)))

In [68]:
cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE = pd.DataFrame(None, hypergeom_test_n_strs, cancer_types_list)

In [69]:
cancer_type_top_n_spearman_corrs_mtbase_strong_exps = map(lambda n: n * mirtars_nonnull_count / N, hypergeom_test_ns)
cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE['PAN'] = cancer_type_top_n_spearman_corrs_mtbase_strong_counts['PAN'] / cancer_type_top_n_spearman_corrs_mtbase_strong_exps
cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE['PAN'] = cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE['PAN'].apply(lambda x: m.log(x + 1, 2))

In [105]:
cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs = pd.DataFrame(False, cancer_types_list, miRNAmRNA_spearman_corrs.index)
cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs.loc['PAN'] = get_top_n_intersection_miRNAs_mask(miRNAmRNA_spearman_corrs, miRNAmRNAs_mirtarbase_strong_mask, 1000)

In [70]:
cancer_type_mtbase_strong_hypergeom_test_params = pd.DataFrame(None, cancer_types_list, ['N', 'K'])
cancer_type_mtbase_strong_hypergeom_test_params.loc['PAN'] = [N, mirtars_strong_nonnull_count]

#### Within-cancer

In [107]:
for cancer_type in cancer_types_list:
    type_miRNAmRNA_log_corrs = get_corrs_df(bucket, 'explore/miRTar/pearson-corrs/data/mirtar-log-corrs_' + cancer_type + '.csv', 'miRNA')
    miRNAmRNA_corrs_nonnull = ~type_miRNAmRNA_log_corrs.isnull()
    N = miRNAmRNA_corrs_nonnull.sum().sum()
    mirtars_strong_nonnull_count = (miRNAmRNA_corrs_nonnull & miRNAmRNAs_mirtarbase_strong_mask).sum().sum()
    cancer_type_mtbase_strong_hypergeom_test_params.loc[cancer_type] = [N, mirtars_strong_nonnull_count]
    hypergeom_test_n_rvs = map(lambda n: stats.hypergeom(N, mirtars_strong_nonnull_count, n), hypergeom_test_ns)
    cancer_type_top_n_log_corrs_mtbase_strong_counts[cancer_type] = get_ranks_intersection_counts(type_miRNAmRNA_log_corrs, miRNAmRNAs_mirtarbase_strong_mask, hypergeom_test_ns)
    type_top_n_log_corrs_mtbase_strong_counts = cancer_type_top_n_log_corrs_mtbase_strong_counts[cancer_type]
    cancer_type_top_n_log_corrs_mtbase_strong_pvals[cancer_type] = map(lambda i: 1 - hypergeom_test_n_rvs[i].cdf([type_top_n_log_corrs_mtbase_strong_counts[i]])[0],
                                                                       range(len(hypergeom_test_ns)))
    cancer_type_top_n_corrs_mtbase_strong_exps = map(lambda n: n * mirtars_strong_nonnull_count / N, hypergeom_test_ns)
    cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE[cancer_type] = cancer_type_top_n_log_corrs_mtbase_strong_counts[cancer_type] / cancer_type_top_n_corrs_mtbase_strong_exps
    cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE[cancer_type] = cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE[cancer_type].apply(lambda x: m.log(x + 1, 2))
    cancer_type_top1000_log_corrs_mtb_strong_miRNAs.loc[cancer_type] = get_top_n_intersection_miRNAs_mask(type_miRNAmRNA_log_corrs, miRNAmRNAs_mirtarbase_strong_mask, 1000)
    type_miRNAmRNA_spearman_corrs = get_corrs_df(bucket, 'explore/miRTar/spearman-corrs/data/mirtar-spearman-corrs_' + cancer_type + '.csv', 'miRNA')
    cancer_type_top_n_spearman_corrs_mtbase_strong_counts[cancer_type] = get_ranks_intersection_counts(type_miRNAmRNA_spearman_corrs, miRNAmRNAs_mirtarbase_strong_mask, hypergeom_test_ns)
    type_top_n_spearman_corrs_mtbase_strong_counts = cancer_type_top_n_spearman_corrs_mtbase_strong_counts[cancer_type]
    cancer_type_top_n_spearman_corrs_mtbase_strong_pvals[cancer_type] = map(lambda i: 1 - hypergeom_test_n_rvs[i].cdf([type_top_n_spearman_corrs_mtbase_strong_counts[i]])[0],
                                                                            range(len(hypergeom_test_ns)))
    cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE[cancer_type] = cancer_type_top_n_spearman_corrs_mtbase_strong_counts[cancer_type] / cancer_type_top_n_corrs_mtbase_strong_exps
    cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE[cancer_type] = cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE[cancer_type].apply(lambda x: m.log(x+1, 2))
    cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs.loc[cancer_type] = get_top_n_intersection_miRNAs_mask(type_miRNAmRNA_spearman_corrs, miRNAmRNAs_mirtarbase_strong_mask, 1000)


Source object gs://yfl-mirna/explore/miRTar/pearson-corrs/data/mirtar-log-corrs_PAN.csv does not exist
Source object gs://yfl-mirna/explore/miRTar/spearman-corrs/data/mirtar-spearman-corrs_PAN.csv does not exist

In [77]:
cancer_type_top_n_log_corrs_mtbase_strong_pvals_bf_adj = cancer_type_top_n_log_corrs_mtbase_strong_pvals.loc[:, cancer_types_list].apply(bonferroni_adj, axis=1)
cancer_type_top_n_log_corrs_mtbase_strong_pvals_bh_adj = cancer_type_top_n_log_corrs_mtbase_strong_pvals.loc[:, cancer_types_list].apply(benjaminihochberg, axis=1)
cancer_type_top_n_log_corrs_mtbase_strong_pvals_2sbh_adj = cancer_type_top_n_log_corrs_mtbase_strong_pvals.loc[:, cancer_types_list].apply(benjaminihochberg_2stage, axis=1)

In [None]:
write_df_to_csv(cancer_type_top_n_log_corrs_mtbase_strong_counts, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_counts.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mtbase_strong_pvals, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_pvals.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mtbase_strong_pvals_bf_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_pvals-bf-adj.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mtbase_strong_pvals_bh_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_pvals-bh-adj.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mtbase_strong_pvals_2sbh_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_pvals-2sbh-adj.csv')
write_df_to_csv(cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_counts_log2OE.csv')

In [None]:
write_df_to_csv(cancer_type_top1000_log_corrs_mtb_strong_miRNAs, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-1000-log-corrs-mtbase-strong_miRNAs.csv')

In [77]:
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bf_adj = cancer_type_top_n_spearman_corrs_mtbase_strong_pvals.loc[:, cancer_types_list].apply(bonferroni_adj, axis=1)
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bh_adj = cancer_type_top_n_spearman_corrs_mtbase_strong_pvals.loc[:, cancer_types_list].apply(benjaminihochberg, axis=1)
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj = cancer_type_top_n_spearman_corrs_mtbase_strong_pvals.loc[:, cancer_types_list].apply(benjaminihochberg_2stage, axis=1)

In [None]:
write_df_to_csv(cancer_type_top_n_spearman_corrs_mtbase_strong_counts, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_counts.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mtbase_strong_pvals, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bf_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals-bf-adj.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bh_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals-bh-adj.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals-2sbh-adj.csv')
write_df_to_csv(cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE, 'n', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_counts_log2OE.csv')

In [None]:
write_df_to_csv(cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-1000-spearman-corrs-mtbase-strong_miRNAs.csv')

In [None]:
cancer_type_top1000_mtb_strong_stats = pd.DataFrame({'log_corrs_mtb_miRNAs': cancer_type_top1000_log_corrs_mtb_strong_miRNAs.sum(axis=1).map(lambda x: int(x)),
                                                     'log_corrs_all_miRNAs': cancer_type_top1000_log_corrs_miRNAs.sum(axis=1).map(lambda x: int(x)),
                                                     'log_corrs_mtb': cancer_type_top_n_log_corrs_mtbase_strong_counts.loc[1000].reindex(cancer_types_list),
                                                     'log_corrs_mtb_log2OE': cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE.loc[1000].reindex(cancer_types_list),
                                                     'spearman_corrs_mtb_miRNAs': cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs.sum(axis=1).map(lambda x: int(x)),
                                                     'spearman_corrs_all_miRNAs': cancer_type_top1000_spearman_corrs_miRNAs.sum(axis=1).map(lambda x: int(x)),                            
                                                     'spearman_corrs_mtb': cancer_type_top_n_spearman_corrs_mtbase_strong_counts.loc[1000].reindex(cancer_types_list),
                                                     'spearman_corrs_mtb_log2OE': cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE.loc[1000].reindex(cancer_types_list)})
cancer_type_top1000_mtb_strong_stats = cancer_type_top1000_mtb_strong_stats.reindex(columns=['log_corrs_mtb_miRNAs', 'log_corrs_all_miRNAs', 'log_corrs_mtb', 'log_corrs_mtb_log2OE',
                                                                                             'spearman_corrs_mtb_miRNAs', 'spearman_corrs_all_miRNAs', 'spearman_corrs_mtb', 'spearman_corrs_mtb_log2OE'])
write_df_to_csv(cancer_type_top1000_mtb_strong_stats, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancertype_top1000-mtb-strong_stats.csv')

In [None]:
write_df_to_csv(cancer_type_mtbase_strong_hypergeom_test_params, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_mtbase-strong_hypergeom-test-params.csv')

### TODO: Edit to reflect one-pass workflow

### Because I'm an idiot and didn't write up conclusions right after analysis

In [45]:
cancer_type_top_n_log_corrs_mirtarbase_pvals = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_pvals.csv')
cancer_type_top_n_log_corrs_mirtarbase_pvals.set_index('n', inplace=True)

In [46]:
cancer_type_top_n_log_corrs_mirtarbase_pvals_bf_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_pvals-bf-adj.csv')
cancer_type_top_n_log_corrs_mirtarbase_pvals_bf_adj.set_index('n', inplace=True)

In [47]:
cancer_type_top_n_log_corrs_mirtarbase_pvals_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_pvals-bh-adj.csv')
cancer_type_top_n_log_corrs_mirtarbase_pvals_bh_adj.set_index('n', inplace=True)

In [48]:
cancer_type_top_n_log_corrs_mirtarbase_pvals_2sbh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_pvals-2sbh-adj.csv')
cancer_type_top_n_log_corrs_mirtarbase_pvals_2sbh_adj.set_index('n', inplace=True)

In [49]:
cancer_type_top_n_spearman_corrs_mirtarbase_pvals = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_pvals.csv')
cancer_type_top_n_spearman_corrs_mirtarbase_pvals.set_index('n', inplace=True)

In [50]:
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bf_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_pvals-bf-adj.csv')
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bf_adj.set_index('n', inplace=True)

In [51]:
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_pvals-bh-adj.csv')
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bh_adj.set_index('n', inplace=True)

In [52]:
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_2sbh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_pvals-2sbh-adj.csv')
cancer_type_top_n_spearman_corrs_mirtarbase_pvals_2sbh_adj.set_index('n', inplace=True)

In [53]:
cancer_type_top_n_log_corrs_mtbase_strong_pvals = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_pvals.csv')
cancer_type_top_n_log_corrs_mtbase_strong_pvals.set_index('n', inplace=True)

In [54]:
cancer_type_top_n_log_corrs_mtbase_strong_pvals_bf_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_pvals-bf-adj.csv')
cancer_type_top_n_log_corrs_mtbase_strong_pvals_bf_adj.set_index('n', inplace=True)

In [55]:
cancer_type_top_n_log_corrs_mtbase_strong_pvals_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_pvals-bh-adj.csv')
cancer_type_top_n_log_corrs_mtbase_strong_pvals_bh_adj.set_index('n', inplace=True)

In [56]:
cancer_type_top_n_log_corrs_mtbase_strong_pvals_2sbh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_pvals-2sbh-adj.csv')
cancer_type_top_n_log_corrs_mtbase_strong_pvals_2sbh_adj.set_index('n', inplace=True)

In [57]:
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals.set_index('n', inplace=True)

In [58]:
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bf_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals-bf-adj.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bf_adj.set_index('n', inplace=True)

In [59]:
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals-bh-adj.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bh_adj.set_index('n', inplace=True)

In [60]:
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals-2sbh-adj.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj.set_index('n', inplace=True)

In [61]:
cancer_type_top1000_spearman_corrs_miRNAs = read_file(bucket, 'analysis/enrichment/cancer-type_top-1000-spearman-corrs_miRNAs.csv')
cancer_type_top1000_spearman_corrs_miRNAs.set_index('cancer_type', inplace=True)

In [56]:
cancer_type_top1000_spearman_corrs_miRNAs.sum().describe()

count    743.000000
mean       3.429341
std        3.068851
min        0.000000
25%        1.000000
50%        3.000000
75%        5.000000
max       18.000000
dtype: float64

### TODO: Move following cells

#### TODO: Uncomment

In [62]:
cancer_type_top1000_spearman_corrs_miRNAs_sorted = cancer_type_top1000_spearman_corrs_miRNAs.sum().sort_values().apply(int)
#write_series_to_csv(cancer_type_top1000_spearman_corrs_miRNAs_sorted[-1:-26:-1], 'miRNA', 'gs://yfl-mirna/analysis/enrichment/cancertype-top1000-spearman-anticorrs_top26-miRNAs.csv')

In [63]:
top1000_spearman_corrs_miRNAcancer_count = cancer_type_top1000_spearman_corrs_miRNAs.sum().sum()

In [64]:
cancer_type_top1000_spearman_corrs_miRNAs_sorted[cancer_type_top1000_spearman_corrs_miRNAs_sorted >= 10].sum()

475

In [None]:
cancer_type_top1000_spearman_corrs_miRNAs_sorted[cancer_type_top1000_spearman_corrs_miRNAs_sorted >= 10].sum() / top1000_spearman_corrs_miRNAcancer_count

In [None]:
cancer_type_top1000_spearman_corrs_miRNAs_sorted[cancer_type_top1000_spearman_corrs_miRNAs_sorted >= 5].sum() / top1000_spearman_corrs_miRNAcancer_count

In [None]:
(cancer_type_top1000_spearman_corrs_miRNAs_sorted >= 10).sum()

In [None]:
((cancer_type_top1000_spearman_corrs_miRNAs_sorted < 10) & (cancer_type_top1000_spearman_corrs_miRNAs_sorted >= 5)).sum()

In [None]:
((cancer_type_top1000_spearman_corrs_miRNAs_sorted < 5) & (cancer_type_top1000_spearman_corrs_miRNAs_sorted > 0)).sum()

#### Probably obsolete: Delete?

In [65]:
all_ns = cancer_type_top_n_spearman_corrs_mirtarbase_pvals_bf_adj.loc[[100, 500, 1000], :].T.rename(columns=lambda n: 'all_' + str(n))
strong_ns = cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bf_adj.loc[[100, 500, 1000], :].T.rename(columns=lambda n: 'strong_' + str(n))
#write_df_to_csv(all_ns.join(strong_ns), 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman_all-vs-strong-mtb_pvals-bf-adj.csv')

In [66]:
all_ns.join(strong_ns)

n,all_100,all_500,all_1000,strong_100,strong_500,strong_1000
CHOL,0.012928,0.000282,2.216039e-05,0.04421,0.08438281,0.0006699086
DLBC,1.0,1.0,0.04390561,1.0,1.0,1.0
UCS,0.000163,0.034313,0.001299621,0.041815,0.9183866,4.188977e-05
KICH,1.0,0.011297,0.0001719406,1.0,0.08017122,0.06782149
ACC,1.0,0.716888,0.2635288,1.0,0.9375176,1.0
UVM,1.0,1.0,1.0,1.0,1.0,1.0
MESO,0.012305,0.000944,0.02229928,0.000706,0.0,0.0
SKCM,0.46079,0.093506,0.04570719,0.041261,0.0764681,0.006483476
THYM,0.476149,0.003313,1.341815e-06,1.0,0.9163238,0.06512095
TGCT,1.0,0.009925,0.1016029,1.0,1.0,1.0


In [67]:
(all_ns < 0.05).sum()

n
all_100      8
all_500     18
all_1000    21
dtype: int64

In [68]:
(strong_ns < 0.05).sum()

n
strong_100     15
strong_500     12
strong_1000    20
dtype: int64

In [69]:
del all_ns
del strong_ns

#### Hypergeometric test p-values for enrichment of miRTarBase interactions (all or those with strong support type entries only) in top n anticorrelations per cancer type and pan-cancer; n = 100, 500, 1000. FDR control using two-stage Benjamini-Hochberg.

In [66]:
all_ns = cancer_type_top_n_spearman_corrs_mirtarbase_pvals_2sbh_adj.loc[[100, 500, 1000], :].T.rename(columns=lambda n: 'all_' + str(n))
strong_ns = cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj.loc[[100, 500, 1000], :].T.rename(columns=lambda n: 'strong_' + str(n))
write_df_to_csv(all_ns.join(strong_ns), 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman_all-vs-strong-mtb_pvals-2sbh-adj.csv')

Copying file://temp.csv [Content-Type=text/csv]...
/ [1 files][  4.0 KiB/  4.0 KiB]                                                
Operation completed over 1 objects/4.0 KiB.                                      


In [None]:
all_ns.join(strong_ns)

In [None]:
(all_ns < 0.05).sum()

In [None]:
(strong_ns < 0.05).sum()

In [64]:
del all_ns
del strong_ns

#### More cruft?

In [None]:
pd.concat([cancer_type_top_n_log_corrs_mirtarbase_pvals_bf_adj.loc[[100, 500, 1000], :].T, cancer_type_top_n_log_corrs_mtbase_strong_pvals_bf_adj.loc[[100, 500, 1000], :].T], axis=1)

In [66]:
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals-bh-adj.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_bh_adj.set_index('n', inplace=True)

In [67]:
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_pvals-2sbh-adj.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj.set_index('n', inplace=True)

In [68]:
data = { 'log': (cancer_type_top_n_log_corrs_mirtarbase_pvals_2sbh_adj <= 0.05).sum(), 'spearman': (cancer_type_top_n_spearman_corrs_mirtarbase_pvals_2sbh_adj <= 0.05).sum(),
         'log_strong': (cancer_type_top_n_log_corrs_mtbase_strong_pvals_2sbh_adj <= 0.05).sum(),
         'spearman_strong': (cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj <= 0.05).sum() }
cancertype_topn_corrs_mtb_enrichment_2sbh_adj_5pct_sign_counts = pd.DataFrame(data, index=cancer_types_list)

In [69]:
data = {'log': (cancer_type_top_n_log_corrs_mirtarbase_pvals_2sbh_adj <= 0.05).sum(axis=1), 'spearman': (cancer_type_top_n_spearman_corrs_mirtarbase_pvals_2sbh_adj <= 0.05).sum(axis=1),
        'log_strong': (cancer_type_top_n_log_corrs_mtbase_strong_pvals_2sbh_adj <= 0.05).sum(axis=1),
        'spearman_strong': (cancer_type_top_n_spearman_corrs_mtbase_strong_pvals_2sbh_adj <= 0.05).sum(axis=1) }
cancertype_topn_corrs_mtb_enrichment_2sbh_adj_5pct_sigtype_counts = pd.DataFrame(data, index=hypergeom_test_ns)

## Hypergeometric test for miRTarBase enrichment in n strongest mRNA expression anticorrelations per miRNA

#### TODO: Delete all *2sbh?

In [70]:
hypergeom_test_ns = [50, 100, 250, 500, 750, 1000]

In [71]:
hypergeom_test_n_strs = map(str, hypergeom_test_ns)

### All miRTarBase relationships

In [46]:
cancertype_n_log_corrs_sig_pval_counts_bh = pd.DataFrame(None, cancer_types_list, hypergeom_test_n_strs)
#cancertype_n_log_corrs_sig_pval_counts_2sbh = pd.DataFrame(None, cancer_types_list, hypergeom_test_n_strs)

In [47]:
cancertype_n_spearman_corrs_sig_pval_counts_bh = pd.DataFrame(None, cancer_types_list, hypergeom_test_n_strs)
#cancertype_n_spearman_corrs_sig_pval_counts_2sbh = pd.DataFrame(None, cancer_types_list, hypergeom_test_n_strs)

In [48]:
cancertype_miRNA_top50_log_corrs_log2OE = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_miRNA_top50_log_corrs_pval_bh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_top50_log_corrs_mtbase_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)

In [49]:
cancertype_miRNA_top50_spearman_corrs_log2OE = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_miRNA_top50_spearman_corrs_pval_bh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_top50_spearman_corrs_mtbase_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)

#### TODO: Delete the next two (and all subsequent references/usage)?

In [36]:
cancertype_miRNA_top500_log_corrs_log2OE = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_miRNA_top500_log_corrs_pval_bh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_top500_log_corrs_mtbase_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_miRNA_top500_log_corrs_pval_2sbh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)

In [37]:
cancertype_miRNA_top500_spearman_corrs_log2OE = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_miRNA_top500_spearman_corrs_pval_bh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_top500_spearman_corrs_mtbase_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_miRNA_top500_spearman_corrs_pval_2sbh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)

#### Pan-cancer

In [72]:
miRNAs = miRNAmRNAs_in_mirtarbase_mask.index # making use of this coinciding with miRNAmRNA_log_corrs.index
mRNAs = miRNAmRNA_pancan_corrs_in_mirtarbase_mask.columns.intersection(miRNAmRNA_log_corrs.columns)
pancan_corrs_in_mirtarbase_mask = miRNAmRNA_pancan_corrs_in_mirtarbase_mask.loc[miRNAs, mRNAs]
miRNA_pancan_target_counts = pancan_corrs_in_mirtarbase_mask.sum(axis=1)

In [73]:
(miRNAs == miRNAmRNAs_in_mirtarbase_mask.index).sum()

743

In [74]:
N = mRNAs.size

In [75]:
hypergeom_test_miRNA_n_rvs = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
hypergeom_test_miRNA_n_rvs = hypergeom_test_miRNA_n_rvs.apply(lambda n: miRNA_pancan_target_counts.map(lambda count: stats.hypergeom(N, count, int(n.name))))

In [41]:
miRNA_top_n_log_corrs_mask = {}
miRNA_top_n_spearman_corrs_mask = {}

In [42]:
for n in hypergeom_test_n_strs:
  miRNA_top_n_log_corrs_mask[n] = pd.DataFrame(False, miRNAs, mRNAs)
  miRNA_top_n_log_corrs = miRNAmRNA_log_corrs.loc[miRNAs, mRNAs].apply(lambda row: row.argsort()[:int(n)], axis=1)
  for miRNA in miRNA_top_n_log_corrs_mask[n].index:
    miRNA_top_n_log_corrs_mask[n].loc[miRNA][miRNA_top_n_log_corrs.loc[miRNA]] = True
  miRNA_top_n_spearman_corrs_mask[n] = pd.DataFrame(False, miRNAs, mRNAs)
  miRNA_top_n_spearman_corrs = miRNAmRNA_spearman_corrs.loc[miRNAs, mRNAs].apply(lambda row: row.argsort()[:int(n)], axis=1)
  for miRNA in miRNA_top_n_spearman_corrs_mask[n].index:
    miRNA_top_n_spearman_corrs_mask[n].loc[miRNA][miRNA_top_n_spearman_corrs.loc[miRNA]] = True

In [38]:
def get_pval(rvs, top_n_counts, miRNA):
  return (1 - rvs[top_n_counts.name][miRNA].cdf(top_n_counts[miRNA]))

In [44]:
miRNA_top_n_log_corrs_mtbase_counts = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
miRNA_top_n_log_corrs_mtbase_counts = miRNA_top_n_log_corrs_mtbase_counts.apply(lambda n: (miRNA_top_n_log_corrs_mask[n.name] & pancan_corrs_in_mirtarbase_mask).sum(axis=1))
miRNA_top_n_log_corrs_mtbase_pvals = miRNA_top_n_log_corrs_mtbase_counts.apply(lambda counts: miRNAs.map(lambda miRNA: get_pval(hypergeom_test_miRNA_n_rvs, counts, miRNA)))

In [46]:
miRNA_top_n_spearman_corrs_mtbase_counts = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
miRNA_top_n_spearman_corrs_mtbase_counts = miRNA_top_n_spearman_corrs_mtbase_counts.apply(lambda n: (miRNA_top_n_spearman_corrs_mask[n.name] & pancan_corrs_in_mirtarbase_mask).sum(axis=1))
miRNA_top_n_spearman_corrs_mtbase_pvals = miRNA_top_n_spearman_corrs_mtbase_counts.apply(lambda counts: miRNAs.map(lambda miRNA: get_pval(hypergeom_test_miRNA_n_rvs, counts, miRNA)))

In [48]:
pancan_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj = miRNA_top_n_log_corrs_mtbase_pvals.apply(benjaminihochberg)
pancan_miRNA_top_n_log_corrs_mtbase_pvals_2sbh_adj = miRNA_top_n_log_corrs_mtbase_pvals.apply(benjaminihochberg_2stage)
pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj = miRNA_top_n_spearman_corrs_mtbase_pvals.apply(benjaminihochberg)
pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_2sbh_adj = miRNA_top_n_spearman_corrs_mtbase_pvals.apply(benjaminihochberg_2stage)

In [None]:
write_df_to_csv(miRNA_top_n_log_corrs_mtbase_counts, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-log-corrs-mtbase_counts.csv')
write_df_to_csv(miRNA_top_n_spearman_corrs_mtbase_counts, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-spearman-corrs-mtbase_counts.csv')
write_df_to_csv(pancan_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-log-corrs-mtbase_pvals-bh-adj.csv')
write_df_to_csv(pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-spearman-corrs-mtbase_pvals-bh-adj.csv')

In [57]:
cancertype_n_log_corrs_sig_pval_counts_bh.loc['PAN'] = (pancan_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj <= 0.05).sum()
cancertype_n_log_corrs_sig_pval_counts_2sbh.loc['PAN'] = (pancan_miRNA_top_n_log_corrs_mtbase_pvals_2sbh_adj <= 0.05).sum()

In [58]:
cancertype_n_spearman_corrs_sig_pval_counts_bh.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj <= 0.05).sum()
cancertype_n_spearman_corrs_sig_pval_counts_2sbh.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_2sbh_adj <= 0.05).sum()

In [52]:
cancertype_miRNA_top50_log_corrs_log2OE.loc['PAN', miRNAs] = ((miRNA_top_n_log_corrs_mtbase_counts['50'] * 1.0) / (miRNA_pancan_target_counts * 50.0 / N))
cancertype_miRNA_top50_log_corrs_log2OE.loc['PAN', miRNAs] = cancertype_miRNA_top50_log_corrs_log2OE.loc['PAN', miRNAs].map(lambda oe: m.log(oe + 1, 2))
cancertype_miRNA_top50_log_corrs_pval_bh_adj.loc['PAN'] = pancan_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj['50']
cancertype_top50_log_corrs_mtbase_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj['50'] <= 0.05)

In [53]:
cancertype_miRNA_top50_spearman_corrs_log2OE.loc['PAN', miRNAs] = ((miRNA_top_n_spearman_corrs_mtbase_counts['50'] * 1.0) / (miRNA_pancan_target_counts * 50.0 / N))
cancertype_miRNA_top50_spearman_corrs_log2OE.loc['PAN', miRNAs] = cancertype_miRNA_top50_spearman_corrs_log2OE.loc['PAN', miRNAs].map(lambda oe: m.log(oe + 1, 2))
cancertype_miRNA_top50_spearman_corrs_pval_bh_adj.loc['PAN'] = pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['50']
cancertype_top50_spearman_corrs_mtbase_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['50'] <= 0.05)

#### TODO: Delete next 2 cells?

In [52]:
cancertype_miRNA_top500_log_corrs_log2OE.loc['PAN', miRNAs] = ((miRNA_top_n_log_corrs_mtbase_counts['500'] * 1.0) / (miRNA_pancan_target_counts * 500.0 / N))
cancertype_miRNA_top500_log_corrs_log2OE.loc['PAN', miRNAs] = cancertype_miRNA_top500_log_corrs_log2OE.loc['PAN', miRNAs].map(lambda oe: m.log(oe + 1, 2))
cancertype_miRNA_top500_log_corrs_pval_bh_adj.loc['PAN'] = pancan_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj['500']
cancertype_top500_log_corrs_mtbase_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj['500'] <= 0.05)
cancertype_miRNA_top500_log_corrs_pval_2sbh_adj.loc['PAN'] = pancan_miRNA_top_n_log_corrs_mtbase_pvals_2sbh_adj['500']

In [53]:
cancertype_miRNA_top500_spearman_corrs_log2OE.loc['PAN', miRNAs] = ((miRNA_top_n_spearman_corrs_mtbase_counts['500'] * 1.0) / (miRNA_pancan_target_counts * 500.0 / N))
cancertype_miRNA_top500_spearman_corrs_log2OE.loc['PAN', miRNAs] = cancertype_miRNA_top500_spearman_corrs_log2OE.loc['PAN', miRNAs].map(lambda oe: m.log(oe + 1, 2))
cancertype_miRNA_top500_spearman_corrs_pval_bh_adj.loc['PAN'] = pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['500']
cancertype_top500_spearman_corrs_mtbase_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['500'] <= 0.05)
cancertype_miRNA_top500_spearman_corrs_pval_2sbh_adj.loc['PAN'] = pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_2sbh_adj['500']

In [149]:
for cancer_type in cancer_types:
  type_miRNAmRNA_log_corrs = get_corrs_df(bucket, 'explore/miRTar/pearson-corrs/data/mirtar-log-corrs_' + cancer_type + '.csv', 'miRNA')
  type_miRNAmRNA_spearman_corrs = get_corrs_df(bucket, 'explore/miRTar/spearman-corrs/data/mirtar-spearman-corrs_' + cancer_type + '.csv', 'miRNA')
  miRNAs = miRNAmRNAs_in_mirtarbase_mask.index.intersection(type_miRNAmRNA_log_corrs.index)
  mRNAs = miRNAmRNAs_in_mirtarbase_mask.columns.intersection(type_miRNAmRNA_log_corrs.columns)
  corrs_in_mirtarbase_mask = miRNAmRNAs_in_mirtarbase_mask.loc[miRNAs, mRNAs]
  miRNA_target_counts = corrs_in_mirtarbase_mask.sum(axis=1)
  N = mRNAs.size
  hypergeom_test_miRNA_n_rvs = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
  hypergeom_test_miRNA_n_rvs = hypergeom_test_miRNA_n_rvs.apply(lambda n: miRNA_target_counts.map(lambda count: stats.hypergeom(N, count, int(n.name))))
  type_miRNA_top_n_log_corrs_mask = {}
  type_miRNA_top_n_spearman_corrs_mask = {}
  for n in hypergeom_test_n_strs:  
    type_miRNA_top_n_log_corrs_mask[n] = pd.DataFrame(False, miRNAs, mRNAs)
    type_miRNA_top_n_log_corrs = type_miRNAmRNA_log_corrs.loc[miRNAs, mRNAs].apply(lambda row: row.argsort()[:int(n)], axis=1)
    for miRNA in type_miRNA_top_n_log_corrs_mask[n].index:
      type_miRNA_top_n_log_corrs_mask[n].loc[miRNA][type_miRNA_top_n_log_corrs.loc[miRNA]] = True
    type_miRNA_top_n_spearman_corrs_mask[n] = pd.DataFrame(False, miRNAs, mRNAs)
    type_miRNA_top_n_spearman_corrs = type_miRNAmRNA_spearman_corrs.apply(lambda row: row.argsort()[:int(n)], axis=1)
    for miRNA in type_miRNA_top_n_spearman_corrs_mask[n].index:
      type_miRNA_top_n_spearman_corrs_mask[n].loc[miRNA][type_miRNA_top_n_spearman_corrs.loc[miRNA]] = True
  
  type_miRNA_top_n_log_corrs_mtbase_counts = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
  type_miRNA_top_n_log_corrs_mtbase_counts = type_miRNA_top_n_log_corrs_mtbase_counts.apply(lambda n: (type_miRNA_top_n_log_corrs_mask[n.name] & corrs_in_mirtarbase_mask).sum(axis=1))
  type_miRNA_top_n_log_corrs_mtbase_pvals = type_miRNA_top_n_log_corrs_mtbase_counts.apply(lambda counts: miRNAs.map(lambda miRNA: get_pval(hypergeom_test_miRNA_n_rvs, counts, miRNA)))
  type_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj = type_miRNA_top_n_log_corrs_mtbase_pvals.apply(benjaminihochberg)
  type_miRNA_top_n_log_corrs_mtbase_pvals_2sbh_adj = type_miRNA_top_n_log_corrs_mtbase_pvals.apply(benjaminihochberg_2stage)
  
  type_miRNA_top_n_spearman_corrs_mtbase_counts = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
  type_miRNA_top_n_spearman_corrs_mtbase_counts = type_miRNA_top_n_spearman_corrs_mtbase_counts.apply(lambda n: (type_miRNA_top_n_spearman_corrs_mask[n.name] & corrs_in_mirtarbase_mask).sum(axis=1))
  type_miRNA_top_n_spearman_corrs_mtbase_pvals = type_miRNA_top_n_spearman_corrs_mtbase_counts.apply(lambda counts: miRNAs.map(lambda miRNA: get_pval(hypergeom_test_miRNA_n_rvs, counts, miRNA)))
  type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj = type_miRNA_top_n_spearman_corrs_mtbase_pvals.apply(benjaminihochberg)
  type_miRNA_top_n_spearman_corrs_mtbase_pvals_2sbh_adj = type_miRNA_top_n_spearman_corrs_mtbase_pvals.apply(benjaminihochberg_2stage)
  
  write_df_to_csv(type_miRNA_top_n_log_corrs_mtbase_counts, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-log-corrs-mtbase_counts_' + cancer_type + '.csv')
  write_df_to_csv(type_miRNA_top_n_spearman_corrs_mtbase_counts, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-spearman-corrs-mtbase_counts_' + cancer_type + '.csv')
  write_df_to_csv(type_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-log-corrs-mtbase_pvals-bh-adj_' + cancer_type + '.csv')
  write_df_to_csv(type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-spearman-corrs-mtbase_pvals-bh-adj_' + cancer_type + '.csv')
  
  cancertype_n_log_corrs_sig_pval_counts_bh.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj <= 0.05).sum()
  cancertype_n_log_corrs_sig_pval_counts_2sbh.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtbase_pvals_2sbh_adj <= 0.05).sum()
  cancertype_n_spearman_corrs_sig_pval_counts_bh.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj <= 0.05).sum()
  cancertype_n_spearman_corrs_sig_pval_counts_2sbh.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtbase_pvals_2sbh_adj <= 0.05).sum()
  
  cancertype_n_log_corrs_sig_pval_counts_bh.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj <= 0.05).sum()
  cancertype_miRNA_top50_log_corrs_log2OE.loc[cancer_type, miRNAs] = ((type_miRNA_top_n_log_corrs_mtbase_counts['50'] * 1.0) / (miRNA_target_counts * 50.0 / N))
  cancertype_miRNA_top50_log_corrs_log2OE.loc[cancer_type, miRNAs] = cancertype_miRNA_top50_log_corrs_log2OE.loc[cancer_type, miRNAs].map(lambda oe: m.log(oe + 1, 2))
  cancertype_miRNA_top50_log_corrs_pval_bh_adj.loc[cancer_type] = type_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj['50']
  cancertype_top50_log_corrs_mtbase_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj['50'] <= 0.05)
  
  cancertype_n_spearman_corrs_sig_pval_counts_bh.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj <= 0.05).sum()
  cancertype_miRNA_top50_spearman_corrs_log2OE.loc[cancer_type, miRNAs] = ((type_miRNA_top_n_spearman_corrs_mtbase_counts['50'] * 1.0) / (miRNA_target_counts * 50.0 / N))
  cancertype_miRNA_top50_spearman_corrs_log2OE.loc[cancer_type, miRNAs] = cancertype_miRNA_top50_spearman_corrs_log2OE.loc[cancer_type, miRNAs].map(lambda oe: m.log(oe + 1, 2))
  cancertype_miRNA_top50_spearman_corrs_pval_bh_adj.loc[cancer_type] = type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['50']
  cancertype_top50_spearman_corrs_mtbase_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['50'] <= 0.05)
  
  # TODO: Delete commented?
  #   cancertype_miRNA_top500_log_corrs_log2OE.loc[cancer_type, miRNAs] = ((type_miRNA_top_n_log_corrs_mtbase_counts['500'] * 1.0) / (miRNA_target_counts * 500.0 / N))
#   cancertype_miRNA_top500_log_corrs_log2OE.loc[cancer_type, miRNAs] = cancertype_miRNA_top500_log_corrs_log2OE.loc[cancer_type, miRNAs].map(lambda oe: m.log(oe + 1, 2))
#   cancertype_miRNA_top500_log_corrs_pval_bh_adj.loc[cancer_type] = type_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj['500']
#   cancertype_top500_log_corrs_mtbase_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtbase_pvals_bh_adj['500'] <= 0.05)
#   cancertype_miRNA_top500_log_corrs_pval_2sbh_adj.loc[cancer_type] = type_miRNA_top_n_log_corrs_mtbase_pvals_2sbh_adj['500']
#   cancertype_miRNA_top500_spearman_corrs_log2OE.loc[cancer_type, miRNAs] = ((type_miRNA_top_n_spearman_corrs_mtbase_counts['500'] * 1.0) / (miRNA_target_counts * 500.0 / N))
#   cancertype_miRNA_top500_spearman_corrs_log2OE.loc[cancer_type, miRNAs] = cancertype_miRNA_top500_spearman_corrs_log2OE.loc[cancer_type, miRNAs].map(lambda oe: m.log(oe + 1, 2))
#   cancertype_miRNA_top500_spearman_corrs_pval_bh_adj.loc[cancer_type] = type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['500']
#   cancertype_top500_spearman_corrs_mtbase_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['500'] <= 0.05)
#   cancertype_miRNA_top500_spearman_corrs_pval_2sbh_adj.loc[cancer_type] = type_miRNA_top_n_spearman_corrs_mtbase_pvals_2sbh_adj['500']

In [None]:
write_df_to_csv(cancertype_n_log_corrs_sig_pval_counts_bh, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtbase_sig-bh-adj-pval-counts.csv')
write_df_to_csv(cancertype_n_log_corrs_sig_pval_counts_2sbh, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtbase_sig-2sbh-adj-pval-counts.csv')
write_df_to_csv(cancertype_n_spearman_corrs_sig_pval_counts_bh, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase_sig-bh-adj-pval-counts.csv')
write_df_to_csv(cancertype_n_spearman_corrs_sig_pval_counts_2sbh, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase_sig-2sbh-adj-pval-counts.csv')

In [None]:
write_df_to_csv(cancertype_miRNA_top50_log_corrs_log2OE, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top50-log-corrs_log2OE.csv')
write_df_to_csv(cancertype_miRNA_top50_log_corrs_pval_bh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top50-log-corrs_pval-bh-adj.csv')
write_df_to_csv(cancertype_top50_log_corrs_mtbase_sig_miRNAs_mask, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top50-log-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')

In [None]:
write_df_to_csv(cancertype_miRNA_top50_spearman_corrs_log2OE, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top50-spearman-corrs_log2OE.csv')
write_df_to_csv(cancertype_miRNA_top50_spearman_corrs_pval_bh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top50-spearman-corrs_pval-bh-adj.csv')
write_df_to_csv(cancertype_top50_spearman_corrs_mtbase_sig_miRNAs_mask, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top50-spearman-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')

#### TODO: Delete next two?

In [None]:
write_df_to_csv(cancertype_miRNA_top500_log_corrs_log2OE, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-log-corrs_log2OE.csv')
write_df_to_csv(cancertype_miRNA_top500_log_corrs_pval_bh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-log-corrs_pval-bh-adj.csv')
write_df_to_csv(cancertype_top500_log_corrs_mtbase_sig_miRNAs_mask, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top500-log-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
write_df_to_csv(cancertype_miRNA_top500_log_corrs_pval_2sbh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-log-corrs_pval-2sbh-adj.csv')

In [None]:
write_df_to_csv(cancertype_miRNA_top500_spearman_corrs_log2OE, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-spearman-corrs_log2OE.csv')
write_df_to_csv(cancertype_miRNA_top500_spearman_corrs_pval_bh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-spearman-corrs_pval-bh-adj.csv')
write_df_to_csv(cancertype_top500_spearman_corrs_mtbase_sig_miRNAs_mask, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top500-spearman-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
write_df_to_csv(cancertype_miRNA_top500_spearman_corrs_pval_2sbh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-spearman-corrs_pval-2sbh-adj.csv')

### Only strong miRTarBase relationships (at least 1 non-weak-support-type entry)

In [44]:
cancertype_n_log_corrs_mtb_strong_sig_pval_counts_bh = pd.DataFrame(None, cancer_types_list, hypergeom_test_n_strs)
#cancertype_n_log_corrs_mtb_strong_sig_pval_counts_2sbh = pd.DataFrame(None, cancer_types_list, hypergeom_test_n_strs)

In [45]:
cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_bh = pd.DataFrame(None, cancer_types_list, hypergeom_test_n_strs)
#cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_2sbh = pd.DataFrame(None, cancer_types_list, hypergeom_test_n_strs)

In [46]:
cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_miRNA_top50_log_corrs_mtb_strong_pval_bh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_top50_log_corrs_mtb_strong_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)

In [47]:
cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_miRNA_top50_spearman_corrs_mtb_strong_pval_bh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_top50_spearman_corrs_mtb_strong_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)

#### TODO: Delete next two cells?

In [50]:
cancertype_miRNA_top500_log_corrs_mtb_strong_log2OE = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_miRNA_top500_log_corrs_mtb_strong_pval_bh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_top500_log_corrs_mtb_strong_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_miRNA_top500_log_corrs_mtb_strong_pval_2sbh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)

In [51]:
cancertype_miRNA_top500_spearman_corrs_mtb_strong_log2OE = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_miRNA_top500_spearman_corrs_mtb_strong_pval_bh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_top500_spearman_corrs_mtb_strong_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)
cancertype_miRNA_top500_spearman_corrs_mtb_strong_pval_2sbh_adj = pd.DataFrame(None, cancer_types_list, miRNAmRNAs_mirtarbase_strong_mask.index)

#### Pan-cancer

In [76]:
miRNAs = miRNAmRNAs_mirtarbase_strong_mask.index # making use of this coinciding with miRNAmRNA_log_corrs.index
mRNAs = miRNAmRNA_pancan_corrs_mirtarbase_strong_mask.columns.intersection(miRNAmRNA_log_corrs.columns)
pancan_corrs_in_mtbase_strong_mask = miRNAmRNA_pancan_corrs_mirtarbase_strong_mask.loc[miRNAs, mRNAs]
miRNA_pancan_strong_target_counts = pancan_corrs_in_mtbase_strong_mask.sum(axis=1)

In [77]:
N = mRNAs.size

In [78]:
hypergeom_test_miRNA_n_rvs = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
hypergeom_test_miRNA_n_rvs = hypergeom_test_miRNA_n_rvs.apply(lambda n: miRNA_pancan_strong_target_counts.map(lambda count: stats.hypergeom(N, count, int(n.name))))

In [55]:
miRNA_top_n_log_corrs_mask = {}
miRNA_top_n_spearman_corrs_mask = {}

In [56]:
for n in hypergeom_test_n_strs:
  miRNA_top_n_log_corrs_mask[n] = pd.DataFrame(False, miRNAs, mRNAs)
  miRNA_top_n_log_corrs = miRNAmRNA_log_corrs.loc[miRNAs, mRNAs].apply(lambda row: row.argsort()[:int(n)], axis=1)
  for miRNA in miRNA_top_n_log_corrs_mask[n].index:
    miRNA_top_n_log_corrs_mask[n].loc[miRNA][miRNA_top_n_log_corrs.loc[miRNA]] = True
  miRNA_top_n_spearman_corrs_mask[n] = pd.DataFrame(False, miRNAs, mRNAs)
  miRNA_top_n_spearman_corrs = miRNAmRNA_spearman_corrs.loc[miRNAs, mRNAs].apply(lambda row: row.argsort()[:int(n)], axis=1)
  for miRNA in miRNA_top_n_spearman_corrs_mask[n].index:
    miRNA_top_n_spearman_corrs_mask[n].loc[miRNA][miRNA_top_n_spearman_corrs.loc[miRNA]] = True

In [57]:
miRNA_top_n_log_corrs_mtb_strong_counts = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
miRNA_top_n_log_corrs_mtb_strong_counts = miRNA_top_n_log_corrs_mtb_strong_counts.apply(lambda n: (miRNA_top_n_log_corrs_mask[n.name] & pancan_corrs_in_mtbase_strong_mask).sum(axis=1))
miRNA_top_n_log_corrs_mtb_strong_pvals = miRNA_top_n_log_corrs_mtb_strong_counts.apply(lambda counts: miRNAs.map(lambda miRNA: get_pval(hypergeom_test_miRNA_n_rvs, counts, miRNA)))

In [58]:
miRNA_top_n_spearman_corrs_mtb_strong_counts = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
miRNA_top_n_spearman_corrs_mtb_strong_counts = miRNA_top_n_spearman_corrs_mtb_strong_counts.apply(lambda n: (miRNA_top_n_spearman_corrs_mask[n.name] & pancan_corrs_in_mtbase_strong_mask).sum(axis=1))
miRNA_top_n_spearman_corrs_mtb_strong_pvals = miRNA_top_n_spearman_corrs_mtb_strong_counts.apply(lambda counts: miRNAs.map(lambda miRNA: get_pval(hypergeom_test_miRNA_n_rvs, counts, miRNA)))

In [59]:
pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj = miRNA_top_n_log_corrs_mtb_strong_pvals.apply(benjaminihochberg)
pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_2sbh_adj = miRNA_top_n_log_corrs_mtb_strong_pvals.apply(benjaminihochberg_2stage)
pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj = miRNA_top_n_spearman_corrs_mtb_strong_pvals.apply(benjaminihochberg)
pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_2sbh_adj = miRNA_top_n_spearman_corrs_mtb_strong_pvals.apply(benjaminihochberg_2stage)

In [None]:
write_df_to_csv(miRNA_top_n_log_corrs_mtb_strong_counts, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-log-corrs-mtb-strong_counts.csv')
write_df_to_csv(miRNA_top_n_spearman_corrs_mtb_strong_counts, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-spearman-corrs-mtb-strong_counts.csv')
write_df_to_csv(pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-log-corrs-mtb-strong_pvals-bh-adj.csv')
write_df_to_csv(pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-spearman-corrs-mtb-strong_pvals-bh-adj.csv')

In [60]:
cancertype_n_log_corrs_mtb_strong_sig_pval_counts_bh.loc['PAN'] = (pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj <= 0.05).sum()
cancertype_n_log_corrs_mtb_strong_sig_pval_counts_2sbh.loc['PAN'] = (pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_2sbh_adj <= 0.05).sum()

In [61]:
cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_bh.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj <= 0.05).sum()
cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_2sbh.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_2sbh_adj <= 0.05).sum()

In [53]:
cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE.loc['PAN', miRNAs] = ((miRNA_top_n_log_corrs_mtb_strong_counts['50'] * 1.0) / (miRNA_pancan_strong_target_counts * 50.0 / N))
cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE.loc['PAN', miRNAs] = cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE.loc['PAN', miRNAs].map(lambda oe: m.log(oe + 1, 2))
cancertype_miRNA_top50_log_corrs_mtb_strong_pval_bh_adj.loc['PAN'] = pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj['50']
cancertype_top50_log_corrs_mtb_strong_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj['50'] <= 0.05)

In [54]:
cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE.loc['PAN', miRNAs] = ((miRNA_top_n_spearman_corrs_mtb_strong_counts['50'] * 1.0) / (miRNA_pancan_strong_target_counts * 50.0 / N))
cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE.loc['PAN', miRNAs] = cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE.loc['PAN', miRNAs].map(lambda oe: m.log(oe + 1, 2))
cancertype_miRNA_top50_spearman_corrs_mtb_strong_pval_bh_adj.loc['PAN'] = pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj['50']
cancertype_top50_spearman_corrs_mtb_strong_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj['50'] <= 0.05)

#### TODO: Delete next two cells (and subsequence references)?

In [62]:
cancertype_miRNA_top500_log_corrs_mtb_strong_log2OE.loc['PAN', miRNAs] = ((miRNA_top_n_log_corrs_mtb_strong_counts['500'] * 1.0) / (miRNA_pancan_strong_target_counts * 500.0 / N))
cancertype_miRNA_top500_log_corrs_mtb_strong_log2OE.loc['PAN', miRNAs] = cancertype_miRNA_top500_log_corrs_mtb_strong_log2OE.loc['PAN', miRNAs].map(lambda oe: m.log(oe + 1, 2))
cancertype_miRNA_top500_log_corrs_mtb_strong_pval_bh_adj.loc['PAN'] = pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj['500']
cancertype_top500_log_corrs_mtb_strong_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj['500'] <= 0.05)
cancertype_miRNA_top500_log_corrs_mtb_strong_pval_2sbh_adj.loc['PAN'] = pancan_miRNA_top_n_log_corrs_mtb_strong_pvals_2sbh_adj['500']

In [63]:
cancertype_miRNA_top500_spearman_corrs_mtb_strong_log2OE.loc['PAN', miRNAs] = ((miRNA_top_n_spearman_corrs_mtb_strong_counts['500'] * 1.0) / (miRNA_pancan_strong_target_counts * 500.0 / N))
cancertype_miRNA_top500_spearman_corrs_mtb_strong_log2OE.loc['PAN', miRNAs] = cancertype_miRNA_top500_spearman_corrs_mtb_strong_log2OE.loc['PAN', miRNAs].map(lambda oe: m.log(oe + 1, 2))
cancertype_miRNA_top500_spearman_corrs_mtb_strong_pval_bh_adj.loc['PAN'] = pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj['500']
cancertype_top500_spearman_corrs_mtb_strong_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj['500'] <= 0.05)
cancertype_miRNA_top500_spearman_corrs_mtb_strong_pval_2sbh_adj.loc['PAN'] = pancan_miRNA_top_n_spearman_corrs_mtb_strong_pvals_2sbh_adj['500']

In [None]:
for cancer_type in cancer_types:
  type_miRNAmRNA_log_corrs = get_corrs_df(bucket, 'explore/miRTar/pearson-corrs/data/mirtar-log-corrs_' + cancer_type + '.csv', 'miRNA')
  type_miRNAmRNA_spearman_corrs = get_corrs_df(bucket, 'explore/miRTar/spearman-corrs/data/mirtar-spearman-corrs_' + cancer_type + '.csv', 'miRNA')
  miRNAs = miRNAmRNAs_mirtarbase_strong_mask.index.intersection(type_miRNAmRNA_log_corrs.index)
  mRNAs = miRNAmRNAs_mirtarbase_strong_mask.columns.intersection(type_miRNAmRNA_log_corrs.columns)
  corrs_in_mirtarbase_strong_mask = miRNAmRNAs_mirtarbase_strong_mask.loc[miRNAs, mRNAs]
  miRNA_strong_target_counts = corrs_in_mirtarbase_strong_mask.sum(axis=1)
  N = mRNAs.size
  hypergeom_test_miRNA_n_rvs = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
  hypergeom_test_miRNA_n_rvs = hypergeom_test_miRNA_n_rvs.apply(lambda n: miRNA_strong_target_counts.map(lambda count: stats.hypergeom(N, count, int(n.name))))
  type_miRNA_top_n_log_corrs_mask = {}
  type_miRNA_top_n_spearman_corrs_mask = {}
  for n in hypergeom_test_n_strs:  
    type_miRNA_top_n_log_corrs_mask[n] = pd.DataFrame(False, miRNAs, mRNAs)
    type_miRNA_top_n_log_corrs = type_miRNAmRNA_log_corrs.loc[miRNAs, mRNAs].apply(lambda row: row.argsort()[:int(n)], axis=1)
    for miRNA in type_miRNA_top_n_log_corrs_mask[n].index:
      type_miRNA_top_n_log_corrs_mask[n].loc[miRNA][type_miRNA_top_n_log_corrs.loc[miRNA]] = True
    type_miRNA_top_n_spearman_corrs_mask[n] = pd.DataFrame(False, miRNAs, mRNAs)
    type_miRNA_top_n_spearman_corrs = type_miRNAmRNA_spearman_corrs.apply(lambda row: row.argsort()[:int(n)], axis=1)
    for miRNA in type_miRNA_top_n_spearman_corrs_mask[n].index:
      type_miRNA_top_n_spearman_corrs_mask[n].loc[miRNA][type_miRNA_top_n_spearman_corrs.loc[miRNA]] = True
  
  type_miRNA_top_n_log_corrs_mtb_strong_counts = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
  type_miRNA_top_n_log_corrs_mtb_strong_counts = type_miRNA_top_n_log_corrs_mtb_strong_counts.apply(lambda n: (type_miRNA_top_n_log_corrs_mask[n.name] & corrs_in_mirtarbase_strong_mask).sum(axis=1))
  type_miRNA_top_n_log_corrs_mtb_strong_pvals = type_miRNA_top_n_log_corrs_mtb_strong_counts.apply(lambda counts: miRNAs.map(lambda miRNA: get_pval(hypergeom_test_miRNA_n_rvs, counts, miRNA)))
  type_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj = type_miRNA_top_n_log_corrs_mtb_strong_pvals.apply(benjaminihochberg)
  type_miRNA_top_n_log_corrs_mtb_strong_pvals_2sbh_adj = type_miRNA_top_n_log_corrs_mtb_strong_pvals.apply(benjaminihochberg_2stage)
  
  type_miRNA_top_n_spearman_corrs_mtb_strong_counts = pd.DataFrame(None, miRNAs, hypergeom_test_n_strs)
  type_miRNA_top_n_spearman_corrs_mtb_strong_counts = type_miRNA_top_n_spearman_corrs_mtb_strong_counts.apply(lambda n: (type_miRNA_top_n_spearman_corrs_mask[n.name] & corrs_in_mirtarbase_strong_mask).sum(axis=1))
  type_miRNA_top_n_spearman_corrs_mtb_strong_pvals = type_miRNA_top_n_spearman_corrs_mtb_strong_counts.apply(lambda counts: miRNAs.map(lambda miRNA: get_pval(hypergeom_test_miRNA_n_rvs, counts, miRNA)))
  type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj = type_miRNA_top_n_spearman_corrs_mtb_strong_pvals.apply(benjaminihochberg)
  type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_2sbh_adj = type_miRNA_top_n_spearman_corrs_mtb_strong_pvals.apply(benjaminihochberg_2stage)
  
  write_df_to_csv(type_miRNA_top_n_log_corrs_mtb_strong_counts, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-log-corrs-mtb-strong_counts_' + cancer_type + '.csv')
  write_df_to_csv(type_miRNA_top_n_spearman_corrs_mtb_strong_counts, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-spearman-corrs-mtb-strong_counts_' + cancer_type + '.csv')
  write_df_to_csv(type_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-log-corrs-mtb-strong_pvals-bh-adj_' + cancer_type + '.csv')
  write_df_to_csv(type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj, 'miRNA', 'gs://yfl-mirna/analysis/enrichment/miRNA_top-n-spearman-corrs-mtb-strong_pvals-bh-adj_' + cancer_type + '.csv')
  
  cancertype_n_log_corrs_mtb_strong_sig_pval_counts_bh.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj <= 0.05).sum()
  cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs] = ((type_miRNA_top_n_log_corrs_mtb_strong_counts['50'] * 1.0) / (miRNA_strong_target_counts * 50.0 / N))
  cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs] = cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs].map(lambda oe: m.log(oe + 1, 2))
  cancertype_miRNA_top50_log_corrs_mtb_strong_pval_bh_adj.loc[cancer_type] = type_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj['50']
  cancertype_top50_log_corrs_mtb_strong_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj['50'] <= 0.05)
  
  cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_bh.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj <= 0.05).sum()
  cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs] = ((type_miRNA_top_n_spearman_corrs_mtb_strong_counts['50'] * 1.0) / (miRNA_strong_target_counts * 50.0 / N))
  cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs] = cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs].map(lambda oe: m.log(oe + 1, 2))
  cancertype_miRNA_top50_spearman_corrs_mtb_strong_pval_bh_adj.loc[cancer_type] = type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj['50']
  cancertype_top50_spearman_corrs_mtb_strong_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj['50'] <= 0.05)
  
  # TODO: Delete following?
#   cancertype_n_log_corrs_mtb_strong_sig_pval_counts_bh.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj <= 0.05).sum()
#   cancertype_n_log_corrs_mtb_strong_sig_pval_counts_2sbh.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtb_strong_pvals_2sbh_adj <= 0.05).sum()
#   cancertype_miRNA_top500_log_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs] = ((type_miRNA_top_n_log_corrs_mtb_strong_counts['500'] * 1.0) / (miRNA_strong_target_counts * 500.0 / N))
#   cancertype_miRNA_top500_log_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs] = cancertype_miRNA_top500_log_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs].map(lambda oe: m.log(oe + 1, 2))
#   cancertype_miRNA_top500_log_corrs_mtb_strong_pval_bh_adj.loc[cancer_type] = type_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj['500']
#   cancertype_top500_log_corrs_mtb_strong_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_log_corrs_mtb_strong_pvals_bh_adj['500'] <= 0.05)
#   cancertype_miRNA_top500_log_corrs_mtb_strong_pval_2sbh_adj.loc[cancer_type] = type_miRNA_top_n_log_corrs_mtb_strong_pvals_2sbh_adj['500']
  
#   cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_bh.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj <= 0.05).sum()
#   cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_2sbh.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_2sbh_adj <= 0.05).sum()
#   cancertype_miRNA_top500_spearman_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs] = ((type_miRNA_top_n_spearman_corrs_mtb_strong_counts['500'] * 1.0) / (miRNA_strong_target_counts * 500.0 / N))
#   cancertype_miRNA_top500_spearman_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs] = cancertype_miRNA_top500_spearman_corrs_mtb_strong_log2OE.loc[cancer_type, miRNAs].map(lambda oe: m.log(oe + 1, 2))
#   cancertype_miRNA_top500_spearman_corrs_mtb_strong_pval_bh_adj.loc[cancer_type] = type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj['500']
#   cancertype_top500_spearman_corrs_mtb_strong_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_bh_adj['500'] <= 0.05)
#   cancertype_miRNA_top500_spearman_corrs_mtb_strong_pval_2sbh_adj.loc[cancer_type] = type_miRNA_top_n_spearman_corrs_mtb_strong_pvals_2sbh_adj['500']

In [None]:
write_df_to_csv(cancertype_n_log_corrs_mtb_strong_sig_pval_counts_bh, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtb-strong_sig-bh-adj-pval-counts.csv')
write_df_to_csv(cancertype_n_log_corrs_mtb_strong_sig_pval_counts_2sbh, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-log-corrs-mtb-strong_sig-2sbh-adj-pval-counts.csv')
write_df_to_csv(cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_bh, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtb-strong_sig-bh-adj-pval-counts.csv')
write_df_to_csv(cancertype_n_spearman_corrs_mtb_strong_sig_pval_counts_2sbh, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top-n-spearman-corrs-mtb-strong_sig-2sbh-adj-pval-counts.csv')

### TODO: Delete next 2?

In [None]:
write_df_to_csv(cancertype_miRNA_top500_log_corrs_mtb_strong_log2OE, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-log-corrs-mtb-strong_log2OE.csv')
write_df_to_csv(cancertype_miRNA_top500_log_corrs_mtb_strong_pval_bh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-log-corrs-mtb-strong_pval-bh-adj.csv')
write_df_to_csv(cancertype_top500_log_corrs_mtb_strong_sig_miRNAs_mask, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top500-log-corrs-mtb-strong_sig-bh-adj-pval_miRNAs_mask.csv')
write_df_to_csv(cancertype_miRNA_top500_log_corrs_mtb_strong_pval_2sbh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-log-corrs-mtb-strong_pval-2sbh-adj.csv')

In [None]:
write_df_to_csv(cancertype_miRNA_top500_spearman_corrs_mtb_strong_log2OE, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-spearman-corrs-mtb-strong_log2OE.csv')
write_df_to_csv(cancertype_miRNA_top500_spearman_corrs_mtb_strong_pval_bh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-spearman-corrs-mtb-strong_pval-bh-adj.csv')
write_df_to_csv(cancertype_top500_spearman_corrs_mtb_strong_sig_miRNAs_mask, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top500-spearman-corrs-mtb-strong-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
write_df_to_csv(cancertype_miRNA_top500_spearman_corrs_mtb_strong_pval_2sbh_adj, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_miRNA_top500-spearman-corrs-mtb-strong_pval-2sbh-adj.csv')

### Summary, illustration & interpretation

In [79]:
cancertype_miRNA_top50_spearman_corrs_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-spearman-corrs_log2OE.csv')
cancertype_miRNA_top50_spearman_corrs_log2OE.set_index('cancer_type', inplace=True)

In [None]:
cancertype_miRNA_top50_spearman_corrs_log2OE.T.describe()

In [74]:
cancertype_miRNA_top50_spearman_corrs_log2OE.describe().loc['max'].describe()

count    738.000000
mean       2.731844
std        0.987079
min        0.000000
25%        2.212197
50%        2.740034
75%        3.261035
max        6.377193
Name: max, dtype: float64

In [75]:
cancertype_miRNA_top50_spearman_corrs_pval_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-spearman-corrs_pval-bh-adj.csv')
cancertype_miRNA_top50_spearman_corrs_pval_bh_adj.set_index('cancer_type', inplace=True)

In [76]:
cancertype_miRNA_top50_spearman_corrs_pval_bh_adj.T.describe()

cancer_type,CHOL,DLBC,UCS,KICH,ACC,UVM,MESO,SKCM,THYM,TGCT,...,LUSC,PRAD,KIRC,THCA,LUAD,LGG,HNSC,UCEC,BRCA,PAN
count,743.0,743.0,743.0,743.0,743.0,743.0,743.0,743.0,743.0,743.0,...,743.0,743.0,743.0,743.0,743.0,743.0,743.0,743.0,743.0,0.0
mean,0.355704,0.337603,0.334062,0.330386,0.340698,0.355078,0.344907,0.3502,0.318444,0.336156,...,0.34053,0.269108,0.328191,0.331739,0.324232,0.335185,0.317737,0.321239,0.307422,
std,0.129295,0.125322,0.129911,0.123657,0.120742,0.120554,0.129286,0.125572,0.134244,0.113546,...,0.123884,0.145092,0.122535,0.111594,0.11871,0.124219,0.118155,0.147792,0.128193,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
25%,0.29289,0.276172,0.279049,0.276172,0.287924,0.314671,0.28768,0.296861,0.234139,0.288361,...,0.283969,0.170979,0.266224,0.272008,0.270555,0.278631,0.23964,0.221016,0.227638,
50%,0.328513,0.299071,0.311302,0.299071,0.310262,0.322101,0.321237,0.318747,0.303405,0.308245,...,0.313669,0.265807,0.308818,0.304807,0.298275,0.308929,0.294361,0.308123,0.289474,
75%,0.388499,0.366248,0.362467,0.356727,0.361601,0.39542,0.374837,0.371489,0.356727,0.360542,...,0.363761,0.333227,0.357924,0.362467,0.359439,0.36269,0.348323,0.375627,0.352375,
max,0.933997,0.893429,0.915975,0.915975,0.980558,0.874541,0.896904,0.85578,0.878648,0.869698,...,0.97376,0.844694,0.835492,0.846741,0.930333,0.900747,0.900747,0.94105,0.853199,


### TODO: Move remarks; delete file-read cells later

#### Remarks
- Using Benjamini-Hochberg instead of the two-stage version to control the false discovery rate (FDR) because:
  - Going by e.g. log2(observed/expected) for overlap of per-miRNA top 50 mRNA expression anticorrelations with miRTarBase interactions, which have a median of 0 and 75th percentile close to 1 for all cancer types, miRNAs with significant overlap seem sparse (# insignificant close to # of miRNAs)
    - see e.g. http://www.stat.cmu.edu/~genovese/talks/hannover1-04.pdf: "BH performs best in very sparse cases"
- Counts of miRNAs with significant overlap for various values of n, adjusted using Benjamini-Hochberg or two-stage Benjamini-Hochberg, are very similar

In [80]:
cancertype_n_log_corrs_sig_pval_counts_bh = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mtbase_sig-bh-adj-pval-counts.csv')
cancertype_n_log_corrs_sig_pval_counts_bh.set_index('cancer_type', inplace=True)

In [81]:
cancertype_n_spearman_corrs_sig_pval_counts_bh = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase_sig-bh-adj-pval-counts.csv')
cancertype_n_spearman_corrs_sig_pval_counts_bh.set_index('cancer_type', inplace=True)

In [80]:
cancertype_n_spearman_corrs_sig_pval_counts_bh.describe()

Unnamed: 0,50,100,250,500,750,1000
count,32.0,32.0,32.0,32.0,32.0,32.0
mean,10.375,10.46875,11.53125,13.40625,16.25,16.90625
std,7.169604,9.224737,12.753202,18.023031,22.426942,25.33563
min,5.0,5.0,5.0,5.0,5.0,5.0
25%,6.75,6.0,6.0,5.75,5.0,5.0
50%,8.0,8.0,7.0,6.0,7.0,6.0
75%,10.0,9.25,9.0,8.0,8.0,8.0
max,42.0,50.0,70.0,90.0,106.0,113.0


#### Observations
- With both Pearson and Spearman correlations, the number of miRNAs with significant top-n anticorrelation overlap with miRTarBase interactions is small, and decreases (or rises then falls) with increasing n in most cancer types, with a few highly visible exceptions
- Exceptions: READ, ESCA, OV (notable exception—Pearson correlations only), STAD, COAD, PRAD, UCEC, PAN (pan-cancer). The numbers of such miRNAs in these cases also start higher than average to begin with.

#### Follow-up questions
- How do (miRNA and mRNA) expression profiles compare across the two groups of cancers described above? In particular, how do the expression levels of miRNAs testing as significant compare?
- Similarly, how do the numbers of miRTarBase targets for these miRNAs compare?

In [82]:
cancer_type_top1000_log_corrs_mtb_miRNAs = read_file(bucket, 'analysis/enrichment/cancer-type_top-1000-log-corrs-mirtarbase_miRNAs.csv')
cancer_type_top1000_log_corrs_mtb_miRNAs.set_index('cancer_type', inplace=True)

In [83]:
cancer_type_top1000_log_corrs_miRNAs = read_file(bucket, 'analysis/enrichment/cancer-type_top-1000-log-corrs_miRNAs.csv')
cancer_type_top1000_log_corrs_miRNAs.set_index('cancer_type', inplace=True)

In [84]:
cancer_type_top1000_spearman_corrs_mtb_miRNAs = read_file(bucket, 'analysis/enrichment/cancer-type_top-1000-spearman-corrs-mirtarbase_miRNAs.csv')
cancer_type_top1000_spearman_corrs_mtb_miRNAs.set_index('cancer_type', inplace=True)

#### Overlap between miRNAs in top 1000 Pearson and Spearman correlations as indicator of linearity (or lack thereof) of strongest monotonic miRNA-mRNA expression relationships

In [None]:
cols = ['# miRNAs in top 1000 Pearson correlations', '% also in top 1000 Spearman correlations',
        '# miRNAs in top 1000 Spearman correlations', '% also in top 1000 Pearson correlations']
intersection = (cancer_type_top1000_log_corrs_miRNAs & cancer_type_top1000_spearman_corrs_miRNAs).sum(axis=1)
pearson_counts = cancer_type_top1000_log_corrs_miRNAs.sum(axis=1)
spearman_counts = cancer_type_top1000_spearman_corrs_miRNAs.sum(axis=1)
data = { '# miRNAs in top 1000 Pearson correlations': pearson_counts, '% also in top 1000 Spearman correlations': intersection / pearson_counts,
         '# miRNAs in top 1000 Spearman correlations': spearman_counts, '% also in top 1000 Pearson correlations': intersection / spearman_counts }
cancer_type_top1000_log_and_spearman_corrs_miRNA_counts = pd.DataFrame(data, index=cancer_types_list, columns=cols)
del intersection, pearson_counts, spearman_counts
cancer_type_top1000_log_and_spearman_corrs_miRNA_counts

In [None]:
cols = ['# miRNAs in intersection of top 1000 Pearson correlations and miRTarBase', '# miRNAs in intersection of top 1000 Spearman correlations and miRTarBase',
        '# miRNAs in intersection of the preceding']
data = { '# miRNAs in intersection of top 1000 Pearson correlations and miRTarBase': cancer_type_top1000_log_corrs_mtb_miRNAs.sum(axis=1),
         '# miRNAs in intersection of top 1000 Spearman correlations and miRTarBase': cancer_type_top1000_spearman_corrs_mtb_miRNAs.sum(axis=1),
         '# miRNAs in intersection of the preceding': (cancer_type_top1000_log_corrs_mtb_miRNAs & cancer_type_top1000_spearman_corrs_mtb_miRNAs).sum(axis=1) }
cancer_type_top1000_log_and_spearman_corrs_mtb_miRNA_counts = pd.DataFrame(data, index=cancer_types_list, columns=cols)
cancer_type_top1000_log_and_spearman_corrs_mtb_miRNA_counts

### Observations

- Within individual cancer types, only 12 to 183, i.e. a small fraction of about 1.5% to 25%, of miRNAs have a Pearson (log) anticorrelation in the top 1000, with the corresponding figures for Spearman correlations being 16 to 315 miRNAs
  - The proportion is very roughly decreasing as sample size increases
  - This also means that each miRNA in the top 1000 anticorrelations is paired with several mRNAs
  - The corresponding ratio, of mRNAs to miRNAs, in the intersection of top 1000 anticorrelations with miRTarBase interactions is much lower
    - This reflects the relatively low ratio of miRTarBase miRNA-target interactions to miRNA-mRNA pairs
  - However, there are notable differences between the figures arising from rankings based on Pearson and Spearman correlations
    - Generally fewer miRNAs in the Spearman rankings (1), but at the same time, a significantly higher proportion intersecting with miRTarBase (2)
      - (1) I.e. at least some/several of the strongest linear miRNA-mRNA expression relationships are monotonically weaker
      - (2) I.e. the strongest monotonic relationships show a greater enrichment for miRTarBase interactions than the strongest linear (but monotonically weaker) ones
        - Credible support for validity of miRTarBase because the former should show greater enrichment for genuine miRNA-target interactions than the latter
      - And several of the strongest monotonic relationships are significantly non-/less linear
      - The strongest monotonic relationships are relatively concentrated on fewer miRNAs
        - TODO: Might the distributions of miRNA expression and number of miRTarBase targets in the two rankings show some interesting patterns?
      - What's the degree of overlap between the miRNAs and miRNA-mRNA pairs in the two rankings? A lower overlap suggests that the strongest monotonic relationships are less or non-linear, and the strongest linear relationships are somewhat noisy as indicators of (monotonic) dependence
        - TODO: What's the degree of overlap between the miRNAs and miRNA-mRNA pairs in the two rankings?
- In turn, about 1.5% to 33% of those miRNAs have an (top 1000) anticorrelation recorded in miRTarBase, with even fewer of strong support type
  - Nevertheless, as observed elsewhere, the top 1000 anticorrelations are enriched for miRTarBase interactions, though arguably not strongly:
    - log2(O/E) below 3 for enrichment of any miRTarBase interactions in top-1000 anticorrelations
    - Slightly stronger enrichment considering only miRTarBase interactions reported with strong support: a few log2(O/E) exceeding 4

### Questions
- TODO: Is the distribution of miRNAs in the intersection of top-1000 anticorrelations and miRTarBase interactions different from that of all miRNAs involved in top-1000 anticorrelations? E.g. how do their expression profiles compare?

### TODO: Delete when done

In [85]:
cancertype_miRNA_top50_log_corrs_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-log-corrs_log2OE.csv')
cancertype_miRNA_top50_log_corrs_log2OE.set_index('cancer_type', inplace=True)
cancertype_miRNA_top50_log_corrs_pval_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-log-corrs_pval-bh-adj.csv')
cancertype_miRNA_top50_log_corrs_pval_bh_adj.set_index('cancer_type', inplace=True)
cancertype_top50_log_corrs_mtbase_sig_miRNAs_mask = read_file(bucket, 'analysis/enrichment/cancer-type_top50-log-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
cancertype_top50_log_corrs_mtbase_sig_miRNAs_mask.set_index('cancer_type', inplace=True)

cancertype_miRNA_top50_spearman_corrs_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-spearman-corrs_log2OE.csv')
cancertype_miRNA_top50_spearman_corrs_log2OE.set_index('cancer_type', inplace=True)
cancertype_miRNA_top50_spearman_corrs_pval_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-spearman-corrs_pval-bh-adj.csv')
cancertype_miRNA_top50_spearman_corrs_pval_bh_adj.set_index('cancer_type', inplace=True)
cancertype_top50_spearman_corrs_mtbase_sig_miRNAs_mask = read_file(bucket, 'analysis/enrichment/cancer-type_top50-spearman-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
cancertype_top50_spearman_corrs_mtbase_sig_miRNAs_mask.set_index('cancer_type', inplace=True)

In [86]:
cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-log-corrs-mtb-strong_log2OE.csv')
cancertype_miRNA_top50_log_corrs_mtb_strong_log2OE.set_index('cancer_type', inplace=True)
cancertype_miRNA_top50_log_corrs_mtb_strong_pval_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-log-corrs-mtb-strong_pval-bh-adj.csv')
cancertype_miRNA_top50_log_corrs_mtb_strong_pval_bh_adj.set_index('cancer_type', inplace=True)
cancertype_top50_log_corrs_mtb_strong_sig_miRNAs_mask = read_file(bucket, 'analysis/enrichment/cancer-type_top50-log-corrs-mtb-strong_sig-bh-adj-pval_miRNAs_mask.csv')
cancertype_top50_log_corrs_mtb_strong_sig_miRNAs_mask.set_index('cancer_type', inplace=True)

cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-spearman-corrs-mtb-strong_log2OE.csv')
cancertype_miRNA_top50_spearman_corrs_mtb_strong_log2OE.set_index('cancer_type', inplace=True)
cancertype_miRNA_top50_spearman_corrs_mtb_strong_pval_bh_adj = read_file(bucket, 'analysis/enrichment/cancer-type_miRNA_top50-spearman-corrs-mtb-strong_pval-bh-adj.csv')
cancertype_miRNA_top50_spearman_corrs_mtb_strong_pval_bh_adj.set_index('cancer_type', inplace=True)
cancertype_top50_spearman_corrs_mtb_strong_sig_miRNAs_mask = read_file(bucket, 'analysis/enrichment/cancer-type_top50-spearman-corrs-mtb-strong-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
cancertype_top50_spearman_corrs_mtb_strong_sig_miRNAs_mask.set_index('cancer_type', inplace=True)

In [87]:
cancertype_miRNA_top50_log_corrs_log2OE.T.describe()

cancer_type,CHOL,DLBC,UCS,KICH,ACC,UVM,MESO,SKCM,THYM,TGCT,...,LUSC,PRAD,KIRC,THCA,LUAD,LGG,HNSC,UCEC,BRCA,PAN
count,738.0,738.0,738.0,738.0,738.0,738.0,738.0,738.0,738.0,738.0,...,738.0,738.0,738.0,738.0,738.0,738.0,738.0,738.0,738.0,0.0
mean,0.471941,0.540487,0.540063,0.39793,0.442971,0.45281,0.474411,0.521547,0.557593,0.538629,...,0.492293,0.720291,0.54596,0.490415,0.591637,0.532026,0.59495,0.588695,0.552402,
std,0.907733,0.986805,0.940857,0.821127,0.86979,0.879896,0.87323,0.925187,0.974868,0.934342,...,0.901434,1.061707,0.950901,0.886388,0.984555,0.912345,0.974959,0.95482,0.965476,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
75%,0.604673,0.886536,1.022947,0.0,0.490477,0.622102,0.831802,0.933924,1.029181,0.97685,...,0.84684,1.559617,1.015845,0.923407,1.14897,1.031516,1.229754,1.195361,1.077615,
max,4.069795,5.260368,5.394446,6.377193,5.543015,5.394446,4.652723,4.873745,5.543015,5.260368,...,4.57363,5.291898,5.394446,5.026187,5.394446,6.695646,5.260368,4.143502,4.612588,


### Observations

- miRNA expression
- number of miRNAs in top 1000, intersection with miRTarBase or all

In [88]:
cancer_type_top_n_log_corrs_mirtarbase_counts = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_counts.csv')
cancer_type_top_n_log_corrs_mirtarbase_counts.set_index('n', inplace=True)

In [89]:
cancer_type_top_n_log_corrs_mtbase_strong_counts = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_counts.csv')
cancer_type_top_n_log_corrs_mtbase_strong_counts.set_index('n', inplace=True)

In [90]:
cancer_type_top_n_spearman_corrs_mirtarbase_counts = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_counts.csv')
cancer_type_top_n_spearman_corrs_mirtarbase_counts.set_index('n', inplace=True)

In [91]:
cancer_type_top_n_spearman_corrs_mtbase_strong_counts = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_counts.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_counts.set_index('n', inplace=True)

In [95]:
pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj = read_file(bucket, 'analysis/enrichment/miRNA_top-n-spearman-corrs-mtbase_pvals-bh-adj.csv')
pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj.set_index('miRNA', inplace=True)

In [111]:
cancertype_top1000_spearman_corrs_mtbase_sig_miRNAs_mask = pd.DataFrame(False, cancer_types_list, miRNAmRNAs_in_mirtarbase_mask.index)
cancertype_top1000_spearman_corrs_mtbase_sig_miRNAs_mask.loc['PAN'] = (pancan_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['1000'] <= 0.05)
for cancer_type in cancer_types:
  type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj = read_file(bucket, 'analysis/enrichment/miRNA_top-n-spearman-corrs-mtbase_pvals-bh-adj_' + cancer_type + '.csv')
  type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj.set_index('miRNA', inplace=True)
  cancertype_top1000_spearman_corrs_mtbase_sig_miRNAs_mask.loc[cancer_type] = (type_miRNA_top_n_spearman_corrs_mtbase_pvals_bh_adj['1000'] <= 0.05)
write_df_to_csv(cancertype_top1000_spearman_corrs_mtbase_sig_miRNAs_mask, 'cancer_type', 'gs://yfl-mirna/analysis/enrichment/cancer-type_top1000-spearman-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')

Copying file://temp.csv [Content-Type=text/csv]...
/ [1 files][149.9 KiB/149.9 KiB]                                                
Operation completed over 1 objects/149.9 KiB.                                    


In [112]:
cols = ['top1000_corrs_mtb & per-miRNA_top1000_corrs_mtb_sig # miRNAs', 'top1000_corrs & per-miRNA_top1000_corrs_mtb_sig # miRNAs',
        'per-miRNA_top1000_corrs_mtb_sig # miRNAs', 'top1000_corrs # miRNAs', 'top1000_corrs_mtb # miRNAs', 'top1000_corrs_mtb']
data = { 'top1000_corrs & per-miRNA_top1000_corrs_mtb_sig # miRNAs': (cancer_type_top1000_spearman_corrs_miRNAs & cancertype_top1000_spearman_corrs_mtbase_sig_miRNAs_mask).sum(axis=1),
         'top1000_corrs_mtb & per-miRNA_top1000_corrs_mtb_sig # miRNAs': (cancer_type_top1000_spearman_corrs_mtb_miRNAs & cancertype_top1000_spearman_corrs_mtbase_sig_miRNAs_mask).sum(axis=1),
         'per-miRNA_top1000_corrs_mtb_sig # miRNAs': cancertype_top1000_spearman_corrs_mtbase_sig_miRNAs_mask.sum(axis=1),
         'top1000_corrs # miRNAs': cancer_type_top1000_spearman_corrs_miRNAs.sum(axis=1),
         'top1000_corrs_mtb # miRNAs': cancer_type_top1000_spearman_corrs_mtb_miRNAs.sum(axis=1),
         'top1000_corrs_mtb': cancer_type_top_n_spearman_corrs_mirtarbase_counts.loc[1000].reindex(cancer_types_list) }
cancertype_top1000_spearman_corrs_and_permiRNA_top1000_corrs_mtb_sig_counts = pd.DataFrame(data, index=cancer_types_list, columns=cols)

In [113]:
cancertype_top1000_spearman_corrs_and_permiRNA_top1000_corrs_mtb_sig_counts

Unnamed: 0,top1000_corrs_mtb & per-miRNA_top1000_corrs_mtb_sig # miRNAs,top1000_corrs & per-miRNA_top1000_corrs_mtb_sig # miRNAs,per-miRNA_top1000_corrs_mtb_sig # miRNAs,top1000_corrs # miRNAs,top1000_corrs_mtb # miRNAs,top1000_corrs_mtb
CHOL,0.0,2.0,5,315.0,18.0,27
DLBC,0.0,1.0,7,109.0,9.0,19
UCS,0.0,1.0,6,152.0,17.0,23
KICH,0.0,1.0,5,181.0,18.0,25
ACC,0.0,1.0,6,94.0,14.0,17
UVM,0.0,1.0,5,57.0,7.0,7
MESO,0.0,1.0,5,87.0,10.0,20
SKCM,0.0,0.0,5,171.0,15.0,19
THYM,0.0,0.0,5,61.0,19.0,32
TGCT,0.0,2.0,5,79.0,13.0,18


#### Table S2

In [137]:
cols = ['top1000_pearson_corrs_miRNAs', 'top1000_pearson_corrs_mtb_miRNA_pcts', 'top1000_spearman_corrs_miRNAs', 'top1000_spearman_corrs_mtb_miRNA_pcts']
top1000_pearson_corrs_miRNAs = cancer_type_top1000_log_corrs_miRNAs.sum(axis=1)
top1000_spearman_corrs_miRNAs = cancer_type_top1000_spearman_corrs_miRNAs.sum(axis=1)
data = { 'top1000_pearson_corrs_miRNAs': top1000_pearson_corrs_miRNAs,
        'top1000_pearson_corrs_mtb_miRNA_pcts': cancer_type_top1000_log_corrs_mtb_miRNAs.sum(axis=1) / top1000_pearson_corrs_miRNAs,
         'top1000_spearman_corrs_miRNAs': top1000_spearman_corrs_miRNAs,
         'top1000_spearman_corrs_mtb_miRNA_pcts': cancer_type_top1000_spearman_corrs_mtb_miRNAs.sum(axis=1) / top1000_spearman_corrs_miRNAs }
cancertype_top1000_spearman_pearson_corrs_mtb_miRNAs = pd.DataFrame(data, index=cancer_types_list, columns=cols)
del top1000_pearson_corrs_miRNAs, top1000_spearman_corrs_miRNAs

In [138]:
cancertype_top1000_spearman_pearson_corrs_mtb_miRNAs

Unnamed: 0,top1000_pearson_corrs_miRNAs,top1000_pearson_corrs_mtb_miRNA_pcts,top1000_spearman_corrs_miRNAs,top1000_spearman_corrs_mtb_miRNA_pcts
CHOL,158.0,0.063291,315.0,0.057143
DLBC,129.0,0.085271,109.0,0.082569
UCS,112.0,0.089286,152.0,0.111842
KICH,129.0,0.015504,181.0,0.099448
ACC,152.0,0.098684,94.0,0.148936
UVM,69.0,0.086957,57.0,0.122807
MESO,109.0,0.100917,87.0,0.114943
SKCM,183.0,0.076503,171.0,0.087719
THYM,102.0,0.127451,61.0,0.311475
TGCT,84.0,0.130952,79.0,0.164557


#### Observations
- There is no enrichment in most cancer types in miRNAs in the top 1000 anticorrelations for miRNAs with significant top-50 mRNA expression anticorrelations
  - Mild enrichment in a few, and significant enrichment in a few others: LUSC, PRAD, LUAD, UCEC
- In turn, in most cancer types, there is no visible enrichment in miRNAs in the top 1000 anticorrelations with miRTarBase support for miRNAs with significant top-50 mRNA expression anticorrelations, with a few possible exceptions: OV, STAD, LUSC, LUAD, UCEC
- This suggests that there may be no easily identifiable subset of miRNAs or miRNA-mRNA pairs, e.g. the most highly expressed or most anticorrelated, consistently enriched for miRTarBase interactions.
  - However, further exploratory work could be done to refine/investigate this, e.g.:
    - Compare the distributions of expression and/or # of (miRTarBase) targets for miRNAs with significant enrichment of miRTarBase interactions in their top 50 mRNA expression anticorrelations.

In [76]:
cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mirtarbase_counts_log2OE.csv')
cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE.set_index('n', inplace=True)

In [77]:
cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-log-corrs-mtbase-strong_counts_log2OE.csv')
cancer_type_top_n_log_corrs_mtbase_strong_counts_log2OE.set_index('n', inplace=True)

In [78]:
cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mirtarbase_counts_log2OE.csv')
cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE.set_index('n', inplace=True)

In [79]:
cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_counts_log2OE.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE.set_index('n', inplace=True)

In [80]:
cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE = read_file(bucket, 'analysis/enrichment/cancer-type_top-n-spearman-corrs-mtbase-strong_counts_log2OE.csv')
cancer_type_top_n_spearman_corrs_mtbase_strong_counts_log2OE.set_index('n', inplace=True)

In [81]:
cancer_type_top1000_mtb_strong_stats = read_file(bucket, 'analysis/enrichment/cancertype_top1000-mtb-strong_stats.csv')
cancer_type_top1000_mtb_strong_stats.set_index('cancer_type', inplace=True)

In [96]:
cancer_type_top1000_mtb_strong_stats

Unnamed: 0_level_0,log_corrs_mtb_miRNAs,log_corrs_all_miRNAs,log_corrs_mtb,log_corrs_mtb_log2OE,spearman_corrs_mtb_miRNAs,spearman_corrs_all_miRNAs,spearman_corrs_mtb,spearman_corrs_mtb_log2OE
cancer_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CHOL,0,158,0,,5,315,5,inf
DLBC,0,129,0,,0,109,0,
UCS,3,112,3,inf,6,152,6,inf
KICH,0,129,0,,2,181,3,inf
ACC,1,152,1,inf,1,94,1,inf
UVM,0,69,0,,0,57,0,
MESO,5,109,10,inf,5,87,9,inf
SKCM,3,183,4,inf,3,171,4,inf
THYM,1,102,1,inf,3,61,3,inf
TGCT,1,84,1,inf,1,79,1,inf


In [94]:
cancer_type_top1000_stats = read_file(bucket, 'analysis/enrichment/cancertype_top1000-mtb_stats.csv')
cancer_type_top1000_stats.set_index('cancer_type', inplace=True)

#### Possible reasons for relatively weak relationship between miRTarBase support and (unadjusted/uncontrolled) miRNA-mRNA anticorrelation strength

- Unaccounted-for genomic & epigenetic factors, e.g. copy-number variation and gene methylation, affecting mRNA expression
  - Could obscure downregulation in genuine miRNA-target pairs
  - Could cause anticorrelation between miRNA and non-target mRNA expression
- Changes in cancer cell transcriptome, e.g. shortening of 3'-UTR region (thus reducing miRNA-target binding potential)
  - Weakens downregulation, i.e. reduces anticorrelation, between miRNA and affected targets
- ceRNA effects, e.g.
  - All else being equal, miRNAs might be expected to bind preferentially to higher-affinity targets
  - Sufficiently high expression of high-affinity targets could reduce amount of miRNA available to downregulate lower-affinity targets
  - Outside conditions of exceptional transcriptome changes, there may be insufficient miRNA available to appreciably downregulate all or even many targets
- Other gene network interactions, e.g. causal relationships between expressions of various genes. Their aggregate effects could:
  - Strengthen or weaken anticorrelation resulting from miRNA-target downregulation
  - Cause anticorrelation between miRNA and non-target mRNA expressions
- Other factors present in cell type or microenvironment affecting miRNA or target expression
- Last but not least: incompleteness or inaccuracy of miRTarBase entries
  - Many genuine miRNA-target interactions haven't yet been discovered or recorded in miRTarBase, likely including pairs with strong anticorrelations
  - Some interactions reported in miRTarBase may be false positives, possibly accounting for some spurious weak anticorrelations

- Comparison with normal tissue could help identify causal miRNA-target relationships and distinguish them from spurious correlations

- Stacked histogram of miRTarBase and non-miRTarBase pair correlation distributions

In [82]:
cancer_type_top1000_log_corrs_mtb_strong_miRNAs = read_file(bucket, 'analysis/enrichment/cancer-type_top-1000-log-corrs-mtbase-strong_miRNAs.csv')
cancer_type_top1000_log_corrs_mtb_strong_miRNAs.set_index('cancer_type', inplace=True)

In [83]:
cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs = read_file(bucket, 'analysis/enrichment/cancer-type_top-1000-spearman-corrs-mtbase-strong_miRNAs.csv')
cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs.set_index('cancer_type', inplace=True)

In [84]:
cancertype_top500_log_corrs_mtbase_sig_miRNAs_mask = read_file(bucket, 'analysis/enrichment/cancer-type_top500-spearman-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
cancertype_top500_log_corrs_mtbase_sig_miRNAs_mask.set_index('cancer_type', inplace=True)

In [85]:
cancertype_top500_spearman_corrs_mtbase_sig_miRNAs_mask = read_file(bucket, 'analysis/enrichment/cancer-type_top500-log-corrs-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
cancertype_top500_spearman_corrs_mtbase_sig_miRNAs_mask.set_index('cancer_type', inplace=True)

In [86]:
cancertype_top500_log_corrs_mtb_strong_sig_miRNAs_mask = read_file(bucket, 'analysis/enrichment/cancer-type_top500-log-corrs-mtb-strong_sig-bh-adj-pval_miRNAs_mask.csv')
cancertype_top500_log_corrs_mtb_strong_sig_miRNAs_mask.set_index('cancer_type', inplace=True)

In [87]:
cancertype_top500_spearman_corrs_mtb_strong_sig_miRNAs_mask = read_file(bucket, 'analysis/enrichment/cancer-type_top500-spearman-corrs-mtb-strong-mtbase_sig-bh-adj-pval_miRNAs_mask.csv')
cancertype_top500_spearman_corrs_mtb_strong_sig_miRNAs_mask.set_index('cancer_type', inplace=True)

### Test

In [27]:
miRNAmRNA_log_corrs.columns

Index([u'100133144', u'100134869', u'10357', u'10431', u'155060', u'388795',
       u'390284', u'57714', u'645851', u'653553',
       ...
       u'55055', u'11130', u'7789', u'158586', u'79364', u'440590', u'79699',
       u'7791', u'23140', u'26009'],
      dtype='object', length=16335)

In [28]:
miRNAmRNA_log_corrs.index

Index([u'hsa-let-7a-2-3p', u'hsa-let-7a-3p', u'hsa-let-7a-5p',
       u'hsa-let-7b-3p', u'hsa-let-7b-5p', u'hsa-let-7c-3p', u'hsa-let-7c-5p',
       u'hsa-let-7d-3p', u'hsa-let-7d-5p', u'hsa-let-7e-3p',
       ...
       u'hsa-miR-527', u'hsa-miR-548x-3p', u'hsa-miR-5584-5p',
       u'hsa-miR-670-3p', u'hsa-miR-885-3p', u'hsa-miR-888-5p', u'hsa-miR-890',
       u'hsa-miR-891b', u'hsa-miR-892b', u'hsa-miR-892c-3p'],
      dtype='object', name=u'miRNA', length=743)

In [37]:
miRNAmRNA_log_corrs.loc['hsa-let-7a-2-3p', '100133144']

-0.052223221148815856

In [40]:
pd.Series([3,2,4,1], index=['a', 'b', 'c', 'd']).argsort()

a    3
b    1
c    0
d    2
dtype: int64

In [51]:
(miRNAmRNA_log_corrs.loc['hsa-let-7a-2-3p'].argsort() < 500)

100133144    False
100134869    False
10357        False
10431        False
155060       False
388795       False
390284       False
57714        False
645851       False
653553       False
8225         False
90288        False
1            False
87769        False
2            False
144568       False
53947        False
8086         False
65985        False
51166        False
79719        False
22848        False
14           False
15           False
16           False
57505        False
80755        False
132949       False
60496        False
10157        False
             ...  
9753         False
221584       False
80345        False
65982        False
7579         False
7589         False
342945       False
222696       False
54993        False
146050       False
79149        False
342933       False
90204        False
140831       False
65249        False
57643        False
57688        False
125150       False
221302       False
9183         False
55055        False
11130       

### Questions

- What are the overlaps (in top n anticorrelations) across correlation types and miRTarBase support types?
- What are the overlaps across cancer types (within each correlation and miRTarBase support type)?
- What's the overlap between pairs in the top n anticorrelations across many/all cancer types, and those in top n pan-cancer anticorrelations? (Also addresses aux. question)
- What's the distribution of (miRNA) expression and correlations across cancers?

### Scratch

In [52]:
df1 = pd.DataFrame({
    'A': [1,2,3,4,5],
    'B': [1,2,3,4,5]
})
df2 = pd.DataFrame({
    'C': [1,2,3,4,5],
    'D': [1,2,3,4,5]
})

In [53]:
df_concat = pd.concat([df1, df2], axis=1)

In [54]:
df_concat

Unnamed: 0,A,B,C,D
0,1,1,1,1
1,2,2,2,2
2,3,3,3,3
3,4,4,4,4
4,5,5,5,5


### TODO / Bonus: Rename the following
- cancer_type_top1000_log_corrs_mtb_miRNAs -> cancer_type_top1000_log_corrs_mtb_miRNAs_mask
- cancer_type_top1000_log_corrs_miRNAs -> cancer_type_top1000_log_corrs_miRNAs_mask
- cancer_type_top1000_spearman_corrs_mtb_miRNAs -> cancer_type_top1000_spearman_corrs_mtb_miRNAs_mask
- cancer_type_top1000_spearman_corrs_miRNAs -> cancer_type_top1000_spearman_corrs_miRNAs_mask
- cancer_type_top1000_log_corrs_mtb_strong_miRNAs -> cancer_type_top1000_log_corrs_mtb_strong_miRNAs_mask
- cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs -> cancer_type_top1000_spearman_corrs_mtb_strong_miRNAs_mask

### Scraps: Delete after finalising notebook if not used

#### Attempt to account for differences in enrichment test p-values across cancer types

In [None]:
heatmap(cancer_type_top_n_log_corrs_mirtarbase_counts_log2OE.T.reindex(cancer_types_list).T,
        'log2(O/E) for enrichment of miRTarBase relationships in top n Spearman anticorrelation pairs, cancer types sorted by sample size', 'cancer type', 'n',
        np.arange(cancer_types_list.size) + 0.5, np.arange(len(hypergeom_test_ns)) + 0.5, cancer_types_list,
        hypergeom_test_ns, 30, 12, 'cubehelix_r', ha='center', va='center', label_fontsize=16).savefig('temp.png')
#save_as('temp.png', 'gs://yfl-mirna/analysis/enrichment/plots/cancer-types_top-n-spearman-corrs-mtbase-log2OE_heatmap.png')

In [None]:
heatmap(cancer_type_top_n_spearman_corrs_mirtarbase_pvals.T.reindex(cancer_types_list).T,
        'p-values for enrichment of miRTarBase relationships in top n Spearman anticorrelation pairs, cancer types sorted by sample size', 'cancer type', 'n',
        np.arange(cancer_types_list.size) + 0.5, np.arange(len(hypergeom_test_ns)) + 0.5, cancer_types_list,
        hypergeom_test_ns, 30, 12, 'cubehelix_r', ha='center', va='center', label_fontsize=16).savefig('temp.png')
#save_as('temp.png', 'gs://yfl-mirna/analysis/enrichment/plots/cancer-types_top-n-spearman-corrs-mtbase-pvals_heatmap.png')

#### Scrap of scraps

In [None]:
pd.melt(cancer_type_top_n_spearman_corrs_mirtarbase_counts_log2OE.T.reset_index(), id_vars=['index'], value_vars=map(lambda x: int(x), hypergeom_test_n_strs),
        var_name='n', value_name='log2(O/E)')

### FMI: For reference purposes. Delete/Move after report is done.

#### Analysis plan
- Use hypergeometric test to test for enrichment of miRTarBase relationships in miRNA-mRNA pairs with top n (most negative) correlations: n = 10, 50, 100, 500, 1000
  - Bonus: and vice versa—enrichment of top n correlations in miRTarBase relationships (I haven't been able to convince myself the test is symmetric)
  - If a pattern is visible: Any way to demonstrate formally?
- If the preceding item is encouraging enough:
  - Get summary statistics for distributions of # of miRTarBase entries and proportion of entries with “strong” support type (or any support type) for each miRTar pair
    - Tabulate (intersections of categories of the two) and visualise in heatmap
  - Use the Chi-squared test to test for enrichment of relationships (considering only those in miRTarBase) in top n with at least a certain # of miRTarBase entries and a certain % of entries with strong support type
    - How many degrees of freedom???

### Observations

- After Bonferroni correction for multiple testing, cancer types with larger sample sizes are much more consistently enriched (across all miRTarBase interactions or strong ones only, and Pearson as well as Spearman correlations), and to a greater degree, than cancer types with smaller sample sizes
- Across cancer types, there is greater enrichment at higher values of n
- At the lowest values of n (e.g. <= 100), enrichment is generally not or borderline significant, with a few exceptions
- At the pan-cancer level, Spearman correlation-based rankings are enriched starting at lower values of n than Pearson correlation-based rankings
- Spearman correlations: There is significant enrichment starting at lower values of n for strong miRTarBase interactions
- Some cancer types are highly enriched across all correlation types. In roughly decreasing order:
  - LUSC, STAD, THCA, LIHC, UCEC, MESO, BLCA, HNSC, PRAD
- Other cancer types showing consistent enrichment, again in roughly decreasing order:
  - SKCM, UCSM SARC, OV, CHOL (Spearman), BRCA, KIRC (Spearman), ESCA (regular miRTarBase interactions), LGG, THYM, PAAD (Spearman), COAD (strong miRTarBase support type interactions), PCPG (Spearman), DLBC (regular miRTarBase interactions)
- Cancer types showing little enrichment:
  - KICH (only strong enrichment for regular miRTarBase interactions in Spearman correlations)
  - DLBC (only some enrichment for regular miRTarBase interactions)
  - KIRP (weak enrichment for strong miRTarBase interactions)
  - TGCT, ACC
- Cancer types showing no enrichment, in increasing order of sample size:
  - UVM, READ, CESC