# MPRA training (for training Malinois' weights)

This notebook process the MPRA results to be fed to Malinois during the training stage. The table this notebook generates was exclusively used for optimizing Malinois. Every analysis and perfomance report that uses the MPRA activity from these libraries were created using a slightly differently processed (with no DNA-count filter before averaging across libraries) table provided in the notebook "CODA_preprocess_MPRA_validation".

In [10]:
import pandas as pd
import gzip
import os

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))

pd.set_option('display.max_columns', None)

  from IPython.core.display import display, HTML


### Fetch the MPRA results

The dataset consists of oligo libraries (OL) from the MPRA projects UKBB (OL27-33), GTEx (OL41,42), and CRE (OL15).

In [11]:
rootdir = '/training'

cell_types_lower = ['k562', 'hepg2', 'sknsh']
cell_types = ['K562', 'HepG2', 'SKNSH']
std_cell_names = dict(zip(cell_types_lower, cell_types))

UKBB_libraries = [f'OL{n}' for n in range(27, 34)]
GTEx_libraries = ['OL41', 'OL42', 'OL41-42', 'OL41B']
CREs_libraries = ['OL15']

data_project_dict = {OL: 'UKBB' for OL in UKBB_libraries}
data_project_dict.update({OL: 'GTEx' for OL in GTEx_libraries})
data_project_dict.update({OL: 'CRE' for OL in CREs_libraries})

print('Fetching files:\n')
in_df = []
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        filepath = subdir + os.sep + file
        if filepath.endswith(".out") and ('Counts' not in filepath):
            library_num, cell_type = file.split('_')[:2]
            if cell_type.lower() in cell_types_lower:
                df_temp = pd.read_csv(filepath, sep='\t', low_memory=False)
                df_temp['OL'] = library_num
                df_temp['cell_type'] = std_cell_names[cell_type.lower()]
                df_temp['data_project'] = data_project_dict[library_num]
                in_df.append(df_temp)
                print(file)

in_columns = ['ID', 'chr', 'project', 'ctrl_exp', 'Ctrl.Mean',
       'Exp.Mean', 'log2FoldChange', 'lfcSE', 'OL',
       'cell_type', 'data_project']

in_df = pd.concat(in_df)[in_columns]

row_filter = in_df['chr'].isna()
chrs = []
for ID in in_df[row_filter]['ID'].tolist():
    if 'sample' in ID:
        chrs.append('synth')
    else:
        chrs.append(ID.split(':')[0])
        
in_df.loc[row_filter, 'chr'] = chrs

in_df.loc[in_df['project'].isna(), 'project'] = ''

Fetching files:

OL42_HepG2_20211010.out
OL28_SKNSH_20210428.out
OL41_SKNSH_20211010.out
OL42_SKNSH_20211010.out
OL28_HEPG2_20210428.out
OL27_K562_20210428.out
OL33_HEPG2_20210428.out
OL15_HEPG2_Neon_20200904.out
OL30_SKNSH_20210428.out
OL33_SKNSH_20210428.out
OL41B_HepG2_20211010.out
OL30_HEPG2_20210428.out
OL32_SKNSH_20210428.out
OL33_K562_20210428.out
OL15_K562_20200904.out
OL31_HEPG2_20210428.out
OL15_SKNSH_20200904.out
OL32_HEPG2_20210428.out
OL32_K562_20210428.out
OL31_SKNSH_20210428.out
OL27_SKNSH_20210428.out
OL28_K562_20210428.out
OL31_K562_20210428.out
OL29_HEPG2_20210428.out
OL27_HEPG2_20210428.out
OL29_K562_20210428.out
OL30_K562_20210428.out
OL29_SKNSH_20210428.out
OL41-42_K562_20211010.out


### Filter low-count oligos

To ensure Malinois is trained only on oligos with acceptable quality, we dropped any potential noisy log2FC results before we average activities across libraries in UKBB. We imposed an oligo count threshold of at least 20, and a threshold of non-zero RNA count. For general purposes, perhaps a more suitable oligo count threhold could be at least 100 oligos. However, for the purposes of training Malinois, we did not observed any significant gains from further increasing the stringency of these thresholds. (Note that for performing analyses and figures we use a version of this table with no oligo-count threshold, only the non-zero RNA count threshold. Instead, we only filter based on an optional log2(Fold-Change) Standard Error imposed after processing the results.)

Note: in order to reproduce the test set reported in the paper, one would need to set oligo_count_cutoff = 0 and filter by lfcSE after generating the table (as described in the Methods section of the paper).

In [12]:
in_df

Unnamed: 0,ID,chr,project,ctrl_exp,Ctrl.Mean,Exp.Mean,log2FoldChange,lfcSE,OL,cell_type,data_project
0,1:1000156:C:T:A:wC,1,GTEx,GTEx,90.451376,116.787452,0.370168,0.315274,OL42,HepG2,GTEx
1,1:1000156:C:T:R:wC,1,GTEx,GTEx,81.044871,138.000723,0.764444,0.316890,OL42,HepG2,GTEx
2,1:10002921:T:G:A:wC,1,GTEx,GTEx,1151.172596,4035.711481,1.809318,0.082221,OL42,HepG2,GTEx
3,1:10002921:T:G:R:wC,1,GTEx,GTEx,974.523926,1669.391661,0.776754,0.099932,OL42,HepG2,GTEx
4,1:10003320:G:A:A:wC,1,GTEx,GTEx,161.422841,456.978661,1.499823,0.186002,OL42,HepG2,GTEx
...,...,...,...,...,...,...,...,...,...,...,...
493780,X:99996102:A:G:R:wC,X,GTEx,GTEx,1138.914651,1375.044964,0.272142,0.166480,OL41-42,K562,GTEx
493781,X:99998829:C:T:A:wC,X,GTEx,GTEx,151.525843,171.747787,0.181686,0.381688,OL41-42,K562,GTEx
493782,X:99998829:C:T:R:wC,X,GTEx,GTEx,165.073711,174.544210,0.081406,0.425423,OL41-42,K562,GTEx
493783,X:99999349:G:A:A:wC,X,GTEx,GTEx,886.880586,739.964383,-0.261358,0.211378,OL41-42,K562,GTEx


In [13]:
oligo_count_cutoff = 20
RNA_count_cutoff = 0

quality_filter = (in_df['Ctrl.Mean'] >= oligo_count_cutoff) & \
                         (in_df['Exp.Mean'] > RNA_count_cutoff)

in_df = in_df[quality_filter].reset_index(drop=True) 

### Get the nucleotide sequence

We load the nuecleotide sequence information from each corresponding fasta file and assign to each oligo ID its corresponding sequence. It is worth noting that there might be very rare edge cases in which the same oligo ID might differ by one position between the UKBB and GTEx databases. To our knowledge, there is only one case. Nonetheless, fetching the sequence from the corresponding OL fasta file (as opposed to just creating a set of ID-to_sequence pairs) ensures we get the correct measured sequence.

In [14]:
rootdir = '/Fastas'

ignore_libraries = ['OL46']

fasta_dict = {'UKBB': {}, 'GTEx': {}, 'CRE': {}}
print('Processing fasta files:\n')
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        filepath = subdir + os.sep + file
        if filepath.endswith(".gz") and ('fasta' in filepath):
            library_num = file.split('_')[0]
            if library_num not in ignore_libraries:
                print(file)
                project = data_project_dict[library_num]
                with gzip.open(filepath, 'rt') as f:
                    for line_str in f:
                        if line_str[0] == '>':
                            oligo_id = line_str.lstrip('>').rstrip('\n')
                            fasta_dict[project][oligo_id] = ''
                        else:
                            fasta_dict[project][oligo_id] += line_str.rstrip('\n')

for project in fasta_dict.keys():
    project_fasta_dict = fasta_dict[project]
    compressed_IDs = [ID for ID in project_fasta_dict.keys() if ';' in ID]
    for compressed_ID in compressed_IDs:
        stripped_ID = compressed_ID.lstrip('(').rstrip(')')
        for single_ID in stripped_ID.split(';'):
            project_fasta_dict[single_ID] = project_fasta_dict[compressed_ID]
        
in_df['sequence'] = in_df.apply(lambda x: fasta_dict[x['data_project']][x['ID']], axis=1)

Processing fasta files:

OL41_reference.fasta.gz
OL32_reference.fasta.gz
OL31_reference.fasta.gz
OL42_reference.fasta.gz
OL27_reference.fasta.gz
OL29_reference.fasta.gz
OL15_reference.fasta.gz
OL28_reference.fasta.gz
OL30_reference.fasta.gz
OL33_reference.fasta.gz


### Grouping results for oligos shared across libraries in a data project

This step is needed only for sequences in the UKBB oligo libraries, since these libraries share some sequences. However it does hurt to do it for the libraries in GTEx and CRE. For practical purposes, we average the log2(Fold-Change) readout of sequences shared across libraries in order to have a single sequence-to-activity pair for each sequence. For measures that can be inform us of empirical noise (such as oligo count, RNA count, and lof2FC standard error), we retrieve the "worst" value (min, min, max, respectively) across libraries.

Note 1: For legacy reasons, we do not average shared sequences across projects since the generation of these projects was asynchronous. However, the consistency of the MPRA results probably allows for averaging across different projects as well.

Note 2: We believe that many other ways of aggregating the results can achieve equal or better results for training models. For example, it could be equally useful (if not more) to show a model several sequence-to-activity pairs associated to a single sequence, as it could be a way to implicitly showing the empirical noise to the model during the training. 

In [15]:
grouped_project_dfs = []
for data_project in ['UKBB', 'GTEx', 'CRE']:
    print(f'Grouping sequences across cell types and across libraries in {data_project}')
    project_df = in_df[in_df['data_project'] == data_project].reset_index(drop=True)
    groups = project_df.groupby(by=['sequence', 'cell_type']).agg(
                log2FC_mean = ('log2FoldChange', 'mean'),
                chromosome = ('chr', 'first'),
                OL = ('OL', set),
                classes = ('project', set),
                IDs = ('ID', set),
                exp_mean = ('Exp.Mean', 'min'),
                ctrl_mean = ('Ctrl.Mean', 'min'),
                lfcSE = ('lfcSE', 'max'))
    grouped_project_df = groups.unstack(level=-1)
    grouped_project_df.columns = ['_'.join(col).rstrip('_') for col in grouped_project_df.columns.values]
    grouped_project_df.reset_index(inplace=True)
    grouped_project_df['data_project'] = data_project
    grouped_project_dfs.append(grouped_project_df)

Grouping sequences across cell types and across libraries in UKBB
Grouping sequences across cell types and across libraries in GTEx
Grouping sequences across cell types and across libraries in CRE


### Clean the grouped columns

These lines just clean up the columns generated from stacking the groups generated in the previous cell.

In [16]:
column_drop_list = ['chromosome_K562', 'chromosome_SKNSH',
               'classes_K562','classes_SKNSH',
               'IDs_K562', 'IDs_SKNSH',
               'OL_K562', 'OL_HepG2',
              ]

new_column_name_dict = {'sequence': 'sequence',
                    'log2FC_mean_HepG2': 'HepG2_log2FC',
                    'log2FC_mean_K562': 'K562_log2FC',
                    'log2FC_mean_SKNSH': 'SKNSH_log2FC',
                    'chromosome_HepG2': 'chr',
                    'classes_HepG2': 'class',
                    'IDs_HepG2': 'IDs',
                    'OL_SKNSH': 'OL',
                    'exp_mean_HepG2': 'exp_mean_HepG2',
                    'exp_mean_K562': 'exp_mean_K562',
                    'exp_mean_SKNSH': 'exp_mean_SKNSH',
                    'ctrl_mean_HepG2': 'ctrl_mean_HepG2',
                    'ctrl_mean_K562': 'ctrl_mean_K562',
                    'ctrl_mean_SKNSH': 'ctrl_mean_SKNSH',
                    'lfcSE_HepG2': 'HepG2_lfcSE',
                    'lfcSE_K562': 'K562_lfcSE',
                    'lfcSE_SKNSH': 'SKNSH_lfcSE',
                    'data_project': 'data_project',
                   }

out_column_order = ['IDs', 'sequence', 'data_project', 'OL', 'class', 'chr',
                    'K562_log2FC', 'HepG2_log2FC', 'SKNSH_log2FC',
                    'K562_lfcSE', 'SKNSH_lfcSE', 'HepG2_lfcSE', 
#                     'ctrl_mean_K562', 'ctrl_mean_HepG2', 'ctrl_mean_SKNSH',
#                     'exp_mean_K562', 'exp_mean_HepG2', 'exp_mean_SKNSH',
                   ]

out_df = pd.concat(grouped_project_dfs)

out_df.drop(column_drop_list, axis=1, inplace=True)
out_df.columns = [new_column_name_dict[old_name] for old_name in out_df.columns.values]

null_filter = (out_df['HepG2_log2FC'].notnull()) & (out_df['K562_log2FC'].notnull()) & (out_df['SKNSH_log2FC'].notnull())
out_df = out_df[null_filter].reset_index(drop=True)

out_df['OL'] = out_df['OL'].apply(lambda x: ','.join(sorted(x)))
out_df['class'] = out_df['class'].apply(lambda x: ','.join(sorted(x)))
out_df['IDs'] = out_df['IDs'].apply(lambda x: ';'.join(sorted(x)))

compressed_ids_filter = out_df['IDs'].str.contains(';')
out_df.loc[compressed_ids_filter, 'IDs'] = out_df[compressed_ids_filter]['IDs'].apply(lambda x: '(' + ';'.join(x.split(';')) + ')')

out_df = out_df[out_column_order]

### Drop duplicated sequences across projects

In order to keep a single sequence-to-activity pair for each sequence, we drop duplicated oligos across projects. If a duplicated oligo is present in UKBB, we keep only the UKBB information. If a duplicated oligo is present in GTEx and CRE but not UKBB, we keep only the GTEx information.

In [17]:
out_df = out_df.sort_values(by=['data_project'], key=lambda x: x.map({'UKBB': 0, 'GTEx': 1, 'CRE':2})).reset_index(drop=True)
out_df = out_df.drop_duplicates('sequence', keep='first').reset_index(drop=True)

out_df = out_df[out_df['chr'] != 'synth'].reset_index(drop=True)

out_df

Unnamed: 0,IDs,sequence,data_project,OL,class,chr,K562_log2FC,HepG2_log2FC,SKNSH_log2FC,K562_lfcSE,SKNSH_lfcSE,HepG2_lfcSE
0,12:103293737:G:A:A:wC,AAAAAAAAAAAAAAAAAAAAAAAAAAAAGTGAGTAACAAAAACAAC...,UKBB,OL33,"DVT,Asthma,Cholelithiasis",12,1.114658,0.785664,0.715749,0.113273,0.102159,0.199215
1,12:56396768:G:C:R:wC,GGTGTGGTGGTACATACCTGTAATCTCAGCTACTTGAGAGGCTGAG...,UKBB,"OL28,OL31,OL33","Asthma,DVT,Eosino,FEV1FVC",12,-0.387188,-0.208768,-0.201981,0.291659,0.729915,0.653503
2,12:56396768:G:C:A:wC,GGTGTGGTGGTACATACCTGTAATCTCAGCTACTTGAGAGGCTGAG...,UKBB,"OL28,OL31,OL33","Asthma,DVT,Eosino,FEV1FVC",12,-0.491576,-0.300131,-0.370913,0.474038,1.305596,1.606495
3,15:73041767:A:G:A:wC,GGTGTGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGCCGAG...,UKBB,OL32,UA,15,0.353154,-0.287985,0.527210,0.171574,0.478370,0.147259
4,15:73041767:A:G:R:wC,GGTGTGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGCCGAG...,UKBB,OL32,UA,15,0.469194,-0.132364,0.816555,0.147521,0.422527,0.161479
...,...,...,...,...,...,...,...,...,...,...,...,...
777930,2:70141730:NA:NA,CAGGTCCCCGGAAGTTTTCCCCGGGGACAGGTCTTGGCAACAGGTC...,CRE,OL15,,2,2.523162,2.951357,2.888325,0.100441,0.141465,0.079976
777931,10:85302269:NA:NA,CAGGTCAGGCATTTGCCTCTGCCACCAAACACTAGTTATTTCCATT...,CRE,OL15,,10,-0.304211,-0.589980,-0.654606,0.150466,0.215701,0.201285
777932,8:146012002:NA:NA,CAGGTATTCGGGAGGCTGAGACAGGAGAATCCCTTGAACTCGGGAG...,CRE,OL15,,8,-0.187047,-0.193959,0.404501,0.253936,0.270250,0.337266
777933,8:6652084:NA:NA,CAGTCCTGTCTGTCCTTTCCAGCAGCCCCGCAGAGGCTGAACCCCC...,CRE,OL15,,8,-0.199414,0.773116,2.511860,0.130757,0.132544,0.111419
