# MPRA validation

This notebook process the MPRA results corresponding to the oligo libray (OL) 46. This library includes all of the natural and synthetic cell type-specific sequences proposed in the CODA paper, experimental controls, and other synthetic sequences pertaining other internal optimization experiments/projects.

In [1]:
import os
import numpy as np
import pandas as pd
import gzip

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))

pd.set_option('display.max_columns', None)

  from IPython.core.display import display, HTML


### Fetch the MPRA results

This dataset comes from a single oligo library, so averaging across libraries like in the MPRA training notebooks is not a possibility here, nor is it needed. We directly load the result files for each cell and concatenate them into a single dataframe. We load an updated version of the attributes files that better provides information about the intended target cell type for sequences falling into the project "BODA:genomics", which the original attributes file does not provide.

In [2]:
#--- Load results
K562_df  = pd.read_table('/validation/OL46_K562_20220720.out', sep='\t', header=0, index_col='ID').reset_index(drop=False)
HepG2_df = pd.read_table('/validation/OL46_HepG2_20220720.out', sep='\t', header=0, index_col='ID').reset_index(drop=False)
SKNSH_df = pd.read_table('/validation/OL46_SKNSH_20220720.out', sep='\t', header=0, index_col='ID').reset_index(drop=False)

#--- Concatenate into a single dataframe
out_df = pd.concat([K562_df.loc[:,('ID', 'project','log2FoldChange', 'lfcSE')],
                    HepG2_df.loc[:,('log2FoldChange', 'lfcSE')],
                    SKNSH_df.loc[:,('log2FoldChange', 'lfcSE')]],
                   axis=1)

#--- Rename concatenated columns
out_df.columns = ['ID', 'project', 'K562_log2FC', 'K562_lfcSE', 'HepG2_log2FC', 'HepG2_lfcSE', 'SKNSH_log2FC', 'SKNSH_lfcSE']

#--- Load the attributes file to update the 'project' attribute 
attributes_df = pd.read_csv('/validation/OL46.attributes_v2', sep='\t')
new_project_dict = dict(zip(attributes_df['ID'], attributes_df['project']))

out_df['project'] = out_df.apply(lambda x: new_project_dict[x['ID']], axis=1)

### Get the nucleotide sequence

We load the nuecleotide sequence information from the fasta file and assign to each oligo ID its corresponding sequence.

In [3]:
#--- Add nucleotide sequences
filepath = '/Fastas/OL46_reference.fasta.gz'

library_num = 'OL46'
fasta_dict = {}
with gzip.open(filepath, 'rt') as f:
    for line_str in f:
        if line_str[0] == '>':
            oligo_id = line_str.lstrip('>').rstrip('\n')
            fasta_dict[oligo_id] = ''
        else:
            fasta_dict[oligo_id] += line_str.rstrip('\n')


compressed_IDs = [ID for ID in fasta_dict.keys() if ';' in ID]
for compressed_ID in compressed_IDs:
    stripped_ID = compressed_ID.lstrip('(').rstrip(')')
    for single_ID in stripped_ID.split(';'):
        fasta_dict[single_ID] = fasta_dict[compressed_ID]
        
out_df['sequence'] = out_df.apply(lambda x: fasta_dict[x['ID']], axis=1)

### Get the design/proposal method

Each oligo ID contains information about the method used to propose the sequence. We extract that information and include it in a column 'method'.

In [4]:
#--- Add method

# Malinois natural
row_filter = (out_df['ID'].str.contains('chr')) & (~ out_df['ID'].str.contains('DHS'))
assert row_filter.sum() > 0
out_df.loc[row_filter, 'method'] = 'Malinois-natural'

# all others
methods = ['DHS', 'fsp', 'al', 'sa', 'hmc', 'fsp_uc', 'al_uc', 'sa_uc', 'sa_rep']
method_rename = {'DHS': 'DHS-natural', 'fsp': 'Fast SeqProp',
                 'al': 'AdaLead', 'sa': 'Simulated Annealing',
                 'hmc': 'hmc', 'fsp_uc': 'fsp_uc', 'al_uc': 'al_uc',
                 'sa_uc': 'sa_uc', 'sa_rep': 'sa_rep'}
for method in methods:
    row_filter = (out_df['ID'].str.contains(method))
    assert row_filter.sum() > 0
    out_df.loc[row_filter, 'method'] = method_rename[method]
    
out_df.sort_values('method', inplace=True)

### Get the intended target cell type

The project attribute column contains information about the intended target cell type (the predicted target cell type before we obtained the empirical results). 

In [5]:
#--- Add target cell type

cell_types_lower = ['k562', 'hepg2', 'sknsh']
cell_types = ['K562', 'HepG2', 'SKNSH']
std_cell_names = dict(zip(cell_types_lower, cell_types))

non_control_filter = out_df['method'].notnull()
target_cell_values = [std_cell_names[x[-1]] for x in out_df[non_control_filter]['project'].str.split(':')]

out_df.loc[non_control_filter, 'target_cell'] = target_cell_values

### Get the penalization round

The oligo ID of synthetic sequences include information about the motif-penalization round explained in the paper. The motif penalization described in the paper was designed to work with the algorithm Fast SeqProp. However, there are additional groups that also have penalizations not included in the CODA paper.

In [6]:
#--- Add penalty round

out_df['round'] = 0

row_filter = out_df['method'].isin(['Fast SeqProp', 'AdaLead', 'Simulated Annealing'])
round_list = [int(split[-1]) for split in out_df.loc[row_filter, 'ID'].str.split('__').tolist()]
out_df.loc[row_filter, 'round'] = round_list

### Filter the CODA paper sequences

The lines below filter out all of the sequences that belong to other optimization objectives not described in the CODA paper.

In [7]:
#--- Filter only CODA paper sequences

non_penalized_methods = ['DHS-natural', 'Malinois-natural', 'AdaLead', 'Simulated Annealing']
CODA_filter_1 = out_df['method'].isin(non_penalized_methods) & (out_df['round'] == 0)
CODA_filter_2 = (out_df['method'] == 'Fast SeqProp')

out_df = out_df[CODA_filter_1 | CODA_filter_2].reset_index(drop=True)

out_column_order = ['ID', 'sequence', 'method', 'target_cell', 'round', 'project',
                    'K562_log2FC', 'K562_lfcSE', 'HepG2_log2FC',
                   'HepG2_lfcSE', 'SKNSH_log2FC', 'SKNSH_lfcSE']

out_df = out_df[out_column_order]
out_df

Unnamed: 0,ID,sequence,method,target_cell,round,project,K562_log2FC,K562_lfcSE,HepG2_log2FC,HepG2_lfcSE,SKNSH_log2FC,SKNSH_lfcSE
0,20211206_221956__59439361__0::al__k562__0,TCGAAGCGATGTAATCACCCATGAACTGTCTCTCCAAGAGTAGCAA...,AdaLead,K562,0,BODA:al:k562,6.465092,0.412323,-1.307851,0.440709,-1.654066,0.314627
1,20211208_41545__46700778__99::al__sknsh__0,ATGTTGCCGTCGAGGATTCTCTCGTTCGGTTGCCGTCTAAAGGATG...,AdaLead,SKNSH,0,BODA:al:sknsh,-0.712902,0.176119,0.525922,0.302298,3.280751,0.155179
2,20211208_41545__46700778__98::al__sknsh__0,CCTCCAACTTCTTCGACGGTGCGGACCAATATCCGGTTGGGGATGC...,AdaLead,SKNSH,0,BODA:al:sknsh,-1.000488,0.423602,-0.766991,0.355013,1.993535,0.258053
3,20211208_41545__46700778__97::al__sknsh__0,GATGGATACCGACGTAGGATGCATCATTGGAGCTGTCAAATCTGTG...,AdaLead,SKNSH,0,BODA:al:sknsh,-0.745058,0.233764,-0.710009,0.265298,1.596357,0.124595
4,20211208_41545__46700778__96::al__sknsh__0,TCTCGACTCGACGGGCCGTCGAAGTTACTGGAGCATCACTATCTGT...,AdaLead,SKNSH,0,BODA:al:sknsh,-0.998574,0.319906,0.383423,0.221527,5.125296,0.208422
...,...,...,...,...,...,...,...,...,...,...,...,...
74995,20211207_205407__862697__202::sa__k562__0,TCGAGGAGCTGACACCGCCAGTTGATTCATATCTTCGACGGTGACA...,Simulated Annealing,K562,0,BODA:sa:k562,4.349904,0.204375,-0.074157,0.802863,-7.026503,3.907715
74996,20211207_205407__862697__1995::sa__k562__0,CGCCTTATCGCGTTCTACACGCCGTCGACAGCAGAAGAGCTTATCG...,Simulated Annealing,K562,0,BODA:sa:k562,1.225483,0.176462,-1.977623,0.556474,-2.576055,0.539065
74997,20211207_205407__862697__200::sa__k562__0,CCTTTTAGGCGGGAAAGTCCCACGCCAGACATATGCTGCACGTGAT...,Simulated Annealing,K562,0,BODA:sa:k562,0.676201,0.198427,-1.027023,0.419903,-1.574123,0.403715
74998,20211207_205407__862697__203::sa__k562__0,CTTTCGACGTGTGATGCTAAACCATCGCTTATCATGCTAGCACACA...,Simulated Annealing,K562,0,BODA:sa:k562,5.780654,0.393954,-0.911137,0.375642,-1.437424,0.417942
