In [3]:
import numpy as np
import pandas as pd

# Preprocess MPRA data

In this example, I'm downloading the CMS directory that Hannah shared with us. It should have 3 types of files:

1. A FASTA file (e.g., \*.fa) that contains ID strings and DNA sequences. (ref: https://en.wikipedia.org/wiki/FASTA_format)
2. An MPRA activity file (e.g., \*.(?!emVar)\*.out) which is a TSV file
3. An MPRA "skew" file (e.g., \*.(emVar)\*.out) also a TSV

We will be using the FASTA to get DNA sequences, and the activity files to get DNA sequence activity. The skew files capture differential activity of two MPRA sequences where one nucleotide (i.e., character) is different. The name emVar refers to a human genome sequence variation database.

In [1]:
! unzip CMS.zip

Archive:  CMS.zip
 extracting: CMS_example_summit_shift_HEPG2_emVAR_20201013.out  
 extracting: CMS_example_summit_shift_SKNSH_emVAR_20201013.out  
 extracting: CMS_MRPA_092018_60K.balanced.collapsed.seqOnly.fa  
 extracting: CMS_example_summit_shift_SKNSH_20201013.out  
 extracting: CMS_example_summit_shift_HEPG2_20201013.out  


## Build ID to Seq dictionary

FASTA files use two lines (or more) for each entry. The first line of a pair begins with a ">" character followed by an ID string followed on the next line by the DNA sequence associated with the above ID. Often FASTA files will restrict each line to 50 characters for readability, so each sequence associated with with one ID may be broken up over many lines.

In [2]:
with open('CMS_MRPA_092018_60K.balanced.collapsed.seqOnly.fa', 'r') as f:
    fasta_dict = {}
    for line in f:
        if '>' == line[0]:
            my_id = line.rstrip()[1:]
            fasta_dict[my_id] = ''
        else:
            fasta_dict[my_id] += line.rstrip()

## Collect activity measurements

Now we'll collect (DNA sequence, MPRA activity) pairs and dump this into a pandas DataFrame. First we'll grab data from a Neural cell-type, SKNSH.

In [18]:
data = []

with open('CMS_example_summit_shift_SKNSH_20201013.out', 'r') as f:
    header = f.readline().rstrip()
    activity_idx = header.split().index('log2FoldChange')
    for line in f:
        entry = line.rstrip().split()
        try:
            data.append( [fasta_dict[ entry[0] ], float(entry[activity_idx])] )
        except KeyError:
            print(f'Could not find key: {entry[0]}')
        except ValueError:
            print(f'For key: {entry[0]}, cannot convert value: {entry[activity_idx]}')
            data.append( [fasta_dict[ entry[0] ], np.nan] )
        
data = pd.DataFrame(data, columns=['sequence','activity'])

Could not find key: 5860set_5overlaps_oligo3741
Could not find key: 5860set_5overlaps_oligo3742
Could not find key: 5860set_5overlaps_oligo3743
Could not find key: 5860set_5overlaps_oligo3744
Could not find key: 5860set_5overlaps_oligo3745
Could not find key: 5860set_5overlaps_oligo3746
Could not find key: 5860set_5overlaps_oligo3747
Could not find key: 5860set_5overlaps_oligo3748
For key: 5860set_5overlaps_oligo4181, cannot convert value: NA
For key: 5860set_5overlaps_oligo4183, cannot convert value: NA
For key: 5860set_5overlaps_oligo4186, cannot convert value: NA
For key: 5860set_5overlaps_oligo4188, cannot convert value: NA
For key: BEB_rs10262647_alt, cannot convert value: NA
For key: BEB_rs10262647_ref, cannot convert value: NA
For key: BEB_rs1049254_ref, cannot convert value: NA
For key: BEB_rs1078587_alt, cannot convert value: NA
For key: BEB_rs11090866_ref, cannot convert value: NA
For key: BEB_rs11228164_ref, cannot convert value: NA
Could not find key: BEB_rs11263861_alt
Cou

Could not find key: ESN_rs34393345_alt
Could not find key: ESN_rs34393345_ref
For key: ESN_rs34592426_alt, cannot convert value: NA
For key: ESN_rs34592426_ref, cannot convert value: NA
Could not find key: ESN_rs34714667_alt
Could not find key: ESN_rs34714667_ref
Could not find key: ESN_rs35180670_ref
Could not find key: ESN_rs35406945_alt
Could not find key: ESN_rs35406945_ref
Could not find key: ESN_rs35624343_alt
Could not find key: ESN_rs35624343_ref
Could not find key: ESN_rs35837993_alt
Could not find key: ESN_rs35837993_ref
For key: ESN_rs372839294_alt, cannot convert value: NA
For key: ESN_rs372839294_ref, cannot convert value: NA
For key: ESN_rs3811422_alt, cannot convert value: NA
For key: ESN_rs3811422_ref, cannot convert value: NA
Could not find key: ESN_rs3814921_ref
Could not find key: ESN_rs3814922_ref
Could not find key: ESN_rs3819306_alt
Could not find key: ESN_rs397843773_alt
Could not find key: ESN_rs397843773_ref
Could not find key: ESN_rs4010957_alt
Could not find 

For key: IBS_rs940234_ref, cannot convert value: NA
Could not find key: ITU_rs10796828_alt
Could not find key: ITU_rs10826185_alt
Could not find key: ITU_rs11134728_alt
Could not find key: ITU_rs11134728_ref
For key: ITU_rs111569430_alt, cannot convert value: NA
For key: ITU_rs111569430_ref, cannot convert value: NA
For key: ITU_rs11165252_ref, cannot convert value: NA
Could not find key: ITU_rs11263861_alt
Could not find key: ITU_rs11263861_ref
Could not find key: ITU_rs11264110_alt
For key: ITU_rs11541137_alt, cannot convert value: NA
For key: ITU_rs11541137_ref, cannot convert value: NA
For key: ITU_rs11584821_ref, cannot convert value: NA
Could not find key: ITU_rs11589697_alt
Could not find key: ITU_rs11589697_ref
For key: ITU_rs11910705_alt, cannot convert value: NA
Could not find key: ITU_rs12056045_alt
Could not find key: ITU_rs12109157_alt
Could not find key: ITU_rs12109157_ref
Could not find key: ITU_rs1268644_alt
For key: ITU_rs13034722_alt, cannot convert value: NA
For key:

Could not find key: STU_rs456195_alt
Could not find key: STU_rs470468_alt
For key: STU_rs4989268_alt, cannot convert value: NA
For key: STU_rs4989268_ref, cannot convert value: NA
Could not find key: STU_rs547793267_alt
Could not find key: STU_rs549793936_alt
Could not find key: STU_rs549793936_ref
For key: STU_rs559926519_alt, cannot convert value: NA
Could not find key: STU_rs56303742_ref
Could not find key: STU_rs564149910_ref
Could not find key: STU_rs6095279_ref
For key: STU_rs62065781_ref, cannot convert value: NA
For key: STU_rs62068735_alt, cannot convert value: NA
For key: STU_rs6453646_alt, cannot convert value: NA
For key: STU_rs6453646_ref, cannot convert value: NA
For key: STU_rs6453648_alt, cannot convert value: NA
For key: STU_rs6565915_ref, cannot convert value: NA
Could not find key: STU_rs6969329_ref
For key: STU_rs71462347_alt, cannot convert value: NA
For key: STU_rs71462347_ref, cannot convert value: NA
Could not find key: STU_rs7404028_alt
Could not find key: STU_

In [20]:
data.to_csv('example_SKNSH.txt', sep='\t')