# Target Promoter extraction

## Introduction

We have generated a synthetic promoter library that includes all sequences within the exploratory space, i.e. sequences with representative samples in the training set of the random forest machine learner. Here we evaluate their performance with the random forest machine learner and select six sequences spanning the expression strength, two samples for each low, medium and high expression. 

## System initiation

Loading all necessary libraries.

In [None]:
import os
import time
import pandas as pd
import numpy as np
from ExpressionExpert_Functions import Data_Src_Load, Extract_MultLibSeq, Find_Near_Seq
%matplotlib inline

### Variable setting

We load the naming conventions from 'config.txt'

In [None]:
Name_Dict = dict()
with open('config_EcolPtai.txt') as Conf:
    myline = Conf.read().splitlines()
    for line in myline:
        if not line.startswith('#'):
            (key, val) = line.split(':', 1)
            Name_Dict[str(key.strip())] = val.strip()
        

Data_File = Name_Dict['Data_File']
# extract the filename for naming of newly generated files
File_Base = Name_Dict['File_Base']
# the generated files will be stored in a subfolder with custom name
Data_Folder = Name_Dict['Data_Folder']
# column name of expression values
Y_Col_Name = eval(Name_Dict['Y_Col_Name'])
# figure file type
Fig_Type = Name_Dict['Figure_Type']

## Data loading

General information on the data source csv-file is stored in the 'config.txt' file generated in the '0-Workflow' notebook. The sequence and expression data is stored in a csv file with an identifier in column 'ID' (not used for anything), the DNA-sequence in column 'Sequence', and the expression strength in column 'promoter activity'. While loading, the sequence is converted to a label encrypted sequence, ['A','C','G','T'] replaced by [0,1,2,3], and a one-hot encoding.

In [None]:
SeqDat = Data_Src_Load(Name_Dict)

# loading synthetic promoter library
Csv_ID = Name_Dict['Csv_ID']
SynCsv_File = os.path.join('{}_{}_{}.csv'.format(Name_Dict['SynLib_Date'], File_Base, Csv_ID)) #'data-PromLib_EcolPtai\\TillTest_predicted.xlsx'     
Name_Synth = Name_Dict
Name_Synth['Data_File'] = SynCsv_File
SynDat = Data_Src_Load(Name_Synth)


## Target sequence extraction
### Selection of expression strength

In [None]:
# The absolute magnitudes of expression is shown in the violin plot of expression strength (cf. Expression strength in the statistical analysis notebook)
print('Choose expression strength in absolute numbers and the number of output sequences:\n')

Target = list()
for Lib_idx in range(eval(Name_Dict['Library_Expression'])):
    Target.append(list(str(i) for i in input('Target {}:'.format(Y_Col_Name[Lib_idx])).split(' ')))
    
Seq_Numb = int(input('Sequence samples:'))

Target_Expr = np.array(Target, dtype=float)


In [None]:
Exp_SeqObj, Ref_Target_lst = Extract_MultLibSeq(SeqDat, Target_Expr, 1, Y_Col_Name)
Syn_SeqObj, Syn_Idx_lst = Extract_MultLibSeq(SynDat, Target_Expr, Seq_Numb, Y_Col_Name)

Ref_idx = np.vstack(Ref_Target_lst).reshape(-1)
mydict_ref = {Y_Col_Name[index]:SeqDat[Y_Col_Name[index]].iloc[Ref_idx].values for index in range(eval(Name_Dict['Library_Expression']))}

ExpProm_df = pd.DataFrame({'Promoter ID': SeqDat[Name_Dict['ID_Col_Name']].iloc[Ref_idx].values, 'Sequence': np.vstack(Exp_SeqObj).reshape(-1)})
ExpProm_df = ExpProm_df.join(pd.DataFrame(mydict_ref), how='right')

Targ_idx = np.vstack(Syn_Idx_lst).reshape(-1)
mydict_syn = {Y_Col_Name[index]:SynDat[Y_Col_Name[index]].iloc[Targ_idx].values for index in range(eval(Name_Dict['Library_Expression']))}
SynProm_df = pd.DataFrame({'Sequence': np.vstack(Syn_SeqObj).reshape(-1)})
SynProm_df = SynProm_df.join(pd.DataFrame(mydict_syn), how='right')

Out_df = ExpProm_df.append(SynProm_df, sort=False)
Csv_ID = 'Predicted-Target-Promoter'
TarCsv_File = os.path.join('{}_{}_{}.csv'.format(time.strftime('%Y%m%d'), File_Base, Csv_ID))
Out_df.to_csv(TarCsv_File, index=None)