# Random forest regressor training

## Introduction

This notebook guides through the training of a random forest machine learning regressor. The data is split into a training and a test set, by default with a relation of 9:1. The expression values are scaled to zero mean and unit variance based on the training set. Positions that are non-informative because no alternative nucleotides have been tested are deleted. The performance evaluation is based on the R^2 score from sklearn. The correlation of measured and predicted expression values is plotted. The feature importance from the random forest regression represent the contributions of each nucleotide-position to the prediction. They are extracted and visualized with a Logo-plot.

## System initiation

Loading all necessary libraries.

In [None]:
import os
import joblib
import time
import timeit
import joblib
import pickle
from ExpressionExpert_Functions import Data_Src_Load, make_DataDir, split_train_test, ExpressionScaler, Sequence_Conserved_Adjusted, Est_Grad_Save, Est_Grad_Feat
from sklearn.model_selection import GroupShuffleSplit

%matplotlib inline

### Variable setting

We load the naming conventions from 'config.txt'

In [None]:
Name_Dict = dict()
with open('config_EcolPtai.txt') as Conf:
    myline = Conf.read().splitlines()
    for line in myline:
        if not line.startswith('#'):
            (key, val) = line.split(':', 1)
            Name_Dict[str(key.strip())] = val.strip()
        

Data_File = Name_Dict['Data_File']
# extract the filename for naming of newly generated files
File_Base = Name_Dict['File_Base']
# the generated files will be stored in a subfolder with custom name
Data_Folder = Name_Dict['Data_Folder']
# column name of expression values
Y_Col_Name = eval(Name_Dict['Y_Col_Name'])
# figure file type
Fig_Type = Name_Dict['Figure_Type']
make_DataDir(Name_Dict)

## Data loading

General information on the data source csv-file is stored in the 'config.txt' file generated in the '0-Workflow' notebook. The sequence and expression data is stored in a csv file with an identifier in column 'ID' (not used for anything), the DNA-sequence in column 'Sequence', and the expression strength in column 'promoter activity'. While loading, the sequence is converted to a label encrypted sequence, ['A','C','G','T'] replaced by [0,1,2,3], and a one-hot encoding.

In [None]:
SeqDat = Data_Src_Load(Name_Dict)
SeqDat.head(3)

## Data manipulation

For the machine learning the data is first separated into training and test sets. The training set is used to generate a standard scaler for expression standardization to zero mean and unit variance. On each position the entropy is calculated to assess how much nucleotide diversity has been sampled on each position. If at any position the entropy is zero, i.e. only one nucleotide is present in all samples, this position is removed because it is non-informative for further analysis (Position entropy analysis). 

### Split data to train and test set

In [None]:
# SeqTrain, SeqTest = split_train_test(SeqDat)
train_size = 1 - eval(Name_Dict['TestRatio'])
# split number '1' because we only use one final test set. Cross validation comes later
gss = GroupShuffleSplit(n_splits=1, train_size=train_size)
X = SeqDat['Sequence']
y = SeqDat[Y_Col_Name]
groups = SeqDat['Sequence_letter-encrypted'].str.upper()
Train_Idx, Test_Idx = list(gss.split(X, y, groups))[0]
SeqTest = SeqDat.iloc[Test_Idx].reset_index(drop=True)
SeqTrain = SeqDat.iloc[Train_Idx].reset_index(drop=True)

TrainTest_Data = {'Train': SeqTrain, 'Test': SeqTest}
TrainTest_File = os.path.join(Data_Folder, '{}_{}_TrainTest-Data.pkl'.format(time.strftime('%Y%m%d'), File_Base))
pickle.dump(TrainTest_Data, open(TrainTest_File, 'wb'))


### Normalization of variables

In [None]:
SeqTrain, Expr_Scaler = ExpressionScaler(SeqTrain, Name_Dict)
# removing non-informative positions where no base diversity exists, base one hot encoding
SeqTrain_Hadj, Positions_removed, PSEntropy = Sequence_Conserved_Adjusted(SeqTrain, Name_Dict)

## Random forest regression with grid search on shuffle split

You can either choose to start a new training of a random forest regressor or load an existing regressor. If you load an existing random-forest regressor the parameters of the standard scaler are loaded based on names in the config-file. For the estimation the training set is dynamically separated into a new training and test set with a 9:1 ratio (parameter 'test_ratio') with 1000 random shuffle splits (parameter 'split_number'). The training takes about 5 minutes on 16 cpu-cores.

**User input:** <br>
 * Decision whether a new random-forest training is started or an existing regressor is loaded.
 
*Example:*<br>
 Start new random-forest training by setting:<br>
 RFR_File = 0<br>
 otherwise, insert the file adress:<br>
 RFR_File = 'data-Example1-Pput\\20191106_Example1-Pput_RFR_ML-File.pkl'

In [None]:
from ExpressionExpert_Functions import my_SVR
# Number of independent promoter library measurements
Measure_Numb = int(Name_Dict['Library_Expression'])
RFR_Best = dict()
# ML Random Forest training for number of independent promoter library measurements
for Meas_Idx in range(Measure_Numb):
    print('Starting new regressor training')

    # starting the machine learning with random forest and grid search
    # This can take a while
    start_time = timeit.default_timer()
    test_ratio = .1
    split_number = 100
    Norm_Meas_Input = '{}_scaled'.format(Y_Col_Name[Meas_Idx])
    AddFeat = eval(Name_Dict['Add_Feat'])
    forest_regr = Est_Grad_Feat(SeqTrain_Hadj, test_ratio, split_number, Norm_Meas_Input, AddFeat)
#     forest_regr = my_SVR(SeqTrain_Hadj, Norm_Meas_Input)
    run_time = timeit.default_timer() - start_time
    print('grid search on {} measurements, run time: {:.0f} sec'.format(Y_Col_Name[Meas_Idx], run_time))

    # prediction of training and test data sets
    # getting the best estimator
#     RFR_Best = forest_regr
    RFR_Best = forest_regr.best_estimator_

    # saving the best estimator
    ML_ID = Name_Dict['RFR_ML_File']
    Regressor_File = os.path.join(Data_Folder, '{}_{}_{}_{}.pkl'.format(time.strftime('%Y%m%d'), File_Base, Y_Col_Name[Meas_Idx].replace(' ','-'), ML_ID))
    joblib.dump(RFR_Best, Regressor_File)
    # saving: 
    # 1. conserved positions not used as input for the regressor
    # 2. Mean and standard deviation of training set expression used for normalizing
    # 3. The standard scaler default name is the name of the expression measurement column with suffix: '_Scaler'
    Scaler_DictName = '{}_Scaler'.format(Y_Col_Name[Meas_Idx])
    Data_Prep_Params = {'Positions_removed': Positions_removed, Scaler_DictName: Expr_Scaler[Scaler_DictName]}
    Param_ID = Name_Dict['RFR_Params_File']
    Parameter_File = os.path.join(Data_Folder, '{}_{}_{}_{}.pkl'.format(time.strftime('%Y%m%d'), File_Base, Y_Col_Name[Meas_Idx].replace(' ','-'), Param_ID))
    pickle.dump(Data_Prep_Params, open(Parameter_File, 'wb'))


### Code tests

# SeqTrain_Hadj['Sequence'].values.tolist()
import joblib

myRFR = joblib.load('data-XHost_Replicates_Seq+Spacer/20200413_Spacer_all_Ptai-Activity_RFR_ML-File.pkl')
Nr_Feat = len(myRFR.feature_importances_)
GC_Importance = Nr_Feat - np.arange(Nr_Feat)[np.argsort(myRFR.feature_importances_)==Nr_Feat-1]
print('Position of GC-content in List: {}'.format(GC_Importance))

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

X = np.reshape(np.array(SeqTrain_Hadj['OneHot'].values.tolist()), (len(SeqTrain_Hadj),-1))
Y = SeqTrain_Hadj[Norm_Meas_Input].values.tolist()
Number_Estimators = np.arange(20,35,1)
Max_Features = np.arange(9,15,1)
param_grid = [{'bootstrap':[False], 'n_estimators': Number_Estimators, 'max_features': Max_Features}]

ML_sub = RandomForestRegressor()
gridML = GridSearchCV(ML_sub, param_grid, cv=5, n_jobs=-1)
gridML.fit(X, Y)

from sklearn import svm
from sklearn.model_selection import cross_val_score, ShuffleSplit, GroupShuffleSplit

Norm_Meas_Input = '{}_scaled'.format(Y_Col_Name[0])
# myML = RandomForestRegressor(gridML.best_params_)
X = np.reshape(np.array(SeqTrain_Hadj['OneHot'].values.tolist()), (len(SeqTrain_Hadj),-1))
Y = SeqTrain_Hadj[Norm_Meas_Input].values.tolist()
groups = SeqTrain_Hadj['Sequence_letter-encrypted']
splits = 20
test_size = .1
cv = GroupShuffleSplit(n_splits=splits, test_size=test_size) #, random_state=42
scores = cross_val_score(gridML.best_estimator_, X, Y, cv=cv, groups=groups)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))