## OVERVIEW

I have the following hypothesis:

If we add the following variables to the model data 

m6A: (number of m6a reported peaks)
microRNA: (number of microRNAs sites, miR-291 for mouse, miR-17 human)

We should expect to see an increase in the predictive performance, for human and mouse were the data represents homeostasis conditions. In the following analysis, I will train the same final model (GradientBoostin) and see if I can get a better R2 score in the test data. Note that for Xenopus and fish, these two variables are meaningless since in alpha-amanitin, the pathways are inactive.


## In this notebook

1. load the train and test data
2. Add the two variables: m6A and microRNA sites.
3. Replace missing values for these variables with 0s.
4. Save train and test with the new variables.

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
## 1. LOAD DATA
path_to_data = "../19-04-30-PredictiveModelDecayAllSpecies/19-04-30-EDA/results_data"

train = pd.read_csv(os.path.join(path_to_data,
                                     "training_set.csv"))
test = pd.read_csv(os.path.join(path_to_data,
                                    "validation_set.csv"))

train.sample(10)

Unnamed: 0,specie,cell_type,datatype,gene_id,decay_rate,coding,utrlenlog,cdslenlog
65093,mouse,mES cells,slam-seq,ENSMUSG00000029131,-0.144498,ATGGTGGATTACTATGAAGTGCTGGGCGTGCAGAGACATGCCTCAC...,7.219642,7.002156
67208,mouse,mES cells,slam-seq,ENSMUSG00000059273,-1.363289,ATGCTGAGGGGAAGGGAAGATGGTGAGTTGGAAGAAGGTGAACTGG...,7.690286,8.234565
14212,human,k562,endogenous,ENSG00000162616,-0.914279,ATGGGGAAAGACTATTATTGCATTTTGGGAATTGAGAAAGGAGCTT...,7.448916,6.922644
34343,human,hela,endogenous,ENSG00000148090,0.459557,ATGGCGGCCGCGGTGGCGGCGGCACCTGGGGCCTTGGGATCCCTGC...,6.280396,6.928538
49358,fish,embryo mzt,aamanitin polya,ENSDARG00000019060,0.231966,ATGTCCGAAGCCACAGTGGCCGATACTCGCCGGTTAAACTCCAAAC...,8.211211,6.74876
27428,human,293t,endogenous,ENSG00000184164,0.642966,ATGCGCCTGCCGCGCCGGGCCGCGCTGGGGCTCCTGCCGCTTCTGC...,5.455321,7.098376
48894,fish,embryo mzt,aamanitin polya,ENSDARG00000015343,0.787867,ATGAGGAAGTTCAGCAGAGACCGAATAACAGCAGTCAAAACCCTCA...,5.828946,7.337588
56160,fish,embryo mzt,aamanitin polya,ENSDARG00000079374,0.983276,ATGCTCTTGTCTGTCTTTGGCGGACCTGTTGAAGAAGTGTTAGCAG...,7.378384,8.702676
61087,fish,embryo mzt,aamanitin ribo,ENSDARG00000071259,-0.715353,ATGCCTCAGAAATGTGAAACTATCAGTTACTGCAACCCCACCCCTT...,6.973543,7.687997
43734,human,RPE,endogenous,ENSG00000156140,-0.241959,ATGGTTCTCCTGTCACTTTGGTTGATAGCAGCCGCTCTGGTAGAGG...,7.68156,8.193953


In [3]:
# get m6a and microRNA sites

m6a_micro_data = pd.read_csv("../19-07-15-OptimalityStrongestPredictorOfMrnaStability/results_data/pathways_mRNAstability.csv")
m6a_micro_data = m6a_micro_data.loc[:,['gene_id','specie', 'm6A', 'microRNAsites']]
m6a_micro_data.head()

Unnamed: 0,gene_id,specie,m6A,microRNAsites
0,ENSG00000000419,human,0,0
1,ENSG00000000457,human,1,1
2,ENSG00000000460,human,0,0
3,ENSG00000001036,human,1,0
4,ENSG00000001084,human,0,0


In [4]:
# MERGE DATA AND REPLACE MISSING VALUES
train = pd.merge(train, m6a_micro_data, how='left', on=['gene_id', 'specie'])
test = pd.merge(test, m6a_micro_data, how='left', on=['gene_id', 'specie'])


train['m6A'] = train['m6A'].fillna(0)
train['microRNAsites'] = train['microRNAsites'].fillna(0)

test['m6A'] = test['m6A'].fillna(0)
test['microRNAsites'] = test['microRNAsites'].fillna(0)

In [5]:
# save train and test

train.to_csv("results_data/training_set.csv", index=False)
test.to_csv("results_data/validation_set.csv", index=False)