##### We will prepare the dataset for the prognosis part of the AML.
##### To do so, we will need :
##### -ELN data : We have followed Yoann's methodology and change the ELN classification coming from the Master dataset
##### -Clinical data : it comes frome Master Dataset
##### -Demographic data: it comes from Master dataset
##### -Genetical data with ITD : ITD has been added based on rules we have established looking at the distribution plots of NGS and Clinical ITD and number of read counts. Rules are based on the provenance of ITD
##### (wheter it comes from clinical or NGS or both and the standard deviation of the read counts taking into account the high variability of read counts and its impact on ITD).
##### -Cytogenetical data: it has already been modified based on the rules established with Elli (look at frequency for additions and deletions and keep the ones greater than 2%)
##### -Translocation data: it has been modified based on Elli's rules. We keep only translocation with a count number greater than or equal to 2 . We summarize other translocations
##### that were not kept in a column called other translocs
##### -Column complex to integrate to cytogenetical data: we have modified this column based on some mistakes for complex classification previously done. The rules adopted are the presence of 3 or more aneuploidies 
##### (additions or deletions combined and no presence of translocation(8,21) or (15,17)
##### -Component data: it comes from the component that we have established thanks to the HDP method.


In [1]:
library('hdp')
library('clusterCrit')
library('grid')
library('gridExtra')
library('ggplot2')
library('ggrepel')
library('RColorBrewer')
library('dplyr')
library('reshape2')
library('IRdisplay')
library('scales')
library('survival')
library('corrplot')
library('Hmisc')
source('../../../src/tools.R')     # custom tools function
source('../../../src/hdp_tools_yanis.R')
###       
source("../../../src/merge_df.R")
source("../../../src/my_toolbox.R")
source("../../../src/my_components.R")
source("../../../src/my_utils.R")
source("../../../src/ggstyles.R")
source("../../../src/my_hotspots.R")
###


theme_set(theme_minimal())

# set jupyer notebook parameters
options(repr.plot.res        = 100, # set a medium-definition resolution for the jupyter notebooks plots (DPI)
        repr.matrix.max.rows = 200, # set the maximum number of rows displayed
        repr.matrix.max.cols = 200) # set the maximum number of columns displayed

Run citation('hdp') for citation instructions,
    and file.show(system.file('LICENSE', package='hdp')) for license details.

Attaching package: ‘dplyr’

The following object is masked from ‘package:gridExtra’:

    combine

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘survival’

The following object is masked from ‘package:clusterCrit’:

    concordance

corrplot 0.84 loaded
Loading required package: lattice
Loading required package: Formula

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:dplyr’:

    src, summarize

The following objects are masked from ‘package:base’:

    format.pval, units


Attaching package: ‘DescTools’

The following objects are masked from ‘package:Hmisc’:

    %nin%, Label, Mean, Quantile

Loading required package: ggpubr
Loading required package: magrittr


##### In the dataframe already prepared,(df_all_components), there is already the genetical data WITH ITD , the translocations, the cytogenetical data and the components.
##### We also have demographic and clinical data from the Master dataset.
##### We need to add:
##### -ELN coming from another dataset that I have set up for the new rule
##### -complex: we will add a new column called complex_new


In [2]:
df_all_components <- read.table("../../../data/updated_dataset/all_components.tsv",sep = '\t' , header = T)# 
df_initial <- read.table("../../../data/initial_dataset/Master_04_10_2019.csv",sep = ',' , header = T)
rownames(df_initial) <- df_initial$data_pd
df_initial <- df_initial[,-1:-4]


In [3]:
df_eln <- read.table("../../../data/updated_dataset/modif_final_eln.csv",sep = ',' , header = T)
rownames(df_eln) <- df_eln$data_pd
df_eln <- df_eln[,-1:-2]
names(df_eln)[names(df_eln) == "eln_2017"] <- "new_eln"

In [4]:
df_all <- merge(df_eln['new_eln'],df_all_components,by=0)
rownames(df_all) <- df_all$Row.names
df_all <- df_all[,-1]
# now we have genetical with ITD ,  translocation , cytogenetical , components and eln
# we reorder inv_3 to have the columns in a nice order: eln,genetical,cytogenetical,translocationm and component
df_all <- df_all[,c(1,seq(3,ncol(df_all)-1),2,ncol(df_all))]

##### Now we use df_initial for clinical ,  demographical and new rules for complex:
##### Let's first do the new rules for complex

In [5]:
for (i in 1:22){
    df_initial[,paste("add_",as.character(i),sep="")] <- df_initial[,paste("add_","p",sep=as.character(i))]+df_initial[,paste("add_","q",sep=as.character(i))]+df_initial[,paste("plus",as.character(i),sep="")]
    df_initial[,paste("add_",as.character(i),sep="")][df_initial[,paste("add_",as.character(i),sep="")]>=2] <-1
} 
df_initial$add_x <- df_initial$add_xp + df_initial$add_xq + df_initial$plusx
df_initial$add_y <- df_initial$plusy
###
# deletions
###
for (i in c(c(1,2,3),5:13,15:19)){
    df_initial[,paste("del_",as.character(i),sep="")] <- df_initial[,paste("del_","p",sep=as.character(i))]+df_initial[,paste("del_","q",sep=as.character(i))]+df_initial[,paste("minus",as.character(i),sep="")]
    df_initial[,paste("del_",as.character(i),sep="")][df_initial[,paste("del_",as.character(i),sep="")]>=2] <-1  
}
for (i in c(c(4,14,20,21,22),"x")){
    df_initial[,paste("del_",as.character(i),sep="")]<- df_initial[,paste("del_","q",sep=as.character(i))]+df_initial[,paste("minus",as.character(i),sep="")]
}     
df_initial$del_y <- df_initial$minusy
df_initial$sum <- rowSums(df_initial[,519:566],na.rm=T)
df_initial$new_complex <- ifelse((df_initial$sum>=3) & (df_initial$t_8_21==0) & (df_initial$t_15_17==0), 1,0) 

##### Let's now add clinical and demographical to merge them all together:

In [6]:
clinical <-c("ahd","perf_status","bm_blasts","secondary","wbc","hb","plt")
demographical <- c("gender","age")
survival <- c("os","os_status","cr")

In [7]:
df_all <- merge(df_all,df_initial[,c(clinical,demographical,survival,"new_complex")],by=0)
rownames(df_all) <- df_all$Row.names
df_all <- df_all[,-1]

In [8]:
df_all$complex <- NULL
dim(df_all)

In [9]:
df_all <-df_all[,c(1:152,ncol(df_all),seq(153,ncol(df_all)-1))]


In [20]:
row.has.na <- apply(df_all, 1, function(x){any(is.na(x))})

In [25]:
table(df_all$os_status)


   0    1 
 732 1342 

In [24]:
dim(df_all)
dim(na.omit(df_all))