# Workflow for ascertaining Significance of the model with Simulated Data

## Background

<p align = 'justify'> So, we set the stage with a optimal ML model which has been derived previously and the read-coverage data for each cell-type. In this section though, we'll taper the data for Chromosome 21 only and will tweak the data for the feature that has been found most influencial for the machine learning models (<b>Logistic Regression</b> and <b>Random Forests</b>). </p>
<p align = 'justify'> For the linear models of Logistic Regression, we'll pick the variable that has the lowest p-value and Variance Inflation Factor (VIF). For Random Forests, we'll pick the "most important variable" engendered by the <b>varImp</b> function. The results are tabulated as below.</p>

| S.No. | Algorithm | Optimum Model | Cell Line | Significant Variable
| --- | --- | --- | --- | ---
| 1. | Logistic Regression | a549modelSMOTE | A549 | H3K27me3
| 2. | Logistic Regression | h1escparetoModelSmote | H1ESC | H3K36me3
| 3. | Logistic Regression | helamodel1SMOTE | HELA | RNAPol2
| 4. | Logistic Regression | imr90paretoModelSmote | IMR90 | H3K9me3
| 5. | Logistic Regression | k562modelSMOTE | K562 | RNA.Seq
| 6. | Logistic Regression | mcf7paretoModelSmote | MCF7 | RNA.Seq
| 7. | Random Forests | a549rfSmote1 | A549 | RNAPol2
| 8. | Random Forests | h1escrf1 | H1ESC | H3K27me3
| 9. | Random Forests | helarf1 | HELA | RNAPol2
| 10. | Random Forests | imr90rf1 | IMR90 | H3K27me3
| 11. | Random Forests | k562rf1 | K562 | H3K4me1
| 12. | Random Forests | mcf7rf1 | MCF7 | RAD21


We shall consider all these models at a time and caliberate their performance on the simulation data. 

* [1. Logistic Regression | A549](#link1)
* [2. Logistic Regression | H1ESC](#link2)
* [3. Logistic Regression | HELA](#link3)
* [4. Logistic Regression | IMR90](#link4)
* [5. Logistic Regression | K562](#link5)
* [6. Logistic Regression | MCF7](#link6)
* [7. Random Forests | A549](#link7)
* [8. Random Forests | H1ESC](#link8)
* [9. Random Forests | HELA](#link9)
* [10. Random Forests | IMR90](#link10)
* [11. Random Forests | K562](#link11)
* [12. Random Forests | MCF7](#link12)
* [13. Session Information](#link13)

### <a id=link1>1. Logistic Regression | A549 </a>

In [1]:
## Loading the simulated data

source("../../R/dataSimulation.R")
simData <- dataSimulation(dataFile = "../../data/A549forML.txt", chrName = "chr21", featureName = "H3K27me3")

The function <i>dataSimulation</i> tweaks the read-coverage for each bin of the crucial variable, as a random number between the maximum read-coverage in that variable and twice that number.

In [2]:
head(simData)

Unnamed: 0_level_0,chr,start,end,CTCF,EP300,H3K27me3,H3K36me3,H3K4me1,H3K4me2,H3K4me3,H3K9ac,H3K9me3,RAD21,RNAPol2,YY1,Class
Unnamed: 0_level_1,<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
797085,chr21,1,2000,0,0,13.903848,0,0,0,0,0,0,0,0,0,Non-Hub
797086,chr21,2001,4000,0,0,14.528959,0,0,0,0,0,0,0,0,0,Non-Hub
797087,chr21,4001,6000,0,0,8.766082,0,0,0,0,0,0,0,0,0,Non-Hub
797088,chr21,6001,8000,0,0,10.564129,0,0,0,0,0,0,0,0,0,Non-Hub
797089,chr21,8001,10000,0,0,12.151623,0,0,0,0,0,0,0,0,0,Non-Hub
797090,chr21,10001,12000,0,0,8.055926,0,0,0,0,0,0,0,0,0,Non-Hub


In [3]:
table(simData$Class)


    Hub Non-Hub 
      4   24061 

With this data, we shall execute the model and examine the results.

In [4]:
## Loading model

load("../../results/optimalModels/a549modelLR")
predictions <- predict(a549modelSMOTE, simData[, -c(1:3, 16)], type = "response")

In [5]:
head(predictions)

In [6]:
# Let us turn these into class labels. Recall that 0 for Hub, and 1 for Non-Hub.

labels <- ifelse(predictions < 0.5, 0 , 1)

In [7]:
table(labels)

labels
    0     1 
24022    43 

In [8]:
## Alternatively

classes <- ifelse (labels ==1, "Non-Hub", "Hub")
table(classes)

classes
    Hub Non-Hub 
  24022      43 

This is a complete contrast to the class labels in the simulated data. Let us see how the model behaves with the original data.

In [9]:
## Loading the original, full data

a549gregStandard <- read.table("../../data/A549forML.txt", header = TRUE)
head(a549gregStandard)

Unnamed: 0_level_0,chr,start,end,CTCF,EP300,H3K27me3,H3K36me3,H3K4me1,H3K4me2,H3K4me3,H3K9ac,H3K9me3,RAD21,RNAPol2,YY1,Class
Unnamed: 0_level_1,<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,chr1,1,2000,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,Non-Hub
2,chr1,2001,4000,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,Non-Hub
3,chr1,4001,6000,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,Non-Hub
4,chr1,6001,8000,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,Non-Hub
5,chr1,8001,10000,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,Non-Hub
6,chr1,10001,12000,0,0,0.02727825,0.02613314,1.58534,0,0.03331304,1.537907,0.02891425,0.1631014,0.06050439,0.05796767,Non-Hub


In [10]:
table(a549gregStandard$Class)


    Hub Non-Hub 
   1948 1546344 

In [11]:
## Pruning data for Chromosome 21 only.

a549gregStandardChr21 <- a549gregStandard[a549gregStandard$chr == "chr21", ]
head(a549gregStandardChr21)

Unnamed: 0_level_0,chr,start,end,CTCF,EP300,H3K27me3,H3K36me3,H3K4me1,H3K4me2,H3K4me3,H3K9ac,H3K9me3,RAD21,RNAPol2,YY1,Class
Unnamed: 0_level_1,<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
797085,chr21,1,2000,0,0,0,0,0,0,0,0,0,0,0,0,Non-Hub
797086,chr21,2001,4000,0,0,0,0,0,0,0,0,0,0,0,0,Non-Hub
797087,chr21,4001,6000,0,0,0,0,0,0,0,0,0,0,0,0,Non-Hub
797088,chr21,6001,8000,0,0,0,0,0,0,0,0,0,0,0,0,Non-Hub
797089,chr21,8001,10000,0,0,0,0,0,0,0,0,0,0,0,0,Non-Hub
797090,chr21,10001,12000,0,0,0,0,0,0,0,0,0,0,0,0,Non-Hub


In [12]:
predictions1 <- predict(a549modelSMOTE, a549gregStandardChr21[, -c(1:3, 16)], type = "response")
labels1 <- ifelse(predictions1 < 0.5, 0 , 1)
table(labels1)

labels1
    0     1 
 3028 21037 

Clearly, in the original data, there were more non-hubs than hubs.

### <a id=link13>13. Session Information </a>

In [14]:
sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.25   crayon_1.3.4    IRdisplay_0.7.0 repr_1.1.0     
 [5] lifecycle_0.2.0 jsonlite_1.7.1  evaluate_0.14   pillar_1.4.6   
 [9] rlang_0.4.7     uuid_0.1-4      vctrs_0.3.4     ellipsis_0.3.1 
[13] IRkernel_1.1.1  tools_4.0.2     compiler_4.0.2  base64enc_0.1-3
[17] htmltools_0.5.0 pbdZMQ_0.3-3   