# Preparing input data for MASH analysis
This include input max Z score from univariate analysis and training data, for both before and after LD pruning.

This is in part responding the reviewer's request that we should use independent subset of SNPs to fit mash model and analyze, and compare with results ignoring the LD (as done in Urbut 2017). 

## Obtain data

The required data has [previously been extracted](https://github.com/stephenslab/gtex-eqtls/blob/v6-archive/archive/src/python/analysis_admin.py#L318) from GTEx V6 stored in midway:

```
/project/mstephens/data/internal_supp/gtex-v6-sumstat-hdf5/MatrixEQTLSumStats.Portable.h5
```
It's a 58MB file containing only `max` and `null` summary statistics. For convenience I copy it to my local computer and process from there (see cell below).

In [1]:
[global]
cwd = '~/Documents/GTEx/mash_revision'
input_db = "${cwd!a}/MatrixEQTLSumStats.Portable.h5"

In [2]:
%sosrun get_input
[get_input: provides = input_db]
output: input_db
task:
run:
    rsync -auzP mw:/project/mstephens/data/internal_supp/gtex-v6-sumstat-hdf5/MatrixEQTLSumStats.Portable.h5 ${input_db}




## Extract training and testing data
The cell below loads the data, compute z-score, get a training set (and its correlation estimate $\hat{V}$) and save to an RDS file for use with `mash` analysis. Parameter `snp_list` specifies path to the file of rownames to keep while extracting: by default it uses all SNPs but if a list is provided (eg list after LD pruning) it will only extract results for those SNPs.

In [3]:
%sosrun extract_zscore
[extract_zscore]
parameter: snp_list = 'NULL'
parameter: num_train = 20000
depends: R_library("rhdf5")
input: input_db
output: "${input_db!n}.Z.rds"
task: workdir = cwd
R:
    ConvertP2Z <- function(pval, beta) {
      z <- abs(qnorm(pval / 2))
      z[which(beta < 0)] <- -1 * z[which(beta < 0)]
      return(z)
    }

    GetSS <- function(gene, db) {
      dat <- rhdf5::h5read(db, gene)
      dat$"z-score" <- ConvertP2Z(dat$"p-value", dat$"beta")
      for (name in c("beta", "t-stat", "p-value", "z-score")) {
        dat[[name]] <- t(dat[[name]])
        colnames(dat[[name]]) <- dat$colnames
        rownames(dat[[name]]) <- dat$rownames
      }
      dat$colnames <- dat$rownames <- NULL
      return(dat)
    }
    # load data
    mdat = GetSS('max', ${input!r})
    ndat = GetSS('null', ${input!r})
    # select rows to keep
    if (${snp_list!r} != "NULL") {
        # FIXME: will subset mdat and ndat here
        x = 1
    }
    num_train = ${num_train}
    if (num_train >= nrow(ndat$z)) {
        num_train = floor(nrow(ndat$z) / 2)
    }
    # get vhat (SVS)
    max_absz = apply(abs(ndat$z),1, max)
    nullish = which(max_absz < 2)
    nz = ndat$z[nullish,]
    vhat = cor(nz)
    # save output
    saveRDS(list(train.z = ndat$z[1:num_train,],
                 validate.z = ndat$z[(num_train+1):nrow(ndat$z),],
                 test.z = mdat$z, vhat = vhat), ${output!r})

In [8]:
R:
    dat = readRDS("${input_db!n}.Z.rds")
    str(dat)

List of 4
 $ train.z   : num [1:20000, 1:44] -0.184 0.161 -1.291 -1.628 0.778 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:20000] "ENSG00000000419.8_20_49782767_C_G_b37" "ENSG00000000419.8_20_49654572_A_G_b37" "ENSG00000000419.8_20_49392478_A_G_b37" "ENSG00000000457.9_1_169117725_TG_T_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ validate.z: num [1:28198, 1:44] 0.107 1.1672 0.2172 0.3499 -0.0481 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:28198] "ENSG00000151665.8_2_47601407_A_G_b37" "ENSG00000151687.10_2_190118691_T_G_b37" "ENSG00000151687.10_2_191462133_C_T_b37" "ENSG00000151687.10_2_190620957_G_A_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ test.z    : num [1:16069, 1:44] 0.14 0.955 0.32 1.844 -4.235 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:16069(1d)] "ENSG00000000419.8_20_49461813_G_C_b37" "ENS