## 05-MACAU 

In this notebook I use MACAU - Mixed model Association for Count data via data AUgmentation - to assess the influence of Olympia oyster size/weight on DNA methylation, while controlling for relatedness. I leverage sequence data from bisulfite treated DNA from 2 Olympia oyster populations - Dabob Bay (in Hood Canal), and Oyster Bay (in South Puget Sound). 

MACAU runs on Lenox. The software is available on [Zhou Lab website](http://www.xzlab.org/software.html). The R program seems to have some issues, so I'm using the binary version installed by Sam on Roadrunner. The [MACAU user manual](http://www.xzlab.org/software/macau/MACAUmanual.pdf) is the basis for this analysis. 

### Input files 

MACAU inputs for this type of analysis include: 

(1) Methylated read counts (raw counts, not percentages)
(2) Total read counts (raw counts)
(3) Relatedness matrix
(4) Predictor variable, or covariates 

First, I download all the files I'll need for MACAU. The files were generated as follows: 

Both read count files - (1) and (2) above - were generated by Laura Spencer in RStudio in the [04-raw-count-files.Rmd notebook](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/code/04-raw-count-files.Rmd), using the MethylKit object that Steven Roberts generated in MethylKit in the [01-methylkit.Rmd notebook](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/code/01-methylkit.Rmd). Note that in both files, counts from + and - strand were combined.
- [counts-total-destrand.txt](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/counts-total-destrand.txt)
- [counts-meth-destrand.txt](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/counts-meth-destrand.txt)

The relatedness matrix - (3) - was generated by Katherine Silliman from 2bRad data, using SNPs and the program ANGSD. Check out her notebook entry, [SOS_angsd.ipynb](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/2bRAD/SOS_angsd.ipynb). 
- [HSmbdsamples_rab.txt](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/2bRAD/HSmbdsamples_rab.txt) - generated using only HC/SS samples  
- [mbdsamples_rab.txt](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/2bRAD/mbdsamples_rab.txt) - using all three populations - HC/SS/NF  

The predictor covariates - (4) - is the variable we will use to assess differential DNA methylation. Our data is both shell length and wet weight. 
- [size.macau.txt](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/data/size.macau.txt)

NOTE: all 4 files need to have samples in the same order. They are all ordered using the mbd seq. sample numbers, 1-18. For the count data files, they automatically were ordered sequentially this way when I generated them. For the size covariate file, I re-orded samples in the [04-raw-count-files.Rmd notebook](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/code/04-raw-count-files.Rmd).  Katherine re-ordered her relatedness matrix in the [SOS_angsd.ipynb notebook](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/2bRAD/SOS_angsd.ipynb). 

Here are screen shots to confirm. 

_Hold for SSs of other dataframes from RStudio_

Katherine's relatedness matrix order 
<img src="attachment:image.png" width="600">

See if we can access MACAU 

In [1]:
! /home/shared/macau/macau -h


 MACAU version 1.00alpha, released on 06/05/2015
 implemented by Xiang Zhou

 type ./macau -h [num] for detailed helps
 options: 
 1: quick guide
 2: file I/O related
 3: fit binomial mixed models
 4: note



In [2]:
! /home/shared/macau/macau -h 1

 QUICK GUIDE
 To fit binomial mixed models: 
         ./macau -r [filename] -t [filename] -p [filename] -k [filename] -bmm -o [prefix]



In [3]:
pwd

u'/home/srlab/laura/paper-oly-mbdbs-gen/code'

In [4]:
# create a macau/ directory  

! mkdir ../analyses/macau/

In [7]:
# MACAU writes output files to current directory, so move there 
%cd ../analyses/macau/

/home/srlab/laura/paper-oly-mbdbs-gen/analyses/macau


In [8]:
pwd

u'/home/srlab/laura/paper-oly-mbdbs-gen/analyses/macau'

In [9]:
# Confirm which column I want to use as predictor variable. 
#1= covariate file intercept 
#2 - wet weight (grams, in shell) <-- use this as covariate
#3 - shell length (mm) <-- use this as predictor 

! head ../../data/predictors.size.macau.txt

2.2	17.41
1.9	20.43
2.2	25.33
1.1	19.38
2.2	26.79
1.2	19.8
2.1	20.54
1.9	19.5
1.4	18.43
2.2	21.02


Use the following options: 
    
`-g`  specify the methylated read counts file name.  
`-t`  specify the total read counts file name.  
`-p`  specify the predictor variable file name.  
`-n 2` specify which predictor variable column to use in analysis. In our case column 2 (length in mm)
`-k`  specify the kinship/relatedness matrix file name.  
`-bmm`  specifies binomial mixed model.  
`-o`  specify output file prefix (default “result”).  

Run started at 1:15pm on 08/12/2019. Run ended at TBD

In [10]:
! /home/shared/macau/macau \
-g ../counts-meth-destrand.txt \
-t ../counts-total-destrand.txt \
-p ../../data/predictors.size.macau.txt \
-n 2 \
-c ../../data/cov.weight.macau.txt \
-k ../2bRAD/HSmbdsamples_rab.txt \
-bmm \
-o 20190812-macau

Reading Files ... 
## number of total individuals = 18
## number of analyzed individuals = 18
## number of covariates = 2
## number of total genes/sites = 256043


In [11]:
ls

[0m[01;34moutput[0m/


In [12]:
# Peak at the results file 
! head output/20190812-macau.assoc.txt

id	n	acpt_rate	beta	se_beta	pvalue	h	se_h	sigma2	se_sigma2	alpha0	se_alpha0	alpha1	se_alpha1
Contig038973	18	4.069e-01	-6.949e-02	5.548e-02	2.104e-01	7.713e-01	2.336e-01	2.793e+00	5.452e+00	1.936e+00	1.027e+00	4.199e-01	2.701e-01
Contig039226	18	4.844e-01	-1.022e-02	3.708e-02	7.829e-01	6.961e-01	2.519e-01	9.479e-01	1.905e+00	8.591e-01	6.584e-01	-2.180e-01	1.730e-01
Contig039234	18	3.831e-01	4.603e-02	3.661e-02	2.087e-01	7.543e-01	2.308e-01	3.532e+00	1.117e+01	-1.400e+00	7.399e-01	-1.432e-01	1.573e-01
Contig039252	18	4.388e-01	-4.211e-02	5.445e-02	4.393e-01	7.156e-01	2.512e-01	3.239e+00	8.253e+00	1.283e+00	1.019e+00	1.868e-01	2.675e-01
Contig041234	18	3.909e-01	7.483e-02	9.470e-02	4.294e-01	6.285e-01	2.689e-01	5.002e+00	8.564e+00	4.363e-01	1.708e+00	-4.537e-01	3.745e-01
Contig064124	18	3.505e-01	5.754e-02	9.419e-02	5.413e-01	4.358e-01	2.380e-01	4.554e+00	6.052e+00	-3.804e-01	1.737e+00	-4.357e-01	4.309e-01
Contig064179	18	2.235e-01	1.405e-01	7.007e-02	4.495e-02	8.094e-01	1.432e-01

In [13]:
# List column names - p-value is 6th column (use for indexing)
! head -n 1 output/20190812-macau.assoc.txt

id	n	acpt_rate	beta	se_beta	pvalue	h	se_h	sigma2	se_sigma2	alpha0	se_alpha0	alpha1	se_alpha1


In [14]:
# Count # hits with p-value <0.05
! awk '$6<0.05{pvalue++} END{print pvalue+0}' \
output/20190812-macau.assoc.txt

15616


In [15]:
# Count # hits with p-value <0.01
! awk '$6<0.01{pvalue++} END{print pvalue+0}' \
output/20190812-macau.assoc.txt

4226


In [16]:
# Count # hits with p-value <0.001
! awk '$6<0.001{pvalue++} END{print pvalue+0}' \
output/20190812-macau.assoc.txt

849


In [18]:
# Review run log 
! cat output/20190812-macau.log.txt

##
## MACAU Version = 1.00alpha
##
## Command Line Input = /home/shared/macau/macau -g ../counts-meth-destrand.txt -t ../counts-total-destrand.txt -p ../../data/predictors.size.macau.txt -n 2 -c ../../data/cov.weight.macau.txt -k ../2bRAD/HSmbdsamples_rab.txt -bmm -o 20190812-macau 
##
## Date = Mon Aug 12 18:44:58 2019
##
## Summary Statistics:
## number of total individuals = 18
## number of analyzed individuals = 18
## number of total genes/sites = 256043
## number of analyzed genes/sites = 512086
## number of covariates = 2
##
## Computation time:
## total computation time = 331.402 min 
## computation time break down: 
##      time on eigen-decomposition = 7.16685 min 
##      time on calculating matrix-vector multiplication = 38.4124 min 
##      time on sampling Z = 21.0581 min 
##      time on sampling MW = 32.2596 min 
##      time on sampling Hyperparameters = 218.413 min 
##      time on sampling BAU = 30.8721 min 
##


## Re-do MACAU with filtered count matrices 

Katherine suggested re-running MACAU with count files that have already been filtered. This should increase our power, reduce the number of comparisons, etc.  I created two sets of filtered matrices: 

1) Contigs retained if 75% of samples (14 out of 18) had **10x coverage**
  - Total count data: [counts.tot.destrand.10x75](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/counts.tot.destrand.10x75.txt?raw=true)
  - Methylated count data: [counts.meth.destrand.10x75](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/counts.meth.destrand.10x75.txt?raw=true)
  
2) Contigs retained if 75% of samples (14 out of 18) had **5x coverage**
  - Total count data: [counts.tot.destrand.5x75.txt](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/counts.tot.destrand.5x75.txt?raw=true)
  - Methylated count data: [counts.meth.destrand.5x75.txt](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/counts.meth.destrand.5x75.txt?raw=true)

In [16]:
# Confirm access to MACAU software 
! /home/shared/macau/macau -h 


 MACAU version 1.00alpha, released on 06/05/2015
 implemented by Xiang Zhou

 type ./macau -h [num] for detailed helps
 options: 
 1: quick guide
 2: file I/O related
 3: fit binomial mixed models
 4: note



In [15]:
# Move to the MACAU directory 
%cd ../analyses/macau/

[Errno 2] No such file or directory: '../analyses/macau/'
/home/srlab/GitHub/paper-oly-mbdbs-gen/analyses/macau


I will first run MACAU using the 10x coverage count files. Again, I'd will use shell length as the predictor variable. 

Use the following options: 
    
`-g`  specify the methylated read counts file name.  
`-t`  specify the total read counts file name.  
`-p`  specify the predictor variable file name.  
`-n 2` specify which predictor variable column to use in analysis. In our case column 2 (length in mm)
`-k`  specify the kinship/relatedness matrix file name.  
`-bmm`  specifies binomial mixed model.  
`-o`  specify output file prefix (default “result”).  

Check all files to ensure access

In [12]:
! head ../counts.meth.destrand.10x75.txt -n 3

siteID	numCs1	numCs2	numCs3	numCs4	numCs5	numCs6	numCs7	numCs8	numCs9	numCs10	numCs11	numCs12	numCs13	numCs14	numCs15	numCs16	numCs17	numCs18
Contig0_39226	11	17	10	14	9	13	9	10	18	9	4	15	10	5	8	8	8	8
Contig0_39234	10	7	10	5	12	11	7	8	8	7	1	10	8	5	12	5	12	10


In [13]:
! head ../counts.tot.destrand.5x75.txt -n 3

siteID	coverage1	coverage2	coverage3	coverage4	coverage5	coverage6	coverage7	coverage8	coverage9	coverage10	coverage11	coverage12	coverage13	coverage14	coverage15	coverage16	coverage17	coverage18
Contig0_38973	15	16	16	13	15	17	13	11	19	6	6	15	11	8	9	5	8	17
Contig0_39226	21	22	22	23	23	21	20	16	31	17	5	23	17	10	20	13	19	24


In [8]:
# As before, confirm which column I want to use as predictor variable. 
#1= covariate file intercept 
#2 - wet weight (grams, in shell) <-- use this as covariate
#3 - shell length (mm) <-- use this as predictor 

! head ../../data/predictors.size.macau.txt 

2.2	17.41
1.9	20.43
2.2	25.33
1.1	19.38
2.2	26.79
1.2	19.8
2.1	20.54
1.9	19.5
1.4	18.43
2.2	21.02


In [9]:
! head ../../data/cov.weight.macau.txt

1	2.2
1	1.9
1	2.2
1	1.1
1	2.2
1	1.2
1	2.1
1	1.9
1	1.4
1	2.2


In [10]:
! head ../2bRAD/HSmbdsamples_rab.txt

0	0.00017	0.024933	0.010539	0.005465	0.46994	2e-06	0.09685	0.01807	0	0	2e-06	2e-06	0	0	0	5e-06	0
0.00017	0	0.132784	0.281732	0.17863	0.023179	0.303812	5.2e-05	0.045491	1e-06	0	3.6e-05	0	0	0	1e-06	0.004368	0
0.024933	0.132784	0	0.091539	0.105285	0.010917	0.202775	0.215806	6e-05	2.4e-05	7e-06	0.019305	4e-06	2e-06	4e-06	1.3e-05	0.032585	5e-06
0.010539	0.281732	0.091539	0	0.178819	0.028375	0.269211	0.002831	0.01419	1e-06	1e-06	7e-06	0	2e-06	0	1e-06	5e-06	1e-06
0.005465	0.17863	0.105285	0.178819	0	0	0.505431	2e-06	0.052675	0	0	3e-06	1e-06	0	0	0	0	0
0.46994	0.023179	0.010917	0.028375	0	0	1e-06	0.098603	0.017877	5e-06	0	0	1e-06	0	3e-06	0	0	0
2e-06	0.303812	0.202775	0.269211	0.505431	1e-06	0	3e-06	0.01963	0	1e-06	2e-06	1e-06	1e-06	0	0	0	0
0.09685	5.2e-05	0.215806	0.002831	2e-06	0.098603	3e-06	0	0.000859	2e-06	4e-06	2.4e-05	0.00169	0	0.059431	0	0.000272	1.6e-05
0.01807	0.045491	6e-05	0.01419	0.052675	0.017877	0.01963	0.000859	0	1e-06	0	9e-05	0	0	0	2e-06	2e-06	0
0	1e-06	2.4e-05	1e-06	0	

In [11]:
! /home/shared/macau/macau \
-g ../counts.meth.destrand.10x75.txt \
-t ../counts.tot.destrand.10x75.txt \
-p ../../data/predictors.size.macau.txt \
-n 2 \
-c ../../data/cov.weight.macau.txt \
-k ../2bRAD/HSmbdsamples_rab.txt \
-bmm \
-o 20200107-macau

Reading Files ... 
The total read count file should either contain one data row or match rows in the read count file.
## number of total individuals = 18
## number of analyzed individuals = 18
## number of covariates = 2
## number of total genes/sites = 86791
error! fail to read files. 
