200_fwsw_notebook.Rmd

---
title: "FWSW lab Notebook"
author: "Sarah P. Flanagan"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_notebook
editor_options: 
  chunk_output_type: console
---

```{r setup}
knitr::opts_knit$set(root.dir='../fwsw_results/', fig.pos='H')
```
```{r source}
source("../../gwscaR/R/gwscaR.R")
source("../../gwscaR/R/gwscaR_plot.R")
source("../../gwscaR/R/gwscaR_utility.R")
source("../../gwscaR/R/gwscaR_fsts.R")
source("../../gwscaR/R/gwscaR_popgen.R")
source("../../gwscaR/R/vcf2dadi.R")
source("../R/250_dadi_analysis.R")

library(knitr)
pop.list<-c("ALFW","ALST","FLCC","FLLG","LAFW","TXCC","TXFW")
```

# 30 October 2020
A few runs are still going but I'm moving the ones that are done to move forward a bit. Here is what is remaining on abacus:
- ALFW_ALST_SC2mG 1-10 (9 is still running)
- LAFW_ALST_SC2mG 11-21 (15 is still running)
- LAFW_ALST_SC2NG 11-21 (17 is still running)
- FLLG_FLFW_IM2m 11-20 (20 is still running)
- FLLG_FLFW_IM2mG 11-20 (20 is still running)
- FLLG_FLFW_IM2NG 11-20 (16 is still running)
- TXFW_TXCC_IM2mG 1-20 (18 is still running)
- TXFW_TXCC_IM2NG 1-20 (12 is still running)

But for the TX ones I scp'ed everything anyway so that I'll have at least reps 1-10.

Now I've updated dadi doc. Intriguingly, the AL and LA pops are best modeled by SC (with 2mG or 2N2mG). TX and FL best are both IM2m.

# 3 September 2020

A number of models had finished running so I moved them to C001KR and re-analyzed the results. I've found that ALFW-ALST simple comparisons for both IM and SC have dAIC <10, so I'm going to run the SC complex models as well (#1-20). 

IM is the best for the TX populations so I'll start those (#1-20).

# 27 August 2020

For LAFW_ALFW, FLLG_FLFW, and ALFW_ALST, it took 2-3 days to run 1 round of each of the more complex models, so I'll start rounds 2-11 now. 

# 25 August 2020

The TX simple models were finished running but I realized that some of them had not ever completed (didn't have all the way through 40 reps) and some of them hadn't run. So I'm re-running some of each of those models on abacus. The one extra LA run that I started yesterday is complete, so after these runs from today are finished I'll have 20 reps of AM, IM, SC, and SI for each pairwise comparison. From there I'll choose which complex models to run for TX.

The complex runs from yesterday are still running but are almost done.

# 24 August 2020

The AL and LA comparisons are done running (just that SC, SI, AM, and IM models) for all 20 reps. Taking a look at the updated 250_fwsw_dadi doc, I first observed that one of the LA AM runs was apparently incomplete -- it's run #1, and I'm going to re-start that on abacus. 

Other things to note: 
- the FLFW-FLCC simple models are all quite similar to each other, but the best one is the IM model -- it has both the highest median log likelihood and the highest log likelihood recorded. 
- IM is also the best of the four simple models for ALFW-ALST.
- SC is the best of the four simple models for LAFW-ALST. 
- TX is still running but it looks like IM will be the winner there.

So now I need to run the more complex models. I've started one run of each of the complex models for the FL, AL, and LA combinations mainly to get a sense for how long this might take. I've got lots of time -- the deadline for resubmission has been pushed to 30 Nov (of course I would prefer to be done sooner than that). 

# 13 August 2020

Rounds 1-10 are done for the simple models (SI, SC, AM, and IM) for all four comparisons, and I've copied them to C001KR from abacus. I deleted the FLLG_FLCC ones but not the others. 
I started rounds 10-21 for the simple models for the pour pairwise comparisons on abacus.
The runs on rccuser are still going (which was all models for all comparisons).

Re-knit the document and it's looking like there was clearly migration in all population pairs, but the SI model is pretty high up there for the FL comparisons (not the best though).

# 3 August 2020

I installed stacks v1.40 on C001KR and copied Stacks data to BigData/ so that I can re-filter data for dadi.
Then ran populations with the following command:

```
populations -b 2 -P ./stacks -M ./stacks/fwsw_sub_strat.txt -t 5 --min_maf 0 --vcf
```

I also downloaded all the results from abacus and rccuser, stopped the runs on rccuser, and archived all of the files (saved in tar.gz and deleted the un-zipped ones).

I downloaded the code from Rougemont for the folded models.

TODO: update code to make it a bit easier to run (maybe)
Once populations is done, re-filter for dadi. 


# 30 June 2020

I've done a final update of the dadi runs, scp-ing all of the ones from abacus and rccuser so that I've got as complete a set as possible.

# 29 June 2020

I moved some sets of results from rccuser and abacus even though they aren't totally finished, mainly because I want to check in on the dadi results.

I also realized that maybe I should be using the very final round/replicate from each run, and that maybe that will solve some of my issues with the dadi plots...this ended up changing some of the results (FLFW and FLCC are now "SI" model!) and we'll see what the output looks like in the end. 

UGH No, now it's FLFW_ALST and FLFW_TXCC with problems (rather than FLFW_TXFW and FLFW_TXCC). but the model running on C001KR is done! So I can install updates and restart, FINALLY.

Ok, so I'm an idiot, and I hadn't actually been using the median model for any other than FLFW_FLCC. So I'll do that and go back to not using the final replicate and see how that goes. - that solved the problem!

# 25 June 2020

FLCC_ALFW #13 finally done on abacus, so I moved it.
The permutations are also finished so I can look into that.

```{r}
perms<-readRDS("permuted_fsts_22062020.RDS")
sigPerm<-lapply(perms,function(dat){
  
  return(dat[which(dat$adjP<0.05),])
})
```

I think one of the issues with the permutations is that there are many/several loci with Fst=0 with the correct labels but when the labels are permuted they end up with Fst >0 all the time, so those ones end up being outliers. I could re-vamp the permutations again to only go for loci with actual Fst > the distribution of permutations, but I'm not sure it's worth the effort. 

Are any other loci in the dataset in that 6-bp region? Yes, >200.

# 24 June 2020

One of the papers I read says that LRRCC1 is tightly linked to CA2 in zebrafish, and CA2 is one of the putative freshwater genes

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo_noFL.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[which(fw_SNPinfo$XtX_noFL >= quantile(fw_SNPinfo$XtX_noFL,0.99,na.rm = TRUE))],
               salBF=fw_SNPinfo$ID[which(fw_SNPinfo$logSalBF_noFL>=
                                     quantile(fw_SNPinfo$logSalBF_noFL,0.99,na.rm = TRUE))],
               permutations=fw_SNPinfo$ID[
                 rowSums(fw_SNPinfo[,c("perm_TX","perm_AL","perm_LA")])==3],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & 
                                                  fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05)])
lg6<-fw_SNPinfo[fw_SNPinfo$Chrom=="LG6",]
outdat<-data.frame(rbind(lg6[which(lg6$ID %in% outliers$sharedStacks),],
                         lg6[which(lg6$ID %in% outliers$salBF),]),
                   stringsAsFactors = FALSE)

gff.name<-"ssc_2016_12_20_chromlevel.gff.gz"
if(length(grep("gz",gff.name))>0){
  gff<-read.delim(gzfile(paste("../../scovelli_genome/",gff.name,sep="")),header=F)
} else{
  gff<-read.delim(paste("../../scovelli_genome/",gff.name,sep=""),header=F)
}
colnames(gff)<-c("seqname","source","feature","start","end","score",
                 "strand","frame","attribute")
gff6<-gff[gff$seqname=="LG6",]
put_genes<-read.delim("putative_genes.txt",stringsAsFactors = FALSE)

cas<-unlist(strsplit(put_genes$Scovelli_geneID[put_genes$Gene=="CA"],","))
cas_gff<-gff[unlist(lapply(cas,grep,x=gff$attribute)),]
cas_gff6<-cas_gff[cas_gff$seqname == "LG6",]


```

Ok, they're not near each other here, even if they're on the same LG -- they're `r min(cas_gff6$start)-778986` (6448813) bp apart.

What about other genes nearby?

```{r}
nearby_genes<-gff6$attribute[gff6$feature=="gene" & gff6$start >(778986-100000) & gff6$start < (778986+100000)]
nearby_genes<-gsub("ID=(.*);Name.*","\\1",nearby_genes)
genome.blast<-read.csv("../../scovelli_genome/ssc_2016_12_20_cds_nr_blast_results.csv",
                       skip=1,header=T)#I saved it as a csv
genome.blast[genome.blast$sscv4_gene_ID %in% nearby_genes,]
```

None of those are putative fw genes, so I'll just move on. 

In the supplement there are inconsistencies around how many SNPs are outliers in the stacks analysis, so i need to fix that. 

# 23 June 2020

The pemutations are still running this morning but hopefully they'll be done by the afternoon so I can dig into it. They seem to have finished, whoot...but I screwed up and they didn't save properly. facepalm. 

```{r}
perms<-readRDS("permuted_fsts_22062020.RDS")
```

So I'll start it again and look at them tomorrow. 


Right now I want to get to the bottom of the PCAdapt weirdness. Ok, so I've discovered that with min.maf=0.05, only 781 SNPs pass PCAdapt's pruning. With min.maf=0.01 only 2358 SNPs pass the pruning. With min.maf=0, 12094 pass. With an alpha level of 0.01, there are still 3330 outliers, and most of these are ones with really small minor allele frequencies (median=0.014). But many of these have negative z-scores -- perhaps I can focus my attention only on those with positive z-scores. No, that probably won't work. What I can do is restrict attention to those loci with statistics in the top 99% quantile and then only keep those with qvalues <= alpha (0.01)

This sort of worked (code below for posterity) but I decided to revert to the old way with lots being NA because otherwise the overall analysis seems a bit whack. 

```{r}
statOut<-which(res$stat >= quantile(res$stat,0.99,na.rm=TRUE))
outliers<-statOut[statOut %in% outliers]
```


# 22 June 2020

A few things:
1. I see what Adam was saying about the permutation outliers being small Fst values -- it's on the Manhattan plot. Perhaps there needs to be a different way of identifying these as outliers.
2. PCAdapt doesn't share many of the same outliers because those loci are removed by the minor allele frequency cutoff. If I remove this cutoff, then thousands of loci have really small q-values, which also doesn't seem quite right.
3. I need to look into the set of outliers that shows up on LG6. 

Working on 2:
I'll try a min.maf=0.01 and see if that helps include more outliers without creating chaos. 

Working on 3:
I need to identify which outliers are coming up on LG6, how far apart they are, and whether they're annotated. Then I could visualize it somehow -- some packages exist, but they seem to be bioconductor related. This package might be useful: https://bioconductor.org/packages/release/bioc/vignettes/genomation/inst/doc/GenomationManual.html

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99,na.rm = TRUE)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=
                                     quantile(fw_SNPinfo$logSalBF,0.99,na.rm = TRUE)],
               permutations=fw_SNPinfo$ID[
                 rowSums(fw_SNPinfo[,c("perm_TX","perm_AL","perm_LA")])==3],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & 
                                                  fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05)])
```

There are `r nrow(dim(fw_SNPinfo[fw_SNPinfo$Chrom=="LG6",]))` SNPs on LG6 and `r nrow(fw_SNPinfo[fw_SNPinfo$ID %in% unlist(outliers) & fw_SNPinfo$Chrom=="LG6",])` are outliers in one or more analyses. 

```{r}
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#a65628',0.75),
        stacks=alpha('#f781bf',0.75),xtx=alpha('#ff7f00',0.75))

lg6<-fw_SNPinfo[fw_SNPinfo$Chrom=="LG6",]
lg6$avgFst<-rowMeans(lg6[,c("stacks_AL","stacks_LA","stacks_TX")],na.rm = TRUE)
plot_dat<-fst.plot(lg6,scaffs.to.plot = "LG6",fst.name = "avgFst",
                   chrom.name = "Chrom",bp.name = "Pos",axis.size = 1,pch=19,pt.cex = 1.5)
points(plot_dat$plot.pos[grep("SSC",plot_dat$SSCID)],
       plot_dat$avgFst[grep("SSC",plot_dat$SSCID)],
       col="black",cex=2,pch=19,lwd=2)
points(plot_dat$plot.pos[grep("UTR",plot_dat$region)],
       plot_dat$avgFst[grep("UTR",plot_dat$region)],
       col="red",cex=2,pch=19,lwd=2)

points(plot_dat$plot.pos[which(plot_dat$stacks_AL_P < 0.05 & 
                                 plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 )],
              plot_dat$avgFst[which(plot_dat$stacks_AL_P < 0.05 & 
                                  plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05)],
       col=cols["stacks"],cex=2,pch=8,lwd=3)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$avgFst[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=2,pch=2,lwd=2)


```

Ok, that's a start for visualizing. But I wonder if I can show the genomic information below too

```{r}
gff.name<-"ssc_2016_12_20_chromlevel.gff.gz"
if(length(grep("gz",gff.name))>0){
  gff<-read.delim(gzfile(paste("../../scovelli_genome/",gff.name,sep="")),header=F)
} else{
  gff<-read.delim(paste("../../scovelli_genome/",gff.name,sep=""),header=F)
}
colnames(gff)<-c("seqname","source","feature","start","end","score",
                 "strand","frame","attribute")
gff6<-gff[gff$seqname=="LG6",]
```


I should identify what the genes are in these regions.

```{r}
lg6$description[which(lg6$stacks_AL_P < 0.05 &
                         lg6$stacks_LA_P < 0.05 &
                         lg6$stacks_TX_P < 0.05 )]
lg6$description[lg6$logSalBF>=quantile(lg6$logSalBF,0.99,na.rm = TRUE) & 
           !is.na(lg6$description)]

lg6$abbr<-NA
lg6$abbr[which(lg6$stacks_AL_P < 0.05 &
                         lg6$stacks_LA_P < 0.05 &
                         lg6$stacks_TX_P < 0.05 )]<-c("WRNIP1",
                                                      "uncharacterised",NA,
                                                      "PTPRF6",
                                                      "LRRCC1",NA)
lg6$abbr[lg6$logSalBF>=quantile(lg6$logSalBF,0.99,na.rm = TRUE) & 
           !is.na(lg6$description)]<-c("RP1","WRNIP1",NA,"LRRCC1")

```


```{r}
plot(x=c(min(gff6$start),max(gff6$end)),
     y=c(0,1),type='n', axes=FALSE,
     xlab="",ylab="")#,xlim=c(200,2000000)
abline(h=0.5)
rect(gff6$start[gff6$feature=="gene"],0.25,
     gff6$end[gff6$feature=="gene"],0.75,col = "grey",border="grey")

stacksout<-lg6$Pos[which(lg6$stacks_AL_P < 0.05 &
                         lg6$stacks_LA_P < 0.05 &
                         lg6$stacks_TX_P < 0.05 )]
abline(v=stacksout,col=cols["stacks"],lwd=2)
abline(v=lg6$Pos[lg6$logSalBF>=quantile(lg6$logSalBF,0.99,na.rm = TRUE)],
       col=cols["sal"],lty=2,lwd=2)
text(x =lg6$Pos[!is.na(lg6$abbr)],y=rep(1.1,nrow(lg6[!is.na(lg6$abbr),])),
     lg6$abbr[!is.na(lg6$abbr)],xpd=TRUE,srt=90, pos=3)
```

```{r}
plot(x=c(min(gff6$start),max(gff6$end)),
     y=c(0,1),type='n', axes=FALSE,
     xlab="",ylab="",xlim=c(200,2000000))
abline(h=0.5)
rect(gff6$start[gff6$feature=="gene"],0.25,
     gff6$end[gff6$feature=="gene"],0.75,col = "grey",border="grey")

stacksout<-lg6$Pos[which(lg6$stacks_AL_P < 0.05 &
                         lg6$stacks_LA_P < 0.05 &
                         lg6$stacks_TX_P < 0.05 )]
abline(v=stacksout,col=cols["stacks"],lwd=2)
abline(v=lg6$Pos[lg6$logSalBF>=quantile(lg6$logSalBF,0.99,na.rm = TRUE)],
       col=cols["sal"],lty=2,lwd=2)
text(x =lg6$Pos[!is.na(lg6$abbr)],y=rep(1.1,nrow(lg6[!is.na(lg6$abbr),])),
     lg6$abbr[!is.na(lg6$abbr)],xpd=TRUE,srt=90, pos=3)
```

Are the ones in gene regions in UTRs or in coding regions?

```{r}
lg6$region[which(lg6$stacks_AL_P < 0.05 &
                         lg6$stacks_LA_P < 0.05 &
                         lg6$stacks_TX_P < 0.05 )]
lg6$region[lg6$logSalBF>=quantile(lg6$logSalBF,0.99,na.rm = TRUE) & 
           !is.na(lg6$description)]


```

They are all in coding regions. Weirdly, one of the SNPs is apparently in a gene if we look at the gff but it's got NAs instead. I'll fix that and then re-plot the above and create a table.

```{r}
outdat<-data.frame(rbind(lg6[which(lg6$stacks_AL_P < 0.05& 
                         lg6$stacks_LA_P < 0.05 &
                         lg6$stacks_TX_P < 0.05 ),],
lg6[lg6$logSalBF>=quantile(lg6$logSalBF,0.99,na.rm = TRUE) & 
           !is.na(lg6$description),]),stringsAsFactors = FALSE)
outdat<-unique(outdat)


genome.blast<-read.csv("../../scovelli_genome/ssc_2016_12_20_cds_nr_blast_results.csv",
                       skip=1,header=T)#I saved it as a csv
outdat$SSCID[outdat$ID==9871]<-"SSCG00000016415"
outdat$description[outdat$ID==9871]<-genome.blast$blastp_hit_description[genome.blast$sscv4_gene_ID=="SSCG00000016415"]
outdat$abbr[outdat$ID==9871]<-"GPC5"

# replot
plot(x=c(min(gff6$start),max(gff6$end)),
     y=c(0,1),type='n', axes=FALSE,
     xlab="",ylab="",xlim=c(200,2000000))
abline(h=0.5)
rect(gff6$start[gff6$feature=="gene"],0.25,
     gff6$end[gff6$feature=="gene"],0.75,col = "grey",border="grey")

stacksout<-lg6$Pos[which(lg6$stacks_AL_P < 0.05 &
                         lg6$stacks_LA_P < 0.05 &
                         lg6$stacks_TX_P < 0.05 )]
abline(v=stacksout,col=cols["stacks"],lwd=2)
abline(v=lg6$Pos[lg6$logSalBF>=quantile(lg6$logSalBF,0.99,na.rm = TRUE)],
       col=cols["sal"],lty=2,lwd=2)
text(x =outdat$Pos[!is.na(outdat$abbr)],y=rep(1.1,nrow(outdat[!is.na(outdat$abbr),])),
     outdat$abbr[!is.na(outdat$abbr)],xpd=TRUE,srt=90, pos=3)
```

I realize I should use the outlier thing instead of calling quartile every time for salinity loci because the quartile from LG6 will be different from overall...fixing that in the supplement. 

I'm a bit concerned that the annotation wasn't there for the one locus, so let me try the function again but just for LG6 outliers. ...but wait, maybe I was just dumb and confused? because looking at it now I don't think that it is in a gene. 

```{r}

lg6_annotate<-annotate_snps(outdat,gff,genome.blast,ID="ID",
                            chrom="Chrom",bp="BP",pos = "Pos")
```


Also, what if this isn't the best way to do it? I could look for the element that has the closest start/end points and see if it falls between them. But I don't actually think this is necessary. 

Working on #1, the permutation analysis:

I could also return the SEM and do one-sample t-tests, apply a bonferroni correction, and then look at loci that are significant. Ok, I'm running this now, we'll see how it goes. 


# 19 June 2020

Goals:
1. Update text with Adam's comments
2. Look into Bayenv results without Florida pops.

dadi updates:
I was able to transfer some of the dadi results from abacus to C001KR. Also started ALFW_TXFW_13 (run 660) and FLCC_TXCC_14 (run 661, all models except SC2N2mG, which is still running with # 13 in run 619). I noticed that FLLG_TXFW nothing's been updated since Jun 15 and the only log remaining is SC2N2mG.log.txt, which was updated on 10 June. So I'm thinking of stopping that one (job 626) and re-starting just SC2N2mG. - yep, done. 

Analyzing Bayenv no Florida results:
XTX and salinity are correlated with a correlation coefficient ~0.5. They share ~30-50 outliers and the analysis without Florida has a lot less noise. There's a peak in LG6 that looks like it might be shared by the stacks Fst outliers -- this is the next thing to check.

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
bayenv_noFL<-read.delim("bayenv_output_noFL.txt",header=TRUE)


bothXtX<-bayenv_noFL$ID[bayenv_noFL$XtX_FL>=quantile(bayenv_noFL$XtX_FL,0.99) &
                 bayenv_noFL$XtX_noFL>=quantile(bayenv_noFL$XtX_noFL,0.99)]
bothSal<-bayenv_noFL$ID[bayenv_noFL$logSalBF_FL>=quantile(bayenv_noFL$logSalBF_FL,0.99) &
                 bayenv_noFL$logSalBF_noFL>=quantile(bayenv_noFL$logSalBF_noFL,0.99)]

sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & 
                                                  fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05)]
sharedPerm=fw_SNPinfo$ID[
                 rowSums(fw_SNPinfo[,c("perm_TX","perm_AL","perm_LA")])==3]

pcout<-fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)]


bothXtX[bothXtX %in% sharedPerm]
bothXtX[bothXtX %in% sharedStacks]
bothXtX[bothXtX %in% pcout]

fw_SNPinfo[fw_SNPinfo$ID==11220,]
fw_SNPinfo[fw_SNPinfo$ID==50576,]
fw_SNPinfo[fw_SNPinfo$ID==50892,]
fw_SNPinfo[fw_SNPinfo$ID==46960,]
fw_SNPinfo[fw_SNPinfo$ID==7507,]
fw_SNPinfo[fw_SNPinfo$ID==7925,]
```

So it looks like some of the ones that are shared among some analyses are not in others (e.g., PCAdapt) because they were removed from that analysis for one reason or another. 

# 18 June 2020

Goals:
1. figure out Bayenv matrix issues
2. annotate shared stacks/permutation outliers
3. update supplement 3 (include histograms of Stacks outliers and manhattan plots of permutation outliers)
4. email Adam

Starting with # 1:

This command: 
`../../scripts/run_bayenv2_matrix_general.sh MATRIX noFL ~/Programs/bayenv/ 5`
is not properly creating the matrices, so let's dig into it a bit. It runs bayenv in 'matrix estimation mode', which requires input of the SNPSFILE and the number of populations. So, is the SNPSFILE generated correctly? It seems to be, it's got 24206 lines (12103 * 2) and 5 columns, and they appear to be tab-separated. BUT! it says there should only be polymorphic sites, and looking at the SNPSFILE there are some loci that had another allele in the FL pops but is not in these pops, so those should be removed. I'm adding that to the `SNPSFILEfromPLINKfrq.R` script, and removing monomorphic sites results in a dataset of 10743 SNPs and a SNPSFILE with 21486 rows. I'm also saving the SNP names to a file when creating the SNPSFILE.

Now it seems to be working! 

I've also updated supplement 3 to show more graphs of stacks and permutation outliers, and this should also update the annotations of the outliers (in the minimal sense).

So let's think about the Fst outliers and their annotations. 

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99,na.rm = TRUE)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=
                                     quantile(fw_SNPinfo$logSalBF,0.99,na.rm = TRUE)],
               permutations=fw_SNPinfo$ID[
                 rowSums(fw_SNPinfo[,c("perm_TX","perm_AL","perm_LA")])==3],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & 
                                                  fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05)])
```
```{r}
annInfo<-fw_SNPinfo[,c("ID","region")]
annInfo$region<-as.character(annInfo$region)

annInfo$region[grep("UTR",annInfo$region)]<-"regulatory"
annInfo$region[grep("gene",annInfo$region)]<-"coding"
annInfo$region[annInfo$region %in% "contig"]<-"non-coding"
annInfo$region[annInfo$region %in% "scaffNotFound"]<-"unkown"

annInfo$outlier<-"not-outlier"
annInfo$outlier[annInfo$ID %in% outliers$sharedStacks]<-"outlier"
annInfo$salgene<-"not-putative"
annInfo$salgene[annInfo$ID %in% fw_SNPinfo$ID[!is.na(fw_SNPinfo$Gene)]]<-"putative"

fisher.test(table(annInfo$region,annInfo$outlier))
fisher.test(table(annInfo$outlier,annInfo$salgene))

```

```{r}
fst_outs<-outliers$permutations[outliers$permutations %in% outliers$sharedStacks]
fst_out_dat<-fw_SNPinfo[fw_SNPinfo$ID %in% fst_outs,
           c("ID","SSCID","Chrom","BP","REF","ALT","region","description")]
fst_out_dat$region<-as.character(fst_out_dat$region)
fst_out_dat$region[grep("UTR",fst_out_dat$region)]<-"regulatory"
fst_out_dat$region[grep("gene",fst_out_dat$region)]<-"coding"
fst_out_dat$region[fst_out_dat$region %in% "contig"]<-"non-coding"
fst_out_dat$region[fst_out_dat$region %in% "scaffNotFound"]<-"unkown"
fst_out_dat$SSCID<-gsub(";","; ",fst_out_dat$SSCID)
kable(fst_out_dat,"latex",booktabs=TRUE,
      caption="SNPs that were Fst outliers in all pairwise Texas, Alabama, and Louisiana Stacks and permutation analyses. Shown are the SNP ID in this dataset, the ID of the gene it is mapped to in the S. scovelli genome, its location and reference and alternative alleles. The final two columns show the type of genomic region the SNP is in and the gene description (if relevant).") %>%
  kable_styling(latex_options="HOLD_position")%>%
  column_spec(2,width="10em") %>%
  column_spec(8,width="20em")
```

This should be good for a preliminary investigation. 

# 17 June 2020

I got comments back from Adam, which were useful. He recommended the following:

1. Re-run bayenv without the Florida populations and compare the results to the one that I've already done.
  - I started the bayenv run this AM after subsetting the files.
2. Focus on shared outliers from just the TX, AL, and LA Fst analyses
3. Revisit the permutations, as the Fst values are surprisingly small (and use those with the Stacks Fsts)
4. Make Fig 2 focused on Fsts (stacks + permutations) and increase font sizes and possibly change colors.
5. Make Fig 3 focused on Bayenv and increase font sizes and possibly change colors.
6. Have more discussion of shared outliers (possibly) and definitely the lack of genomic islands.


I think that possibly the positional data in the bayenv analyses has been wrong this whole time...oh no, because I merged the locus IDs and then pulled it from fw_SNPinfo, so it's ok. Phew.

Digging into the permutations:

```{r}
perms<-readRDS("permuted_fsts.RDS")
par(mfrow=c(4,1))
plts<-lapply(perms,fst.plot,scaffs.to.plot = lgs,fst.name = "Fst",
                   chrom.name = "Chrom",bp.name = "Pos",
                   pch=19,pt.cex = 1,y.lim=c(0,1))
```

I see what Adam means, the Florida comparison's Fst values are really small, and these are the actuals.

```{r}
perms_new<-readRDS("permuted_fsts_29052020.RDS")
par(mfrow=c(4,1))
plts<-lapply(perms,fst.plot,scaffs.to.plot = lgs,fst.name = "Fst",
                   chrom.name = "Chrom",bp.name = "Pos",
                   pch=19,pt.cex = 1,y.lim=c(0,1))
```

It's true for both of these permutations. It's strange...

*Note: population_whitelist includes ones run with the radiator-filtered whitelist on all 16 populations.*

Now, looking into the Fst values, I'm noticing that we have way too many loci included, and somehow we've got multiple SNPs per locus! I have no idea how that happened. It's almost like the fwsw.tx etc haven't had the correct SNPs retained...and they didn't! But it doesn't matter because it's all ok in the fw_SNPinfo container. 

I've now saved the subsetted files in the populations_subset75 directory. The 'raw' Fst values from Stacks includes negative values, which is silly, but I can plot histograms and see what's going on there. 

```{r}
par(mfrow=c(2,2))
hist(fwsw.tx$Fst,xlim=c(-0.2,1),breaks=seq(-1,1,0.1))
hist(fwsw.al$Fst,xlim=c(-0.2,1),breaks=seq(-1,1,0.1))
hist(fwsw.la$Fst,xlim=c(-0.2,1),breaks=seq(-1,1,0.1))
hist(fwsw.fl$Fst,xlim=c(-0.2,1),breaks=seq(-1,1,0.1))
```

```{r}
par(mfrow=c(2,2))
hist(fwsw.tx$Corrected.AMOVA.Fst,xlim=c(0,1),breaks=seq(-1,1,0.1))
hist(fwsw.al$Corrected.AMOVA.Fst,xlim=c(0,1),breaks=seq(-1,1,0.1))
hist(fwsw.la$Corrected.AMOVA.Fst,xlim=c(0,1),breaks=seq(-1,1,0.1))
hist(fwsw.fl$Corrected.AMOVA.Fst,xlim=c(0,1),breaks=seq(-1,1,0.1))
```


The reason I'm doing this is to try to see how different the permutations are. They do show substantially zero-skewed distributions, so that's not nothing. But the genome-wide plots of permutations above are much more damning, and more troubling really. I'm not sure this is the correct approach...let's dive into the details of the permutations a bit more.

None of the permutations have lost loci and seem to have approximately the same numbers as the other analyses -- but in different orders!

```{r}
lapply(perms,nrow)
lapply(perms,function(x) nrow(x[x$NumAlleles>1,]))
```

These seem to be in the order: 
[[1]] Texas
[[2]] Florida
[[3]] Alabama
[[4]] Louisiana

Oh, this is the order their histograms are plotted too. 

```{r}
lapply(perms,summary)
```

Ok, I think I've confinced myself that the permutations are done correctly. But the fact remains that the outliers don't match other things. 

What if I ignore FL? -- this increases the number of permutation outliers. I could make an upset plot of the permutation outliers and compare them to the shared Fst outliers

```{r}
outliers<-list(permutations=fw_SNPinfo$ID[
                 rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$perm_AL==1)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$perm_LA ==1)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$perm_TX==1)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$perm_FL==1)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & 
                                                  fw_SNPinfo$stacks_LA_P < 0.05 &
                                                  fw_SNPinfo$stacks_TX_P < 0.05)])
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#a65628',
        xtx='#ff7f00',sharedStacks='#f781bf',
        Alabama='#af8dc3',Louisiana='#e7d4e8',Texas='#762a83',Florida='#1b7837')

upset(fromList(outliers),sets=c("Texas","Alabama","Louisiana","Florida","sharedStacks"),
      point.size=5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = "# Outliers",text.scale=rep(2,6),
      sets.bar.color =cols[c("Texas","Alabama","Louisiana","Florida","sharedStacks")],
      margin1scale = 0.2,
      sets.pt.color=cols[c("Texas","Alabama","Louisiana","Florida","sharedStacks")],
      keep.order = TRUE)
```

FOR WHATEVER REASON Bayenv is not liking my environ file that I'm giving it. It's standardized, tab-separated, and the populations are the columns and variables the rows. I've made sure it's got proper line endings, I've added an extra tab at the end of each row, and cannot for the life of me figure out what's wrong. 

It's possible it's not the environ file that's causing the problem -- maybe there's an issue with the SNPFILEs? no, but the matrix file is empty!

I will need to figure out what's going on there.

Tomorrow:
- figure out Bayenv matrix issues
- annotate shared stacks/permutation outliers
- update supplement 3 (include histograms of Stacks outliers and manhattan plots of permutation outliers)
- email Adam

# 16 June 2020

ALST_LAFW # 14 is done on abacus and so is FLLG_ALFW_SC2N2mG_6 on C001KR (now only # 7 needs to finish running).

# 15 June 2020

Sweet, some runs on abacus are done:
- ALST_TXFW # 13
- ALST_LAFW # 13
- ALST_TXCC # 14
- ALST_ALFW # 14 finished during the day so I moved it & started another one. 

I finally quit job 439 ALST_TXFW_12 & made sure all the data had been moved to C001KR.

Need to run: 
- ALFW vs ALST (# 15) - wait for 628 to finish (just one more model)
- FLCC vs TXCC (# 14) - wait for 619 to finish (just one more model)
- FLCC vs TXFW (anything > 10) - running # 14 models 238-259 on abacus (job 645)
- FLCC vs LAFW (#14) models 212-233 (job 646)

For now that's what I'll go with, and hopefully some more of these models will finish running on abacus in the next couple days.  

On rccuser FLCC_ALFW_SC2NG_1 is running, and ALFW_TXFW_2 IM2N and IM2nG, plus several FLCC_ALST_2 and FLCC_ALST_SC2N2mG_20, plus FLCC_ALFW_2_AM2NG and SC2mG, and FLCC_LAFW_2 (various models). 

tail FLCC-ALFW_SC2NG_1_2.log -- Round 4, 24 of 40
tail FLCC_ALST_20.SC2N2mG.log.txt -- Round 3, 24 of 30 (number 2 is on Round 3, 6 of 30) -- hopefully these aren't interfering with each other, but they might be.

Other stuff:
- I need to run a maximum likelihood unrooted tree with the consensus in treemix to generate covariances. [done]
- I need to re-make the summary figure with the treemix FLLG label turned into FLFW. And change the colors of the tip labels. [done]
- re-run the dadi best models plotting script.
  - The FLLG_TXFW and FLLG_TXCC jSFS plots look strange -- need to investigate that.

# 12 June 2020

I think I've decided that a summary table is better than trying to come up with a figure for the dadi results, at least at the moment. So I've added code to save that table to the Rmarkdown doc. I've also plotted 2D data vs model figures made by dadi for the best models and their optimized parameters, but some of them look really strange. Others look really nice. So that means, I think, that I haven't found very good runs of some of these models. Comparisons with bad plots:

FLLG_ALFW
FLLG_ALST
FLLG_LAFW
FLLG_TXFW
FLLG_TXCC
FLLG_ALFW
FLCC_ALST
FLCC_LAFW
ALFW_LAFW
ALFW_TXCC

But the regular 2D plots of them look fine, so it's probably something to do with my plotting function, and of course I didn't save a log. So I'll try running it again and this time save the output to a log file.

Looking at the log, it looks like I've passed some strange parameters including NAs, and I'm guessing that's the source of my issues. This arose from the simplification I made of using sumTabs to fill the outTable, but the columns were in different orders. So I think I've fixed that and I'm working on making the new figures.


To have more confidence in my current results, it would be really nice if I could easily know which runs were complete and move those from abacus and rccuser. 

This would be done using some fun unix code

```{bash}
find . -type f -iname "$remote_file" -mtime +3
```


# 11 June 2020

My goals for today are:
1. create a generic script that I can launch easily/automatically to create joint SFS images from dadi best models.
2. decide on best way to present the dadi results in the main document.
  - I find my current summary somewhat difficult to interpret
  - I feel like I should be showing the model and data joint SFS figures. 
3. outline/draft discussion.
4. check text about outliers results -- is it accurate/up-to-date?

I adjusted the python script to accept arguments, and I added R code to create the script at the end of the dadi Rmd. I didn't see an easy way of *just* plotting the model summary bit with the dadi code, but I might be able to use the Plotting_Functions.Plot_2D(fs, model_fit, prefix, "sym_mig") from the dadi_pipeline. The only possible issue is that it creates two figures, and I'm not sure how to capture this via command line.

Eventually I'll also need to improve the captions in the dadi document. 

# 10 June 2020

Yesterday I finished up the treemix analyses, including the threepop and fourpop ones and re-making the main text figure. So the goals for today are to update the main text and to finish up/make progress on the dadi results.

I'm going to try to fix table output for dadi so that it's like the summary table but for each population. Done! wrote a function to do this, which I think should work well. Hmm for some reason my function to extract dadi information does not return all fo the models to put into the table...this has to do with the switch to the median. This is because when there are even numbers the median is in between two so neither is chosen. How should I deal with this? I fixed it by choosing the model with the minimum difference between its log likelihood and the median. This is simply to choose which model is used for model comparisons. 

# 9 June 2020

Today's goals:
1. Fix optM stuff and update text around choosing number of migration edges in supplement.
2. Include the maximum likelihood tree with 1 migration edge in the main text figure instead of the old 2 edge one.
3. Update fourpop and threepop analyses.
4. If I have time, add text about dadi to results.
5. check up on dadi results & move any that have finished.

So the Evanno method plot was throwing a fit because of an odd if statement and lack of dev.off() issue. I can modify the function... 

```{r}
my_plot_optm<-function (input, method = "Evanno", plot = TRUE, pdf = NULL) 
{
    if (missing(method)) {
        method = "Evanno"
    }
    else method = method
    if (!is.character(method) | length(method) > 1) 
        stop("The 'method' argument was not set correctly\n")
    methods = c("SiZer", "linear", "Evanno")
    if (!(method %in% methods)) 
        stop("Could not find the selected 'method'.  Please check.\n")
    if (is.null(pdf)) {
        message("No output file will be saved. To save an output file, run with 'pdf = \"file.pdf\"'\n")
    }
    else {
        if (length(pdf) != 1 | !is.character(pdf)) 
            stop("Output pdf file is incorrectly specified.\n")
        pdf = pdf
    }
    if (!is.logical(plot)) 
        stop("Please set 'plot' as either TRUE or FALSE.\n")
    if (!plot & is.null(pdf)) 
        stop(" You want neither a plot opened or an input file saved.  No reason to continue. Exiting.\n")
    if (method == "Evanno") {
        message(paste0("Plotting the treemix results using the ", 
            method, " method.\n"))
        if (is.null(input) | !is.data.frame(input)) 
            stop("Proper input data frame was not detected.\n")
        c = ncol(input)
        if (c != 17) 
            warning(paste0("Warning: This function expects 17 columns in the table but detected ", 
                c, " columns.  Proceeding anyways even if this is incorrect!\n"))
        m = max(input$m, na.rm = T)
        low = min(input$m, na.rm = T)
        runs = mean(input$runs[2:length(input$runs)])
        if (ceiling(runs) != runs) {
            warning(paste0("The mean number of runs detected is ", 
                runs, ", but this must be a whole number.  Rounding up and continuing anyway.\n"))
            runs = round(runs)
        }
        if (!plot) {
            pdf(pdf, width = 7, height = 7)
            graphics::par(mfrow = c(2, 1), mar = c(4.1, 4.1, 
                1.1, 5.1), mgp = c(3, 1, 0))
            plot(input$m, input$"mean(Lm)", pch = 1, axes = F, 
                ann = F)
            graphics::axis(2, las = 1)
            graphics::axis(1)
            graphics::box()
            graphics::segments(input$m, input$"mean(Lm)" - 
                input$"sd(Lm)", input$m, input$"mean(Lm)" + 
                input$"sd(Lm)")
            epsilon = 0.1
            graphics::segments(input$m - epsilon, input$"mean(Lm)" - 
                input$"sd(Lm)", input$m + epsilon, input$"mean(Lm)" - 
                input$"sd(Lm)")
            graphics::segments(input$m - epsilon, input$"mean(Lm)" + 
                input$"sd(Lm)", input$m + epsilon, input$"mean(Lm)" + 
                input$"sd(Lm)")
            graphics::title(ylab = "Mean L(m) +/- SD")
            f.means = input$"mean(f)"
            f.sd = input$"sd(f)"
            graphics::par(new = T)
            plot(input$m, f.means, pch = 19, col = grDevices::rgb(255/255, 
                0, 0, 89.25/255), axes = F, ann = F,ylim=c(0,1))
            graphics::axis(4, las = 1)
            graphics::mtext("Variance Explained", side = 4, 
                line = 3.5)
            y.limits = graphics::par("usr")[3:4]
            print(y.limits)
            if ((y.limits[1]) < 0.998 && (y.limits[2] > 0.998)) {
                graphics::abline(h = 0.998, col = "black", 
                  lty = "dotted")
            }
            else {
                warning("Horizontal line at 99.8% variation cutoff is out of bounds. This is not a big deal and the program is continuing anyway without plotting the line.\n", 
                  immediate. = T)
            }
            graphics::segments(input$m, f.means - f.sd, input$m, 
                f.means + f.sd, col = "red")
            epsilon = 0.1
            graphics::segments(input$m - epsilon, f.means - f.sd, 
                input$m + epsilon, f.means - f.sd, col = "red")
            graphics::segments(input$m - epsilon, f.means + f.sd, 
                input$m + epsilon, f.means + f.sd, col = "red")
            graphics::legend("bottomright", legend = c("likelihoods +/- SD", 
                "% variance", "99.8% threshold"), 
                col = c("black", grDevices::rgb(255/255, 
                  0, 0, 89.25/255), "black"), bty = "n", 
                pch = c(1, 19, NA), lty = c(NA, NA, "dotted"))
            plot(input$m, input$Deltam, col = "blue", pch = 19, 
                xlab = "m", ylab = expression(italic(paste(symbol(Delta), 
                  "m"))))
            graphics::points(input$m, input$Deltam, col = "blue", 
                type = "l")
            grDevices::dev.off()
        }
        else {
            grDevices::dev.new(width = 7, height = 7)
            graphics::par(mfrow = c(2, 1), mar = c(4.1, 4.1, 
                1.1, 5.1), mgp = c(3, 1, 0))
            plot(input$m, input$"mean(Lm)", pch = 1, axes = F, 
                ann = F)
            graphics::axis(2, las = 1)
            graphics::axis(1)
            graphics::box()
            graphics::segments(input$m, input$"mean(Lm)" - 
                input$"sd(Lm)", input$m, input$"mean(Lm)" + 
                input$"sd(Lm)")
            epsilon = 0.1
            graphics::segments(input$m - epsilon, input$"mean(Lm)" - 
                input$"sd(Lm)", input$m + epsilon, input$"mean(Lm)" - 
                input$"sd(Lm)")
            graphics::segments(input$m - epsilon, input$"mean(Lm)" + 
                input$"sd(Lm)", input$m + epsilon, input$"mean(Lm)" + 
                input$"sd(Lm)")
            graphics::title(ylab = "Mean L(m) +/- SD")
            f.means = input$"mean(f)"
            f.sd = input$"sd(f)"
            graphics::par(new = T)
            plot(input$m, f.means, pch = 19, col = grDevices::rgb(255/255, 
                0, 0, 89.25/255), axes = F, ann = F,ylim=c(0,1))
            graphics::axis(4, las = 1)
            graphics::mtext("Variance Explained", side = 4, 
                line = 3.5)
            y.limits = graphics::par("usr")[3:4]
            print(y.limits)
            if ((y.limits[1]) < 0.998 && (y.limits[2] > 0.998)) {
                graphics::abline(h = 0.998, col = "black", 
                  lty = "dotted")
            }
            else {
                warning("Horizontal line at 99.8% variation cutoff is out of bounds. This is not a big deal and the program is continuing anyway without plotting the line.\n", 
                  immediate. = T)
            }
            graphics::segments(input$m, f.means - f.sd, input$m, 
                f.means + f.sd, col = "red")
            epsilon = 0.1
            graphics::segments(input$m - epsilon, f.means - f.sd, 
                input$m + epsilon, f.means - f.sd, col = "red")
            graphics::segments(input$m - epsilon, f.means + f.sd, 
                input$m + epsilon, f.means + f.sd, col = "red")
            graphics::legend("bottomright", legend = c("likelihoods +/- SD", 
                "% variance", "99.8% threshold"), 
                col = c("black", grDevices::rgb(255/255, 
                  0, 0, 89.25/255), "black"), bty = "n", 
                pch = c(1, 19, NA), lty = c(NA, NA, "dotted"))
            plot(input$m, input$Deltam, col = "blue", pch = 19, 
                xlab = "m (migration edges)", ylab = expression(italic(paste(symbol(Delta), 
                  "m"))))
            graphics::points(input$m, input$Deltam, col = "blue", 
                type = "l")
            if (!is.null(pdf)) {
                grDevices::dev.copy(grDevices::pdf, file = pdf)
                grDevices::dev.off()
                message(paste0("Plot saved to file ", pdf, 
                  ".\n"))
            }
            else if (is.null(pdf)) 
                message("No plot file has been saved.\n")
        }
    }
    else if (method == "linear") {
        message("Plotting the treemix results using various linear models.\n")
        if (is.null(input) | !is.list(input) | length(input) != 
            5) 
            stop("Proper input list was not detected.\n")
        if (plot) {
            x = input$PiecewiseLinear$model$model[, 2]
            y = input$PiecewiseLinear$model$model[, 1]
            pl = input$PiecewiseLinear
            bc = input$BentCable
            sim.exp = input$SimpleExponential
            nl_ls = input$NonLinearLeastSquares
            plot(x, y, ylab = "Log Likelihood", xlab = "m (migration edges)", 
                axes = F)
            graphics::box()
            graphics::axis(1)
            graphics::axis(2, las = 1)
            x.grid <- seq(min(x), max(x), length = 200)
            graphics::lines(x.grid, stats::predict(pl, x.grid), 
                col = "darkgreen", lwd = 2)
            graphics::lines(x.grid, stats::predict(bc, x.grid), 
                col = "orange", lwd = 2)
            z = max(y) + 1
            y2 = log(-y + z)
            graphics::lines(x.grid, z - exp(stats::predict(sim.exp, 
                newdata = data.frame(x = x.grid))), col = "red", 
                lwd = 2)
            graphics::lines(x.grid, z - exp(stats::predict(nl_ls, 
                newdata = data.frame(x = x.grid))), col = "blue", 
                lwd = 2)
            graphics::legend("bottomright", legend = c("Observed data", 
                "Piecewise Linear", "Bent Cable", 
                "Simple Exponential", "Non-linear Least Squares", 
                "change points"), col = c("black", 
                "darkgreen", "orange", "red", 
                "blue", "black"), bty = "y", 
                pch = c(1, NA, NA, NA, NA, 8), lty = c(NA, 1, 
                  1, 1, 1, NA))
            cp.pl = input$out[which(rownames(input$out) == "PiecewiseLinear"), 
                4]
            lnPD.pl = stats::predict(pl, cp.pl)
            cp.bc = input$out[which(rownames(input$out) == "BentCable"), 
                4]
            lnPD.bc = stats::predict(bc, cp.bc)
            cp.simexp = input$out[which(rownames(input$out) == 
                "SimpleExponential"), 4]
            lnPD.simexp = z - exp(stats::predict(sim.exp, newdata = data.frame(x = cp.simexp)))
            cp.nlls = input$out[which(rownames(input$out) == 
                "NonLinearLeastSquares"), 4]
            lnPD.nlls = z - exp(stats::predict(nl_ls, newdata = data.frame(x = cp.nlls)))
            graphics::points(x = c(cp.pl, cp.bc, cp.simexp, cp.nlls), 
                y = c(lnPD.pl, lnPD.bc, lnPD.simexp, lnPD.nlls), 
                pch = 8, col = c("darkgreen", "orange", 
                  "red", "blue"), cex = 1.5)
            if (!is.null(pdf)) {
                grDevices::dev.copy(grDevices::pdf, file = pdf)
                grDevices::dev.off()
                message(paste0("Plot saved to file ", pdf, 
                  ".\n"))
            }
            else if (is.null(pdf)) 
                message("No plot file has been saved.\n")
        }
        else if (!plot) {
            x = input$PiecewiseLinear$model$model[, 2]
            y = input$PiecewiseLinear$model$model[, 1]
            pl = input$PiecewiseLinear
            bc = input$BentCable
            sim.exp = input$SimpleExponential
            nl_ls = input$NonLinearLeastSquares
            pdf(pdf, width = 7, height = 7)
            plot(x, y, ylab = "Log Likelihood", xlab = "m (migration edges)", 
                axes = F)
            graphics::box()
            graphics::axis(1)
            graphics::axis(2, las = 1)
            x.grid <- seq(min(x), max(x), length = 200)
            graphics::lines(x.grid, stats::predict(pl, x.grid), 
                col = "darkgreen", lwd = 2)
            graphics::lines(x.grid, stats::predict(bc, x.grid), 
                col = "orange", lwd = 2)
            z = max(y) + 1
            y2 = log(-y + z)
            graphics::lines(x.grid, z - exp(stats::predict(sim.exp, 
                newdata = data.frame(x = x.grid))), col = "red", 
                lwd = 2)
            graphics::lines(x.grid, z - exp(stats::predict(nl_ls, 
                newdata = data.frame(x = x.grid))), col = "blue", 
                lwd = 2)
            graphics::legend("bottomright", legend = c("Observed data", 
                "Piecewise Linear", "Bent Cable", 
                "Simple Exponential", "Non-linear Least Squares", 
                "change points"), col = c("black", 
                "darkgreen", "orange", "red", 
                "blue", "black"), bty = "y", 
                pch = c(1, NA, NA, NA, NA, "X"), lty = c(NA, 
                  1, 1, 1, 1, NA))
            cp.pl = input$out[which(rownames(input$out) == "PiecewiseLinear"), 
                4]
            lnPD.pl = stats::predict(pl, cp.pl)
            cp.bc = input$out[which(rownames(input$out) == "BentCable"), 
                4]
            lnPD.bc = stats::predict(bc, cp.bc)
            cp.simexp = input$out[which(rownames(input$out) == 
                "SimpleExponential"), 4]
            lnPD.simexp = z - exp(stats::predict(sim.exp, newdata = data.frame(x = cp.simexp)))
            cp.nlls = input$out[which(rownames(input$out) == 
                "NonLinearLeastSquares"), 4]
            lnPD.nlls = z - exp(stats::predict(nl_ls, newdata = data.frame(x = cp.nlls)))
            graphics::points(x = c(cp.pl, cp.bc, cp.simexp, cp.nlls), 
                y = c(lnPD.pl, lnPD.bc, lnPD.simexp, lnPD.nlls), 
                pch = 8, col = c("darkgreen", "orange", 
                  "red", "blue"), cex = 1.5)
            grDevices::dev.off()
        }
    }
    else if (method == "SiZer") {
        message("Plotting the treemix results using SiZer.\n")
        if (class(input) != "SiZer") 
            stop("Input object is not of class SiZer.\n")
        if (plot) {
            plot(input, xlab = "m (migration edges)")
            if (!is.null(pdf)) {
                grDevices::dev.copy(grDevices::pdf, file = pdf)
                grDevices::dev.off()
                message(paste0("Plot saved to file ", pdf, 
                  ".\n"))
            }
            else if (is.null(pdf)) 
                message("No plot file has been saved.\n")
        }
        else if (!plot) {
            pdf(pdf, width = 7, height = 7)
            plot(input, xlab = "m (migration edges)")
            grDevices::dev.off()
        }
    }
    message("Finished plotting.  All results are saved to the current directory as requested.\n")
}
```

I created treemix_evanno in 203_treemix_plotting_funcs.R to create those plots outright, and to adjust the x-limits based on the data.

I revisited the Treemix and Reich et al 2009 papers to figure out what I'm supposed to be looking at with the f3 and f4 analyses and it only sort of helps... 

The f3 statistics, if negative, are indicative of gene flow. The f4 statistics, if different from zero, are indicative of gene flow within the tree. The Z-statistics reported by treemix are testing if the f3 is less than zero and if the f4 is different from zero. In Reich et al it says "When the null hypothesis indicates that an f-statistic has mean zero as in the 4 Population Test, the jackknife standard error can be converted to a Z-score, which has mean 0 and variance 1 under the null hypothesis. We warn that the normality assumption becomes imperfect for |Z|>2 (not shown). Thus, large Z-scores should be viewed as statistically significant but not simply convertible to P-values". I'm presuming this is true for the f3 test as well, as they're related...but this will not re-create the p-values reported in the Treemix paper. So what if the Z value has the mean fo the f3 statistics?

I've decided to just ignore the Z-scores and focus on the f3 and f4 statistics and whether they overlap or don't overlap 0.

# 8 June 2020

dadi updates:
- only 2 FLLG_ALFW_SC2N2mG still running on C001KR
- rccuser all the ones are still running.
- abacus: 608 (FLLG_LAFW # 13) and 622 (ALST_TXCC # 13) are done, so I moved them. 
- I copied the FLCC_ALFW runs from abacus and rccuser even though the full sets aren't complete, because I also updated the code in 250_dadi_analysis to not include any results that didn't finish the full 4 rounds and 40 reps. This way I can start to get a sense for which models are the best.
- abacus: 609 (FLFW_TXFW # 13) finished during the day, so I moved those.

To do:
1. Evaluate treemix migration edges
2. Re-make the figure with treemix results. 
3. Update text. 

For (1):
The original runs do have log likelihood files so I can potentially do something with that -- yes, this is the same as the optM idea...

Intriguingly, the bootstraps give different log likelihoods. So what I can do is either use the OptM linear option or write a function to just take all of the log likelihoods and estimate the number of migration edges that is best. But regardless the good news is that they all have the same migration edges even if they have different log likelihoods -- this makes life a bit easier. 

```{r}
lliks<-list.files(path="treemix/migrations/",pattern=".llik",full.names = TRUE)
likes<-data.frame(do.call(rbind,lapply(lliks,function(file){
  likdat<-read.delim(file,header=FALSE,row.names=1,sep=':')
  migs<-as.numeric(gsub(".*m(\\d).*","\\1",file))
  return(cbind(migs=migs,loglikelihood=likdat[2,]))
})))

llikMean<-tapply(likes$loglikelihood,likes$migs,mean)
llikSEM<-tapply(likes$loglikelihood,likes$migs,function(x){
  return(sqrt(var(x)/length(x)))
})

```

```{r}
llikMean<-c(-7377.4708,993.6642,1006.3980,1012.8251,1018.3562,1019.5558)
llikSEM<-c(106.1838924,0.7062885,1.0044586,1.1442901,0.9830052,1.0794252)


plot(0:5,llikMean,pch=19,cex=2,
     xlab="Number of migration edges",
     ylab="Log likelihood",ylim=c(-8000,1200))
arrows(x0 = 0:5,y0=llikMean-llikSEM,
       x1=0:5,y1=llikMean+llikSEM,code=3,angle=90)

```

Ok, so one migration edge is the best. This means that I need to check to make sure that the same migration edge is found in all of the bootstrap and report which one it is. I also need to update the main figure in the paper. 

The evanno method works but is not providing any figure I think because of a warning that says the line at 99.8% variation is out of bounds...so I'm not sure what to do with that.

For tomorrow:
1. Fix optM stuff and update text around choosing number of migration edges in supplement.
2. Include the maximum likelihood tree with 1 migration edge in the main text figure instead of the old 2 edge one.
3. Update fourpop and threepop analyses.
4. If I have time, add text about dadi to results.
5. check up on dadi results & move any that have finished.

# 5 June 2020

Goals for today: 
1. Evaluate treemix migration edges with consensus tree and FLAB as root
2. update docs
3. run more dadi models
  - Run 611 on abacus (FLLG_TXCC) was finished so I moved the files and deleted them from abacus.
  - started FLLG_LAFW # 14 on abacus (82-103) with job ID 625.
  - started FLLG_TXFW # 14 on abacus (108-129) with job ID 626.
  - started FLLG_TXCC # 14 on abacus (134-155) with job ID 627.
  - started ALFW_ALST # 14 on abacus (287-311) with job ID 628.
  - started ALST_TXFW # 14 on abacus (416-441) with job ID 629.
  - started ALST_TXCC # 14 on abacus (442-468) with job ID 630.
  - started LAFW_TXCC # 14 on abacus (494-519) with job ID 631.
  - started TXFW_TXCC # 14 on abacus (520-545) with job ID 632.

Working on (1):
The Evanno method is causing an error -- saying they all have the same standard deviation. I'm trying the SiZer method instead...this produces a plot that I don't know how to read, so now trying linear method to see if that makes any more sense. That one plots log likelihoods and seems to show that 1 migration edge is the best choice. Cool. 

I wonder if there's a way to check all of the migration edges (non-graphically) to compare them across bootstraps, see how consistent it is.

Ok so it's interesting, these are all identical, meaning migration replicates were unnecessary. Unless I need to have the -bootstrap flag added...to check I'll re-run them with that flag enabled. Running it in the background this time though. 
  
# 4 June 2020

This is the treemix code from BITE:

```{bash, eval=FALSE}
#########################################################################
#### Run Phylip on the bootstrapped trees to obtain a consensus tree ####
#########################################################################
echo "   ***** Phylip - consensus tree construction: START *****"
### Clean the environment
rm -rf outfile outtree screanout

# Create parameters file
if [ $outgroup = "NoOutgroup" ]; then
        echo $outname"_boottree.tre" > $outname"_PhylipInputFile"
        echo "Y" >> $outname"_PhylipInputFile"
else
        # Find the position of Outgroup population
        posOutgroup=`head -1 $outname"_boottree.tre" | tr "," "\n" | grep $outgroup -n | cut -d":" -f1`
        # echo $posOutgroup
        echo $outname"_boottree.tre" > $outname"_PhylipInputFile"
        echo "O" >> $outname"_PhylipInputFile"
        echo $posOutgroup >> $outname"_PhylipInputFile"
        echo "Y" >> $outname"_PhylipInputFile"
fi

# Run Phylip
$pathP < $outname"_PhylipInputFile" > screanout

### The output from Phylip will be modified because:
### 1) Treemix accept only one line tree
### 2) Treemix accept newick format file
#sed ':a;N;$!ba;s/\n//g' outtree > $outname"_outtree.newick"
cat outtree | tr -d "\n" > $outname"_outtree.newick"
echo >> $outname"_outtree.newick"
echo "   ***** Phylip - consensus tree construction: DONE *****"


######################################################################################
### Run TreeMix with the chosen number of migrations by loading the consensus tree ###
######################################################################################

### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ###
### N.B. If you need to modify the treemix parameters please do it here ###
### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ###

echo "**** RUNNING TREEMIX with Consensus tree ****"
if [ $outgroup = "NoOutgroup" ]; then
        treemix -i $infile -m $numk -k $blockk -se -tf $outname"_outtree.newick" -o $outname > $outname"_logfile_treemix_boot.log"
else
        treemix -i $infile -m $numk -k $blockk -root $outgroup -se -tf $outname"_outtree.newick" -o $outname > $outname"_logfile_treemix_boot.log"
fi
echo "**** RUNNING TREEMIX: DONE ****"

echo "TreeMix - Bootstrap Analysis --> DONE"

```

Writing to phylip format from sumtrees doesn't seem to work right, and it certainly doesn't fix the problem.

I'm able to run the BITE example and my output modifications to Treemix show up, so that helps me narrow down what might be going on with mine -- it printed "Reading the tree worked, huzzah!" AFTER it prints a log likelihood. Ok, the results I'm getting indicate that p.readtree is being properly used, so now where am I going wrong? The appropriate code is in GraphState2::set_graph_from_file. I've narrowed it down to the line `map<string, Graph::vertex_descriptor> tips=tree->get_tips(tree->root)`. This is helpful! It means it's either the tips or the root that are causing problems. 


Pondering format of the tree:

((FLSI:0.0003,FLFD:8e-06)0.84000000:0.0007,(FLAB:0.002,(FLLG:0.02,FLPB:0,(FLHB:6e-05,FLCC:0.0001)0.82000000:0.0008)0.67000000:0.01)0.67000000:0.001,((FLKB:0.0002,FLSG:8e-05)0.85000000:0.0004,(ALST:0.0001,((TXFW:0.005,(TXCB:0.0006,(TXSP:9e-05,TXCC:1e-05)0.96000000:0.0004)0.88000000:0.0004)0.84000000:0.002,(ALFW:0.0005,LAFW:0.0005)0.50000000:0.0003)0.68000000:0.0005)0.68000000:0.003)0.52000000:0.0003)1.00000000:0;

I wonder if it's caused by scientific notation? changing the numbers doesn't help. I really can't figure out what's going on! The hapmap example has integers instead of decimals, could that be part of it?

---
Given the fact that I've changed the format a bunch and it hasn't really helped, I'm thinking that it's probably something to do with the overall topography of the tree, maybe the fact that FLAB isn't really in the proper position as root? What's weird is that the ape package *should* be outputting a 'correct' format. I really don't get what the problem is. 

I'm thinking the issue may have to do with the number of nodes -- it seems as though the consensus trees are loosing a few expected nodes, as I think there should be 15 (number of tips - 1). This could be part of the problem.


Ok, I used phylip and now it's working! fml but thank goodness
Specifically, I used phylip's consense function with the default consensus tree settings and copied the output back to my working directory, then used R to make sure it was in the correct format.

```{r}
# unrooted tree
library(ape)
tre<-read.tree("unrooted_consensus.newick")
png("../../figs/treemix_unrooted_consense.png",height=8,width=8,units="in",res=300)
plot(tre)
dev.off()

# rooted tree
tre<-read.tree("rooted_consensus.newick")
png("../../figs/treemix_rooted_consense.png",height=8,width=8,units="in",res=300)
plot(tre)
dev.off()
write.tree(tre,'rooted_consensus.newick')

```

Now onto updating the supplements etc.

Ah, I've run into a small snag -- the optM package assumes you've run migrations multiple times. Ok, I can run multiples. Of course, then I'll need to compare the migration edges & see how consistent they are. So now I'm running it 100 times (because I was going to do 10 but then was silly and didn't change it)...it runs quickly so this should be fine.

TO DO: 
- choose treemix migration edges
- write code for the models with highest median dadi runs to be selected. -> done

# 3 June 2020

To sort out the treemix issue, I need to figure out what format it's looking for when with the -ft flag. Alternatively, I need to get the sumtrees output into treemix vertices + edges format, and use the -g flag.

Sumtrees possilbe outputs: nexus, newick, phylip, nexml. newick is the obvious answer (given the cpp code in treemix), so that should be it. 

The sumtrees output has some weird formatting, but I converted it in R (see code below from yesterday) and now it's a single-line newick-formatted file but it's still not running. According to BITE, that's what it requires. Maybe it shouldn't have the branch lengths?

```{r}
library(ape)
tre<-read.tree("treemix/rooted/fwsw_FLAB_boottree.txt")
tre$edge.length<-NULL
tre$node.label<-NULL
write.tree(tre, "treemix/rooted/fwsw_FLAB_consensus.tre")
```

Ah, this doesn't get rid of node labels! That's probably the trouble.

Nope -- getting rid of the numbers doesn't help at all. 

Maybe there needs to be a newline at the end of the file? Nope. Maybe there's something off in the way I'm calling it from the script? - doesn't seem like it, running treemix from the command line hasn't fixed it. 

It runs without the consensus tree so it must be something to do with that. 

I've tried adding some checks to the code but either it's not working or I don't understand how the code is structured. Maybe it's not recognizing it as a newick tree? try labeling it .newick.

I'm looking at the BITE example and it has some numbers but is shorter and all one line. Treemix definitely runs on that one, I tried it.

I don't understand why it's not working! arghhh 

# 2 June 2020

Goals for today:
- start more dadi runs (FLCC_TXCC_1 and ALFW_LAFW_2 have finished on rccuser & I moved them.)
- merge treemix bootstraps
- finish supplements
  - fixed up issues with supp 2's knitting
  - improved the formatting/output of the dadi supplement by not showing code and tweaking kable parameters
  - still need to improve dadi supplement by choosing the highest median log likelihood model.
- merge treemix bootstraps

To merge treemix bootstraps I'm looking at this code:
https://github.com/mgharvey/misc_python/blob/master/bin/TreeMix/treemix_tree_with_bootstraps.py

Which has the following steps:
1. The combine_bootreps() function in there just re-writes all of the lines in each of the bootstrapped treeout files into a 'cat_trees.tre' file.
2. In a comment, it then says you may need to copy and paste them to get them in the correct format
3. commented out are two lines both running the same line of code but in different ways I think
  # os.system("sumtrees.py --rooted -t {0}out_stem.treeout.tre -o {1}boottree.txt {2}cat_trees.tre".format(args.out_dir, args.out_dir, args.out_dir))
	# sumtrees.py --rooted -t ./out_stem.treeout.tre -o ./boottree.txt ./cat_trees.tre

So I'll install sumtrees (following instructions here https://dendropy.org/programs/sumtrees.html) - done.

The bootstrap replicates are all in treemix/ and have the form fwsw_k100bFLPBrm{m}_{i}.treeout.gz, where {m} is the number of migration edges and {i} is the bootstrap replicate. I also have the maximum likelihood trees that are not from the bootstrapped replicates. 

What the sumtrees code is doing is using the maximum likelihood tree as the target tree filepath (-t ./out_stem.treeout.tre), summarizing the concatenated bootstrap trees with the function (./cat_trees.tre), and saving the output to a file called boottree.txt (-o ./boottree.txt).

This R code does save the trees to file, but I'm going to try with python first, using the code from the github page as inspiration -- this would allow  me to run sumtrees on each of them as I go.

```{r}
library(ape)
combine_boots_treemix<-function(stem,out,boots=10){
  require(ape)
  tr<- read.tree(gzfile(paste0(stem, ".treeout.gz")))    
  write.tree(tr,paste0(stem, ".treeout.tre"))
  for(i in 1:boots){
    tr<-paste0(stem,"_",i, ".treeout.gz")
    tr<- read.tree(gzfile(tr))    
    write.tree(tr,out,append=TRUE)
  }
  invisible(out)
}

stem<-"treemix/fwsw_k100bFLPBrm1"

combine_boots_treemix(stem,out="treemix/m1test.tre")
```

The ones with migration edges are causing problems for the poor code in python, so I'm using a version of the R code above -- I turned it into an R script which I call from a bash script along with sumtrees and it all works out nicely. Now I've got the summary trees in SumTrees format, and they have bootstrap estimates for each node. I'm not sure how this is supposed to help with the migration edges (as far as I can tell it doesn't) but I can at least plot the bootstrap values on the tree. -- to do this I need to alter my treemix plotting code again, and probably borrow some of the tricks from BITE. 

Ok, here's an idea: what if I use the bootstraps over the no-migration edge trees to judge confidence in a tree topology, THEN add migration events and choose the optimal edges *giving treemix the consensus tree as input*....could do this with the -tf flag with treemix. This is what BITE does, when looking at their scripts in depth.

So here's the new outline for treemix:
1. Run 100 bootstraps of the unrooted tree and make a consensus with SumTrees.
2. Plot consensus to choose the best root.
3. Run 100 bootstraps of the rooted tree and make a consensus with SumTrees.
4. Run treemix with 1 to 5 migration edges giving it the consensus tree.
5. Run threepop and fourpop giving it the consensus tree. 


Given this, I will delete all the old stuff (it's still in the ~/Research/popgen/treemix/ dir anyway) and create new sub directories. I also will need to re-write the code.

I've run the unrooted version and used the following R code to make a figure:

```{r}
library(ape)
tre<-read.tree("treemix/unrooted/fwsw_boottree.txt")
png("../figs/treemix_unrooted_consensus.png",height=8,width=8,units="in",res=300)
plot(tre)
dev.off()
```

That indicated that FLAB was the best outgroup root, so I then ran treemix with that (with 100 reps). Now I can make a figure in the same way (just to see what it looks like)

```{r}
library(ape)
tre<-read.tree("treemix/rooted/fwsw_FLAB_boottree.txt")
png("../figs/treemix_FLAB_consensus.png",height=8,width=8,units="in",res=300)
plot(tre)
dev.off()
```

It's pretty similar, and similar to previous trees. Now let's try using that tree to run treemix with migration edges. That caused segmentation faults...I think it's due to the stupid format of the output boottree.

Let's see what happens if I convert it.

```{r}
library(ape)
tre<-read.tree("treemix/rooted/fwsw_FLAB_boottree.txt")
write.tree(tre, "treemix/rooted/fwsw_FLAB_consensus.tre")
```

There is still a segmentation fault, and to my best guess it happens when the program converts the newick string to a graph and then tries to set the graph and allocates stuff

```{cpp, eval=FALSE}
void GraphState2::set_graph(string newick){
        tree->set_graph(newick);
        current_npops = allpopnames.size();
        gsl_matrix_free(sigma);
        sigma = gsl_matrix_alloc(current_npops, current_npops);
        gsl_matrix_set_zero(sigma);
        gsl_matrix_free(sigma_cor);
        sigma_cor = gsl_matrix_alloc(current_npops, current_npops);
        gsl_matrix_set_zero(sigma_cor);

    set_branches_ls_wmig();

    current_llik = llik();
}

```

I'm going to have to revisit this tomorrow, my mind feels a bit like mush.

# 29 May 2020

dadi updates:
- abacus
  - LAFW_TXCC doesn't have any models running - moved them to C001KR and deleted from abacus.
  - I moved all of the ALST_TXFW_12 files to C001KR but then deleted the SC2N2mG ones as that's the run that's still going. The log file was updated yesterday on abacus so maybe I should let it go?
- rccuser
  - ALFW_TXCC_2 is done so I moved the files and tar-zipped them on rccuser
  - It seems like rccuser could have more threads going
- C001KR
  - FLLG_ALFW still running, some are on 9th round
 
I rendered the dadi document so I can identify priority things to run. Also created an excel doc to track the runs that are currently going (dadi_ongoing.xlsx)

New runs:
  - FLCC_ALFW rep 13 (models 160-181) on abacus (job ID 618)
  - FLCC_TXCC rep 13 (models 264-285) on abacus (job ID 619)
  - ALST_TXFW rep 13 (models 416-441) on abacus (job ID 620)
  - ALST_LAFW rep 13 (models 390-415) on abacus (job ID 621)
  - ALST_TXCC rep 13 (models 442-467) on abacus (job ID 622)
  - FLCC_ALST rep 20 on rccuser (using code below)

I seem to be the only person using abacus!

```{r}
pops=c('FLCC', 'ALST')   #
models=c(#'SI', 'IM', 'AM', 'SC', 
         'SI2N', 'SIG', 'SI2NG', 'IMG', 
         'IM2N', 'IM2m', 'IM2NG', 'IM2mG', 'AM2N', 'AMG', 'AM2m', 
         'AM2NG', 'AM2N2m', 'AM2mG', 'AM2N2mG', 'SCG' ,'SC2N', 
         'SC2m', 'SC2NG', 'SC2N2m' ,'SC2mG' ,'SC2N2mG')

rundf<-data.frame(pops=NULL,model=NULL)
count=1
for(i in 1:(length(pops)-1)){
  for(j in (i+1):length(pops)){
    for(mod in 1:length(models)){
      rundf[count,1]<-paste(pops[i],pops[j],sep="_")
      rundf[count,2]<-models[mod]
      count=(count+1)
    }
  }
}

write.table(rundf,"dadi_mods_2020-05-18.txt",col.names = FALSE,
            row.names = FALSE,quote=FALSE,eol = '\n')
```


Ok that's good for now, I think...I've got lots running and am getting closer and closer to having 5 reps of each model. I can start more this afternoon so that additional ones will run over the weekend.

Still to do:
- figure out how to handle treemix stuff using either:
    - the python code from https://github.com/mgharvey/misc_python/blob/master/bin/TreeMix/treemix_tree_with_bootstraps.py 
    - the R code from BITE https://github.com/marcomilanesi/BITE)
      -this takes some sort of PHYLIP newick tree and just plots the bootstrap estimates on the tree
- figure out what's going on with the permutation outliers.
- finish adding proper writing to the supplements [Basically done!]
- fix the look of the dadi supplement


Permutation outliers

```{r}
fw_SNPinfo<-read.RDS("fw_SNPinfo.RDS")
permuted_fsts<-readRDS("permuted_fsts.RDS")
vcf<-parse.vcf("converted_subset.vcf")
dat<-data.frame(Locus=paste(permuted_fsts[[1]]$Chrom,as.numeric(permuted_fsts[[1]]$Pos),sep="_"),
                Chrom=permuted_fsts[[1]]$Chrom,
                Pos=permuted_fsts[[1]]$Pos,
                 perm_TX=permuted_fsts[[1]]$act_in_perm,
                 perm_FL=permuted_fsts[[2]]$act_in_perm,
                 perm_AL=permuted_fsts[[3]]$act_in_perm,
                 perm_LA=permuted_fsts[[4]]$act_in_perm,
                 stringsAsFactors = FALSE)
vcf$Locus<-paste(vcf$`#CHROM`,as.numeric(vcf$POS),sep="_")

nrow(vcf[which(vcf$Locus %in% dat$Locus),])
head(dat[which(!dat$Locus %in% vcf$Locus),])
```

Why isn't this all of them? It looks like they've got different position distributions at least for LG1:

```{r}
summary(vcf$POS[vcf$`#CHROM` %in% "LG1"])
summary(as.numeric(dat$Pos[dat$Chrom %in% "LG1"]))
```


```{r}
newdat<-data.frame(ID=vcf$ID,Chrom=vcf$`#CHROM`,Pos=vcf$POS,BP=vcf$POS-1,
                       REF=vcf$REF,ALT=vcf$ALT,
                       perm_TX=permuted_fsts[[1]]$act_in_perm,
                       perm_FL=permuted_fsts[[2]]$act_in_perm,
                       perm_AL=permuted_fsts[[3]]$act_in_perm,
                       perm_LA=permuted_fsts[[4]]$act_in_perm,
                       stringsAsFactors = FALSE)
perm_out<-newdat$ID[which(rowSums(dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4)]
nrow(fw_SNPinfo[fw_SNPinfo$ID %in% perm_out,])

dim(fw_SNPinfo[as.numeric(fw_SNPinfo$ID) %in% vcf$ID,])
```
..so what's weird is that this code ^^ is exactly how fw_SNPinfo was started. So...what is going on?? Perhaps it is time to re-create fw_SNPinfo again, as these should be identical but apparently are not. Ok, so I've done this, but now...

```{r}
dim(fw_SNPinfo[is.na(fw_SNPinfo$ID),])
dim(fw_SNPinfo[!is.na(fw_SNPinfo$ID),])
```

...the NAs are back! Ugh. Last time this was due to Bayenv...what if I re-do everything without the bayenv results, and see if that fixes it? IT DOES! I had been using the wrong index thing to combine the fw_SNPinfo and bayenv information. Ah, but adding the putative genes is also a problem.


I'm also re-running the permutations in the background on C001KR. 

Ok, now that that's all solved, **the only major thing remaining (other than dadi) is to figure out what to do with the treemix bootstraps. **
Look into sumtrees (https://dendropy.org/programs/sumtrees.html#non-parametric-bootstrap-support-of-a-model-tree)


# 28 May 2020

I've plotted all of the bootstrap reps for Treemix, and it seems like interpreting those in the supplement is a good way forward. 
I also found some python code to combine bootstraps using sumtrees: https://github.com/mgharvey/misc_python/blob/master/bin/TreeMix/treemix_tree_with_bootstraps.py


For whatever reason the new fw_SNPinfo seems to have fewer outliers...

```{r}
fw_SNPinfo<-read.RDS("fw_SNPinfo.RDS")
permuted_fsts<-readRDS("permuted_fsts.RDS")
vcf<-parse.vcf("converted_subset.vcf")
dat<-data.frame(ID=vcf$ID,Chrom=vcf$`#CHROM`,Pos=vcf$POS,BP=vcf$POS-1,
                 REF=vcf$REF,ALT=vcf$ALT,
                 perm_TX=permuted_fsts[[1]]$act_in_perm,
                 perm_FL=permuted_fsts[[2]]$act_in_perm,
                 perm_AL=permuted_fsts[[3]]$act_in_perm,
                 perm_LA=permuted_fsts[[4]]$act_in_perm,
                 stringsAsFactors = FALSE)
dat<-dat[order(dat$ID),]
fw_SNPinfo<-fw_SNPinfo[order(fw_SNPinfo$ID),]


perm_out<-dat$ID[which(rowSums(dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4)]
nrow(fw_SNPinfo[fw_SNPinfo$ID %in% perm_out,])
```

Ah, this seems to be due to a mis-match in IDs. It would seem that perhaps I had used a different vcf? Or something. Perhaps it would be prudent to re-run the permutations. 

# 27 May 2020

I've mostly finished the text of the supplement 1, but I need to sort out the treemix bootstrapped runs issue. I'm thinking that I'll show all 11 trees for each set of migration edges, and that's at least a good start. 

# 26 May 2020

I was able to fix the annotations issue, it arose from an odd merging issue in the bayenv results. 

Now working on the treemix bootstrap/multiple runs issue -- plotting the trees from all m2 runs, I'm finding different migration edges coming up. Not sure how to handle that, and the information online is horrible. I could try using sumtrees from dendropy and see what happens, but I don't think that will fix the issue with the migration edges.

The migration edge from FL pops to FLAB is consistent and consistently strong. Sometimes it's from the branch, sometimes from FLPB, and once from FLSI but it's there and consistent. 

The other migration edge is less consistent but in most cases is between two freshwater populations. 

# 25 May 2020

dadi updates:
- rccuser
  - FLCC_TXFW #1 has finished running on rccuser, so I'm moving it and tar-zipping it.'
  - There also appears to be FLCC_LAFW that's finished #1 and not been moved and tar-zipped so I moved that as well.
  - Now rccuser is only running #1 for FLCC_ALFW and FLCC_TXCC (both SC2NG).
- abacus
  - FLLG_ALFW #13 (started 23 Apr) seems to be done, so I moved those and deleted them from abacus.
  - the model from April 6th (ALST_TXFW_SC2N2mG_12) hasn't been updated since 8 May! Perhaps it's worth cancelling that run and re-starting it.
  - all other sets of models are still running
- C001KR
  - FLLG_ALFW is on numbers 5, 6, and 7.

I will also start FLFW_TXCC except AM, IM, SC, and SI (so 134-155) on abacus for run #13 (job 611)


To do:
1. annotations to determine whether SNPs were in coding regions or non-coding regions.
2. improve treemix interpretations (and add something about bootstraps?)
3. finish writing supplements

\#1. annotations
I've re-run the annotate_snps function to verify that the output is reliable, and then I can pull out any of them that have three_prime_UTR or five_prime_UTR in the region field.

For some reason most of the SNPs are in 'unknown' regions -- on scaffolds not found in the gff file, which is mysterious -- there shouldn't be so many SNPs on unknown scaffolds, I wouldn't think. 

> table(annInfo$region)

    coding non-coding regulatory     unkown 
      1529       1106        125       9343

\#2. Treemix
I did run treemix with bootstraps, but I don't know where the output is saved. I also ran each model 10 times but don't really do anything with those replicates, so that's an oversight that will need to be resolved. Let's check all of the different file types

```{r}
library(ape)
stem<-"fwsw_k100bFLPBrm2_1"
d = paste(stem, ".vertices.gz", sep = "")
e = paste(stem, ".edges.gz", sep = "")
se = paste(stem, ".covse.gz", sep = "")
cv<-paste(stem, ".cov.gz", sep = "")
m<-paste(stem, ".modelcov.gz", sep = "")
tr<-paste(stem, ".treeout.gz", sep = "")
ll<-paste(stem, ".llik", sep = "")
d = read.table(gzfile(d), as.is = T, comment.char = "", quote = "")
e = read.table(gzfile(e), as.is  = T, comment.char = "", quote = "")
se = read.table(gzfile(se), as.is = T, comment.char = "", quote = "")
cv<- read.table(gzfile(cv), as.is = T, comment.char = "", quote = "")
m<- read.table(gzfile(m), as.is = T, comment.char = "", quote = "")
tr<- read.tree(gzfile(tr))
ll<- read.table(gzfile(ll), as.is = T, comment.char = "", quote = "")
# get the migration edges
# First line is Newick format ML tree, and the remaining lines contain the migration edges. 
# The first column for these lines is the weight on the edge, 
# followed (optionally) by the jackknife estimate of the weight,
# the jackknife estimate of the standard error, and the p-values. 
# Then come the subtree below the origin of the migration edge, 
# and the subtree below the destination of the migration edge.
tro<-scan(paste0(stem,".treeout.gz"),what='character',sep='\n',quote="") 

```

In the google group Joe Pickrell says: "For building consensus trees from different bootstrapped treemix runs, I used sumtrees in the dendropy package: http://pythonhosted.org/DendroPy/programs/sumtrees.html. I unfortunately don't have a way of visualizing bootstrapped graphs in a satisfying way. "So perhaps I should use python to generate sumamry trees with dentropy -- or use an equivalent function from ape or similar. 

# 22 May 2020

I think I've solved the referencing-figs-and-tables issue: using the bookdown package. It doesn't like chunk names with underscores though so I had to do some editing. The only downside is that figure referencing doesn't seem to work with the rmarkdown figure loading format (![]()), so I'm switching them to include_graphics() from knitr.

Also to do today: revisit annotations and note whether SNPs were in coding regions or non-coding annotated regions of the genome. -- to do this I'll have to go back to the gff.

Still to do: 
- Improve writing of supplements
- Update references in introduction (and improve flow/writing)
- more dadi runs!
  - starting FLFW vs LAFW on abacus except AM, IM, IM2mG, SC, and SI -- so numbers 82-89 (number 607) and 90-103 (608).
  - similarly I'll also queue up FLFW_TXFW in the same way -- 108-114 (609) and 116-123 (610).
  
# 21 May 2020

dadi runs are still plugging along.

In the meantime, I'm working on fixing the supplements. The odd formatting issue I was having with the PCAdapt figure was due to a long figure caption in the chunk label.

Still to do:
- why is kableExtra's longtable not working properly?
  - Because I was using it on the wrong table, oops
- why are some tables kept in place but others are not?
  - fixed this, use `kable_styling(latex_options="HOLD_position")` instead of `kable_styling(latex_options="hold_position")`
- why are affiliations not properly showing up, and why are there no commas between author names?
  - Incorrect yaml formatting!
- Referencing tables and figures
- adding context to the figures (i.e., text)
- improving the discussion of f3 and f4 analysis for treemix
- bootstrapping of treemix?


# 20 May 2020

On rccuser ALFW_TXFW #1 runs are now done. I moved the results to C001KR and tar-zipped them on rccuser. Now there's just FLCC_ALFW, FLCC_TXCC, and FLCC_TXFW still on the first runs on rccuser.

Still each of the population pairs going on abacus are still running at least one combo (ALST_TXFW_12,FLLG_ALFW_13,FLLG_TXCC_13,LAFW_TXCC_13, and TXFW_TXCC_13), but there are only 8 models currently running, so I can start another set. I'll do LAFW_TXFW (468-493), run number 13, which is job 606 on abacus.

I've also cleaned up the supplement document a bit, trying to make the formatting a bit nicer.

In knitting supplement 1 I'm getting a couple of warnings: '!h' float specifier changed to '!ht' and "Float too large for page"...not sure how to fix these.


# 19 May 2020

I did a bit of an analysis on the annotations -- there seems to be an enrichment of outliers in coding regions, but not in the putative genes that I'd found in my literature search. Still to do: Make the supplements pretty & properly referenced, update results in MS. And of course finishing the dadi runs. 

I knitted the two new supplement docs so hopefully that will help identify what text/captioning/fixing-up needs to happen.

# 18 May 2020

What to do today?
- dadi updates
  - abacus is still running 1 job from 6 April, 1 from 23 April, 2 from 2 May, and 2 from 11 May. So nothing new to move.
  - everything is done on C001KR
  - rccuser is onto round 2 with ALFW_LAFW and still finishing round 1 for ALFW_TXFW_SC2NG, FLCC_ALFW_SC2NG, FLCC_LAFW_AM2NG, FLCC_TXCC_AM2N2mG, FLCC_TXCC_SC2NG, and FLCC_TXFW_AM2m.
- Run more dadi models on C001KR and abacus
  - Round out FLFW_ALFW runs on C001KR using the code below (and doing rounds numbered 5 through 10)
  - starting LAFW vs TXCC on abacus (run number 13, models 494-519, job number 603)
- Fix up supplements
  - hiding code [done]
  - improving text 
  - references
- summarize annotations somehow
- update results in MS

```{r}
pops=c('FLLG', 'ALFW')   #
models=c('AM2mG', 'AM2N2m', 'AM2N2mG', 'IM2NG',
         'SC2m','SC2mG','SC2N','SC2N2m','SC2N2mG','SCG')

rundf<-data.frame(pops=NULL,model=NULL)
count=1
for(i in 1:(length(pops)-1)){
  for(j in (i+1):length(pops)){
    for(mod in 1:length(models)){
      rundf[count,1]<-paste(pops[i],pops[j],sep="_")
      rundf[count,2]<-models[mod]
      count=(count+1)
    }
  }
}

write.table(rundf,"dadi_mods_2020-05-18.txt",col.names = FALSE,
            row.names = FALSE,quote=FALSE,eol = '\n')
```

With the annotations, what do I want to know?
(1) Number of SNPs in genes
(2) Number of outliers in genes (are outliers enriched for coding regions?)
(3) Enrichment of any pathways/gene sets in outliers
(4) Enrichment of salinity-associated genes

*I've discovered that somewhere along the way the fw_SNPinfo dataset has gotten shortened. So now I need to track down where that happens (probably a merge that didn't keep everything)* -- this happened with the Bayenv analysis, and is now fixed.

I've done #1 and #2 easily, the enrichment could be tricky. I do have the list of putative salinity genes so that's probably the best way forward. 


# 15 May 2020

- change dadi model comparison code
  - Done.
  - Added a table to show the chosen rep of each model
- Run more dadi models on C001KR

- update references in supplement(s)
- update the numbers of SNPs etc in the manuscript

I created two docs from the reanalysis doc -- a population structure supplement and an outlier supplement.


# 14 May 2020

Goals:
- check dadi model comparison functions/scripts
  - Zach suggests choosing rep with median log likelihood
- scaffold summary dadi table
  - **done**
- add more captions to the reanalysis doc and/or make a unique supplemental document(s)
  - added captions to all of the plots included in the reanalysis doc. 
  - noticed strange formatting issues
- update references in supplement(s)
- update the numbers of SNPs etc in the manuscript

# 13 May 2020

Goals for today:
- Revisit PCA of phenotypes and add to reanalysis doc
  - **done**
- Number and label figures properly in reanalysis doc
  - I am wondering if I should actually have a separate supplementary doc with just the requested figures/analyses. 
- make the dadi supplement 'pretty'when new runs on C001KR are done
  - **done**
- scaffold the dadi figure 2, even if I can't fill it in
  - To present the models and parameters, I'm considering two tables: one with the best-fitting model information and one that could be more of a color-coded figure to show the most-likely demographic relationship from the models for each pair, where the color shows the overall *type* of model.
  - I made diagrams in powerpoint and saved them as PNG files, then created an overview file in R. 
  - I also made a summary/overview thing with color-coded boxes to show the type of model. 

Remaining TODOs:
- scaffold summary dadi table (?)
- add more captions to the reanalysis doc and/or make a unique supplemental document(s)
- update references in supplement(s)
- update the numbers of SNPs etc in the manuscript

# 12 May 2020

The run on abacus from 29 April has finished running -- it looks like it was FLLG_ALST #13. Moved them to C001KR and deleted off abacus.

My TODOs:

- check that the f3 and f4 treemix analyses were properly done & reported - I'd updated the text in the reanalysis doc yesterday
  - **this is all set in the analysis doc**
  - **checked and updated the MS text and it's good to go. **
- need to evaluate which morphometric analyses need re-doing
- make the dadi supplement 'pretty'when new runs on C001KR are done
- scaffold the dadi figure 2, even if I can't fill it in
- Make the reanalysis doc 'pretty'
  - Change settings so code is not double spaced
    - **done**
  - make sure code wraps nicely
    - **done**
  - Improve order of figures and code, if possible
    - **done**
  - fix or remove numbering (it only appears sometimes?)
    - **done**
  - Number and label figures properly


# 11 May 2020

dadi run updates:

- ALST LAFW models have finished running on C001KR.
- ALFW_TXFW, FLCC_ALFW, FLCC_LAFW, FLCC_TXCC, FLCC_TXFW still running on rccuser 
- 1 model from 6 April, 1 from 23 April, 1 from 29 April, and 3 from 1 May are running on abacus

Nothing new to move from abacus, but what about from rccuser?

- Moved ALFW_LAFW_1.\*, ALFW_TXCC_1.\*, FLCC_ALST_1.\* from rccuser
- All of the ones I moved I tar-zipped and left on rccuser for the moment.
- I' can't tell if I already took the FLLG_FLCC results from rccuser or if I ran the same numbers on both rccuser & C001KR. 
  - I'm pretty sure I donwloaded them because on C001KR they have the same timestamp, so I tar zipped them.
  - same for FLLG_ALFW

Now I need to choose what to run on abacus (rep # 13) and what to run on C001KR (rep # ?). To do this, I added the number of runs for each model into the plots in the fwsw_dadi doc. For now my goal is to get 5 reps of each model (then I'll go for 10, but let's start small). 

- I want to finish off FLLG_FLCC models so I'm going to run AM2m, AMG, IM2N twice more - reps 4 through 6. 

```{r}
pops=c('FLLG', 'FLCC')   #
models=c('AMG', 'AM2m', 'IM2N')

rundf<-data.frame(pops=NULL,model=NULL)
count=1
for(i in 1:(length(pops)-1)){
  for(j in (i+1):length(pops)){
    for(mod in 1:length(models)){
      rundf[count,1]<-paste(pops[i],pops[j],sep="_")
      rundf[count,2]<-models[mod]
      count=(count+1)
    }
  }
}

write.table(rundf,"dadi_mods_2020-05-11.txt",col.names = FALSE,row.names = FALSE,quote=FALSE,eol = '\n')
```

- On abaucs, I'll start TXFW vs TXCC (520-545) with rep 13 (run 582)

Other TODOs:

- check that the f3 and f4 treemix analyses were properly done & reported
- finish & knit the fwsw_morphology supplement 
  - **[started on this but need to evaluate what needs re-doing]**
- make the dadi supplement 'pretty' 
  - **[started on this but can't verify until new runs are done]**
- scaffold the dadi figure 2, even if I can't fill it in
- Make the reanalysis doc 'pretty'
  - Change settings so code is not double spaced
  - make sure code wraps nicely
  - Improve order of figures and code, if possible
  - fix or remove numbering (it only appears sometimes?)
  - Number and label figures properly


# 4 May 2020

It seems that the doc didn't knit properly but I didn't record what the issue was, so let's try that again.
It seems that it could not find the bayenv/fwsw75_pruned.png file.

Other TODOs:

- for some reason now the PCAdapt results are not shown in the fig 1
  --> there had been an issue with reading in the colors, but it's fixed now
- make sure I'm showing admixture K=1 through K=8 in supplement 
- add f3 and f4 treeness checks to the supplement
  --> found these in my notes from 21 Oct 2019. However, I'm not confident in my interpretation(s) in my notes, I need to check. 
- check for drift parameter in figures
  --> all good!

Afeter looking at the PDF, what I want to do:
- Change settings so code is not double spaced
- make sure code wraps nicely
- Improve order of figures and code, if possible
- fix or remove numbering (it only appears sometimes?)
- Number and label figures properly


# 1 May 2020

On abacus starting FLLG_TXCC runs (number 13) with numbers 130-155, job ID 538
All of the runs on C001KR have finished (those are the IM2mG ones), so I can start some more!

To run a subset of needed models, let's choose some. It's not that easy to know how many runs each set has. ALST-LAFW seems like a reasonable place to start, though. 

```{r}
pops=c('ALST', 'LAFW')   #
models=c('SI2N', 'SIG', 'SI2NG', 'IMG', 'IM2N', 'IM2m', 'IM2NG',  'AM2N', 'AMG', 'AM2m', 'AM2NG', 'AM2N2m', 'AM2mG', 'AM2N2mG', 'SCG' ,'SC2N', 'SC2m', 'SC2NG', 'SC2N2m' ,'SC2mG' ,'SC2N2mG')
 #'SI', 'IM', 'AM', 'SC','IM2mG', <- removing these because they've already been run once on C001KR 

rundf<-data.frame(pops=NULL,model=NULL)
count=1
for(i in 1:(length(pops)-1)){
  for(j in (i+1):length(pops)){
    for(mod in 1:length(models)){
      rundf[count,1]<-paste(pops[i],pops[j],sep="_")
      rundf[count,2]<-models[mod]
      count=(count+1)
    }
  }
}

write.table(rundf,"alst_lafw_mods.txt",col.names = FALSE,row.names = FALSE,quote=FALSE)
```

And...that worked! Now I'm running the non-commented models above with round 1 (and I will see how long that takes, then possibly start another round).

I've also gone through and tried to fix up some unhappy knitting issues with the reanalysis doc (printing out WAY TOO MUCH INFO -- I'm looking at you, treemix!, improving plot dimensions, etc). Now seeing if it will knit -- so far so good!

# 30 April 2020

I'm having trouble running the treemix analyses on my laptop, probably thanks to some weird dropbox syncing issue.

So today my goals are:
- figure out unknown graphics extension error 
    --> I figured this out, when loading files ala markdown I'd put them in quotes but I shouldn't have
    --> there were some other issues that I had to solve too, but it finally worked!
- added the drift parameter to fig 1 
    --> needed to change xlab=TRUE
- show admixture K=1 through K=8 in supplement 
    --> I need to be able to compile the PDF because I don't have admixture results on this computer. 
    --> I have written the code that *should* do this, but it's throwing an error that needs to be debugged

# 29 April 2020

The models on C001KR are almost done running (only 2 left to finish), and there are only 2 left running on abacus at the moment. The ones on rccuser are still going strong.

- moved ALFW_LAFW_12 from abacus to C001KR (and deleted from abacus)
- moved ALFW_TXFW_12 from abacus to C001KR (and deleted from abacus)

abacus is still running LAST_TXFW_SC2N2mG and FLLG_ALFW_AM2NG

I realized I'd been off by 1 because my list with numbers is from 1:num_models rather than 0:num_models. So I just started #26 on abacus (job ID 524) which should round out the FLLG_ALFW set with run 13.

Also rounding out the FLLG_ALST runs with numbers 53-77, job ID 525.


Now let's see if I can keep moving forward on the supplement stuff from last time. 

- I fixed the AFS plot.

Still TODO:
- compile the Rmd (solve 'unknown graphics extension' error)
- add f3 and f4 treeness checks to the supplement
- plot all K=1 through K=7 or 8 for admixture
- run admixture with bootstraps to estimate SEs for best models
- add drift parameter to fig 1 (and make sure it's in other supp figs)


# 24 April 2020

I've been working on updating the text of the manuscript and on working through some of the reviewer comments, and have noted a few places where I need to make sure I actually include some analyses in the supplements. I've also noted that I need to include the drift parameter in the treemix tree in fig 1. So, here are my tasks for today:

- Add drift parameter to tree in Fig 1 [NEED TO DO THIS]
- Check treemix details are included in the supplement
    - ensure the treemix no-migration tree is shown in the supplement [yep]
    - residuals [yep]
    - f3 and f4 outputs [NO NEED TO ADD]
- Check details of the ADMIXTURE analysis
    - Verify whether we checked ADMIXTURE for convergence and if so present those results in a supplement [Not relevant -- no mention of it in the manual]
    - Show delta-K and LnK (if relevant) are presented in a supplement [yep]
    - Show lower K values to help justify our choice of Ks. [NEED TO ADD TO SUPPLEMENT]
    - Looking at the manual, we might want to consider bootstrapping to estimate standard errors for the parameters [consider this moving forward!]
- Make sure we still show allele frequency spectra somewhere and that they show MINOR allele frequencies [added in]

TODO:
- compile the Rmd
- add f3 and f4 treeness checks to the supplement
- plot all K=1 through K=7 or 8 for admixture
- run admixture with bootstraps to estimate SEs for best models
- add drift parameter to fig 1 (and make sure it's in other supp figs)

# 23 April 2020

To do:

- improve the figure showing all possible models
- figure out knitting issues
    - I got the dadi doc to knit by removing an if(graphics) statement from manuscript.latex
- download completed output from abacus
    - I already have FLLG_FLCC and TXFW_TXCC model on C001KR
        - TXFW_TXCC models were from Mar 23ish (ID 382) and 27th (ID 249)
    - It looks like the ALFW_ALST model runs were the ones from 3 April (ID 420) [moved these]
    - LAFW_TXFW, LAFW_TXCC and some TXFW_TXCC Mar 30 (ID 409) [moved these]
    - It's easiest to wait to move any population pairs that still have currently running models
        - Currently running are ID 439 (ALST_LAFW, ALST_TXFW, ALST_TXCC, and LAFW_TXFW) but only ALST_TXFW_SC2N2mG
            - moved ALST_LAFW, ALST_TXCC
        - 484 (ALFW_LAFW, ALFW_TXFW, and ALFW_TXCC), but only ALFW_LAFW_AM2N2mG and ALFW_TXFW_SC
            - moved ALFW_TXCC
* When I move things from abacus, I am deleting them -- this makes it easier to keep track.

Also, I am going to start another set of runs on abacus, but I need to upload a new set of dadi scripts because I'm missing the IM2mG ones on abacus still. I started run 502 running 27-52 which should be FLLG_ALFW (had to fix up the scripts because i'd done something wrong before, but now it's running). 

# 22 April 2020

Update on what's running:

- C001KR: still running IM2mG models, only one FLLG still going (FLLG vs TXCC)
- abacus: only has 5 still running, one from 3 April, 2 from 6 April, and 2 from 15 April.
- rccuser: still on round 1, not as many of the cores as I would have expected to be running being used, and still on FLCC and ALFW.

So, it's definitely a good thing we've got more time!

But I do think I can get started on some of the other combinations, so that's what I'm going to focus on this evening. Ok, well I've set up the script but I definitely don't have enough combinations for any other combination other than FLLG_FLCC, which is really disappointing. Of course ,that doesn't include combinations from the other machines that I haven't downloaded yet, so that could be a useful next step.

I'm also trying to download files from the other computers that I haven't already downloaded. I'm fairly confident I now have all FLLG_FLCC and FLLG_ALFW runs from rccuser (latest of those files is from April 10th). 


# 17 April 2020

Today I want to re-vamp the 250_fwsw_dadi document.


It would also be great if I could get the 202_fwsw_reanalysis doc to render!

I've done the dadi document re-vamping, but I'm having issues rendering both it and the reanalysis one. This is a problem to solve at another time!


# 16 April 2020

I'm working on making the 202_fwsw_reanalysis document successfully compile into a PDF --  I figure it can be a useful supplement. I've also got some supplemental documents, and I'm not sure how distinct those are from the reanalysis doc.

Goals for today:

- evaluate independent supplement docs and compile a list of PDFs to include as supps
    - S1: 202_fwsw_reanalysis, population structure + outlier analyses (I'm making this prettier)
    - S2: 250_fwsw_dadi, showing the dadi analysis
    - S3: fwsw_morphometrics
- improve the figure showing all possible models
- set up code to analyze all of the existing model output
- check in on the analyses
    - C001KR: still running IM2mG, but we've gotten through all FLLG-FLCC, woohoo!
    - abacus: 2 still running from 3 April, 4 from 6 April, and a bunch from yesterday
    - rccuser: running ALFW vs other runs, round 1 (one started as recently as this AM)
- improve the reanalysis document with chunk labels etc

We got (another) extension!I'm optimistic that we'll be able to actually meet this new deadline.

# 15 April 2020

Ok, so for whatever reason abacus wasn't recognizing my relative path to the dadi scripts. So I had to edit the script and re-start it. Also now I realize that abacus still doesn't have the IM2mG scripts. 

What else needs to be done for the FWSW paper (other than the dadi analysis)?

- put the supplements together
- get the dadi figure outlined
- request an extension

The updated model results for FLLG-FLCC makes it look like the IM2mG is now the best *on average*:

```{r}
flcc_fllg_opts<-dadi.modelcomp(path = "dadi_results/FLLG_FLCC/",
                               pattern="*optimized.*",id="FLLG_FLCC")
flcc_fllg<-modelComparison(flcc_fllg_opts)
plot(as.numeric(flcc_fllg_opts$log.likelihood)~
              as.factor(flcc_fllg_opts$Model),las=3,
        pch=19,xlab="Model",ylab="Maximum log likelihood")

```

What's a bit interesting/concerning is that the 'best' model is still SC2mG and the dAICs are not <= 10 -- perhaps I didn't calculate those correctly?

# 14 April 2020

I need to create a new model list with updated numbers to determine what to queue up on abacus

```{r}
pops=c('FLLG', 'FLCC', 'ALFW', 'ALST', 'LAFW', 'TXFW', 'TXCC')   #
models=c('SI', 'IM', 'AM', 'SC', 'SI2N', 'SIG', 'SI2NG', 'IMG', 'IM2N', 'IM2m', 'IM2NG', 'IM2mG', 'AM2N', 'AMG', 'AM2m', 'AM2NG', 'AM2N2m', 'AM2mG', 'AM2N2mG', 'SCG' ,'SC2N', 'SC2m', 'SC2NG', 'SC2N2m' ,'SC2mG' ,'SC2N2mG')
 

############ CREATE POP COMBOS TO RUN ############
tasks=NULL
for(i in 1:(length(pops)-1)){
  for(j in (i+1):length(pops)){
    for(mod in 1:length(models)){
      tasks<-c(tasks,paste(pops[i],pops[j],models[mod],sep="_"))
    }
  }
}

write.csv(tasks,"model_list_dadi.csv",row.names = TRUE,quote=FALSE)
```

Done. Now I've queued up 313-390, which have job ID 469 and include all ALFW-LAFW, ALFW-TXFW, and ALFW-TXCC comparisons.

# 10 April 2020

I finally fixed the string concatenation issue, which was an annoying issue of the two column file being created on windows rather than on unix ugh.

The model runs on abacus are almost finished, which is exciting. I'm rebooting the rccuser machine and then I'll re-start the runs using the list of ones that have zero runs as of yesterday. 

In the meantime, let's check in on the FLLG_FLCC analysis.

```{r}
flcc_fllg_opts<-dadi.modelcomp(path = "dadi_results/FLLG_FLCC/",
                               pattern="*optimized.*",id="FLLG_FLCC")
flcc_fllg<-modelComparison(flcc_fllg_opts)
plot(as.numeric(flcc_fllg_opts$log.likelihood)~
              as.factor(flcc_fllg_opts$Model),las=3,
        pch=19,xlab="Model",ylab="Maximum log likelihood")

```

It's looking like IM2mG model has a higher on average log likelihood than SC2mG. Let's see if the IM2mG parameter distributions are better

```{r}
flcc_fllg_sc2mg<-do.call(rbind,extract_dadi_params(flcc_fllg_opts[flcc_fllg_opts$Model=="SC2mG",]))
flcc_fllg_im2mg<-do.call(rbind,extract_dadi_params(flcc_fllg_opts[flcc_fllg_opts$Model=="IM2mG",]))
```

```{r}
par(mfrow=c(1,2))
boxplot(flcc_fllg_sc2mg[,11:22])
points(1:12,
       flcc_fllg_sc2mg[which.max(flcc_fllg_sc2mg$log.likelihood),11:22],
       pch=8,col="red",lwd=2)

boxplot(flcc_fllg_im2mg[,11:21])
points(1:11,
       flcc_fllg_im2mg[which.max(flcc_fllg_im2mg$log.likelihood),11:21],
       pch=8,col="red",lwd=2)
```


# 9 April 2020

Goals: 
- update my script to run dadi so that runs each model for rep i etc.
- calculate $N_{ref}$ from the $\theta$ outputs
- consider this comment from Ryan on the forum: "Those vast differences in parameter values with very similar likelihoods suggest that for this model and data your likelihood surface is very flat. There's simply not much information about the model parameters in the data. This suggests to me that you want to stick with simple models, and that the models you're currently fitting are too complex."
- writing up the dadi methods
- start writing the dadi results (and interpret the FLLG_FLCC comparison)
- seriously consider another extension on the ms

Updates on running iterations:
- C001KR is running the IM2mG models on the first 8 sets of comparisons (FLLG_FLCC,FLLG_ALFW,FLLG_ALST, FLLG_LAFW,FLLG_TXFW,FLLG_TXCC,FLCC_ALFW,FLCC_ALST) and I know that it's on ~ round 4 of 5 and has been running for 44hr
- rccuser is still running the FLLG_FLCC and FLLG_ALFW models, going on 799 hrs. this is the main reason why I want to re-write the script, so that I can re-launch them on that machine and get better coverage of all the models. Either way, it's not boding well for finishing by the end of the month.
- abacus is currently running 18 models but this is fewer than yesterday, which means that a number of them have finished running. These models were started on either 3 Apr (3) or 6 Apr (15).

How to revise the run script?
1. wrap what already exists in a for-loop.
2. read in the priority models from a script and loop over them

Ok, so I did #1. I've also added a useful check to make sure that the output file doesn't already exist.

Now the question is how to tell the model which to run. To start getting that going, I want to check in with what's been run so far and what range values I have. So I  ran 252_check_dadi_nums.R (except on abacus where I've just used ls) and now need to combine the outputs.

```{r}
abacus<-read.delim("model_output2020-04-09_abacus.txt",header=FALSE)
outputs<-data.frame(dir=gsub("(\\w{4}_\\w{4})\\/.*","\\1",abacus$V1),
                    file=gsub("(\\w{4}_\\w{4})\\/(.*)","\\2",abacus$V1),
                    stringsAsFactors = FALSE)
outputs$pattern<-gsub("(\\w{4}_\\w{4})_\\d+.*\\.(\\w+.*)\\.optimized.txt","\\1_\\2",outputs$file)
counts<-as.data.frame(table(outputs$pattern))

c001kr<-read.csv("model_counts2020-04-09_C001KR.csv")
rccuser<-read.csv("model_counts2020-04-09_UCRCC0365.csv")
all_outputs<-merge(c001kr,rccuser,by="tasks")
all_outputs<-merge(all_outputs,counts,by.x="tasks",by.y="Var1",all = TRUE)
all_outputs$Freq[is.na(all_outputs$Freq)]<-0
all_outputs$Total<-all_outputs[,2]+all_outputs[,3]+all_outputs[,4]
all_outputs[which(all_outputs$Total==0),]
write.table(all_outputs$tasks,"zero_dadi_runs.txt",col.names = FALSE,row.names = FALSE,quote=FALSE)

script_out<-data.frame(gsub("(\\w{4}_\\w{4})_(.*)","\\1",all_outputs$tasks),gsub("(\\w{4}_\\w{4})_(.*)","\\2",all_outputs$tasks))
write.table(script_out[which(all_outputs$Total==0),],"zero_dadi_runs.txt",col.names = FALSE,row.names = FALSE,quote=FALSE)
```


# 8 April 2020

Goals:

- look at parameter estimates (and distributions) of the best-fitting model for FLLG vs FLCC
- generate JAFS for each pairwise comparison for the data (in python)
- generate JAFS for the best fitting FLLG vs FLCC model + residual plot
- begin a model comparison overview plot (like Rougeux et al 2017 Fig 4)

I've now got 3 runs of the FLLG-FLCC IM2mG model so let's take another look at the model results.

```{r}
flcc_fllg_opts<-dadi.modelcomp(path = "dadi_results/FLLG_FLCC/",
                               pattern="*optimized.*",id="FLLG_FLCC")
flcc_fllg<-modelComparison(flcc_fllg_opts)
plot(as.numeric(flcc_fllg_opts$log.likelihood)~
              as.factor(flcc_fllg_opts$Model),las=3,
        pch=19,xlab="Model",ylab="Maximum log likelihood")
```

The best model is still SC2mG. So now let's look at the parameter estimates for this model.

```{r}
flcc_fllg_sc2mg<-do.call(rbind,extract_dadi_params(flcc_fllg_opts[flcc_fllg_opts$Model=="SC2mG",]))
```

```{r}
boxplot(flcc_fllg_sc2mg[,11:22])
points(1:12,
       flcc_fllg_sc2mg[which.max(flcc_fllg_sc2mg$log.likelihood),11:22],
       pch=8,col="red",lwd=2)
```

This suggests b1>b2 (population growth is larger in FLLG than FLCC), but nu1<nu2 (FLCC has larger population than FLLG). Migration between the two pops is similar but there are genomic islands experiencing larger migration from FLCC to FLLG (me21 > me12 for best models). The time since secondary contact is about the same as the time since the split but maybe a bit higher. 

Portik says "parameter values should ideally be estimated using a bootstrapping procedure to obtain confidence intervals (Gutenkunst et al. 2009)" but they do provide an estimate of $\theta=4N_{ref} \mu l$. Rougeux et al 2017 say they calculate $N_{ref}$ using a fixed $\mu=10^{-8}$ and $L=\frac{zy80}{x}$ where $x$ is the number of SNPs deteceted from $y$ RAD-tags of 80 bp and $z$ is the number of SNPs retained for the dadi analysis. Times were estimated in terms of $2N_{ref}$ generations and they used a generation time of 3.5 years (based on biology of the whitefish). They divided migration rate estimates by $2N_{ref}$ to obtain the proportion of migrants received by each population each generation. 


So I should be able to use dadi to run the Godambe information matrix -- also to create the JAFS.


## how have others plotted results

Portik et al 2017 show the data JAFS, model JAFS, and model residuals next to the diagram of the best-fit model
Rougeux et al. 2017 showed the data and model JAFS and the diagram of the best-fit model but also showed differences in model scores when growth is included, with heterogeneous vs homogeneous rates (compared these statistically) and waics for various models
Rougeux et al 2019 show data JAFS and model JAFS and best fitting model diagram plus table of inferred parameters (including theta)
Barratt et al 2018 show the histogram of log likelihoods and compared the results to another modeling framework. 
Charles et al 2018 show the data and model JAFS and residual plus model diagrams

So a good first-pass is to generate each JAFS diagrams for all of the datasets and then for this, the best FLLG-FLCC run. I can then save them as images and arrange them in R (or in GIMP if that turns out to be easier and I don't need to combine them with other data/figures).

Update: I have successfully created the data JAFS, and created a figure with the best SC2mG model parameters -- the two images don't look that similar but the residuals are tiny so I think that's good. It might be something to do with the comparison model that I used -- multinomial vs Poisson. Well, no, not quite -- the Poisson comparison plot looks really strange. The 2nd best model seems a bit better, and its parameters are closer to the means of all the runs but still has a low log likelihood. 

I can get away with not estimating the parameters (probably), but I can estimate $N_{ref}$. 

So next steps:
- calculate $N_{ref}$ from the $\theta$ outputs
- consider this comment from Ryan on the forum: "Those vast differences in parameter values with very similar likelihoods suggest that for this model and data your likelihood surface is very flat. There's simply not much information about the model parameters in the data. This suggests to me that you want to stick with simple models, and that the models you're currently fitting are too complex."
- start writing up the dadi methods and results
- seriously consider another extension on the ms
- start looking at FLLG_ALFW outputs

```{r}
fllg_alfw_opts<-dadi.modelcomp(path = "dadi_results/FLLG_ALFW/",
                               pattern="*optimized.*",id="FLLG_ALFW")
fllg_alfw<-modelComparison(fllg_alfw_opts)
plot(as.numeric(fllg_alfw_opts$log.likelihood)~
              as.factor(fllg_alfw_opts$Model),las=3,
        pch=19,xlab="Model",ylab="Maximum log likelihood")
```

I don't really have enough runs yet for enough of the models. 

I need to update my script to run dadi so that runs each model for rep i etc.


# 7 April 2020

I definitely have enough reps of FLLG-FLCC that I can start analyzing those. I've downloaded the outputs from the rcc machine -- let's see what we can do with it. I've done some preliminary analyses with the old runs that I did, I can use that code with these. Here's I'm modifying code from 250_fwsw_dadi -- I'll replace that code eventually probably. These are 2D models and I want to see which model fits best. 
```{r source}
source("../../gwscaR/R/gwscaR.R")
source("../../gwscaR/R/gwscaR_plot.R")
source("../../gwscaR/R/gwscaR_utility.R")
source("../../gwscaR/R/gwscaR_fsts.R")
source("../../gwscaR/R/gwscaR_popgen.R")
source("../../gwscaR/R/vcf2dadi.R")
source("../R/250_dadi_analysis.R")

library(knitr)
pop.list<-c("ALFW","ALST","FLCC","FLLG","LAFW","TXCC","TXFW")
```

```{r}
flcc_fllg_opts<-dadi.modelcomp(path = "dadi_results/FLLG_FLCC/",
                               pattern="*optimized.*",id="FLLG_FLCC")
kable(head(flcc_fllg_opts),
      caption = "'Best' runs of each of the models for FLLG vs FLCC 2D comparison")

```

I'm ingoring warnings about concatenation for now. These are the top models based on their AIC (choosing the lowest AIC value). 

The dadi pipeline information recommends checking that all of the runs of each model converge on a similar log likelihood. I don't have great representation of all of the models (IM2mG only has one rep??) but I'll go ahead anyway.


```{r}
plot(as.numeric(flcc_fllg_opts$log.likelihood)~
              as.factor(flcc_fllg_opts$Model),las=3,
        pch=19,xlab="Model",ylab="Maximum log likelihood")
```
 
Some seem to have converged better than others, but I suppose that's ok. In the paper they say, "Models were compared using the Akaike information criterion (AIC), and the replicate with the highest likelihood for each model was used to calculate AIC scores, ΔAIC scores and Akaike weights (ωi) " (Portik et al 2017) - so this is what I need to do.

In Rougeux et al. 2017, they say:

they calculated $\Delta AIC_i=AIC_i-AIC_{min} \le 10$ to retain models (so we can drop those models). Then for each lake, the difference in AIC between worst and best model $\Delta _{max}=AIC_{max}-AIC_{min}$ was used to obtain a scaled score for each using:
$model score = \frac{\Delta _{max}-\Delta AIC_i}{\Delta _{max}}$

They also computed $w_{AIC}$:

$$
w_{AIC}=\frac{e^\frac{-\Delta AIC_i}{2}}{\sum_{i=1}^Re^\frac{-\Delta AIC_i}{2}}
$$

I started working on this before -- at the end of the 250_fwsw_dadi.Rmd file, but I don't think I fully implemented them.

So as I understand it this is the sequence of steps once I have multiple reps of model output for each model and each 2D comparison:

1. For each replicate of optimizations choose the round with the highest log likelihood (should be in round 4)
2. For each model, choose the replicate with the highest log likelihood
3. calculate  $\Delta AIC_i=AIC_i-AIC_{min} \le 10$ from those best rounds for each replicate.
4. If any models have $\Delta AIC_i le 10$, drop those models.
5. For each model, caclulate $\Delta _{max}=AIC_{max}-AIC_{min}$.
6. Use $\Delta AIC$ to calculate the model score for each model, and use these to compare models.
7. Calculate $w_{AIC}$ for each model and use these to compare the probabilities for each model.

Repeat steps 1-6 for each 2D population comparison. And I should wrap this all up into a function - it will be in `modelComparison`.


```{r}
# this function is meant to be used on a set of models each represented by its best replicate 
calc_dAIC<-function(mod){
  minAIC<-min(as.numeric(as.character(mod$AIC)))
  return(as.numeric(as.character(mod$AIC))-minAIC)
    
}
```

```{r}
# calculate delta max, which is across models the AIC max - AIC min
calc_dMax<-function(opts){
  dMax<-max(as.numeric(opts$AIC))-min(as.numeric(opts$AIC))
  return(dMax)
}
```
```{r}
# calculate Akaike weights
calc_wAIC<-function(opts){
  w<-(exp((-1*opts$dAIC)/2))/sum(exp((-1*opts$dAIC)/2))
  return(w)
}
```


```{r}

modelComparison<-function(opts){
  # make sure log likelihood is numeric
  opts$log.likelihood<-as.numeric(as.character(opts$log.likelihood))
  # choose best replicate for each model
  mod_dat<-do.call(rbind,by(opts,opts$Model,function(modDat){
    return(modDat[which.max(modDat$log.likelihood),])
  }))
  # calc dAIC
  mod_dat$dAIC<-calc_dAIC(mod_dat)
  # calculate dMax
  dMax<-calc_dMax(mod_dat)
  # calculate model score
  mod_dat$model_score<-(dMax - mod_dat$dAIC)/dMax
  # calculate wAIC
  mod_dat$wAIC<-calc_wAIC(mod_dat)
  # keep models with dAIC<=10
  keep_mods<-mod_dat[mod_dat$dAIC<=10,]
  if(nrow(keep_mods)<=5){
    return(mod_dat)
  } else{
    return(keep_mods)
  }
  
}
```

```{r}
flcc_fllg_opts<-dadi.modelcomp(path = "dadi_results/FLLG_FLCC/",
                               pattern="*optimized.*",id="FLLG_FLCC")
flcc_fllg<-modelComparison(flcc_fllg_opts)
```


I've added these to the 250_dadi_analysis.R script. 

For the FLLG-FLCC analysis, the SC2mG model is the best -- which generally fits with the patterns of the log likelihoods plotted above. However, somehow I only have one replicate for the IM2mG model, which is not great. 

```{r}
table(flcc_fllg_opts$Model)
```

Oh no! This is because I don't have the IM2mG script -- I wonder how that happened? It wasn't included in the model list in the generate_dadi_scripts file :facepalm:.

The next step would be to generate the JAFS for the best-fitting model (in python), plot model comparisons, and look at parameter estimates (and distributions) of the best-fitting model. First up: look at parameter estimates/distributions.


# 6 April 2020

The models are still running. I've started a bunch more on abacus and will probably stop the FLLG-FLCC and FLLG-ALFW ones running on my RCC machine, as I've already got lots of those reps and I need some of the others. I'm also going to work on making the figures, which need to get done one way or another.

# 2 April 2020

And the models STILL run! On abacus and on a cluster. I want to check the status, see if I should kill some to prioritize running others.

```{r}
abacus<-read.delim("models_abacus.txt",header = FALSE,stringsAsFactors = FALSE)
outputs<-data.frame(dir=gsub("(\\w{4}_\\w{4})\\/.*","\\1",abacus$V1),
                    file=gsub("(\\w{4}_\\w{4})\\/(.*)","\\2",abacus$V1),
                    stringsAsFactors = FALSE)
outputs$pattern<-gsub("(\\w{4}_\\w{4})_\\d+.*\\.(\\w+.*)\\.optimized.txt","\\1_\\2",outputs$file)

counts<-as.data.frame(table(outputs$pattern))


```
```{r}
rcc<-read.csv("../model_counts2020-04-02_UCRCC0365.csv")
home<-read.csv("model_counts2020-04-02_L002KR.csv")
all_runs<-merge(rcc,counts,by.x="tasks",by.y="Var1",all.x=TRUE)
all_runs<-merge(all_runs,home,by="tasks")
all_runs$Freq.y[is.na(all_runs$Freq.y)]<-0
all_runs$Total<-all_runs[,2]+all_runs[,3]+all_runs[,4]
```


# 6 January 2020

The cursed models continue to run. I'm comparing the FLLG models because they've now run a number of times, and I'm trying to evaluate (a) whether I can get away with running fewer models that result from the best-fitting simple models and (b) the best way to compare the models. I've implemented the model score and wAIC methods from Rougeux et al (https://academic.oup.com/gbe/article/9/8/2057/4060520#106356015) and it provides somewhat useful comparisons. At the moment all of the best FLLG simple models are the IM models. I only ran one iteration of the more complex ones, and the best model there is not an IM for most of them, but maybe that's not the best way to go about doing things. 

I think what's clear is that I need to ask for another extension because I won't be able to run all these models in time - unless Adam can magically round up a cluster that I can use. I'll send him an email now.

# 18 December 2019

Ugh, got an email from the abacus people saying that they won't install dadi until late January! so angry. I'm going to have to pursue the azure thing again I think, especially now that I've got a better sense of what I'm doing to parallelize. Today I updated/wrote a script to run the scripts I spawned (252_run_dadi_scripts.sh) and I think it's working ok. I'm not sure the `sem` command is behaving exactly like I expected, but for now it's good enough.


# 17 December 2019

Today I'm looking into ABCs instead of dadi. I cloned https://github.com/QuentinRougemont/abc_inferences into popgen/ on C001KR and then the data folder https://github.com/QuentinRougemont/abc_inferences_data.git into that folder. My questions for this analysis are:
1. What is the input data?
    Based on the github, it looks like I need a ped and a map file, and then can run the utility script to reformat:
    `./00-scripts/utility_scripts/prepare_abc.sh 01-salmon_data/salmon.data.ped 01-salmon_data/salmon.data.map`
    
    ** did I convert my ped files?? ** - nope, they didn't convert properly. I could write a function to convert vcf to ped/map? This sounds annoying but I could do it. 

2. How long will it take to run?
    According to the github, the robustness check is 'very time-consuming'. So maybe this isn't the solution for me,e specially given that the cluster is being dumb AF.
    
In the meantime, almost all of the runs are done for the FLLG-to-others comparisons, but not for the rest. I could start another batch. I really wish I could run this on abacus...It feels very frustrating to be limited by my ability to run my programs!!

Ok, let's just focus on dadi for right now. Maybe I'll just run the simple models for the others -- create a file to run them? This is a bit complicated so I just ran the ALFW SI models using:

 `for script in scripts/dadi_scripts/ALFW*SI.sh; do bash ./$script &; done`

Ugh I'm feeling so ugh about these dadi analyses.

The other thing I need to do for this analysis is to fix the PCA plot, which I noticed had some wrong points and point colors -- TXFW and ALFW are both upright blue triangles in the PC plots. Done!


# 11 December 2019

I'm converting fw_SNPinfo on my linux machine:


```{r}
gff.name<-"ssc_2016_12_20_chromlevel.gff"
if(length(grep("gz",gff.name))>0){
  gff<-read.delim(gzfile(paste("../../scovelli_genome/",gff.name,sep="")),header=F)
} else{
  gff<-read.delim(paste("../../scovelli_genome/",gff.name,sep=""),header=F)
}
colnames(gff)<-c("seqname","source","feature","start","end","score","strand","frame","attribute")
genome.blast<-read.csv("../../scovelli_genome/ssc_2016_12_20_cds_nr_blast_results.csv",skip=1,header=T)#I saved it as a csv
```
```{r}
# old agps
ssc.agp<-read.delim("../../scovelli_genome/SSC_genome.agp",comment.char="#",header=FALSE)
colnames(ssc.agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end","orientation")
sscf.agp<-read.delim("../../scovelli_genome/SSC_scaffolds.agp",comment.char="#",header=FALSE)
colnames(sscf.agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end","orientation")
# new scaffold and chrom level agps
scf.agp<-read.delim(gzfile("../../scovelli_genome/ssc_2016_12_20_scafflevel.agp.gz"),comment.char="#",header=FALSE)
chr.agp<-read.delim(gzfile("../../scovelli_genome/ssc_2016_12_20_chromlevel.agp.gz"),comment.char="#",header=FALSE)
colnames(scf.agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end","orientation")
colnames(chr.agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end","orientation")


convert.agp<-function(locus=NULL,old.agp,old.scf,new.agp,scf.agp,
                      chr=NULL,bp=NULL,id=NULL){
  if(!is.null(locus)){
    chr<-locus$`#CHROM`
    bp<-locus$POS
    id<-locus$ID
  }else{
    bp<-as.numeric(unlist(bp))
    chr<-as.character(chr)
    id<-as.character(id)
  }
  component<-as.data.frame(old.agp[old.agp$object %in% chr & old.agp$object_beg <= bp & old.agp$object_end >= bp,],stringsAsFactors=FALSE)
  if(nrow(component)>0){
    # it's found on one of the LGs
    comp.id<-component$component_id
    if(comp.id != 100){
      #make sure it's an actual scaffold as a component
      comp.bp<-as.numeric(as.character(component$component_beg))+(bp-as.numeric(as.character(component$object_beg)))-1
      if(comp.bp<as.numeric(as.character(component$component_end))){ #sanity check - is it a reasonable size?
        updated<-new.agp[new.agp$component_id%in%comp.id & 
                  as.numeric(as.character(new.agp$component_beg)) <=comp.bp & 
                  as.numeric(as.character(new.agp$component_end)) >= comp.bp,]
        if(nrow(updated)==0){ #if you didn't find it, check scaffold
          updated<-scf.agp[scf.agp$object%in%comp.id & 
                  as.numeric(as.character(scf.agp$object_beg)) <=comp.bp & 
                  as.numeric(as.character(scf.agp$object_end)) >= comp.bp,]
          updated.bp<-comp.bp
          updated.chr<-as.character(comp.id)
        } else{
          updated.bp<-updated$object_beg+comp.bp
          updated.chr<-as.character(updated$object)  
        }
      }else {
        print("WARNING: position in component larger than component")
        updated.bp<-comp.id
        updated.chr<-as.character(comp.id)
      }
    }else{
      print(paste("WARNING: locus ",id, " is not on a scaffold",sep=""))
      updated.bp<-bp
      updated.chr<-NA
    }
    out<-data.frame(Locus=id,OrigChr=chr,OrigBP=bp,NewChr=updated.chr,NewBP=updated.bp,stringsAsFactors = FALSE)
  }else{
    #it's not on an LG - let's check the scaffolds
    component<-as.data.frame(old.scf[old.scf$object %in% chr & old.scf$object_beg <= bp & old.scf$object_end >= bp,],stringsAsFactors=FALSE)
    if(nrow(component)>0){
      #then we found it
      #check to make sure my bp makes sense
      if(bp < max(old.scf[old.scf$object==chr,"object_end"])){
        comp.bp<-bp
        comp.id<-as.character(chr)
        #look for it in the new assembly
        updated<-new.agp[new.agp$component_id%in%comp.id & 
                  as.numeric(as.character(new.agp$component_beg)) <=comp.bp &   
                  as.numeric(as.character(new.agp$component_end)) >= comp.bp,]
        if(nrow(updated)==0){ #if you didn't find it, check scaffold
          updated<-scf.agp[scf.agp$object%in%comp.id & 
                  as.numeric(as.character(scf.agp$object_beg)) <=comp.bp & 
                  as.numeric(as.character(scf.agp$object_end)) >= comp.bp,]
          updated.bp<-comp.bp
          updated.chr<-as.character(comp.id)
        } else{
          updated.bp<-updated$object_beg+comp.bp
          updated.chr<-as.character(updated$object)  
        }
      } else {
          print("WARNING: position in scaffold larger than scaffold")
          updated.bp<-NA
          updated.chr<-NA
      }
      out<-data.frame(Locus=id,OrigChr=chr,OrigBP=bp,NewChr=updated.chr,NewBP=updated.bp,stringsAsFactors = FALSE)
    }else{
      out<-data.frame(Locus=id,OrigChr=chr,OrigBP=bp,NewChr=NA,NewBP=NA,stringsAsFactors = FALSE)
      print(paste("WARNING: locus ", id, " not found",sep=""))
       
    }
  }
  
  return(out)
}
```
```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
SNPconvert<-data.frame(Locus=integer(),OrigChr=character(),OrigBP=integer(),
                       NewChr=character(),NewBP=integer(),stringsAsFactors = FALSE)
for(i in 1:nrow(fw_SNPinfo)){
  SNPconvert[i,]<-convert.agp(old.agp=ssc.agp,old.scf=sscf.agp,
                             new.agp=chr.agp[chr.agp$W=="W",],scf.agp = scf.agp,
                             chr = fw_SNPinfo$Chrom[i],bp = fw_SNPinfo$Pos[i],id = fw_SNPinfo$ID[i])
}
fw_SNPinfo<-fw_SNPinfo[fw_SNPinfo$ID %in% SNPconvert$Locus,]
fw_SNPinfo$Chrom<-SNPconvert$NewChr
fw_SNPinfo$Pos<-SNPconvert$NewBP
fw_SNPinfo$BP<-fw_SNPinfo$Pos - 1
saveRDS(fw_SNPinfo,"fw_SNPinfo.RDS")
```
```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
snp_annotate<-annotate_snps(fw_SNPinfo,gff,genome.blast,ID="ID",chrom="Chrom",bp="BP",pos = "Pos")
snp_annotate$Locus<-as.numeric(as.character(snp_annotate$Locus))
fw_SNPinfo<-merge(fw_SNPinfo,snp_annotate,by.x="ID",by.y="Locus",all.x=TRUE)
saveRDS(fw_SNPinfo,"fw_SNPinfo.RDS")
```
```{r}
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05 & fw_SNPinfo$stacks_FL_P < 0.05)])
out_snps<-fw_SNPinfo[fw_SNPinfo$ID %in% unlist(outliers),]
table(out_snps$region)
```

Now, onto the actual work: are the outliers more likely to be in a coding region than expected by chance? First, though, I want to simplify the "region" information here. I could do a binary "gene" and "not-gene", but I could also break it down into not-gene, CDS/exon, three_prime_UTR, five_prime_UTR, and other-gene-region. What's confusing is that they can be in different parts of genes -- so I think I'll start with binary gene/not-gene. This will allow me to do a Fisher's exact test with two binary variables: outlier and in-gene.

```{r}
fw_SNPinfo$functional<-as.character(fw_SNPinfo$region)
fw_SNPinfo$functional[fw_SNPinfo$functional != "contig"]<-"Gene"
fw_SNPinfo$functional[fw_SNPinfo$functional == "contig"]<-"notGene"
fw_SNPinfo$outlier<-FALSE
fw_SNPinfo$outlier[fw_SNPinfo$ID %in% unlist(outliers)]<-TRUE
```

Here's our contingency table:
```{r}
outContingency<-table(fw_SNPinfo$outlier,fw_SNPinfo$functional)
fisher.test(outContingency)
```

Trying some other things:
```{r}
A<-outContingency[1,1]
B<-outContingency[1,2]
C<-outContingency[2,1]
D<-outContingency[2,2]
N<-sum(outContingency)
  
p<-choose((A+B),A)*choose(C+D,C)/choose(N,A+C) # this doesn't work


```

Or this from this website: https://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/#calculationsInR
```{r}
n<-rowSums(outContingency)[1] # not outliers
m<-rowSums(outContingency)[2] # outliers
k<-colSums(outContingency)[1] # gene hits
x<-c(0:outContingency[2,1]) # gene hits and outlier
probabilities <- dhyper(x, m, n, k, log = FALSE)
```

Not sure what to do with these, so I'm just going to go with the fisher's exact test and conclude that outliers writ large are not more likely to be in gene regions than non-outliers.

What about freshwater genes? Or specific outlier sets? Possibly the best approach is to do some exploratory plots the numbers of non-outliers vs outliers for different gene regions, and could do this for different sets of outliers. 

```{r}
library(UpSetR);library(scales);library(ggplot2);library(grid);library(gwscaR);library(gridGraphics)
source("../R/upset_hacked.R")
source("../R/205_popgenPlotting.R")

snp_groups<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05 & fw_SNPinfo$stacks_FL_P < 0.05)],
               genes=fw_SNPinfo$ID[fw_SNPinfo$functional=="Gene"],
               notgenes=fw_SNPinfo$ID[fw_SNPinfo$functional=="notGene"])
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#a65628',
        xtx='#ff7f00',sharedStacks='#f781bf',
        Alabama='#af8dc3',Louisiana='#e7d4e8',Texas='#762a83',Florida='#1b7837',
        genes="black",notgenes="grey")
upset(fromList(snp_groups),sets=c("permutations","salBF","xtx","pcadapt","sharedStacks","genes"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[c("permutations","salBF","xtx","pcadapt","sharedStacks","genes")],margin1scale = 0.2,
      sets.pt.color=cols[c("permutations","salBF","xtx","pcadapt","sharedStacks","genes")])

```

Ok, this just isn't that helpful. I could check for enrichment of saltwater-associated genes, especially in the salinity-associated genes. 

```{r}
put.genes<-read.delim("putative_genes.txt",header=TRUE,sep='\t')
put.reg<-read.delim("putative.gene.regions.tsv",header=TRUE,sep='\t')
put_snps<-unlist(lapply(1:nrow(fw_SNPinfo),function(i,out_snps,put.reg){
  if(!is.na(out_snps$SSCID[i])){
    out_genes<-unlist(strsplit(as.character(out_snps$SSCID[i]),";"))
    put_match<-lapply(out_genes,grep,put.reg$Scovelli_geneID)
    if(put_match[1]=="integer(0)"){
      return(NA)
    }else{
      return(paste(unique(put.reg$Gene[unlist(put_match)]),collapse=";"))
    }
  }else{
    return(NA)
  }
},fw_SNPinfo,put.reg))
```

So are outliers more likely to be in putative regions than non-outliers?

```{r}
putative<-is.na(put_snps)
putative[putative==TRUE]<-"Salinity"
putative[putative==FALSE]<-"NotSalinity"
outContingency<-table(putative,fw_SNPinfo$outlier)
fisher.test(outContingency)
```

Still not enriched. Oh well!

```{r}
salouts<-rep(FALSE,length(fw_SNPinfo$outlier))
salouts[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)]<-TRUE
table(putative,salouts)
fisher.test(outContingency)
```
And salinity-associated genes aren't either. :shrug:

I'll just report some of these findings in the text and move on -- I don't think it's worth doing GO/KEGG, though I'll get Adam & Emily's feedback (and the reviewers' I suppose).

Remaining thing on the whiteboard is to make a colored table of pairwise FST values. It would be nice to do this in an automated way. I can do this with the package xlsx.

```{r}
full_fsts<-read.delim("stacks/populations_whitelist/batch_2.fst_summary.tsv",row.names = 1)
full_fsts<-rbind(full_fsts,TXSP=rep(NA,ncol(full_fsts))) #add the final row
Tfull_fsts<-t(full_fsts)
full_fsts[lower.tri(full_fsts)]<-Tfull_fsts[lower.tri(Tfull_fsts)] # now it's symmetric
full_fsts<-full_fsts[pop.list,pop.list]
colnames(full_fsts)<-rownames(full_fsts)<-pop.labs
fst_mat<-as.matrix(full_fsts)
```

This code is from the reanalysis doc. Can I fill in the lower triangle with additional/different information? Pairwise PSTs perhaps?

Then I'll use xlsx to write it to a file.

```{r}
cov<-read.table(gzfile("treemix/fwsw_k100b.cov.gz"), as.is = T, head = T, quote = "", comment.char = "")
#reorder
covplot <- data.frame(matrix(nrow = nrow(cov), ncol = ncol(cov)))
for(i in 1:length(pop.list)){
  for( j in 1:length(pop.list)){
    
    covplot[i, j] = cov[which(names(cov)==pop.list[i]), which(names(cov)==pop.list[j])]
    rownames(covplot)[i]<-pop.list[i]
    colnames(covplot)[j]<-pop.list[j]
  }
}
covplot<-as.matrix(covplot)


```
```{r}
library(xlsx); library(RColorBrewer); library(scales)

table2<-fst_mat
table2[lower.tri(table2)]<-covplot[lower.tri(covplot)]
diag(table2)<-diag(covplot)
table2<-round(table2,digits = 4)

# first export the data
sheetName <- "FstCov"
file<-"table2_fst_cov.xlsx"
write.xlsx(table2,file,sheetName = sheetName)

wb<-loadWorkbook(file)
sheets <- getSheets(wb)               
sheet <- sheets[[sheetName]]          
rows <- getRows(sheet, rowIndex=2:(nrow(table2)+1)) # 1st row is headers
cells <- getCells(rows, colIndex = 2:(ncol(table2)+1)) # 1st col is rownames         
values<-lapply(cells,getCellValue)

# set the colors
pal<-colorRampPalette(c("#deebf7","#3182bd"))
cols<-matrix(nrow = nrow(table2),ncol=nrow(table2))
cols[upper.tri(cols)]<-pal(10)[as.numeric(cut(table2[upper.tri(table2)],breaks = 10))]
pal<-colorRampPalette(c("#78c679","#f7fcb9"))
cols[lower.tri(cols)]<-pal(10)[as.numeric(cut(table2[lower.tri(table2)],breaks = 10))]
pal<-colorRampPalette(c("#f7f7f7","#969696"))
diag(cols)<-pal(10)[as.numeric(cut(diag(table2),breaks = 10))]

for(i in 1:nrow(table2)){
  for(j in 1:ncol(table2)){
    csij<-CellStyle(wb,fill=Fill(foregroundColor = alpha(cols[i,j])))
    setCellStyle(cells[[paste(i+1,j+1,sep=".")]],cellStyle = csij)
  }
}

saveWorkbook(wb, file)


```


ALSO: look into ABCs instead of dadi: https://github.com/QuentinRougemont/abc_inferences


# 10 December 2019

I'm thinking I'll just map the consensus tags to the genome and see if they are in the same place. So I need to extract the sequences from batch_2.catalog.tags.tsv

```{r}
tags<-read.delim("/media/sarah/8TB Seagate Expansion Drive/ubuntushare/popgen/fwsw_results/stacks/batch_2.catalog.tags.tsv",
                 comment.char = "#",header=FALSE)
headers<-paste0(">Stacks",tags$V3,"_", tags$V4,"_",tags$V5)
o<-mapply(function(names,seq){
  write.table(names,file="stacks/catalog_tags.fasta",quote=FALSE,col.names=FALSE,append=TRUE,row.names=FALSE)
  write.table(seq,file="stacks/catalog_tags.fasta",quote=FALSE,col.names=FALSE,append=TRUE,row.names=FALSE)
},headers,tags$V10)
```

Now I just need to run this:
`bowtie2 --sensitive -x ../../scovelli_genome/ssc_genome -S catalog.sam -t -f -U stacks/catalog_tags.fasta --un catalog.unpaired.sam -p 3`

It's possible that I've been using a catalog file that was generated from the old genome, which would explain some of the discrepancies.

That code ran really quickly! And...it looks like the locations are rather different.

```{r}
catsam<-read.delim("catalog.sam",header=FALSE,comment.char="@")
converter<-data.frame(locus=gsub("Stacks(\\d+)_.*","\\1",catsam$V1),
                      Chrom=catsam$V3, Pos=catsam$V4, 
                      oldChrom=gsub("Stacks(\\d+)_(.*)_(.*)","\\2",catsam$V1),
                      oldPos=as.numeric(gsub("Stacks(\\d+)_(.*)_(.*)","\\3",catsam$V1)),
                      stringsAsFactors=FALSE)
```

Note that these are the locations of the TAG, not the SNP -- so I need to use the difference to identify the SNPs.

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
SNPconvert<-converter[converter$locus %in% fw_SNPinfo$ID,]
SNPconvert$SNPpos<-SNPconvert$Pos + (fw_SNPinfo$BP - SNPconvert$oldPos)
```

Ok, if I replace the old locations with new and run the annotation program, what happens?

```{r}
fw_SNPinfo$Chrom<-SNPconvert$Chrom
fw_SNPinfo$Pos<-SNPconvert$Pos
fw_SNPinfo$BP<-SNPconvert$Pos-1

```


```{r}
gff.name<-"ssc_2016_12_20_chromlevel.gff"
if(length(grep("gz",gff.name))>0){
  gff<-read.delim(gzfile(paste("../../scovelli_genome/",gff.name,sep="")),header=F)
} else{
  gff<-read.delim(paste("../../scovelli_genome/",gff.name,sep=""),header=F)
}
colnames(gff)<-c("seqname","source","feature","start","end","score","strand","frame","attribute")
genome.blast<-read.csv("../../scovelli_genome/ssc_2016_12_20_cds_nr_blast_results.csv",skip=1,header=T)#I saved it as a csv
```
```{r}
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05 & fw_SNPinfo$stacks_FL_P < 0.05)])
out_snps<-fw_SNPinfo[fw_SNPinfo$ID %in% unlist(outliers),]
out_ann<-annotate_snps(out_snps,gff,genome.blast,ID="ID",chrom="Chrom",bp="BP",pos = "Pos")
```


```{r}
table(out_ann$region)
```

This removed those 'beyond.last.contig' ones, which is great, but I do have two scaffNotFound.

```{r}
out_ann[out_ann$region=="scaffNotFound",]
```

Well, what do you expect when you have stars?

```{r}
converter[converter$locus %in% out_ann$Locus[out_ann$region=="scaffNotFound"],]
```

something seems to have gotten wonky here. Let's try merging rather than replacing

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
SNPconvert<-converter[converter$locus %in% fw_SNPinfo$ID,]
colnames(SNPconvert)[3]<-"tagPos"

fw_SNPinfo<-merge(fw_SNPinfo,SNPconvert,by.x="ID",by.y="locus")

fw_SNPinfo$SNPpos<-fw_SNPinfo$tagPos + (fw_SNPinfo$BP - fw_SNPinfo$oldPos)
fw_SNPinfo<-fw_SNPinfo[,-2]
out_snps<-fw_SNPinfo[fw_SNPinfo$ID %in% unlist(outliers),]
out_ann<-annotate_snps(out_snps,gff,genome.blast,ID="ID",chrom="Chrom.y",bp="BP",pos = "SNPpos")
table(out_ann$region)
```

Hmm, still some scaffs not found and some that are beyond the last contig.

```{r}
catsam[gsub("Stacks(\\d+).*","\\1",catsam$V1) %in% out_ann[out_ann$region=="scaffNotFound","Locus"],]
```

Ok, these two are unmapped. That's that the 4 in the 2nd column indicates. 

```{r}
dim(catsam[catsam$V2==4,])
```

So there are 1971 tags that were unmapped, including two outliers. Ugh this is frustrating

So this seems to be a problem that I've previously solved to my satisfaction using 230_converting_genomes.Rmd. I'm going to try updating that and see if it fixes things...

From converting genomes (slightly modified):


```{r gen_agp_lgs}
# old agps
ssc.agp<-read.delim("../../scovelli_genome/SSC_genome.agp",comment.char="#",header=FALSE)
colnames(ssc.agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end","orientation")
sscf.agp<-read.delim("../../scovelli_genome/SSC_scaffolds.agp",comment.char="#",header=FALSE)
colnames(sscf.agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end","orientation")
# new scaffold and chrom level agps
scf.agp<-read.delim(gzfile("../../scovelli_genome/ssc_2016_12_20_scafflevel.agp.gz"),comment.char="#",header=FALSE)
chr.agp<-read.delim(gzfile("../../scovelli_genome/ssc_2016_12_20_chromlevel.agp.gz"),comment.char="#",header=FALSE)
colnames(scf.agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end","orientation")
colnames(chr.agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end","orientation")


convert.agp<-function(locus=NULL,old.agp,old.scf,new.agp,scf.agp,
                      chr=NULL,bp=NULL,id=NULL){
  if(!is.null(locus)){
    chr<-locus$`#CHROM`
    bp<-locus$POS
    id<-locus$ID
  }else{
    bp<-as.numeric(unlist(bp))
    chr<-as.character(chr)
    id<-as.character(id)
  }
  component<-as.data.frame(old.agp[old.agp$object %in% chr & old.agp$object_beg <= bp & old.agp$object_end >= bp,],stringsAsFactors=FALSE)
  if(nrow(component)>0){
    # it's found on one of the LGs
    comp.id<-component$component_id
    if(comp.id != 100){
      #make sure it's an actual scaffold as a component
      comp.bp<-as.numeric(as.character(component$component_beg))+(bp-as.numeric(as.character(component$object_beg)))-1
      if(comp.bp<as.numeric(as.character(component$component_end))){ #sanity check - is it a reasonable size?
        updated<-new.agp[new.agp$component_id%in%comp.id & 
                  as.numeric(as.character(new.agp$component_beg)) <=comp.bp & 
                  as.numeric(as.character(new.agp$component_end)) >= comp.bp,]
        if(nrow(updated)==0){ #if you didn't find it, check scaffold
          updated<-scf.agp[scf.agp$object%in%comp.id & 
                  as.numeric(as.character(scf.agp$object_beg)) <=comp.bp & 
                  as.numeric(as.character(scf.agp$object_end)) >= comp.bp,]
          updated.bp<-comp.bp
          updated.chr<-as.character(comp.id)
        } else{
          updated.bp<-updated$object_beg+comp.bp
          updated.chr<-as.character(updated$object)  
        }
      }else {
        print("WARNING: position in component larger than component")
        updated.bp<-comp.id
        updated.chr<-as.character(comp.id)
      }
    }else{
      print(paste("WARNING: locus ",id, " is not on a scaffold",sep=""))
      updated.bp<-bp
      updated.chr<-NA
    }
    out<-data.frame(Locus=id,OrigChr=chr,OrigBP=bp,NewChr=updated.chr,NewBP=updated.bp,stringsAsFactors = FALSE)
  }else{
    #it's not on an LG - let's check the scaffolds
    component<-as.data.frame(old.scf[old.scf$object %in% chr & old.scf$object_beg <= bp & old.scf$object_end >= bp,],stringsAsFactors=FALSE)
    if(nrow(component)>0){
      #then we found it
      #check to make sure my bp makes sense
      if(bp < max(old.scf[old.scf$object==chr,"object_end"])){
        comp.bp<-bp
        comp.id<-as.character(chr)
        #look for it in the new assembly
        updated<-new.agp[new.agp$component_id%in%comp.id & 
                  as.numeric(as.character(new.agp$component_beg)) <=comp.bp &   
                  as.numeric(as.character(new.agp$component_end)) >= comp.bp,]
        if(nrow(updated)==0){ #if you didn't find it, check scaffold
          updated<-scf.agp[scf.agp$object%in%comp.id & 
                  as.numeric(as.character(scf.agp$object_beg)) <=comp.bp & 
                  as.numeric(as.character(scf.agp$object_end)) >= comp.bp,]
          updated.bp<-comp.bp
          updated.chr<-as.character(comp.id)
        } else{
          updated.bp<-updated$object_beg+comp.bp
          updated.chr<-as.character(updated$object)  
        }
      } else {
          print("WARNING: position in scaffold larger than scaffold")
          updated.bp<-NA
          updated.chr<-NA
      }
      out<-data.frame(Locus=id,OrigChr=chr,OrigBP=bp,NewChr=updated.chr,NewBP=updated.bp,stringsAsFactors = FALSE)
    }else{
      out<-data.frame(Locus=id,OrigChr=chr,OrigBP=bp,NewChr=NA,NewBP=NA,stringsAsFactors = FALSE)
      print(paste("WARNING: locus ", id, " not found",sep=""))
       
    }
  }
  
  return(out)
}
```
```{r convert_vcf,eval=FALSE}
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
converted<-data.frame(Locus=integer(),OrigChr=character(),OrigBP=integer(),NewChr=character(),NewBP=integer(),stringsAsFactors = FALSE)
for(i in 1:nrow(vcf)){
  converted[i,]<-convert.agp(locus=vcf[i,],old.agp=ssc.agp,old.scf=sscf.agp,new.agp=chr.agp[chr.agp$W=="W",],scf.agp = scf.agp)
}

```

We can test this by looking at the sizes of the gff, the ones in the vcf, and the converted ones:

```{r}
gff.mb<-by(gff,gff$seqname,function(chr){ return(max(chr$end)/1000000) })
vcf.mb<-by(vcf,vcf$`#CHROM`,function(chr){ return(max(chr$POS)/1000000) })
comp.mb<-data.frame(Genome=gff.mb[lgs],VCF=vcf.mb[lgs])
comp.mb$converted<-conv.mb[lgs]
```

The converted numbers are now less than the genome size, so that's good. So now I just need to update my files. 

```{r}
write.csv(converted,"locus_converter.csv")
```

Now I just need to convert the data -- modifying from 230_converting_genomes.

First, the vcf:

```{r}
lgs<-c("LG1","LG2","LG3","LG4","LG5","LG6","LG7","LG8","LG9","LG10","LG11",
	"LG12","LG13","LG14","LG15","LG16","LG17","LG18","LG19","LG20","LG21",
	"LG22")
```


```{r replace_vcf,eval=FALSE}
#write a new vcf file
new.vcf<-as.data.frame(vcf,stringsAsFactor=FALSE)
for(i in 1:nrow(vcf)){
  new.vcf$POS[i]<-converted$NewBP[i]
  new.vcf$`#CHROM`[i]<-as.character(converted$NewChr[i])
}

#save it
write.table(new.vcf,"converted_subset.vcf",sep='\t',quote=FALSE,col.names = TRUE,row.names = FALSE)
```
```{r read_new_vcf}
#read in the new one
vcf<-parse.vcf("converted_subset.vcf")
```


Now the stacks

```{r replace_stacks,eval=TRUE}
fwsw.al<-read.delim("stacks/populations_subset75/batch_2.fst_ALFW-ALST.tsv",stringsAsFactors = FALSE)
fwsw.la<-read.delim("stacks/populations_subset75/batch_2.fst_ALST-LAFW.tsv",stringsAsFactors = FALSE)
fwsw.tx<-read.delim("stacks/populations_subset75/batch_2.fst_TXCC-TXFW.tsv",stringsAsFactors = FALSE)
fwsw.fl<-read.delim("stacks/populations_subset75/batch_2.fst_FLCC-FLFW.tsv",stringsAsFactors = FALSE)
swsw.fl<-read.delim("stacks/populations_subset75/batch_2.fst_FLCC-FLHB.tsv",stringsAsFactors = FALSE)
swsw.tx<-read.delim("stacks/populations_subset75/batch_2.fst_TXCB-TXCC.tsv",stringsAsFactors = FALSE)
swsw.al<-read.delim("stacks/populations_subset75/batch_2.fst_ALST-FLSG.tsv",stringsAsFactors = FALSE)

convert.stacks<-function(stacks.fst,outname,lgs,ssc.agp,sscf.agp,chr.agp,scf.agp){
  for(i in 1:nrow(stacks.fst)){
     convert<-convert.agp(old.agp=ssc.agp,old.scf=sscf.agp,new.agp=chr.agp[chr.agp$W=="W",],scf.agp = scf.agp,
                                 chr=as.character(stacks.fst$Chr[i]),bp=stacks.fst$BP[i],id=as.character(stacks.fst$Locus.ID[i]))
    stacks.fst[i,"Chr"]<-convert["NewChr"]
    stacks.fst[i,"BP"]<-convert["NewBP"]
  }
  # reorder by chrom
  scaffs<-levels(as.factor(stacks.fst$Chr))
  scaffs[1:22]<-lgs
  upd.fst<-do.call(rbind,lapply(scaffs,function(lg){
    this.chr<-stacks.fst[stacks.fst$Chr==lg,]
    this.chr<-this.chr[order(as.numeric(this.chr$BP)),]
    return(this.chr)
  }))
  write.table(upd.fst,outname,col.names = TRUE,row.names = FALSE,quote=FALSE,sep='\t')
  print(by(upd.fst,upd.fst$Chr,function(chr){ return(max(chr$BP)/1000000) })[lgs])
  return(upd.fst)
}

upd.tx<-convert.stacks(fwsw.tx,"stacks/converted.fst_TXCC-TXFW.txt",lgs,ssc.agp,sscf.agp,chr.agp,scf.agp)
upd.la<-convert.stacks(fwsw.la,"stacks/converted.fst_ALST-LAFW.txt",lgs,ssc.agp,sscf.agp,chr.agp,scf.agp)
upd.al<-convert.stacks(fwsw.al,"stacks/converted.fst_ALFW-ALST.txt",lgs,ssc.agp,sscf.agp,chr.agp,scf.agp)
upd.fl<-convert.stacks(fwsw.fl,"stacks/converted.fst_FLCC-FLLG.txt",lgs,ssc.agp,sscf.agp,chr.agp,scf.agp)
upd.st<-convert.stacks(swsw.tx,"stacks/converted.fst_TXCB-TXCC.txt",lgs,ssc.agp,sscf.agp,chr.agp,scf.agp)
upd.sa<-convert.stacks(swsw.al,"stacks/converted.fst_ALST-FLSG.txt",lgs,ssc.agp,sscf.agp,chr.agp,scf.agp)
upd.sf<-convert.stacks(swsw.fl,"stacks/converted.fst_FLCC-FLHB.txt",lgs,ssc.agp,sscf.agp,chr.agp,scf.agp)
```


I should also either update fw_SNPinfo OR re-run all the analyses, including the permutations. I'm updating all the code, and for the bayenv analysis I need to update the plink map file.

The PLINK map file:


```{r replace_map,eval=FALSE}
map<-read.delim("stacks/populations_subset75/batch_2.pruned.map",
                header = FALSE,stringsAsFactors = FALSE)
map[,1]<-as.character(map[,1])
map[,2]<-as.character(map[,2])
map.convert<-data.frame(Locus=character(),OrigChr=character(),OrigBP=integer(),
                        NewChr=character(),NewBP=integer(),stringsAsFactors = FALSE)
for(i in 1:nrow(map)){
  map.convert[i,]<-convert.agp(old.agp=ssc.agp,old.scf=sscf.agp,new.agp=chr.agp[chr.agp$W=="W",],scf.agp = scf.agp,
                               chr=as.character(map[i,1]),bp=map[i,4],id=as.character(map[i,2]))
}

write.table(map.convert,"stacks/converted_subset.map",sep='\t',quote=FALSE,col.names = TRUE,row.names = FALSE)

map.mb<-by(map.convert,map.convert$NewChr,function(chr){ return(max(chr$NewBP)/1000000) })
map.mb[lgs]
```

Hmm, this isn't working properly. The basepairs in the map seem to be larger than is reasonable for the others...so let's just change them in fw_SNPinfo.

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
SNPconvert<-data.frame(Locus=integer(),OrigChr=character(),OrigBP=integer(),
                       NewChr=character(),NewBP=integer(),stringsAsFactors = FALSE)
for(i in 1:nrow(fw_SNPinfo)){
  SNPconvert[i,]<-convert.agp(old.agp=ssc.agp,old.scf=sscf.agp,
                             new.agp=chr.agp[chr.agp$W=="W",],scf.agp = scf.agp,
                             chr = fw_SNPinfo$Chrom[i],bp = fw_SNPinfo$Pos[i],id = fw_SNPinfo$ID[i])
}
fw_SNPinfo<-fw_SNPinfo[fw_SNPinfo$ID %in% SNPconvert$Locus,]
fw_SNPinfo$Chrom<-SNPconvert$NewChr
fw_SNPinfo$Pos<-SNPconvert$NewBP
fw_SNPinfo$BP<-fw_SNPinfo$Pos - 1
saveRDS(fw_SNPinfo,"fw_SNPinfo.RDS")
```


Now I should be all sorted. I re-made the figures and it looks to be pretty good!

Now for the annotations...

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05 & fw_SNPinfo$stacks_FL_P < 0.05)])
out_snps<-fw_SNPinfo[fw_SNPinfo$ID %in% unlist(outliers),]
out_ann<-annotate_snps(out_snps,gff,genome.blast,ID="ID",chrom="Chrom",bp="BP",pos = "Pos")
table(out_ann$region)
```

How do we go about trying to see if the outliers are more frequently in coding regions than non-outlier regions? Step one is to annotate the non-outliers, naturally.

```{r}
snp_annotate<-annotate_snps(fw_SNPinfo,gff,genome.blast,ID="ID",chrom="Chrom",bp="BP",pos = "Pos")
snp_annotate$Locus<-as.character(snp_annotate$Locus)
fw_SNPinfo$ID<-as.character(fw_SNPinfo$ID)
fw_SNPinfo<-merge(fw_SNPinfo,snp_annotate,by.x="ID",by.y="Locus",all.x=TRUE)
saveRDS(fw_SNPinfo,"fw_SNPinfo.RDS")
```


I could do some resampling to do this, or might it be a binomial test?

```{r}

```


# 9 December 2019

Back to annotations


```{r}
gff.name<-"ssc_2016_12_20_chromlevel.gff.gz"
if(length(grep("gz",gff.name))>0){
  gff<-read.delim(gzfile(paste("../../scovelli_genome/",gff.name,sep="")),header=F)
} else{
  gff<-read.delim(paste("../../scovelli_genome/",gff.name,sep=""),header=F)
}
colnames(gff)<-c("seqname","source","feature","start","end","score","strand","frame","attribute")
```


```{r}
annotate_snps<-function(snpDF,gff,genome.blast,ID="Locus.ID",chrom="Chr",bp="BP",pos="Column")
{
  fw.sig.reg<-do.call(rbind,apply(snpDF,1,function(sig){
    this.gff<-gff[as.character(gff$seqname) %in% as.character(unlist(sig[chrom])),]
    description<-NA
    SSCID<-NA
    if(nrow(this.gff)>0){
      this.reg<-this.gff[this.gff$start <= as.numeric(sig[bp]) & this.gff$end >= as.numeric(sig[bp]),]
      if(nrow(this.reg) == 0){
        if(as.numeric(sig[bp]) > max(as.numeric(this.gff$end))){
          region<-"beyond.last.contig"
        }else{
          region<-NA
        }
      }else{
        if(length(grep("SSCG\\d+",this.reg$attribute))>0){
          geneID<-unique(gsub(".*(SSCG\\d+).*","\\1",this.reg$attribute[grep("SSCG\\d+",this.reg$attribute)]))
          gene<-genome.blast[genome.blast$sscv4_gene_ID %in% geneID,"blastp_hit_description"]
        }else{
          geneID<-NA
          gene<-NA
        }
        # if there are multiples they'll be in separated by a semi-colon
        region<-paste(this.reg$feature,collapse = ";")
        description<-paste(gene,collapse=";")
        SSCID<-paste(geneID,collapse=";")
      }
    }else{
      region<-"scaffNotFound"
    }
    return(data.frame(Locus=sig[[ID]],Chr=sig[chrom],BP=sig[bp],SNPCol=sig[pos],
                      region=region, description=description,SSCID=SSCID,row.names=NULL))
  }))
}
```

```{r}
genome.blast<-read.csv("../../scovelli_genome/ssc_2016_12_20_cds_nr_blast_results.csv",skip=1,header=T)#I saved it as a csv
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05 & fw_SNPinfo$stacks_FL_P < 0.05)])
out_snps<-fw_SNPinfo[fw_SNPinfo$ID %in% unlist(outliers),]
out_ann<-annotate_snps(out_snps,gff,genome.blast,ID="ID",chrom="Chrom",bp="BP",pos = "Pos")
```

I'm getting some warnings:
```
In data.frame(Locus = sig[ID], Chr = sig[chrom], BP = sig[bp],  ... :
  row names were found from a short variable and have been discarded
```
and
```
In max(as.numeric(this.gff$end)) :
  no non-missing arguments to max; returning -Inf
```

The max one - somehow I've got snps on scaffold_247, which is not in the gff file. I've added an if statement to avoid this issue. I've also added `,row.names=NULL` to the data.frame calls to try to avoid the first warning. My additions have removed the warnings. 

Some loci have multiple entries! I don't think that should be happening...

```{r}
out_ann[out_ann$Locus %in% out_ann$Locus[duplicated(out_ann$Locus)],]
```

These duplicates appear to arise from having multiple SSCIDs. I've now updated the code to collapse the information, so it should work much better now. 

Now I just want to understand why some scaffolds aren't in the gff file (and they should be).

```{r}
out_ann[out_ann$region=="scaffNotFound",]
gff[gff$seqname %in% out_ann[out_ann$region=="scaffNotFound","Chr"],]
```

They're not in the chromosome-level assembly file.

I'll check the scaffold-level file.

```{r}
sgff.name<-"ssc_2016_12_20_scafflevel.gff.gz"
if(length(grep("gz",sgff.name))>0){
  sgff<-read.delim(gzfile(paste("../../scovelli_genome/",sgff.name,sep="")),header=F)
} else{
  sgff<-read.delim(paste("../../scovelli_genome/",sgff.name,sep=""),header=F)
}
colnames(sgff)<-c("seqname","source","feature","start","end","score","strand","frame","attribute")
```

```{r}
nrow(gff[gff$seqname %in% out_ann[out_ann$region=="scaffNotFound","Chr"],])
unique(gff$seqname[gff$seqname %in% out_ann[out_ann$region=="scaffNotFound","Chr"]])
```

They *are* in the scaffold-level assembly. Maybe my chromosome-level file is incomplete? I've re-downloaded it, let's see -- nope! Weird. How many are non-overlapping between the two gffs?

```{r}
unique(sgff$seqname[!sgff$seqname %in% gff$seqname])
```

Ah, of course most of these scaffolds will be on the chromosomes so won't be in the gff...Are these three scaffolds actually part of chromosomes, and they're just mis-specified in my alignments? Let's check the AGP.

```{r}
agp<-read.delim(gzfile("../../scovelli_genome/ssc_2016_12_20_chromlevel.agp.gz"),header = FALSE,comment.char = "#")
colnames(agp)<-c("object","object_beg","object_end","part_number","W","component_id","component_beg","component_end")
agp$object[agp$object %in% out_ann[out_ann$region=="scaffNotFound","Chr"]]
```

So they're not objects

```{r}
agp[agp$component_id %in% out_ann[out_ann$region=="scaffNotFound","Chr"],]
```

They are components! Parts of other chromosomes. So I need to adjust this -- and not just in the outliers file, but in all of them (and I'll need to re-make the fsts figure)


```{r}
reassign_scaffs<-function(wrong_dat,chrom,pos,locusID,agp){
  misassigned<-unique(wrong_dat[wrong_dat[,chrom] %in% agp$component_id,chrom])
  fixed<-do.call(rbind,lapply(misassigned,function(scaff,wrong_dat,agp){
    to_fix<-wrong_dat[wrong_dat[,chrom] %in% scaff,]
    if(nrow(to_fix)>0){ 
      corrections<-agp[agp$component_id %in% scaff,]
      to_fix[,chrom]<-as.character(corrections$object)
      to_fix[,pos]<-to_fix[,pos] + corrections$object_beg
    }
    return(to_fix)
  },wrong_dat=wrong_dat,agp=agp))
  wrong_dat[wrong_dat[,locusID] %in% fixed[,locusID],]<-fixed
  return(wrong_dat)
}
```

```{r}
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
vcf<-reassign_scaffs(vcf,"#CHROM","POS","ID",agp)
fwsw.al<-reassign_scaffs(read.delim("stacks/populations_subset75/batch_2.fst_ALFW-ALST.tsv"),
                         "Chr","BP","Locus.ID",agp)
fwsw.la<-reassign_scaffs(read.delim("stacks/populations_subset75/batch_2.fst_ALST-LAFW.tsv"),
                         "Chr","BP","Locus.ID",agp)
fwsw.tx<-reassign_scaffs(read.delim("stacks/populations_subset75/batch_2.fst_TXCC-TXFW.tsv"),
                         "Chr","BP","Locus.ID",agp)
fwsw.fl<-reassign_scaffs(read.delim("stacks/populations_subset75/batch_2.fst_FLCC-FLFW.tsv"),
                         "Chr","BP","Locus.ID",agp)
fw_SNPinfo<-reassign_scaffs(fw_SNPinfo,"Chrom","Pos","ID",agp)
```

Let's try this again: 
```{r}

out_snps<-fw_SNPinfo[fw_SNPinfo$ID %in% unlist(outliers),]
out_ann<-annotate_snps(out_snps,gff,genome.blast,ID="ID",chrom="Chrom",bp="BP",pos = "Pos")
```

And it works!

I should really annotate all of the SNPs so I can look into enrichment.

```{r}
annotated<-annotate_snps(fw_SNPinfo,gff,genome.blast,ID="ID",chrom="Chrom",bp="BP",pos = "Pos")
```

```{r}
table(annotated$region)
```

There are a lot of these labelled 'beyond.last.contig', which probably shouldn't even be possible. I'm guessing this is an issue with the gff file as well, and now I'm a little concerned that all of the annotations will be off...I checked though and it looks like I used the newest assembly for the analysis, so I'm not sure where this would have gone wonky.

Let's take a look at the 'beyond last contig' ones.

```{r}
dim(fw_SNPinfo[fw_SNPinfo$ID %in% annotated$Locus[annotated$region=="beyond.last.contig"],])
unique(fw_SNPinfo[fw_SNPinfo$ID %in% annotated$Locus[annotated$region=="beyond.last.contig"],"Chrom"])
```

It doesn't seem like there's an obvious explanation here. 

Using this code: `sed -n '/LG15/,/>/p' ssc_2016_12_20_chromlevel.fa | wc` I find that 8298981 characters are present (so the sequence has 8298971 bp) - but this is much lower than `r max(fw_SNPinfo$BP[fw_SNPinfo$Chrom=="LG15"])`. Crap. I might re-run the pipeline just to verify (save in different folders etc.) Good thing I'm working from home today! -- actually, not so much, all the data is on hard drives at work. But that might work out ok, because then I can save the alignments in a new place anyway. But ugh, this just feels like yet another dumb setback. And I don't look forward to installing Stacks on my work computer.


# 6 December 2019

I have found the discrepancy! I was using Fst values < 0.05 rather than p values --this is what I had:

```{r}
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05 & fw_SNPinfo$stacks_LA < 0.05 &
                 fw_SNPinfo$stacks_TX < 0.05 & fw_SNPinfo$stacks_FL < 0.05)])
```

and this is what it should be:
```{r}
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA_P < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX_P < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL_P < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL_P < 0.05 & fw_SNPinfo$stacks_LA_P < 0.05 &
                 fw_SNPinfo$stacks_TX_P < 0.05 & fw_SNPinfo$stacks_FL_P < 0.05)])
```

So the next thing to do is to annotate the outliers.

We can start by looking at which ones are in coding regions

```{r}
gff.name<-"ssc_2016_12_20_chromlevel.gff"
if(length(grep("gz",gff.name))>0){
  gff<-read.delim(gzfile(paste("../../scovelli_genome/",gff.name,sep="")),header=F)
} else{
  gff<-read.delim(paste("../../scovelli_genome/",gff.name,sep=""),header=F)
}
colnames(gff)<-c("seqname","source","feature","start","end","score","strand","frame","attribute")
genome.blast<-read.csv("../../scovelli_genome/ssc_2016_12_20_cds_nr_blast_results.csv",skip=1,header=T)#I saved it as a csv
```


```{r}
annotate_snps<-function(snpDF,gff,genome.blast,ID="Locus.ID",chrom="Chr",bp="BP",pos="Column")
{
  fw.sig.reg<-do.call(rbind,apply(snpDF,1,function(sig){
    this.gff<-gff[as.character(gff$seqname) %in% as.character(unlist(sig[chrom])),]
    this.reg<-this.gff[this.gff$start <= as.numeric(sig[bp]) & this.gff$end >= as.numeric(sig[bp]),]
    if(nrow(this.reg) == 0){
      if(as.numeric(sig[bp])>max(as.numeric(this.gff$end))){
        new<-data.frame(Locus=sig[[ID]],Chr=sig[chrom],BP=sig[bp],SNPCol=sig[pos],
                        region="beyond.last.contig", description=NA,SSCID=NA)
      }else{
        new<-data.frame(Locus=sig[[ID]],Chr=sig[chrom],BP=sig[bp],SNPCol=sig[pos],
                        region=NA,description=NA,SSCID=NA)
      }
    }else{
      if(length(grep("SSCG\\d+",this.reg$attribute))>0){
        geneID<-unique(gsub(".*(SSCG\\d+).*","\\1",this.reg$attribute[grep("SSCG\\d+",this.reg$attribute)]))
        gene<-genome.blast[genome.blast$sscv4_gene_ID %in% geneID,"blastp_hit_description"]
      }else{
        geneID<-NA
        gene<-NA
      }
      new<-data.frame(Locus=sig[ID],Chr=sig[chrom],BP=sig[bp],SNPCol=sig[pos],
                      region=paste(this.reg$feature,sep=",",collapse = ","),description=gene,SSCID=geneID)
    }
    return(as.data.frame(new))
  }))
}
```

```{r}
out_snps<-fw_SNPinfo[fw_SNPinfo$ID %in% unlist(outliers),]
out_ann<-annotate_snps(out_snps,gff,genome.blast,ID="ID",chrom="Chrom",bp="BP",pos = "Pos")
```


# 5 December 2019

I'm pretty happy with the way it's turned out, except that the upset_hacked code allows for a weird amount of overwriting of axes.

```{r}
library(UpSetR);library(scales);library(ggplot2);library(grid);library(gwscaR);library(gridGraphics)
source("../R/upset_hacked.R")
source("../R/205_popgenPlotting.R")
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05 & fw_SNPinfo$stacks_LA < 0.05 &
                 fw_SNPinfo$stacks_TX < 0.05 & fw_SNPinfo$stacks_FL < 0.05)])
```

```{r}
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#a65628',
        xtx='#ff7f00',sharedStacks='#f781bf',
        stacksTX='#af8dc3',stacksAL='#e7d4e8',stacksLA='#762a83',stacksFL='#1b7837')
png("../figs/upsetOutliers.png",res=300,height=4,width=7.5,units="in",pointsize = 16)
upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[1:5],sets.pt.color=cols[1:5],margin1scale = 0.2)
dev.off()
png("../figs/sharedStacks.png",res=300,height=4,width=7.5,units="in",pointsize = 16)
upset(fromList(outliers),sets=c("Alabama","Louisiana","Texas","Florida"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[6:9],sets.pt.color=cols[6:9],margin1scale = 0.2)
dev.off()


```

I tried changing the expand= bit in Make_matrix_plot but that didn't help, it just caused the matrix to be misaligned with the barplots. 

Managed to fix the labels by changing the background in Make_size_plot transparent. Lowering the bottom margin size in the Make_main_bar function fixed it! yay!

So now:

```{r}
library(magick);library(multipanelfigure)

image_files <- c("../figs/FstOutliers.png",
                 "../figs/upsetOutliers.png",
                 "../figs/sharedStacks.png")


png("../figs/fstPlots.png",height=4,width=8.5,units="in",res=300,pointsize = 11)
figure <- multi_panel_figure(
  width = c(4.5, 3.5),
  height = c(1.8,1.8),
  unit = "inches",
  row_spacing = 0.0,column_spacing = 0
)
(figure %<>% fill_panel(image_files[1],row = 1:2,scaling="fit",allow_panel_overwriting = TRUE) )
(figure %<>% fill_panel(image_files[3], column=2, row=1,scaling="fit") )
(figure %<>% fill_panel(image_files[2],column=2,row=2,scaling="fit"))
dev.off()

```


And it's beautiful! Except the colors don't match.

```{r}
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#a65628',
        xtx='#ff7f00',sharedStacks='#f781bf',
        stacksTX='#af8dc3',stacksAL='#e7d4e8',stacksLA='#762a83',stacksFL='#1b7837')
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#a65628',0.75),stacks=alpha('#f781bf',0.75),xtx=alpha('#ff7f00',0.75))
grp7colors<-c('#762a83','#9970ab','#c2a5cf','#d9f0d3','#a6dba0','#5aae61','#1b7837')
```

Ok, they look like they match in the list - it's probably how I'm specifying them.

```{r}
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#a65628',
        xtx='#ff7f00',sharedStacks='#f781bf',
        Alabama='#af8dc3',Louisiana='#e7d4e8',Texas='#762a83',Florida='#1b7837')
png("../figs/upsetOutliers.png",res=300,height=4,width=7.5,units="in",pointsize = 16)
upset(fromList(outliers),sets=c("permutations","salBF","xtx","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[c("permutations","salBF","xtx","pcadapt","sharedStacks")],margin1scale = 0.2,
      sets.pt.color=cols[c("permutations","salBF","xtx","pcadapt","sharedStacks")])
dev.off()
png("../figs/sharedStacks.png",res=300,height=4,width=7.5,units="in",pointsize = 16)
upset(fromList(outliers),sets=c("Texas","Alabama","Louisiana","Florida"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[c("Texas","Alabama","Louisiana","Florida")],margin1scale = 0.2,
      sets.pt.color=cols[c("Texas","Alabama","Louisiana","Florida")],keep.order = TRUE)
dev.off()


```

Ok, it's not just that - it's how they're getting called in the function. They data is getting re-ordered but not the list of colors

I'm getting closer but the struggle is still real. I think I've fixed it! I've now tweaked all of the major plotting functions and fixed the color vectors, and I think it's working! I'm going to put this code into the reanalysis document and commit to github. 

The only issue is when I embed it in the word document the text looks really small.

Also, we have a small issue that the numbers of shared outliers in the Fst plots do not correspond to the shared outliers in the upset plots, especially in the stacks comparisons. uh oh!

Once that's figured out, it's time to annotate outliers.

# 29 November 2019

Ok, so I'm playing around with the plotting stuff in upset_hacked to try to fix the plotting issue. 

```{r}
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#a65628',
        xtx='#ff7f00',sharedStacks='#f781bf',
        stacksTX='#af8dc3',stacksAL='#e7d4e8',stacksLA='#762a83',stacksFL='#1b7837')
png("../figs/upsetOutliers.png",res=300,height=4,width=7.5,units="in",pointsize = 16)
upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[1:5],sets.pt.color=cols[1:5],exclude_hist=FALSE)
dev.off()
png("../figs/sharedStacks.png",res=300,height=4,width=7.5,units="in",pointsize = 16)
upset(fromList(outliers),sets=c("Alabama","Louisiana","Texas","Florida"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[6:9],sets.pt.color=cols[6:9])
dev.off()


```

Ugh this is not so easy. 


# 28 November 2019

I'm going to get this to work! I'm thinking that changing the size of the upset plots when I save them to a file is a good place to start. So, to recap, here's the code I need to run


```{r}
library(UpSetR);library(scales);library(ggplot2);library(grid);library(gwscaR);library(gridGraphics)
source("../R/upset_hacked.R")
source("../R/205_popgenPlotting.R")
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               Alabama=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05)], 
               Louisiana=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA < 0.05)],
               Texas=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX < 0.05)],
               Florida=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05 & fw_SNPinfo$stacks_LA < 0.05 &
                 fw_SNPinfo$stacks_TX < 0.05 & fw_SNPinfo$stacks_FL < 0.05)])
```

```{r}
pop.list<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST","ALFW","FLSG","FLKB",
	"FLFD","FLSI","FLAB","FLPB","FLHB","FLCC","FLLG")
pop.labs<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST","ALFW","FLSG","FLKB",
            "FLFD","FLSI","FLAB","FLPB","FLHB","FLCC","FLFW")
lgs<-c("LG1","LG2","LG3","LG4","LG5","LG6","LG7","LG8","LG9","LG10","LG11",
	"LG12","LG13","LG14","LG15","LG16","LG17","LG18","LG19","LG20","LG21",
	"LG22")
lgn<-seq(1,22)
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#a65628',0.75),stacks=alpha('#f781bf',0.75),xtx=alpha('#ff7f00',0.75))
grp7colors<-c('#762a83','#9970ab','#c2a5cf','#d9f0d3','#a6dba0','#5aae61','#1b7837')
png("../figs/FstOutliers.png",height=8,width=8.5,units="in",res=300,pointsize=16)
par(mfrow=c(4,1),oma=c(1,1,0.5,1),mar=c(2,2,1,1),xpd=TRUE)
# plot TX
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_TX",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[1],grp7colors[2]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_TX[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_TX[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_TX[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_TX[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_TX[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("TXFW vs. TXCC",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot AL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_AL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[3],"lightgrey"),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_AL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_AL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_AL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_AL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_AL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("ALFW vs. ALST",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot LA
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_LA",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c("lightgrey",grp7colors[3]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_LA[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_LA[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_LA[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_LA[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_LA[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("LAFW vs. ALST",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)


# FL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_FL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[6],grp7colors[7]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_FL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_FL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_FL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_FL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_FL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("FLFW vs. FLCC",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# add outside legend

opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
            mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend("top",c(expression("Permutation"~italic("F")["ST"]),
         expression("Stacks"~italic("F")["ST"]),
         "PCAdapt",expression(italic("X")^T~italic("X")),"Salinity BF"),xjust = 0.5,x.intersp = 0.5,
       col = cols[c("perm","stacks","pc","xtx","sal")],
       pt.bg=cols[c("perm","stacks","pc","xtx","sal")],pch=c(4,5,0,3,2),bty='n',ncol=5)
dev.off()
```
```{r}
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#4daf4a',
        xtx='#984ea3',sharedStacks='#ff7f00',
        stacksTX='#af8dc3',stacksAL='#e7d4e8',stacksLA='#762a83',stacksFL='#1b7837')
png("../figs/upsetOutliers.png",res=300,height=4,width=7.5,units="in",pointsize = 16)
upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[1:5],sets.pt.color=cols[1:5])
dev.off()
png("../figs/sharedStacks.png",res=300,height=4,width=7.5,units="in",pointsize = 16)
upset(fromList(outliers),sets=c("Alabama","Louisiana","Texas","Florida"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[6:9],sets.pt.color=cols[6:9])
dev.off()


```

```{r}
library(magick);library(multipanelfigure)

image_files <- c("../figs/FstOutliers.png",
                 "../figs/upsetOutliers.png",
                 "../figs/sharedStacks.png")


png("../figs/fstPlots.png",height=4,width=8.5,units="in",res=300,pointsize = 11)
figure <- multi_panel_figure(
  width = c(4.5, 3.5),
  height = c(1.8,1.8),
  unit = "inches",
  row_spacing = 0.0,column_spacing = 0
)
(figure %<>% fill_panel(image_files[1],row = 1:2,scaling="fit",allow_panel_overwriting = TRUE) )
(figure %<>% fill_panel(image_files[3], column=2, row=1,scaling="fit") )
(figure %<>% fill_panel(image_files[2],column=2,row=2,scaling="fit"))
dev.off()

```


# 26 November 2019

Because UpSetR uses ggplot (and therefore is based on grid), it doesn't play nicely with base graphics and my usual way of doing multi-plots, par(mfrow) or layout(). And my base-R plot doesn't play well with the grid options, so I need to use the gridGraphics package - now it's a matter of getting all the conditions right.


```{r}
library(UpSetR);library(scales);library(ggplot2);library(grid);library(gwscaR);library(gridGraphics)
source("../R/upset_hacked.R")
source("../R/205_popgenPlotting.R")
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               stacksAL=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05)], 
               stacksLA=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA < 0.05)],
               stacksTX=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX < 0.05)],
               stacksFL=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05 & fw_SNPinfo$stacks_LA < 0.05 &
                 fw_SNPinfo$stacks_TX < 0.05 & fw_SNPinfo$stacks_FL < 0.05)])
```


```{r}
# FST plots
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#a65628',0.75),stacks=alpha('#f781bf',0.75),xtx=alpha('#ff7f00',0.75))
grp7colors<-c('#762a83','#9970ab','#c2a5cf','#d9f0d3','#a6dba0','#5aae61','#1b7837')
par(mfrow=c(4,1),oma=c(1,1,0.5,1),mar=c(2,2,1,1),xpd=TRUE)
# plot TX
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_TX",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[1],grp7colors[2]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_TX[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_TX[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_TX[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_TX[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_TX[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("TXFW vs. TXCC",2,line=1,cex=1)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)
legend(x=midpts[2],y=1.2,xpd=TRUE,c(expression("Permutation"~italic("F")["ST"]),
         expression("Stacks"~italic("F")["ST"]),
         "PCAdapt",expression(italic("X")^T~italic("X")),"Salinity BF"),x.intersp = 0.5,
       col = cols[c("perm","stacks","pc","xtx","sal")],cex = 1.5,
       pt.bg=cols[c("perm","stacks","pc","xtx","sal")],pch=c(4,5,0,3,2),bty='n',ncol=5)

# plot AL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_AL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[3],"lightgrey"),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_AL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_AL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_AL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_AL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_AL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("ALFW vs. ALST",2,line=1,cex=1)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot LA
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_LA",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c("lightgrey",grp7colors[3]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_LA[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_LA[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_LA[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_LA[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_LA[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("LAFW vs. ALST",2,line=1,cex=1)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)


# FL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_FL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[6],grp7colors[7]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_FL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_FL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_FL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_FL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_FL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("FLFW vs. FLCC",2,line=1,cex=1)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# add outside legend

#opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
#            mar=c(0, 0, 0, 0), new=TRUE)
#on.exit(par(opar))
#plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')

grid.echo()
fstgrob<-grid.grab()


# Upset plots
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#4daf4a',
        xtx='#984ea3',sharedStacks='#ff7f00',
        stacksTX='#af8dc3',stacksAL='#e7d4e8',stacksLA='#762a83',stacksFL='#1b7837')
upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "# Shared Outliers",
      sets.x.label = "# Outlier SNPs",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[1:5],sets.pt.color=cols[1:5])
grid::grid.edit('arrange',name='arrange2')
vp = grid::grid.grab()
upset(fromList(outliers),sets=c("stacksAL","stacksLA","stacksTX","stacksFL"),
      point.size=3.5,line.size=2,mainbar.y.label = "# Shared Outliers",
      sets.x.label = "# Outlier SNPs",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[6:9],sets.pt.color=cols[6:9])
grid::grid.edit('arrange',name='arrange3')
vp1 = grid::grid.grab()


```
```{r}
png("../figs/Fst_and_upset.png",height=7,width=10,units="in",res=300)
gridExtra::grid.arrange(fstgrob,vp1,vp,
                        layout_matrix=matrix(c(1,2,1,3),2,2,byrow=TRUE),
                        widths=unit(c(5.25,4.25),c("in")),padding=unit(0.25,"in"))
dev.off()
```

Ugh the dimensions are so ugly and things keep moving/changing text size.

Maybe I'll just save them as pngs and then combine them. imager will do this, so now let's just save these as pngs

```{r}
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#a65628',0.75),stacks=alpha('#f781bf',0.75),xtx=alpha('#ff7f00',0.75))
png("../figs/FstOutliers.png",height=5,width=8.5,units="in",res=300,pointsize=14)
par(mfrow=c(4,1),oma=c(1,1,0.5,1),mar=c(2,2,1,1),xpd=TRUE)
# plot TX
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_TX",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[1],grp7colors[2]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_TX[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_TX[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_TX[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_TX[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_TX[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("TXFW vs. TXCC",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot AL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_AL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[3],"lightgrey"),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_AL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_AL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_AL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_AL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_AL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("ALFW vs. ALST",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot LA
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_LA",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c("lightgrey",grp7colors[3]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_LA[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_LA[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_LA[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_LA[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_LA[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("LAFW vs. ALST",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)


# FL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_FL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[6],grp7colors[7]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_FL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_FL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_FL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_FL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_FL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("FLFW vs. FLCC",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# add outside legend

opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
            mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend("top",c(expression("Permutation"~italic("F")["ST"]),
         expression("Stacks"~italic("F")["ST"]),
         "PCAdapt",expression(italic("X")^T~italic("X")),"Salinity BF"),xjust = 0.5,x.intersp = 0.5,
       col = cols[c("perm","stacks","pc","xtx","sal")],
       pt.bg=cols[c("perm","stacks","pc","xtx","sal")],pch=c(4,5,0,3,2),bty='n',ncol=5)
dev.off()
```
```{r}
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#4daf4a',
        xtx='#984ea3',sharedStacks='#ff7f00',
        stacksTX='#af8dc3',stacksAL='#e7d4e8',stacksLA='#762a83',stacksFL='#1b7837')
png("../figs/upsetOutliers.png",res=300,height=5,width=7,units="in",pointsize = 14)
upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[1:5],sets.pt.color=cols[1:5])
dev.off()
png("../figs/sharedStacks.png",res=300,height=5,width=7,units="in",pointsize = 14)
upset(fromList(outliers),sets=c("stacksAL","stacksLA","stacksTX","stacksFL"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = " Number of Outliers",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[6:9],sets.pt.color=cols[6:9])
dev.off()


```

```{r}
library(imager)
fsts<-load.image("../figs/FstOutliers.png")
upsetOut<-load.image("../figs/upsetOutliers.png")
sharedStacks<-load.image("../figs/sharedStacks.png")

layout(matrix(c(1,2,1,3),2,2,byrow=TRUE))
par(mar=c(2,2,2,2))
plot(fsts,axes=FALSE)
plot(upsetOut,axes=FALSE)
plot(sharedStacks,axes=FALSE)
```
Ugh, this looks awful. Maybe there's a better solution.
 
What about split.screen?

```{r}
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#a65628',0.75),stacks=alpha('#f781bf',0.75),xtx=alpha('#ff7f00',0.75))
par(mfrow=c(4,1),oma=c(1,1,0.5,1),mar=c(2,2,1,1),xpd=TRUE)
# plot TX
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_TX",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[1],grp7colors[2]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_TX[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_TX[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_TX[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_TX[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_TX[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("TXFW vs. TXCC",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot AL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_AL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[3],"lightgrey"),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_AL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_AL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_AL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_AL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_AL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("ALFW vs. ALST",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot LA
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_LA",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c("lightgrey",grp7colors[3]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_LA[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_LA[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_LA[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_LA[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_LA[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("LAFW vs. ALST",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)


# FL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_FL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[6],grp7colors[7]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_FL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_FL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_FL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_FL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_FL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("FLFW vs. FLCC",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# add outside legend

opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
            mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend("top",c(expression("Permutation"~italic("F")["ST"]),
         expression("Stacks"~italic("F")["ST"]),
         "PCAdapt",expression(italic("X")^T~italic("X")),"Salinity BF"),xjust = 0.5,x.intersp = 0.5,
       col = cols[c("perm","stacks","pc","xtx","sal")],
       pt.bg=cols[c("perm","stacks","pc","xtx","sal")],pch=c(4,5,0,3,2),bty='n',ncol=5)
fstsPlot <- recordPlot()
```

```{r}
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#4daf4a',
        xtx='#984ea3',sharedStacks='#ff7f00',
        stacksTX='#af8dc3',stacksAL='#e7d4e8',stacksLA='#762a83',stacksFL='#1b7837')
upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "# Shared Outliers",
      sets.x.label = "# Outlier SNPs",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[1:5],sets.pt.color=cols[1:5])
grid::grid.edit('arrange',name='arrange2')
vp = grid::grid.grab()
upset(fromList(outliers),sets=c("stacksAL","stacksLA","stacksTX","stacksFL"),
      point.size=3.5,line.size=2,mainbar.y.label = "# Shared Outliers",
      sets.x.label = "# Outlier SNPs",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[6:9],sets.pt.color=cols[6:9])
grid::grid.edit('arrange',name='arrange3')
vp1 = grid::grid.grab()
```

```{r}
pdf("../figs/fstPlots.pdf")
split.screen(c(2, 1))

screen(1)
fstsPlot

screen(2)
gridExtra::grid.arrange(vp,vp1)

close.screen(all=TRUE) 
dev.off()
```

Ok, this clearly isn't working. What about multipanelfigure?

```{r}
library(magick);library(multipanelfigure)

image_files <- c("../figs/FstOutliers.png",
                 "../figs/upsetOutliers.png",
                 "../figs/sharedStacks.png")


png("../figs/fstPlots.png",height=4,width=8.5,units="in",res=300,pointsize = 11)
figure <- multi_panel_figure(
  width = c(4.5, 3.5),
  height = c(1.8,1.8),
  unit = "inches",
  row_spacing = 0.05,column_spacing = 0.05
)
(figure %<>% fill_panel(image_files[1],row = 1:2,scaling="fit",allow_panel_overwriting = TRUE) )
(figure %<>% fill_panel(image_files[3], column=2, row=1,scaling="fit") )
(figure %<>% fill_panel(image_files[2],column=2,row=2,scaling="fit"))
dev.off()

```

This is getting much closer to what I want, though the font sizes are a tad small. 

# 14 November 2019

Ok, so now what? I'm going to finish the table of population summaries - add in the environmental info. Ok, done! I put the table in a google sheet and linked to it in the google drive document.

Ok, now I'm thinking of merging upset plots with the Fst plots. 

# 13 November 2019

Ok, so when plotting I found that 24 of 29 XtX outliers are PCAdapt outliers. But when making the Upset plot yesterday this wasn't true. So what gives?

Well, somehow it looks like I'd lost the Fsts and p-values from Stacks. Grabbing the outlier analysis from 21 Oct, we can try this again

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
shared_stacks<-fw_SNPinfo$ID[fw_SNPinfo$stacks_AL_P < 0.05 & fw_SNPinfo$stacks_LA_P < 0.05 &
                          fw_SNPinfo$stacks_TX_P < 0.05 & fw_SNPinfo$stacks_FL_P < 0.05]
shared_stacks<-shared_stacks[!is.na(shared_stacks)]
fw_SNPinfo[fw_SNPinfo$ID %in% shared_stacks,]
```

Number of shared permutation outliers:
```{r}
length(which(rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4))
length(unique(fw_SNPinfo$Chrom[which(rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4)]))
```

Are any of those the same as the Stacks ones?

```{r}
permout<-fw_SNPinfo$ID[which(rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4)]
permout[permout %in% shared_stacks]
```

Pcapadt: 
```{r}
dim(fw_SNPinfo[fw_SNPinfo$pcadaptQ<0.01 & !is.na(fw_SNPinfo$pcadaptQ),])
pcout<-fw_SNPinfo$ID[fw_SNPinfo$pcadaptQ<0.01 & !is.na(fw_SNPinfo$pcadaptQ)]
pcout[pcout %in% shared_stacks] # overlapping with shared stacks
pcout[pcout %in% permout] # overlapping with permutations
```

Bayenv:

```{r}
xtxout<-fw_SNPinfo$ID[fw_SNPinfo$XtX>=quantile(fw_SNPinfo$XtX,0.99)]
xtxout[xtxout %in% pcout]
```

```{r}
salout<-fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)]
salout[salout %in% xtxout]
salout[salout %in% pcout]
salout[salout %in% shared_stacks]
salout[salout %in% permout]
```

Ok, I'm still only getting one shared outlier between Pcadapt and Bayenv's Xtx. Really not sure what's different. Since then all I've done is filter some datasets and re-run populations, so it shouldn't have affected anything. How strange! The only other difference might be that I was using my laptop, possibly? None of those figures appear to be saved on my desktop at work, so I must have been doing that at home. So possibly I had different data on my laptop. It seems that the problem was that I lost a bunch of rows at some point when making fw_SNPinfo on here. If I use the data on my laptop, I get the same thing, and I still have 12103 rows of data. So yay!

Ok, so now upset plots with this lovely new information.

```{r}
library(UpSetR)
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               stacksAL=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05)], 
               stacksLA=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA < 0.05)],
               stacksTX=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX < 0.05)],
               stacksFL=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL < 0.05)],
               sharedStacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05 & fw_SNPinfo$stacks_LA < 0.05 &
                 fw_SNPinfo$stacks_TX < 0.05 & fw_SNPinfo$stacks_FL < 0.05)])

uss<-upset(fromList(outliers),sets=c("stacksAL","stacksLA","stacksTX","stacksFL"))

```
```{r}
usa<-upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"))
```

I think I'm going to want to trick them out with color, better labels, etc., and probably put them in a single multi-panel thing. From the issues section of the github page (https://github.com/hms-dbmi/UpSetR/issues/63), I found this potential solution:

```{r}
library(scales)

cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#4daf4a',
        xtx='#984ea3',sharedStacks='#ff7f00',
        stacksTX='#e7d4e8',stacksAL='#762a83',stacksLA='#af8dc3',stacksFL='#7fbf7b')
metadata<-data.frame(set=names(outliers),
                     cols=c(cols[c("xtx","sal","perm","pc","stacks")],grp.colors[c(1,2,3,6)]))
upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = "Number of Outlier SNPs",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[1:5])
    
grid::grid.edit('arrange',name='arrange2')
vp = grid::grid.grab()
upset(fromList(outliers),sets=c("stacksAL","stacksLA","stacksTX","stacksFL"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = "Number of Outlier SNPs",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5))
grid::grid.edit('arrange',name='arrange3')
vp1 = grid::grid.grab()
gridExtra::grid.arrange(vp,vp1)
```

This works, as long as the fig size is large enough! So now just need to trick it out with colors. I've tried adding
```
set.metadata = list(data = metadata, 
                          plots = list(list(type = "matrix_rows",column = "set",colors=c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#4daf4a',
        xtx='#984ea3',sharedStacks='#ff7f00'))))
```     

but it doesn't help much. So now I'm going to try to use the information in thisblog post: https://www.r-bloggers.com/hacking-our-way-through-upsetr/ 

```{r}
require(ggplot2); require(plyr); require(gridExtra); require(grid); require(UpSetR)
source("../R/upset_hacked.R")
movies <- read.csv( system.file("extdata", "movies.csv", package = "UpSetR"), 
                    header=T, sep=";" )
upset(data = movies, sets = c("Action", "Comedy", "Drama"), 
      order.by="degree", matrix.color="blue", point.size=5,
      sets.bar.color=c("maroon","blue","orange"))

```

I used the code from the blog and updated/modified it a bit. It's saved in `upset_hacked.R`. I'm 90% confident I need to modify it further to make the colors more versatile (I think right now it's stuck at 3 different choices).

Yep, that's right. I adjusted the code and added another parameter that allows the point colors to be set.

```{r}
source("../R/upset_hacked.R")
cols<-c(permutations='#e41a1c',salBF='#377eb8',pcadapt='#4daf4a',
        xtx='#984ea3',sharedStacks='#ff7f00',
        stacksTX='#af8dc3',stacksAL='#e7d4e8',stacksLA='#762a83',stacksFL='#1b7837')
upset(fromList(outliers),sets=c("xtx","salBF","permutations","pcadapt","sharedStacks"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = "Number of Outlier SNPs",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[1:5],sets.pt.color=cols[1:5])
grid::grid.edit('arrange',name='arrange2')
vp = grid::grid.grab()
upset(fromList(outliers),sets=c("stacksAL","stacksLA","stacksTX","stacksFL"),
      point.size=3.5,line.size=2,mainbar.y.label = "Number of Shared Outliers",
      sets.x.label = "Number of Outlier SNPs",text.scale=c(1.5,1.5,1.5,1.5,1.5,1.5),
      sets.bar.color =cols[6:9],sets.pt.color=cols[6:9])
grid::grid.edit('arrange',name='arrange3')
vp1 = grid::grid.grab()
gridExtra::grid.arrange(vp,vp1)
```

Not sure what exactly is best to include in the paper, but this is a useful way to visualize overlap for sure.

# 12 November 2019

The supplement is turning into a more streamlined version of 202_fwsw_reanalysis.Rmd, which is OK - and I'll compile it to hide the code and it will look lovely. Plus I can add citations and more text in the supplement doc. I also started a separate supplement doc for the morphometrics analyses. 

Now I'm wondering if for the shared outliers an UpSet plot would be better than what I've currently got. This will take some playing around with. I'd also like to annotate the outliers (like I did before). Then I think all that I'll have left is the demographic analysis. So, to summarize:

* make UpSet plot 
* Annotate outliers
* Analyze dadi output

For the upset plot, it needs a list of sets or a binary matrix. I can easily get a list of sets from fw_SNPinfo (I think)
```{r}
library(UpSetR)
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
outliers<-list(xtx=fw_SNPinfo$ID[fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99)],
               salBF=fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               stacksAL=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05)], 
               stacksLA=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA < 0.05)],
               stacksTX=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX < 0.05)],
               stacksFL=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL < 0.05)])
```

Here's the overall plot

```{r}
upset(fromList(outliers),nsets = length(outliers))
```

But I wonder if I should do two -- one with just stacks output and one with all stacks lumped vs the others

```{r}
stacks_outliers<-list(stacksAL=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05)], 
               stacksLA=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_LA < 0.05)],
               stacksTX=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_TX < 0.05)],
               stacksFL=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_FL < 0.05)])


lumped_outliers<-list(xtx=fw_SNPinfo$ID[which(fw_SNPinfo$XtX >= quantile(fw_SNPinfo$XtX,0.99))],
               salBF=fw_SNPinfo$ID[which(fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99))],
               permutations=fw_SNPinfo$ID[rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
               pcadapt=fw_SNPinfo$ID[which(fw_SNPinfo$pcadaptQ<0.01)],
               stacks=fw_SNPinfo$ID[which(fw_SNPinfo$stacks_AL < 0.05 & fw_SNPinfo$stacks_LA < 0.05 &
                 fw_SNPinfo$stacks_TX < 0.05 & fw_SNPinfo$stacks_FL < 0.05)])


upset(fromList(stacks_outliers),nsets = length(stacks_outliers))
upset(fromList(lumped_outliers),nsets = length(lumped_outliers))
```

What's weird is that this doesn't really match what I wrote in the results which is that 24 of 29 XtX outliers are PCAdapt outliers. 

```{r}
fw_SNPinfo[which(fw_SNPinfo$pcadaptQ<0.01) %in% which(fw_SNPinfo$XtX>=quantile(fw_SNPinfo$XtX,0.99)),]
```

There's something wonky with the which -es

# 11 November 2019

Now that I've got data for the subsetted dataset for all populations, I can make the summary table. So that's what I'm doing. But first I merged the 12103 locus, 7 population dataset with the smaller dataset for all 16 pops with the whitelisted loci. Now I can run the code I'd previously written to estimate observed heterozygosity, allele frequencies, and polymorphic loci. Easy peasy. There are some slightly unnerving differences between the full and subsetted dataset for the populations not in the subsetted analysis, which is kind of weird. I just realized this table should probably include environmental variables.

The other thing I'd like to do is create a supplementary document showing additional figures and data. I can do this in Rmarkdown using a template package. It looks like rticles has a bunch of choices, and I also know that rmdTemplates (https://github.com/Pakillo/rmdTemplates) works really well. The supplement will include lots of stuff from 202_fwsw_reanalysis.Rmd and also from this document, probably. I've started the document and have put together the backbone for it. 

 
# 7 Nov 2019

That seems to have worked better. 

# 6 Nov 2019

My computer restarted overnight, great. Re-started populations.
At work, I'm checking out the progress of my dadi runs.
`for dir in fwsw_results/dadi_results/FLLG*/; do ls $dir | wc -l; done` 
works in bash and all I have to do is divide by 2. I get 
35
23
37
33
29
25
for FLLG_ALFW, FLLG_ALST, FLLG_FLCC, FLLG_LAFW, FLLG_TXCC, and FLLG_TXFW respectively - so FLLG vs ALFW and vs FLCC are moving along at a pretty good pace. The slow ones are ALST and TXFW. Still have a ways to go, though, considering there are 26 models to run...

Ok, populations finished running and seems to have completed successfully! Though not all of the loci were retained...
It looks like there were incompatible loci (with more than 2 alleles present)

So my original reason to do this was to get data for the 'null' population comparisons, I think. So I realize easiest thing would be to have it run on all 16 populations...maybe I can get this to work? Let me "archive" this successful run by re-naming it sw_pops_subset75 within populations_subset75 and I've created an empty folder all_pops_subset75. Now I just need to modify the script with the directory name and the correct maps file, and I'll give it another try (not changing any of the other parameters).

Now I'm running it with `../scripts/208_run_populations_subset.sh 2>&1 | tee logs/208_run_populations_whitelist_all.log`


# 5 Nov 2019

Ok, let's try this again
```{r}
sub1_map<-read.table("stacks/populations_subset75/all_pops1/batch_2.plink.map",header=FALSE)
sub2_map<-read.table("stacks/populations_subset75/all_pops2/batch_2.plink.map",header=FALSE)
keep_map<-read.table("stacks/populations_subset75/batch_2.pruned.map",header = FALSE)
```

No wonder stacks is freaking out, it's got a very large number of loci - why isn't it just keeping the whitelist?

```{r}
dim(sub1_map[sub1_map$V2 %in% keep_map$V2,])
dim(sub2_map[sub2_map$V2 %in% keep_map$V2,])
```

But all the loci in the whitelist are not in the subset maps!

Maybe my whitelist isn't formatted well?

```{r}
dim(sub1_map[gsub("(\\d+)_\\d+","\\1",sub1_map$V2) %in% gsub("(\\d+)_\\d+","\\1",keep_map$V2),])
```

That keeps a lot more loci - 82210. Maybe the Position value I used wasn't right - how did I make the whitelist? 

I used the map positions, but maybe that's not right. According to the Stacks manual:

```
The column is a zero-based coordinate of the SNP location, so the first nucleotide at a locus is labeled as column zero, the second position as column one. These coordinates correspond with the column reported in the populations.sumstats.tsv file as well as in several other output files from populations. 
```

And it looks like I used the wrong values to generate the whitelist - I used the V4 column, which is the BP not the column position.

```{r}
wl<-data.frame(loc=gsub("(\\d+)_\\d+","\\1",keep_map$V2),column=gsub("(\\d+)_(\\d+)","\\2",keep_map$V2))
write.table(wl,"stacks/populations_subset75/pruned_snps.txt",
            sep = '\t',row.names = FALSE,col.names = FALSE,quote=FALSE)
```

So now I've just got to try this again, and turn off all the populations filters. I've changed the code and now just have to run it at home.


# 1 Nov 2019

Populations didn't run properly, but it does seem to have saved plink output -- it seemed to die at the vcf-creation moment. So I'll see what the plink data look like.

```{r}
sub1_map<-read.table("stacks/populations_subset75/all_pops1/batch_2.plink.map",header=FALSE)
sub2_map<-read.table("stacks/populations_subset75/all_pops2/batch_2.plink.map",header=FALSE)
```


# 31 October 2019

Populations isn't running, and I think it's because I'm running out of memory. So! I will run it not on all 16 populations with the whitelisted SNPs, but on the ones that aren't in the subset.

```{r}
popmap<-read.delim("../fwsw_pops_map.txt",header = FALSE)
submap<-read.delim("../fwsw_sub_map.txt",header = FALSE)
altmap<-popmap[!(popmap$V1 %in% submap$V1),]
write.table(altmap,"../fwsw_alt_map.txt",col.names = FALSE,row.names=FALSE,quote=FALSE,sep='\t')
```

Now I'm running populations with the whitelist - we'll see if it helps.

It still gets stuck - let's split it into two.

```{r}
write.table(altmap[altmap$V2 %in% unique(altmap$V2)[1:4],],"../fwsw_alt1_map.txt",
            col.names = FALSE,row.names=FALSE,quote=FALSE,sep='\t')
write.table(altmap[altmap$V2 %in% unique(altmap$V2)[5:length(unique(altmap$V2))],],
            "../fwsw_alt2_map.txt",
            col.names = FALSE,row.names=FALSE,quote=FALSE,sep='\t')
```


# 29 October 2019

For whatever reason populations doesn't want to run with the whitelist. I might be able to achieve the same goal just by using the vcfs

```{r}
all_vcf<-parse.vcf("stacks/populations_all/batch_2.vcf")
sub_vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
sub_all_vcf<-all_vcf[which(paste(all_vcf$ID,all_vcf$POS,sep="_") %in% paste(sub_vcf$ID,sub_vcf$POS,sep="_") ),]
```

Weirdly, this only results in 9549 loci. Huh, also, the populations_all has fewer unique IDs than the subset 75. So I might not be able to do this -- I may have to rely on populations on my home computer after all.

As for the dadi analyses, the FLLG comparisons are still running (which are the first set). Unfortunately the log files are not being written because they're being passed to python as another argument...so that's not helpful. I'll probably need to change it to be `| tee x-y.log`. But here's the ones that have run so far:

```
FLLG_ALFW:
FLLG_ALFW_1.AM.log.txt        FLLG_ALFW_1.IM.log.txt        FLLG_ALFW_1.SC.log.txt        FLLG_ALFW_1.SI2N.log.txt        FLLG_ALFW_1.SIG.log.txt        FLLG_ALFW_1.SI.log.txt        SIG.log.txt
FLLG_ALFW_1.AM.optimized.txt  FLLG_ALFW_1.IM.optimized.txt  FLLG_ALFW_1.SC.optimized.txt  FLLG_ALFW_1.SI2N.optimized.txt  FLLG_ALFW_1.SIG.optimized.txt  FLLG_ALFW_1.SI.optimized.txt

FLLG_ALST:
FLLG_ALST_1.AM.log.txt        FLLG_ALST_1.IM2N.optimized.txt  FLLG_ALST_1.IM.log.txt        FLLG_ALST_1.SC.optimized.txt     FLLG_ALST_1.SI2N.log.txt        FLLG_ALST_1.SIG.optimized.txt  IM2N.log.txt
FLLG_ALST_1.AM.optimized.txt  FLLG_ALST_1.IMG.log.txt         FLLG_ALST_1.IM.optimized.txt  FLLG_ALST_1.SI2NG.log.txt        FLLG_ALST_1.SI2N.optimized.txt  FLLG_ALST_1.SI.log.txt
FLLG_ALST_1.IM2N.log.txt      FLLG_ALST_1.IMG.optimized.txt   FLLG_ALST_1.SC.log.txt        FLLG_ALST_1.SI2NG.optimized.txt  FLLG_ALST_1.SIG.log.txt         FLLG_ALST_1.SI.optimized.txt

FLLG_FLCC:
FLLG_FLCC_1.AM.log.txt        FLLG_FLCC_1.IM2m.optimized.txt  FLLG_FLCC_1.IMG.log.txt        FLLG_FLCC_1.IM.optimized.txt  FLLG_FLCC_1.SI2NG.log.txt        FLLG_FLCC_1.SI2N.optimized.txt  FLLG_FLCC_1.SI.log.txt
FLLG_FLCC_1.AM.optimized.txt  FLLG_FLCC_1.IM2N.log.txt        FLLG_FLCC_1.IMG.optimized.txt  FLLG_FLCC_1.SC.log.txt        FLLG_FLCC_1.SI2NG.optimized.txt  FLLG_FLCC_1.SIG.log.txt         FLLG_FLCC_1.SI.optimized.txt
FLLG_FLCC_1.IM2m.log.txt      FLLG_FLCC_1.IM2N.optimized.txt  FLLG_FLCC_1.IM.log.txt         FLLG_FLCC_1.SC.optimized.txt  FLLG_FLCC_1.SI2N.log.txt         FLLG_FLCC_1.SIG.optimized.txt   IM2m.log.txt

FLLG_LAFW:
FLLG_LAFW_1.AM.log.txt        FLLG_LAFW_1.IM2N.optimized.txt  FLLG_LAFW_1.IM.log.txt        FLLG_LAFW_1.SC.optimized.txt     FLLG_LAFW_1.SI2N.log.txt        FLLG_LAFW_1.SIG.optimized.txt  IM2N.log.txt
FLLG_LAFW_1.AM.optimized.txt  FLLG_LAFW_1.IMG.log.txt         FLLG_LAFW_1.IM.optimized.txt  FLLG_LAFW_1.SI2NG.log.txt        FLLG_LAFW_1.SI2N.optimized.txt  FLLG_LAFW_1.SI.log.txt
FLLG_LAFW_1.IM2N.log.txt      FLLG_LAFW_1.IMG.optimized.txt   FLLG_LAFW_1.SC.log.txt        FLLG_LAFW_1.SI2NG.optimized.txt  FLLG_LAFW_1.SIG.log.txt         FLLG_LAFW_1.SI.optimized.txt

FLLG_TXCC:
FLLG_TXCC_1.AM.log.txt        FLLG_TXCC_1.IM2m.optimized.txt  FLLG_TXCC_1.IMG.log.txt        FLLG_TXCC_1.IM.optimized.txt  FLLG_TXCC_1.SI2NG.log.txt        FLLG_TXCC_1.SI2N.optimized.txt  FLLG_TXCC_1.SI.log.txt
FLLG_TXCC_1.AM.optimized.txt  FLLG_TXCC_1.IM2N.log.txt        FLLG_TXCC_1.IMG.optimized.txt  FLLG_TXCC_1.SC.log.txt        FLLG_TXCC_1.SI2NG.optimized.txt  FLLG_TXCC_1.SIG.log.txt         FLLG_TXCC_1.SI.optimized.txt
FLLG_TXCC_1.IM2m.log.txt      FLLG_TXCC_1.IM2N.optimized.txt  FLLG_TXCC_1.IM.log.txt         FLLG_TXCC_1.SC.optimized.txt  FLLG_TXCC_1.SI2N.log.txt         FLLG_TXCC_1.SIG.optimized.txt   IM2m.log.txt

FLLG_TXFW:
FLLG_TXFW_1.AM.log.txt        FLLG_TXFW_1.IM2N.optimized.txt  FLLG_TXFW_1.IM.log.txt        FLLG_TXFW_1.SC.optimized.txt     FLLG_TXFW_1.SI2N.log.txt        FLLG_TXFW_1.SIG.optimized.txt  IM2N.log.txt
FLLG_TXFW_1.AM.optimized.txt  FLLG_TXFW_1.IMG.log.txt         FLLG_TXFW_1.IM.optimized.txt  FLLG_TXFW_1.SI2NG.log.txt        FLLG_TXFW_1.SI2N.optimized.txt  FLLG_TXFW_1.SI.log.txt
FLLG_TXFW_1.IM2N.log.txt      FLLG_TXFW_1.IMG.optimized.txt   FLLG_TXFW_1.SC.log.txt        FLLG_TXFW_1.SI2NG.optimized.txt  FLLG_TXFW_1.SIG.log.txt         FLLG_TXFW_1.SI.optimized.txt
```

So it's pretty good progress - all of them have gotten through the homogeneous models. But it is rather slow...we'll see how this goes. I might need to re-evaluate how I'm running these (I should follow up with the HPC guy).


# 25 October 2019

I realized I don't think I updated the projections for the filtered dataset - let me try that again.

```{r}
pop.map<-read.delim("stacks/populations_subset75/filter_rad_20191022@1502/14_filtered/strata.filtered.tsv")
table(pop.map$STRATA)*2

```
Ok, that's not true, those are the projections I used. 

So I'm not sure what's going on with that or what to do next. But I can do a couple of things:

* Update the text of the ms
* Re-do outlier analysis with this filtered dataset
* Figure out how to automate the dadi runs a bit better.

Is it weird that I want to do the third one first? Even though I don't know what to make of the filtered data? Well, it's what I want to do so I'm going to do it - even if I run it on the 12103 SNPs I was originally using. 

In my notes it says I need to use MINIMUM projections - could that be the issue? I need to use the minimum number of individuals genotyped at a locus?

Ok, I've written a python script that takes arguments and runs all of the models on a population pair for a given number of rounds. 

Ok, but which population pairs do I want to use, and which dataset?? Also, shouldn't I also run 1D models? 

The dadi manual says:

```

When constructing the Spectrum each SNP will be projected down to the requested number of samples in each population. (Note that SNPs cannot be projected up, so SNPs without enough calls in any population will be ignored.)...

If your data have missing calls for some individuals, projecting down to a smaller sample
size will increase the number of SNPs you can use for analysis. On the other hand, some
fraction of the SNPs will now project down to frequency 0, and thus be uniformative. As a
rule of thumb, we often choose our projection to maximize the number of segregating sites
in our final fs (assessed via fs.S()), although we have not formally tested whether this
maximizes statistical power.

```
So this probably isn't causing the wonky spectra but it does suggest that the most appropriate dataset is the one with loci present in 100% of individuals in the populations analysed. I could check with the fs.S() command...

with ALFW-ALST, spect.S() resulted in 1371 for the filtered dataset, 536 for the second filtered dataset, and 
1359.86 for the 'full' dataset. The original dataset has 5478.9 (without changing the projections). So...maybe this is part of the issue. And maybe I should just use the original dadi dataset, even if it does have some missingness issues.

So I also need a wrapper script to run the python script. Done! And I started running it using GNU parallel, hopefully it works well. 

I've decided to stick with the 12103 dataset so I don't need to re-do other analyses.  What I should do is use the whitelist for the 12103 to extract SNPs for the other populations so I can do the pairwise Fst comparisons.

Ok, so I need a create a whitelist.

```{r}
map<-read.delim("stacks/populations_subset75/batch_2.pruned.map",header = FALSE)
wl<-data.frame(gsub("(\\d+)_\\d+","\\1",map$V2),map$V4)
write.table(wl,"stacks/populations_subset75/pruned_snps.txt",
            sep = '\t',row.names = FALSE,col.names = FALSE,quote=FALSE)
```


# 24 October 2019

The 1D spectra for the filtered datasets don't look very good - almost all of them dip below (or to?) 0, for counts > 10 or 20, which means that they're not well represented -- and this is different from the unfiltered dataset. I'm guessing this is because of the smaller overall sample size, but it could also be due to a relatively permissive MAC filter.

What if I use the full dataset instead?

```{r}
dadi_full<-vcf2dadiSNPs(vcf="filter_rad_20191014@1654/14_filtered/radiator_data_20191014@1710.vcf",
                        filename="dadi_filtered/fwsw_full.snps",
                        pop.map = "filter_rad_20191014@1654/14_filtered/strata.filtered.tsv")

```

Running the plots function on it results in just as bad/weird plots. Not sure what to do about that.

What if I use a more stringent MAC filter and missingness filter?

```{r}
setwd("stacks/populations_subset75/")
data<-radiator::filter_rad(data="batch_2.pruned_header.vcf",
                           strata="fwsw_sub_strata.txt",
                           output=c("genepop","vcf","plink","structure"))
setwd("../../")
```

It results in a lot fewer loci (4499) -- log at bottom

```{r}
dadi_filt2<-vcf2dadiSNPs(vcf="stacks/populations_subset75/filter_rad_20191024@0843/14_filtered/radiator_data_20191024@0846.vcf",
                         filename="dadi_filtered/fwsw_filter2.snps",
                         pop.map = "stacks/populations_subset75/filter_rad_20191024@0843/14_filtered/strata.filtered.tsv")
```

Nope, this did not help.

```
################################################################################
############################# radiator::filter_rad #############################
################################################################################
The function arguments names have changed: please read documentation

Execution date@time: 20191024@0843
Folder created: filter_rad_20191024@0843
Function call and arguments stored in: radiator_filter_rad_args_20191024@0843.tsv
File written: random.seed (779280)
Filters parameters file generated: filters_parameters_20191024@0843.tsv

Reading VCF
Data summary: 
    number of samples: 303
    number of markers: 12103
done! timing: 1 sec

Generating individual stats...
Generating markers stats...


Number of chromosome/contig/scaffold: 532
Number of locus: 12103
Number of markers: 12103
Number of populations: 7
Number of individuals: 303

Number of ind/pop:
ALFW = 48
ALST = 47
FLCC = 41
FLLG = 47
LAFW = 48
TXCC = 41
TXFW = 31

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20191024@0843.gds
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20191024@0844
Function call and arguments stored in: radiator_filter_monomorphic_args_20191024@0844.tsv
Filters parameters file: initiated
File written: blacklist.monomorphic.markers_20191024@0844.tsv
Synchronizing markers.meta
File written: whitelist.polymorphic.markers_20191024@0844.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 532 / 12103 / 12103
    Blacklisted: 0 / 0 / 1 / 9 / 9
    After: 303 / 7 / 531 / 12094 / 12094

Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20191024@0844
Function call and arguments stored in: radiator_filter_common_markers_args_20191024@0844.tsv
Filters parameters file: initiated
Scanning for common markers...
Generating UpSet plot to visualize markers in common
File written: whitelist.common.markers_20191024@0844.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 531 / 12094 / 12094
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 303 / 7 / 531 / 12094 / 12094

Computation time, overall: 2 sec
####################### filter_common_markers completed ########################
################################################################################
######################### radiator::filter_individuals #########################
################################################################################
Execution date@time: 20191024@0844
Function call and arguments stored in: radiator_filter_individuals_args_20191024@0844.tsv
Interactive mode: on

Step 1. Visualization
Step 2. Missingness
Step 3. Heterozygosity
Step 4. Total Coverage (if available)


Filters parameters file: initiated

Step 1. Visualization of samples QC

File written: individuals qc info and stats summary
File written: individuals qc plot

Step 2. Filtering markers based individual missingness/genotyping

Do you want to blacklist samples based on missingness ? (y/n):
y
2 options to blacklist samples:
1. based on the outlier statistics
2. enter your own threshold
1

Removing outliers individuals based on genotyping statistics: 0.213308
Filters parameters file: updated
################################### RESULTS ####################################

Filter individuals based on missingness: 0.213308
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 531 / 12094 / 12094
    Blacklisted: 49 / 0 / 0 / 0 / 0
    After: 254 / 7 / 531 / 12094 / 12094

Step 3. Filtering markers based on individual heterozygosity

Do you want to blacklist samples based on heterozygosity ? (y/n):
y
2 options to blacklist samples:
1. based on the outlier statistics
2. enter your own threshold
1

Removing outliers individuals based on heterozygosity statistics: 0.010165 / 0.04393725
    number of individuals blacklisted based on heterozygosity: 10
Filters parameters file: updated
################################### RESULTS ####################################

Filter individuals based on heterozygosity: 0.010165 0.04393725
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 531 / 12094 / 12094
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 531 / 12094 / 12094

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 16 / 426 / 426

Computation time, overall: 35 sec
########################### completed filter_individuals #######################
################################################################################
############################## radiator::filter_mac ############################
################################################################################
Execution date@time: 20191024@0844
Function call and arguments stored in: radiator_filter_mac_args_20191024@0844.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on MAC


Importing data ...
Filters parameters file: initiated
Calculating GLOBAL MAC
File written: maf.global.tsv

Step 1. MAC visualization and helper table

File written: mac.summary.stats.tsv
MAC range: [1 - 251]
MAF range: [0.002 - 0.4941]
Generating MAC helper table...
File written: maf.helper.table.tsv

Step 2. Filtering markers based on MAC

Choose the filter.mac threshold: 
2
File written: whitelist.markers.mac.tsv
File written: blacklist.markers.mac.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter mac threshold: 2
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 515 / 11668 / 11668
    Blacklisted: 0 / 0 / 163 / 6466 / 6466
    After: 254 / 7 / 352 / 5202 / 5202

Computation time, overall: 22 sec
############################ completed filter_mac ##############################
################################################################################
########################### radiator::filter_coverage ##########################
################################################################################
Execution date@time: 20191024@0845
Function call and arguments stored in: radiator_filter_coverage_args_20191024@0845.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on total coverage


Importing data ...
Filters parameters file: initiated

Step 1. Coverage visualization and helper table

Generating coverage statistics
Generating coverage statistics: without outliers
Generating mean coverage helper table...
Files written: helper tables and plots

Step 2. Filtering markers based on mean coverage

Choose the min mean coverage threshold(e.g. 7 or 10): 
11
Choose the max mean coverage threshold (e.g. 100 or 300): 
50
File written: blacklist.markers.coverage_20191024@0845.tsv
File written: whitelist.markers.coverage_20191024@0845.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter mean coverage thresholds: 11 / 50
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 352 / 5202 / 5202
    Blacklisted: 0 / 0 / 31 / 175 / 175
    After: 254 / 7 / 321 / 5027 / 5027

Computation time, overall: 17 sec
########################## completed filter_coverage ###########################
################################################################################
######################### radiator::filter_genotyping ##########################
################################################################################
Execution date@time: 20191024@0845
Function call and arguments stored in: radiator_filter_genotyping_args_20191024@0845.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on maximum missing proportion allowed


Importing data ...
Filters parameters file: initiated

Step 1. Missing visualization and helper table

Generating statistics
Generating missingness/genotyping helper table...
File written: markers.pop.missing.helper.table.tsv
Files written: helper tables and plots

Step 2. Filtering markers based on maximum missing proportion

Choose the maximum missing proportion allowed: 
0.1

Removing markers based on genotyping statistic: 0.1
File written: blacklist.markers.genotyping_20191024@0845.tsv
File written: whitelist.markers.genotyping_20191024@0845.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter genotyping threshold: 0.1
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 321 / 5027 / 5027
    Blacklisted: 0 / 0 / 0 / 1 / 1
    After: 254 / 7 / 321 / 5026 / 5026

Computation time, overall: 28 sec
######################## completed filter_genotyping ###########################
################################################################################
######################### radiator::filter_snp_position_read ###################
################################################################################
Execution date@time: 20191024@0845
Function call and arguments stored in: radiator_filter_snp_position_read_args_20191024@0845.tsv
2 steps to visualize and filter the data based on the number of SNP on the read/locus:
Step 1. Visualization (boxplot, distribution
Step 2. Threshold selection
Filters parameters file: initiated
COL info required, returning data

Computation time, overall: 0 sec
##################### completed filter_snp_position_read #######################
################################################################################
############################ radiator::filter_snp_number #######################
################################################################################
Execution date@time: 20191024@0845
Function call and arguments stored in: radiator_filter_snp_number_args_20191024@0845.tsv
Interactive mode: on
2 steps to visualize and filter the data based on the number of SNP on the read/locus:
Step 1. Impact of SNP number per read/locus (on individual genotypes and locus/snp number potentially filtered)
Step 2. Choose the filtering thresholds
Filters parameters file: initiated
Generating statistics
Generating helper table...
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
Files written: helper tables and plots

Step 2. Filtering markers based on the maximum of SNPs per locus

Do you still want to blacklist markers? (y/n):
n
File written: whitelist.markers.genotyping.tsv
File written: blacklist.markers.genotyping.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter SNPs per locus threshold: 1e+12
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 321 / 5026 / 5026
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 321 / 5026 / 5026

Computation time, overall: 4 sec
######################### completed filter_snp_number ##########################
################################################################################
############################## radiator::filter_ld #############################
################################################################################
Execution date@time: 20191024@0846
Function call and arguments stored in: radiator_filter_ld_args_20191024@0846.tsv

Interactive mode: on

Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD


Filters parameters file: initiated
Minimizing short distance LD...

There is no variation in the number of SNP/locus across the data

Filters parameters file: updated
################################### RESULTS ####################################

Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 321 / 5026 / 5026
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 321 / 5026 / 5026

Do you want to continue filtering using long distance ld  ? (y/n):
n

Computation time, overall: 1 sec
############################# completed filter_ld ##############################
################################################################################
######################## radiator::detect_mixed_genomes ########################
################################################################################
Execution date@time: 20191024@0846

detect_mixed_genomes function call arguments:
    data = SeqVarGDSClass
    interactive.filter = TRUE
    detect.mixed.genomes = FALSE
    ind.heterozygosity.threshold = NULL
    verbose = TRUE
    parallel.core = 11

dots-dots-dots ... arguments

Arguments inside "..." assigned in detect_mixed_genomes:
    internal = FALSE
    parameters = list
    path.folder = filter_rad_20191024@0843


File written: radiator_detect_mixed_genomes_args_20191024@0846.tsv
Filters parameters file: initiated
Calculating heterozygosity statistics
Generating plots

The greatest value of a picture is when it forces us
to notice what we never expected to see.

John W. Tukey. Exploratory Data Analysis. 1977.


Inspect plots and tables in folder created...
    Do you want to exclude individuals based on heterozygosity ? (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect mixed genomes: 0 1
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 321 / 5026 / 5026
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 321 / 5026 / 5026

Computation time, overall: 12 sec
######################## completed detect_mixed_genomes ########################

################################################################################
###################### radiator::detect_duplicate_genomes ######################
################################################################################
Execution date@time: 20191024@0846
Function call and arguments stored in a file
File written: radiator_detect_duplicate_genomes_args_20191024@0846.tsv
File written: random.seed (779280)
Filters parameters file: initiated
Calculating manhattan distances between individuals...
Generating summary statistics
Generating plots

Inspect tables and figures to decide if some individual(s) need to be blacklisted
    Do you need to blacklist individual(s) (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect duplicate genomes: 0
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 321 / 5026 / 5026
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 321 / 5026 / 5026

Computation time, overall: 9 sec
###################### completed detect_duplicate_genomes ######################
################################################################################
############################# radiator::filter_hwe #############################
################################################################################
Execution date@time: 20191024@0846
Interactive mode: on
Function call and arguments stored in: radiator_filter_hwe_args_20191024@0846.tsv
Filters parameters file: initiated
    using tidy data frame of genotypes as input
    skipping all filters
Summarizing data
File written: genotypes.summary.tsv
HWE analysis for pop: ALFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: ALST
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLCC
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLLG
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: LAFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: TXCC
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: TXFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: OVERALL
  |=============================================| 100%, Elapsed 00:01
File written: hw.pop.sum.tsv
Plot written: hwd.plot.blacklist.markers.pdf
Plot written: hwe.ternary.plots.missing.data.pdf
Plot written: hwe.manhattan.plot.pdf

Do you want to continue with the filtering ? (y/n):
y

Based on figures and tables enter the hw.pop.threshold (integer): 
1

Generating blacklists, whitelists and filtered tidy data...
done!

Choosing the final filtered dataset
   select the mid p-value threshold (5 options):
1: 0.05 *
2. 0.01 **
3. 0.001 ***
4. 0.0001 ****
5. 0.00001 *****
1

Final filtered tidy dataset: 
tidy.filtered.hwe.0.05.mid.p.value.1.hw.pop.threshold.rad

Using hw.pop.threshold/midp.threshold: 1/0.05
Filters parameters file: updated
################################### RESULTS ####################################
Filter HWE: 1 / 0.05
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 321 / 5026 / 5026
    Blacklisted: 0 / 0 / 19 / 527 / 527
    After: 254 / 7 / 302 / 4499 / 4499

Computation time, overall: 21 sec
############################# completed filter_hwe #############################

Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv
Writing the filtered strata: strata.filtered.tsvstrata.filtered.tsv

Generating statistics after filtering
calculating individual stats...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...

Transferring data to genomic converter...
Synchronizing data and strata...
    Number of strata: 7
    Number of individuals: 254

Writing tidy data set:
radiator_data_20191024@0846.rad
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 2
Data summary: 
    number of samples: 254
    number of markers: 4499

Computation time, overall: 193 sec
############################# completed filter_rad #############################

```

# 23 October 2019

cool, got the extension

Now I'm thinking that I need to re-do the dadi analysis with the better pruned dataset. Dammit. But I wonder if part of the issue comes from missingness in the dataset.

Choices:

1. plow ahead and pretend nothing is wrong.
2. re-analyze at least some of the datasets using the same models but the trimmed dataset
3. compare spectra for trimmed vs nontrimmed dataset
4. use trimmed data plus all of the models from Rougeux for all of the population pairs

I'm going to start with (3), which means that I need to convert the dataset into dadi format. And crap, I didn't save to vcf yesterday, so I've got to re-do all that other stuff - this is within the populations_subset75 directory.

```{r}
vcf_header<-grep("##",readLines("batch_2.vcf"),value=TRUE)
# fix a weird stacks thing
vcf_header[grep("AD",vcf_header)]<-gsub("(ID=AD,Number=)(\\d)","\\1\\.",grep("AD",vcf_header,value=TRUE))
vcf<-parse.vcf("batch_2.pruned.vcf")
write.table(vcf_header,"batch_2.pruned_header.vcf",quote = FALSE,col.names = FALSE,row.names = FALSE)
suppressWarnings(write.table(vcf,"batch_2.pruned_header.vcf",quote = FALSE,col.names = TRUE,row.names = FALSE,append = TRUE,sep='\t'))
```


```{r}
data<-radiator::filter_rad(data="batch_2.pruned_header.vcf",
                           strata="fwsw_sub_strata.txt",
                           output=c("genepop","vcf","plink","structure"))
```

I chose the same settings as below (also, log pasted here at end of today's entry).  

Ok, NOW I can convert to dadi format. Moving back to the fwsw_results, directory...

```{r}
pop.list<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST","ALFW","FLSG","FLKB",
	"FLFD","FLSI","FLAB","FLPB","FLHB","FLCC","FLLG")
dir.create("dadi_filtered") # make a new directory for this!
vcf<-parse.vcf("stacks/populations_subset75/filter_rad_20191023@0847/14_filtered/radiator_data_20191023@0852.vcf")
dadi_filt<-vcf2dadiSNPs(vcf="stacks/populations_subset75/filter_rad_20191023@0847/14_filtered/radiator_data_20191023@0852.vcf",
                        filename="dadi_filtered/fwsw_sub75_filt",
                        pop.map = "stacks/populations_subset75/filter_rad_20191023@0847/14_filtered/strata.filtered.tsv")
```

I had to make some tweaks to the vcf2dadiSNPs function but I think I got it up and running with this dataset. Awesome. 

Ok, now I'll use interactive python to create SFS images.

I also need to know how many individuals are in each population

```{r}
pop.map<-read.delim("stacks/populations_subset75/filter_rad_20191023@0847/14_filtered/strata.filtered.tsv")
table(pop.map$STRATA)*2
```

Ok, so I've automated the plotting.  Now I'm wondering if I can write a loop to write scripts so I can automate the script generation and not need to copy and paste. A bash script might be right - and possibly I could use that to also run the analyses?

I'm thinking I'll pursue #4 above, although I'm not sure that the 2D SFS are any better now than before, they seem pretty sparse to be honest.


```
################################################################################
############################# radiator::filter_rad #############################
################################################################################
The function arguments names have changed: please read documentation

Execution date@time: 20191023@0847
Folder created: filter_rad_20191023@0847
Function call and arguments stored in: radiator_filter_rad_args_20191023@0847.tsv
File written: random.seed (911592)
Filters parameters file generated: filters_parameters_20191023@0847.tsv

Reading VCF
Data summary: 
    number of samples: 303
    number of markers: 12103
done! timing: 1 sec

Generating individual stats...
Generating markers stats...


Number of chromosome/contig/scaffold: 532
Number of locus: 12103
Number of markers: 12103
Number of populations: 7
Number of individuals: 303

Number of ind/pop:
ALFW = 48
ALST = 47
FLCC = 41
FLLG = 47
LAFW = 48
TXCC = 41
TXFW = 31

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20191023@0847.gds
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20191023@0847
Function call and arguments stored in: radiator_filter_monomorphic_args_20191023@0847.tsv
Filters parameters file: initiated
File written: blacklist.monomorphic.markers_20191023@0847.tsv
Synchronizing markers.meta
File written: whitelist.polymorphic.markers_20191023@0847.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 532 / 12103 / 12103
    Blacklisted: 0 / 0 / 1 / 9 / 9
    After: 303 / 7 / 531 / 12094 / 12094

Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20191023@0847
Function call and arguments stored in: radiator_filter_common_markers_args_20191023@0847.tsv
Filters parameters file: initiated
Scanning for common markers...
Generating UpSet plot to visualize markers in common
File written: whitelist.common.markers_20191023@0847.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 531 / 12094 / 12094
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 303 / 7 / 531 / 12094 / 12094

Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
######################### radiator::filter_individuals #########################
################################################################################
Execution date@time: 20191023@0847
Function call and arguments stored in: radiator_filter_individuals_args_20191023@0847.tsv
Interactive mode: on

Step 1. Visualization
Step 2. Missingness
Step 3. Heterozygosity
Step 4. Total Coverage (if available)


Filters parameters file: initiated

Step 1. Visualization of samples QC

File written: individuals qc info and stats summary
File written: individuals qc plot

Step 2. Filtering markers based individual missingness/genotyping

Do you want to blacklist samples based on missingness ? (y/n):
y
2 options to blacklist samples:
1. based on the outlier statistics
2. enter your own threshold
1

Removing outliers individuals based on genotyping statistics: 0.213308
Filters parameters file: updated
################################### RESULTS ####################################

Filter individuals based on missingness: 0.213308
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 531 / 12094 / 12094
    Blacklisted: 49 / 0 / 0 / 0 / 0
    After: 254 / 7 / 531 / 12094 / 12094

Step 3. Filtering markers based on individual heterozygosity

Do you want to blacklist samples based on heterozygosity ? (y/n):
y
2 options to blacklist samples:
1. based on the outlier statistics
2. enter your own threshold
1

Removing outliers individuals based on heterozygosity statistics: 0.010165 / 0.04393725
    number of individuals blacklisted based on heterozygosity: 10
Filters parameters file: updated
################################### RESULTS ####################################

Filter individuals based on heterozygosity: 0.010165 0.04393725
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 531 / 12094 / 12094
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 531 / 12094 / 12094

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 16 / 426 / 426

Computation time, overall: 48 sec
########################### completed filter_individuals #######################
################################################################################
############################## radiator::filter_mac ############################
################################################################################
Execution date@time: 20191023@0848
Function call and arguments stored in: radiator_filter_mac_args_20191023@0848.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on MAC


Importing data ...
Filters parameters file: initiated
Calculating GLOBAL MAC
File written: maf.global.tsv

Step 1. MAC visualization and helper table

File written: mac.summary.stats.tsv
MAC range: [1 - 251]
MAF range: [0.002 - 0.4941]
Generating MAC helper table...
File written: maf.helper.table.tsv

Step 2. Filtering markers based on MAC

Choose the filter.mac threshold: 
1
File written: whitelist.markers.mac.tsv
File written: blacklist.markers.mac.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter mac threshold: 1
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 515 / 11668 / 11668
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 515 / 11668 / 11668

Computation time, overall: 22 sec
############################ completed filter_mac ##############################
################################################################################
########################### radiator::filter_coverage ##########################
################################################################################
Execution date@time: 20191023@0848
Function call and arguments stored in: radiator_filter_coverage_args_20191023@0848.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on total coverage


Importing data ...
Filters parameters file: initiated

Step 1. Coverage visualization and helper table

Generating coverage statistics
Generating coverage statistics: without outliers
Generating mean coverage helper table...
Files written: helper tables and plots

Step 2. Filtering markers based on mean coverage

Choose the min mean coverage threshold(e.g. 7 or 10): 
11
Choose the max mean coverage threshold (e.g. 100 or 300): 
50
File written: blacklist.markers.coverage_20191023@0848.tsv
File written: whitelist.markers.coverage_20191023@0848.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter mean coverage thresholds: 11 / 50
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 515 / 11668 / 11668
    Blacklisted: 0 / 0 / 23 / 256 / 256
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 80 sec
########################## completed filter_coverage ###########################
################################################################################
######################### radiator::filter_genotyping ##########################
################################################################################
Execution date@time: 20191023@0850
Function call and arguments stored in: radiator_filter_genotyping_args_20191023@0850.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on maximum missing proportion allowed


Importing data ...
Filters parameters file: initiated

Step 1. Missing visualization and helper table

Generating statistics
Generating missingness/genotyping helper table...
File written: markers.pop.missing.helper.table.tsv
Files written: helper tables and plots

Step 2. Filtering markers based on maximum missing proportion

Choose the maximum missing proportion allowed: 
0.2

Removing markers based on genotyping statistic: 0.2
File written: blacklist.markers.genotyping_20191023@0850.tsv
File written: whitelist.markers.genotyping_20191023@0850.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter genotyping threshold: 0.2
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 11 sec
######################## completed filter_genotyping ###########################
################################################################################
######################### radiator::filter_snp_position_read ###################
################################################################################
Execution date@time: 20191023@0850
Function call and arguments stored in: radiator_filter_snp_position_read_args_20191023@0850.tsv
2 steps to visualize and filter the data based on the number of SNP on the read/locus:
Step 1. Visualization (boxplot, distribution
Step 2. Threshold selection
Filters parameters file: initiated
COL info required, returning data

Computation time, overall: 0 sec
##################### completed filter_snp_position_read #######################
################################################################################
############################ radiator::filter_snp_number #######################
################################################################################
Execution date@time: 20191023@0850
Function call and arguments stored in: radiator_filter_snp_number_args_20191023@0850.tsv
Interactive mode: on
2 steps to visualize and filter the data based on the number of SNP on the read/locus:
Step 1. Impact of SNP number per read/locus (on individual genotypes and locus/snp number potentially filtered)
Step 2. Choose the filtering thresholds
Filters parameters file: initiated
Generating statistics
Generating helper table...
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
Files written: helper tables and plots

Step 2. Filtering markers based on the maximum of SNPs per locus

Do you still want to blacklist markers? (y/n):
n
File written: whitelist.markers.genotyping.tsv
File written: blacklist.markers.genotyping.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter SNPs per locus threshold: 1e+12
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 6 sec
######################### completed filter_snp_number ##########################
################################################################################
############################## radiator::filter_ld #############################
################################################################################
Execution date@time: 20191023@0850
Function call and arguments stored in: radiator_filter_ld_args_20191023@0850.tsv

Interactive mode: on

Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD


Filters parameters file: initiated
Minimizing short distance LD...

There is no variation in the number of SNP/locus across the data

Filters parameters file: updated
################################### RESULTS ####################################

Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Do you want to continue filtering using long distance ld  ? (y/n):
n

Computation time, overall: 5 sec
############################# completed filter_ld ##############################
################################################################################
######################## radiator::detect_mixed_genomes ########################
################################################################################
Execution date@time: 20191023@0850

detect_mixed_genomes function call arguments:
    data = SeqVarGDSClass
    interactive.filter = TRUE
    detect.mixed.genomes = FALSE
    ind.heterozygosity.threshold = NULL
    verbose = TRUE
    parallel.core = 11

dots-dots-dots ... arguments

Arguments inside "..." assigned in detect_mixed_genomes:
    internal = FALSE
    parameters = list
    path.folder = filter_rad_20191023@0847


File written: radiator_detect_mixed_genomes_args_20191023@0850.tsv
Filters parameters file: initiated
Calculating heterozygosity statistics
Generating plots

The greatest value of a picture is when it forces us
to notice what we never expected to see.

John W. Tukey. Exploratory Data Analysis. 1977.


Inspect plots and tables in folder created...
    Do you want to exclude individuals based on heterozygosity ? (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect mixed genomes: 0 1
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 48 sec
######################## completed detect_mixed_genomes ########################

################################################################################
###################### radiator::detect_duplicate_genomes ######################
################################################################################
Execution date@time: 20191023@0851
Function call and arguments stored in a file
File written: radiator_detect_duplicate_genomes_args_20191023@0851.tsv
File written: random.seed (911592)
Filters parameters file: initiated
Calculating manhattan distances between individuals...
Generating summary statistics
Generating plots

Inspect tables and figures to decide if some individual(s) need to be blacklisted
    Do you need to blacklist individual(s) (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect duplicate genomes: 0
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 8 sec
###################### completed detect_duplicate_genomes ######################
Registered S3 methods overwritten by 'ggtern':
  method           from   
  +.gg             ggplot2
  grid.draw.ggplot ggplot2
  plot.ggplot      ggplot2
  print.ggplot     ggplot2
################################################################################
############################# radiator::filter_hwe #############################
################################################################################
Execution date@time: 20191023@0851
Interactive mode: on
Function call and arguments stored in: radiator_filter_hwe_args_20191023@0851.tsv
Filters parameters file: initiated
    using tidy data frame of genotypes as input
    skipping all filters
Summarizing data
File written: genotypes.summary.tsv
HWE analysis for pop: ALFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: ALST
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLCC
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLLG
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: LAFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: TXCC
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: TXFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: OVERALL
  |=============================================| 100%, Elapsed 00:01
File written: hw.pop.sum.tsv
Plot written: hwd.plot.blacklist.markers.pdf
Plot written: hwe.ternary.plots.missing.data.pdf
Plot written: hwe.manhattan.plot.pdf

Do you want to continue with the filtering ? (y/n):
y

Based on figures and tables enter the hw.pop.threshold (integer): 
1

Generating blacklists, whitelists and filtered tidy data...
done!

Choosing the final filtered dataset
   select the mid p-value threshold (5 options):
1: 0.05 *
2. 0.01 **
3. 0.001 ***
4. 0.0001 ****
5. 0.00001 *****
5
Please try again: 
5
Please try again: 
1

Final filtered tidy dataset: 
tidy.filtered.hwe.0.05.mid.p.value.1.hw.pop.threshold.rad

Using hw.pop.threshold/midp.threshold: 1/0.05
Filters parameters file: updated
################################### RESULTS ####################################
Filter HWE: 1 / 0.05
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 9 / 527 / 527
    After: 254 / 7 / 483 / 10885 / 10885

Computation time, overall: 51 sec
############################# completed filter_hwe #############################

Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv
Writing the filtered strata: strata.filtered.tsvstrata.filtered.tsv

Generating statistics after filtering
calculating individual stats...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...

Transferring data to genomic converter...
Synchronizing data and strata...
    Number of strata: 7
    Number of individuals: 254

Writing tidy data set:
radiator_data_20191023@0852.rad
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 2
Data summary: 
    number of samples: 254
    number of markers: 10885

Computation time, overall: 365 sec
############################# completed filter_rad #############################
```


# 22 October 2019

Adam's suggested I ask for an extension. That's what I should be doing! - done.

But I also want to know if the 12103 markers I've been using are filtered properly so I might run radiator on them. 

the batch_2.pruned.vcf file is the one I want. I copied the headers from batch_2.vcf to batch_2.pruned.vcf - that was the only way it would work. I also need to make a new pop map

```{r}
vcf<-parse.vcf("batch_2.pruned.vcf")
strata<-data.frame(INDIVIDUALS=colnames(vcf)[10:ncol(vcf)],
                   STRATA=gsub("sample_(\\w{4}).*","\\1",colnames(vcf)[10:ncol(vcf)]))
write.table(strata,"fwsw_sub_strata.txt",col.names = TRUE,row.names = FALSE,quote = FALSE,sep='\t')
```


```{r}
library(radiator)
setwd("popgen/fwsw_results/stacks/populations_subset75/")
data<-radiator::filter_rad(data="batch_2.pruned.vcf",
                           strata="fwsw_sub_strata.txt")
```

I keep getting this weird error: 

```
Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx,  : 
  One of the nodes produced an error: Can not open file 'C:\Users\spf50\Research\popgen\fwsw_results\stacks\populations_subset75\filter_rad_20191022@1453\01_radiator\radiator_20191022@1453.gds'. The process cannot access the file because it is being used by another process.
```

Re-started the computer, we'll see if that helps -- it did not. 

Looks like this might be related to parallel cores -- 

```{r}
 data<-radiator::filter_rad(data="batch_2.pruned.vcf",
                            strata="fwsw_sub_strata.txt",parallel.core = 1)
```

^ works.


In the MAF helper table it looks like MAC 26 corresponds to MAF of 0.05 - which only includes a small number of loci!

I've got a feeling that the 12103 SNPs I've been using aren't the best set. 

```
> data<-radiator::filter_rad(data="batch_2.pruned.vcf",
+                            strata="fwsw_sub_strata.txt",parallel.core = 1)
################################################################################
############################# radiator::filter_rad #############################
################################################################################
The function arguments names have changed: please read documentation

Execution date@time: 20191022@1502
Folder created: filter_rad_20191022@1502
Function call and arguments stored in: radiator_filter_rad_args_20191022@1502.tsv
File written: random.seed (56970)
Filters parameters file generated: filters_parameters_20191022@1502.tsv

Reading VCF
Data summary: 
    number of samples: 303
    number of markers: 12103
done! timing: 8 sec

Generating individual stats...
[==================================================] 100%, completed in 0s
Generating markers stats...
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s


Number of chromosome/contig/scaffold: 532
Number of locus: 12103
Number of markers: 12103
Number of populations: 7
Number of individuals: 303

Number of ind/pop:
ALFW = 48
ALST = 47
FLCC = 41
FLLG = 47
LAFW = 48
TXCC = 41
TXFW = 31

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20191022@1502.gds
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20191022@1507
Function call and arguments stored in: radiator_filter_monomorphic_args_20191022@1507.tsv
Filters parameters file: initiated
File written: blacklist.monomorphic.markers_20191022@1507.tsv
Synchronizing markers.meta
File written: whitelist.polymorphic.markers_20191022@1507.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 532 / 12103 / 12103
    Blacklisted: 0 / 0 / 1 / 9 / 9
    After: 303 / 7 / 531 / 12094 / 12094

Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20191022@1507
Function call and arguments stored in: radiator_filter_common_markers_args_20191022@1507.tsv
Filters parameters file: initiated
Scanning for common markers...
Generating UpSet plot to visualize markers in common
File written: whitelist.common.markers_20191022@1507.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 531 / 12094 / 12094
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 303 / 7 / 531 / 12094 / 12094

Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
######################### radiator::filter_individuals #########################
################################################################################
Execution date@time: 20191022@1507
Function call and arguments stored in: radiator_filter_individuals_args_20191022@1507.tsv
Interactive mode: on

Step 1. Visualization
Step 2. Missingness
Step 3. Heterozygosity
Step 4. Total Coverage (if available)


Filters parameters file: initiated

Step 1. Visualization of samples QC

[==================================================] 100%, completed in 0s
File written: individuals qc info and stats summary
File written: individuals qc plot

Step 2. Filtering markers based individual missingness/genotyping

Do you want to blacklist samples based on missingness ? (y/n):
y
2 options to blacklist samples:
1. based on the outlier statistics
2. enter your own threshold
1

Removing outliers individuals based on genotyping statistics: 0.213308
Filters parameters file: updated
################################### RESULTS ####################################

Filter individuals based on missingness: 0.213308
Number of individuals / strata / chrom / locus / SNP:
    Before: 303 / 7 / 531 / 12094 / 12094
    Blacklisted: 49 / 0 / 0 / 0 / 0
    After: 254 / 7 / 531 / 12094 / 12094

Step 3. Filtering markers based on individual heterozygosity

Do you want to blacklist samples based on heterozygosity ? (y/n):
n

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 16 / 426 / 426

Computation time, overall: 314 sec
########################### completed filter_individuals #######################
################################################################################
############################## radiator::filter_mac ############################
################################################################################
Execution date@time: 20191022@1512
Function call and arguments stored in: radiator_filter_mac_args_20191022@1512.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on MAC


Importing data ...
Filters parameters file: initiated
Calculating GLOBAL MAC
[==================================================] 100%, completed in 0s
File written: maf.global.tsv

Step 1. MAC visualization and helper table

File written: mac.summary.stats.tsv
MAC range: [1 - 251]
MAF range: [0.002 - 0.4941]
Generating MAC helper table...
File written: maf.helper.table.tsv

Step 2. Filtering markers based on MAC

Choose the filter.mac threshold: 
1
File written: whitelist.markers.mac.tsv
File written: blacklist.markers.mac.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter mac threshold: 1
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 515 / 11668 / 11668
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 515 / 11668 / 11668

Computation time, overall: 443 sec
############################ completed filter_mac ##############################
################################################################################
########################### radiator::filter_coverage ##########################
################################################################################
Execution date@time: 20191022@1519
Function call and arguments stored in: radiator_filter_coverage_args_20191022@1519.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on total coverage


Importing data ...
Filters parameters file: initiated

Step 1. Coverage visualization and helper table

Generating coverage statistics
[==================================================] 100%, completed in 0s
Generating coverage statistics: without outliers
Generating mean coverage helper table...
Files written: helper tables and plots

Step 2. Filtering markers based on mean coverage

Choose the min mean coverage threshold(e.g. 7 or 10): 
11
Choose the max mean coverage threshold (e.g. 100 or 300): 
50
File written: blacklist.markers.coverage_20191022@1519.tsv
File written: whitelist.markers.coverage_20191022@1519.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter mean coverage thresholds: 11 / 50
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 515 / 11668 / 11668
    Blacklisted: 0 / 0 / 23 / 256 / 256
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 208 sec
########################## completed filter_coverage ###########################
################################################################################
######################### radiator::filter_genotyping ##########################
################################################################################
Execution date@time: 20191022@1523
Function call and arguments stored in: radiator_filter_genotyping_args_20191022@1523.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on maximum missing proportion allowed


Importing data ...
Filters parameters file: initiated

Step 1. Missing visualization and helper table

Generating statistics
[==================================================] 100%, completed in 0s
Generating missingness/genotyping helper table...
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
File written: markers.pop.missing.helper.table.tsv
Files written: helper tables and plots

Step 2. Filtering markers based on maximum missing proportion

Choose the maximum missing proportion allowed: 
0.2

Removing markers based on genotyping statistic: 0.2
File written: blacklist.markers.genotyping_20191022@1523.tsv
File written: whitelist.markers.genotyping_20191022@1523.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter genotyping threshold: 0.2
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 45 sec
######################## completed filter_genotyping ###########################
################################################################################
######################### radiator::filter_snp_position_read ###################
################################################################################
Execution date@time: 20191022@1524
Function call and arguments stored in: radiator_filter_snp_position_read_args_20191022@1524.tsv
2 steps to visualize and filter the data based on the number of SNP on the read/locus:
Step 1. Visualization (boxplot, distribution
Step 2. Threshold selection
Filters parameters file: initiated
COL info required, returning data

Computation time, overall: 0 sec
##################### completed filter_snp_position_read #######################
################################################################################
############################ radiator::filter_snp_number #######################
################################################################################
Execution date@time: 20191022@1524
Function call and arguments stored in: radiator_filter_snp_number_args_20191022@1524.tsv
Interactive mode: on
2 steps to visualize and filter the data based on the number of SNP on the read/locus:
Step 1. Impact of SNP number per read/locus (on individual genotypes and locus/snp number potentially filtered)
Step 2. Choose the filtering thresholds
Filters parameters file: initiated
Generating statistics
Generating helper table...
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need
to adjust the group aesthetic?
Files written: helper tables and plots

Step 2. Filtering markers based on the maximum of SNPs per locus

Do you still want to blacklist markers? (y/n):
n
File written: whitelist.markers.genotyping.tsv
File written: blacklist.markers.genotyping.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter SNPs per locus threshold: 1e+12
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 14 sec
######################### completed filter_snp_number ##########################
################################################################################
############################## radiator::filter_ld #############################
################################################################################
Execution date@time: 20191022@1524
Function call and arguments stored in: radiator_filter_ld_args_20191022@1524.tsv

Interactive mode: on

Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD


Filters parameters file: initiated
Minimizing short distance LD...

There is no variation in the number of SNP/locus across the data

Filters parameters file: updated
################################### RESULTS ####################################

Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Do you want to continue filtering using long distance ld  ? (y/n):
n

Computation time, overall: 9 sec
############################# completed filter_ld ##############################
################################################################################
######################## radiator::detect_mixed_genomes ########################
################################################################################
Execution date@time: 20191022@1524

detect_mixed_genomes function call arguments:
    data = SeqVarGDSClass
    interactive.filter = TRUE
    detect.mixed.genomes = FALSE
    ind.heterozygosity.threshold = NULL
    verbose = TRUE
    parallel.core = 1

dots-dots-dots ... arguments

Arguments inside "..." assigned in detect_mixed_genomes:
    internal = FALSE
    parameters = list
    path.folder = filter_rad_20191022@1502


File written: radiator_detect_mixed_genomes_args_20191022@1524.tsv
Filters parameters file: initiated
[==================================================] 100%, completed in 0s
Calculating heterozygosity statistics
Generating plots

The greatest value of a picture is when it forces us
to notice what we never expected to see.

John W. Tukey. Exploratory Data Analysis. 1977.


Inspect plots and tables in folder created...
    Do you want to exclude individuals based on heterozygosity ? (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect mixed genomes: 0 1
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 26 sec
######################## completed detect_mixed_genomes ########################

################################################################################
###################### radiator::detect_duplicate_genomes ######################
################################################################################
Execution date@time: 20191022@1524
Function call and arguments stored in a file
File written: radiator_detect_duplicate_genomes_args_20191022@1524.tsv
File written: random.seed (56970)
Filters parameters file: initiated
[==================================================] 100%, completed in 0s
Calculating manhattan distances between individuals...
Generating summary statistics
Generating plots

Inspect tables and figures to decide if some individual(s) need to be blacklisted
    Do you need to blacklist individual(s) (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect duplicate genomes: 0
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 254 / 7 / 492 / 11412 / 11412

Computation time, overall: 28 sec
###################### completed detect_duplicate_genomes ######################
################################################################################
############################# radiator::filter_hwe #############################
################################################################################
Execution date@time: 20191022@1525
Interactive mode: on
Function call and arguments stored in: radiator_filter_hwe_args_20191022@1525.tsv
Filters parameters file: initiated
    using tidy data frame of genotypes as input
    skipping all filters
Summarizing data
File written: genotypes.summary.tsv
HWE analysis for pop: ALFW
HWE analysis for pop: ALST
HWE analysis for pop: FLCC
HWE analysis for pop: FLLG
HWE analysis for pop: LAFW
HWE analysis for pop: TXCC
HWE analysis for pop: TXFW
HWE analysis for pop: OVERALL
File written: hw.pop.sum.tsv
Plot written: hwd.plot.blacklist.markers.pdf
Plot written: hwe.ternary.plots.missing.data.pdf
Plot written: hwe.manhattan.plot.pdf

Do you want to continue with the filtering ? (y/n):
y

Based on figures and tables enter the hw.pop.threshold (integer): 
1

Generating blacklists, whitelists and filtered tidy data...
done!

Choosing the final filtered dataset
   select the mid p-value threshold (5 options):
1: 0.05 *
2. 0.01 **
3. 0.001 ***
4. 0.0001 ****
5. 0.00001 *****
5
Please try again: 
5
Please try again: 
1

Final filtered tidy dataset: 
tidy.filtered.hwe.0.05.mid.p.value.1.hw.pop.threshold.rad

Using hw.pop.threshold/midp.threshold: 1/0.05
Filters parameters file: updated
################################### RESULTS ####################################
Filter HWE: 1 / 0.05
Number of individuals / strata / chrom / locus / SNP:
    Before: 254 / 7 / 492 / 11412 / 11412
    Blacklisted: 0 / 0 / 9 / 527 / 527
    After: 254 / 7 / 483 / 10885 / 10885

Computation time, overall: 209 sec
############################# completed filter_hwe #############################

Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv
Writing the filtered strata: strata.filtered.tsvstrata.filtered.tsv

Generating statistics after filtering
calculating individual stats...
[==================================================] 100%, completed in 0s
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s

Transferring data to genomic converter...
Synchronizing data and strata...
    Number of strata: 7
    Number of individuals: 254

Writing tidy data set:
radiator_data_20191022@1532.rad
Data summary: 
    number of samples: 254
    number of markers: 10885

Computation time, overall: 1829 sec
############################# completed filter_rad #############################
There were 50 or more warnings (use warnings() to see the first 50)


```
the warnings are about an unavailable font family.

what this analysis suggests is that perhaps I've been working with less-than-ideal data, including 49 individuals that didn't pass missingness tests.

```{r}
blacklist_inds<-read.delim("filter_rad_20191022@1502/04_filter_individuals/blacklist.individuals.missing_20191022@1507.tsv")
blacklist_inds$INDIVIDUALS<-gsub("-","_",blacklist_inds$INDIVIDUALS)
strata<-read.delim("fwsw_sub_strata.txt")
table(strata$STRATA[!(strata$INDIVIDUALS %in% blacklist_inds$INDIVIDUALS)])
```

But removing those individuals leaves enough individuals per population - I could run the dadi analysis on this dataset. 

Ok, so I also re-ran stacks on the whitelisted loci and individuals for the 'full' dataset. For some reason the Stacks Fst values are much higher than the ones I calculated using gwscaR

```{r}
full_fsts<-read.delim("stacks/populations_whitelist/batch_2.fst_summary.tsv",row.names = 1)
full_fsts<-rbind(full_fsts,TXSP=rep(NA,ncol(full_fsts))) #add the final row

Tfull_fsts<-t(full_fsts)
full_fsts[lower.tri(full_fsts)]<-Tfull_fsts[lower.tri(Tfull_fsts)] # now it's symmetric
full_fsts<-full_fsts[pop.list,pop.list]
colnames(full_fsts)<-rownames(full_fsts)<-pop.labs
```

The maximum value is between FLFW and TXFW, 0.254.

`r rowMeans(full_fsts,na.rm = TRUE)`
TXSP                        0.04080420
TXCC                        0.04113533
TXFW                        0.07004118
TXCB                        0.04451881
LAFW                        0.03660252
ALST                        0.03097860
ALFW                        0.03860451
FLSG                        0.02929251
FLKB                        0.02966820
FLFD                        0.03289763
FLSI                        0.03135164
FLAB                        0.03929079
FLPB                        0.05155559
FLHB                        0.05572510
FLCC                        0.06095059
FLFW                        0.11157567

And FLFW is most different from the others. 

# 21 October 2019

To fill in the results, let's get some stats.

Number of shared Stacks outliers:

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
shared_stacks<-fw_SNPinfo$ID[fw_SNPinfo$stacks_AL_P < 0.05 & fw_SNPinfo$stacks_LA_P < 0.05 &
                          fw_SNPinfo$stacks_TX_P < 0.05 & fw_SNPinfo$stacks_FL_P < 0.05]
shared_stacks<-shared_stacks[!is.na(shared_stacks)]
fw_SNPinfo[fw_SNPinfo$ID %in% shared_stacks,]
```

Number of shared permutation outliers:
```{r}
length(which(rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4))
length(unique(fw_SNPinfo$Chrom[which(rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4)]))
```

Are any of those the same as the Stacks ones?

```{r}
permout<-fw_SNPinfo$ID[which(rowSums(fw_SNPinfo[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4)]
permout[permout %in% shared_stacks]
```

Pcapadt: 
```{r}
dim(fw_SNPinfo[fw_SNPinfo$pcadaptQ<0.01 & !is.na(fw_SNPinfo$pcadaptQ),])
pcout<-fw_SNPinfo$ID[fw_SNPinfo$pcadaptQ<0.01 & !is.na(fw_SNPinfo$pcadaptQ)]
pcout[pcout %in% shared_stacks] # overlapping with shared stacks
pcout[pcout %in% permout] # overlapping with permutations
```

Bayenv:

```{r}
xtxout<-fw_SNPinfo$ID[fw_SNPinfo$XtX>=quantile(plot_dat$XtX,0.99)]
xtxout[xtxout %in% pcout]
```

```{r}
salout<-fw_SNPinfo$ID[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99)]
salout[salout %in% xtxout]
salout[salout %in% pcout]
salout[salout %in% shared_stacks]
salout[salout %in% permout]
```


To support the treemix analysis, I want to investigate the f3 and f4 analyses for the two best-supported migration edges.

```{r}
threepop<-data.frame(do.call(rbind,strsplit(grep(";",readLines("treemix/fwsw_threepop.txt"),value = TRUE),' ')))
colnames(threepop)<-c("pops","f3_stat","f3_se","f3_z")
threepop$f3_p<-pnorm(-abs(as.numeric(as.character(threepop$f3_z)))) # one sided
```
Significantly negative f3 statistics mean that the first pop in the list (A in A;B,C) is admixed.

```{r}
fourpop<-data.frame(do.call(rbind,strsplit(grep(";",readLines("treemix/fwsw_fourpop_1.txt"),value = TRUE),' ')))
colnames(fourpop)<-c("pops","f4_stat","f4_se","f4_z")
fourpop$f4_p<-2*pnorm(-abs(as.numeric(as.character(fourpop$f4_z))))
```
a significantly non-zero value indicates gene flow in the tree.


Edge 1: FLSI/FLAB -> TXFW/TX

```{r}
threepop[grep("FLSI;.*TXFW",threepop$pops),] # use this to see the order that I want
threepop[threepop$pops=="FLSI;TXFW,FLAB",] 
```
This does not have a negative value so FLSI is not admixed -- but that's not really what's interesting, is it?

```{r}
threepop[grep("TXFW;.*FLSI",threepop$pops),] # use this to see the order that I want
threepop[threepop$pops=="TXFW;TXCC,FLSI",] 
```
But this is also not admixed, so does not support direct migration?


```{r}
fourpop[grep("TXCC,TXFW;.*FLSI.*",fourpop$pops),] # use this to see the order that I want
fourpop[fourpop$pops=="TXCC,TXFW;FLSI,FLAB",] 
as.numeric(as.character(fourpop$f4_stat[fourpop$pops=="TXCC,TXFW;FLSI,FLAB"]))+as.numeric(as.character(fourpop$f4_se[fourpop$pops=="TXCC,TXFW;FLSI,FLAB"]))
```
The statistic is negative but overlaps with zero, so not indicative of gene flow. 

Where are the bootstrap values? I think they're in the treeout as the 'jackknife estimates' - I'll deal with that later, it's for tree topology not migration edges anyway.


Edge 2: branch from west FL to others -> FLAB

```{r}
threepop[grep("FLAB;.*",threepop$pops),] # use this to see the order that I want
threepop[threepop$pops=="FLAB;FLPB,FLLG",]
```
This indicates no admixture, and none of the FLAB values are neggative

```{r}
fourpop[grep("FLAB,FL.*;.*",fourpop$pops),] # use this to see the order that I want
fourpop[fourpop$pops=="FLAB,FLCC;FLPB,FLLG",] 
as.numeric(as.character(fourpop$f4_stat[fourpop$pops=="FLAB,FLCC;FLPB,FLLG"]))+as.numeric(as.character(fourpop$f4_se[fourpop$pops=="FLAB,FLCC;FLPB,FLLG"]))
```

Also not indicative of gene flow. 

```{r}
fourpop$f4_stat<-as.numeric(as.character(fourpop$f4_stat))
fourpop$f4_se<-as.numeric(as.character(fourpop$f4_se))
abs(fourpop$f4_stat[grep("FLAB,FL.*;.*",fourpop$pops)])-fourpop$f4_se[grep("FLAB,FL.*;.*",fourpop$pops)]
```
These are all pretty close to zero...

Things to do:

* pairwise FSTs using the full dataset (make a figure or table) & plot with treemix covariances [check]
* summary statistics for the populations with number of individuals, HE, HO, etc. [check]
* minor allele frequency histograms 
* Add labels to Fig 1 (and maybe bootstrap estimates?)
* TXCC-TXCB, ALST-FLSG, FLPB-FLHB plots

I wrote a loop to calculate mean fsts from the full dataset -- see the reanalysis doc.

What do I want in the table of summary statistics?

Pop | N Pregnant (N in full) | N non-pregnant (N in full) | N females (N in full) | N juveniles (N In full) | Ho in full (variance) | HO in subset (variance) | % polymorphic | p in full (variance) | p in subset (variance)

To do this I need both vcfs

```{r}
pop_map<-read.delim("../fwsw_pops_map.txt",header = FALSE,stringsAsFactors = FALSE)
ful_vcf<-parse.vcf("filter_rad_20191014@1654/14_filtered/radiator_data_20191014@1710.vcf")
colnames(ful_vcf)<-gsub("\\-","_",colnames(ful_vcf))
sub_vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
```

```{r}
pop_summaries<-do.call(rbind,lapply(pop.list,function(pop,ful,sub){
  ful_dat<-ful[,c(1:9,grep(pop,colnames(ful)))]
  sub_dat<-sub[,c(1:9,grep(pop,colnames(sub)))]
  # calc observed het values
  ful_ho<-apply(ful_vcf,1,calc.het)
  sub_ho<-apply(sub_vcf,1,calc.het)
  # estimate allele freqs
  ful_afs<-do.call(rbind,apply(ful_vcf,1,calc.afs.vcf))
  sub_afs<-do.call(rbind,apply(sub_vcf,1,calc.afs.vcf))
  # save data frame
  dat<-data.frame(pop=pop,
                  preg_full=length(grep(paste0(pop,"P"),colnames(ful))),
                  preg_sub=length(grep(paste0(pop,"P"),colnames(sub))),
                  nonp_full=length(grep(paste0(pop,"NP"),colnames(ful))),
                  nonp_sub=length(grep(paste0(pop,"NP"),colnames(sub))),
                  nfem_full=length(grep(paste0(pop,"F"),colnames(ful))),
                  nfem_sub=length(grep(paste0(pop,"F"),colnames(sub))),
                  njuv_full=length(grep(paste0(pop,"J"),colnames(ful)))+
                    length(grep(paste0(pop,"DB"),colnames(ful))),
                  njuv_sub=length(grep(paste0(pop,"J"),colnames(sub))) + 
                    length(grep(paste0(pop,"DB"),colnames(sub))),
                  ho_full = mean(ful_ho,na.rm = TRUE),
                  hov_full = var(ful_ho,na.rm = TRUE),
                  ho_sub = mean(sub_ho,na.rm = TRUE),
                  hov_sub = var(sub_ho,na.rm = TRUE),
                  poly_ful = nrow(ful_afs[ful_afs$RefFreq<1,])/nrow(ful_afs)*100,
                  poly_sub = nrow(sub_afs[sub_afs$RefFreq<1,])/nrow(sub_afs)*100,
                  p_full = mean(ful_afs$RefFreq,na.rm=TRUE),
                  pv_full = var(ful_afs$RefFreq,na.rm=TRUE),
                  p_sub = mean(sub_afs$RefFreq,na.rm=TRUE),
                  pv_sub = var(sub_afs$RefFreq,na.rm=TRUE))
  return(dat)
},ful=ful_vcf,sub=sub_vcf))
write.csv(pop_summaries,"population_summaries.csv",col.names = TRUE,row.names = FALSE,quote=FALSE)
```

Allele frequency spectra: 
```{r}
png("../figs/afs.png",height=8,width=6,units="in",res=300)
par(mfrow=c(4,4),oma=c(2,2,2,2),mar=c(2,2,2,2))

pop_afs<-do.call(rbind,lapply(pop.labs,function(pop,ful,cols){
  if(pop=="FLFW"){
    ful_dat<-ful[,c(1:9,grep("FLLG",colnames(ful)))]
  } else {
    ful_dat<-ful[,c(1:9,grep(pop,colnames(ful)))]
  }
  # estimate allele freqs
  ful_afs<-do.call(rbind,apply(ful_vcf,1,calc.afs.vcf))
  h<-hist(ful_afs$AltFreq,breaks=seq(0,0.5,0.01),main=pop,col = cols[pop],
          xlab = "Allele Frequency",ylab = "Frequency",freq = TRUE,border = cols[pop])
},ful=ful_vcf,cols=c(TXSP="black",TXCC="black",TXFW="#2166ac",TXCB="black",
                     LAFW="#2166ac",ALST="black",ALFW="#2166ac",FLSG="black",
                     FLKB="black",FLFD="black",FLSI="black",FLAB="black",
                     FLPB="black",FLHB="black",FLCC="black",FLFW="#2166ac")))
dev.off()
```

Ok, now the null SW-SW comparisons...the downside is that the subsetted dataset doesn't have the additional pops. 

```{r stacks_fsts}
swsw.al<-read.delim("stacks/populations_all/batch_2.fst_ALST-FLSG.tsv")
swsw.tx<-read.delim("stacks/populations_all/batch_2.fst_TXCB-TXCC.tsv")
swsw.fl<-read.delim("stacks/populations_all/batch_2.fst_FLCC-FLPB.tsv")


fst_dat<-list(swsw.tx,swsw.al,swsw.fl)
png("../../fsts_null.png",height=6,width=6.5,units="in",res=300)
fsts<-plot_multiple_LGs(list_fsts = fst_dat,fst_name = "Corrected.AMOVA.Fst",bp_name="BP",chr_name="Chr",
                        lgs=lgs,plot_labs=list("TXCB vs TXCC","ALST vs FLSG","FLCC vs FLPB"),
                        pt_cols = list(c(grp.colors[1],grp.colors[2]),c(grp.colors[2],grp.colors[3]),
                                       c(grp.colors[6],grp.colors[5])),
                        ncol=1,addSmooth = FALSE,pch=19,y.lim = c(0,1),pt.cex=1,axis.size = 1)
dev.off()
```

```{r stacks_fsts}
fwsw.al_all<-read.delim("stacks/populations_all/batch_2.fst_ALFW-ALST.tsv")
fwsw.la_all<-read.delim("stacks/populations_all/batch_2.fst_ALST-LAFW.tsv")
fwsw.tx_all<-read.delim("stacks/populations_all/batch_2.fst_TXCC-TXFW.tsv")
fwsw.fl_all<-read.delim("stacks/populations_all/batch_2.fst_FLCC-FLFW.tsv")


fst_dat<-list(fwsw.tx_all,fwsw.la_all,fwsw.al_all,fwsw.fl_all)
png("../../fsts_fw_all.png",height=6,width=6.5,units="in",res=300)
fsts<-plot_multiple_LGs(list_fsts = fst_dat,fst_name = "Corrected.AMOVA.Fst",bp_name="BP",chr_name="Chr",
                        lgs=lgs,plot_labs=list("TXFW vs TXCC","ALST vs LAFW","ALST vs ALFW","FLCC vs FLFW"),
                        pt_cols = list(c(grp.colors[1],grp.colors[2]),c(grp.colors[2],grp.colors[3]),
                                       c(grp.colors[3],grp.colors[2]),c(grp.colors[6],grp.colors[5])),
                        ncol=1,addSmooth = FALSE,pch=19,y.lim = c(0,1),pt.cex=1,axis.size = 1)
dev.off()
```

The final thing to do is adding labels to fig 1. 

Perhaps make heatmaps of covariance and Fsts? YES!

I want to convert the whitelist from radiator into a whitelist for stacks

```{r}
map<-read.delim("stacks/populations_all/batch_2.plink.map",stringsAsFactors = FALSE,
                comment.char = "#",header=FALSE)
map$locus<-gsub("(\\d+)_\\d+","\\1",map$V2)
map$pos<-gsub("(\\d+)_(\\d+)","\\2",map$V2)
map$bp<-as.numeric(as.character(map$V4))+1
whitelist<-read.delim("filter_rad_20191014@1654/14_filtered/whitelist.markers.tsv",header = TRUE)

whitemap<-map[paste(map$V1,map$locus,map$bp,sep="_") %in%
                paste(whitelist$CHROM,whitelist$LOCUS,whitelist$POS,sep="_"),]
write.table(whitemap[,c("locus","pos")],"stacks/populations_all/whitelist.txt",col.names = FALSE,
            row.names = FALSE,sep='\t',quote=FALSE)
```


# 18 October 2019

Now I want to do Fst outlier plots, it should be easy after doing the bayenv plots.


```{r plotOutliersSetup}
library(scales)
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
```

First, save and revisit the bayenv analysis:

```{r plotBayenvOutliers}
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#a65628',0.75),stacks=alpha('#f781bf',0.75),xtx=alpha('#ff7f00',0.75))
png("../figs/BayenvOutliers.png",height=4,width=8.5,units="in",res=300)
par(mfrow=c(2,1),oma=c(1,1.5,1,1),mar=c(2,2,1,1),xpd=TRUE)
# plot XtX
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "XtX",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$XtX[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=0.75,pch=3,lwd=2)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$XtX[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=0.85,pch=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$XtX[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$XtX[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$XtX[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1)
mtext(expression(italic("X")^"T"~italic("X")),2,line=2)

# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=0)

# plot Bayes Factors
plot_dat<-fst.plot(plot_dat,scaffs.to.plot = lgs,fst.name = "logSalBF",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$logSalBF[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=0.85,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$logSalBF[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=0.75,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$logSalBF[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$logSalBF[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$logSalBF[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1)
mtext("log(Salinity Bayes Factors)",2,line=2)

# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-5)

# add outside legend

opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
            mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend("top",c(expression("Permutation"~italic("F")["ST"]),
         expression("Stacks"~italic("F")["ST"]),
         "PCAdapt",expression(italic("X")^T~italic("X")),"Salinity BF"),xjust = 0.5,x.intersp = 0.5,
       col = cols[c("perm","stacks","pc","xtx","sal")],
       pt.bg=cols[c("perm","stacks","pc","xtx","sal")],pch=c(4,5,0,3,2),bty='n',ncol=5)
dev.off()
```

Ok, now I can try the fwsw analysis using a function from the 202_fwsw_reanalysis document


```{r stacks_fsts}
fwsw.al<-read.delim("stacks/populations_subset75/batch_2.fst_ALFW-ALST.tsv")
fwsw.al<-fwsw.al[paste(fwsw.al$Chr,fwsw.al$BP,sep="_") %in% paste(fw_SNPinfo$Chrom,fw_SNPinfo$BP,sep="_"),]
fwsw.la<-read.delim("stacks/populations_subset75/batch_2.fst_ALST-LAFW.tsv")
fwsw.la<-fwsw.la[paste(fwsw.la$Chr,fwsw.la$BP,sep="_") %in% paste(fw_SNPinfo$Chrom,fw_SNPinfo$BP,sep="_"),]
fwsw.tx<-read.delim("stacks/populations_subset75/batch_2.fst_TXCC-TXFW.tsv")
fwsw.tx<-fwsw.tx[paste(fwsw.tx$Chr,fwsw.tx$BP,sep="_") %in% paste(fw_SNPinfo$Chrom,fw_SNPinfo$BP,sep="_"),]
fwsw.fl<-read.delim("stacks/populations_subset75/batch_2.fst_FLCC-FLFW.tsv")
fwsw.fl<-fwsw.fl[paste(fwsw.fl$Chr,fwsw.fl$BP,sep="_") %in% paste(fw_SNPinfo$Chrom,fw_SNPinfo$BP,sep="_"),]
```

```{r plot_stacks_fsts}
fst_dat<-list(fwsw.al,fwsw.la,fwsw.tx,fwsw.fl)
fsts<-plot_multiple_LGs(list_fsts = list_fsts,fst_name = "Corrected.AMOVA.Fst",bp_name="BP",chr_name="Chr",
                        lgs=lgs,plot_labs=list("ALFW vs ALST","ALST vs LAFW","TXFW vs TXCC","FLFW vs FLCC"),
                        pt_cols = list(c(grp.colors[3],grp.colors[2]),c(grp.colors[2],grp.colors[3]),
                                       c(grp.colors[1],grp.colors[2]),c(grp.colors[6],grp.colors[5])),
                        ncol=1,addSmooth = FALSE,pch=19,y.lim = c(0,1),pt.cex=1,axis.size = 1)
```

So I vectorized it above, is there a way to turn points into a list so I could pass that to the function and add those? I'll need to make sure the fw_SNPinfo has both p-values and fst values....but this is actually kind of challenging so I'm just going to do the hacky version at the moment. First, to merge it all. 

```{r}
colnames(fw_SNPinfo)[grep("stacks",colnames(fw_SNPinfo))]<-paste0(colnames(fw_SNPinfo)[grep("stacks",colnames(fw_SNPinfo))],"_P")
fw_SNPinfo<-merge(fw_SNPinfo,fwsw.al[,c("Locus.ID","Corrected.AMOVA.Fst")],by.x="ID",by.y="Locus.ID",all.x = TRUE)
colnames(fw_SNPinfo)[ncol(fw_SNPinfo)]<-"stacks_AL"
fw_SNPinfo<-merge(fw_SNPinfo,fwsw.la[,c("Locus.ID","Corrected.AMOVA.Fst")],by.x="ID",by.y="Locus.ID",all.x = TRUE)
colnames(fw_SNPinfo)[ncol(fw_SNPinfo)]<-"stacks_LA"
fw_SNPinfo<-merge(fw_SNPinfo,fwsw.tx[,c("Locus.ID","Corrected.AMOVA.Fst")],by.x="ID",by.y="Locus.ID",all.x = TRUE)
colnames(fw_SNPinfo)[ncol(fw_SNPinfo)]<-"stacks_TX"
fw_SNPinfo<-merge(fw_SNPinfo,fwsw.fl[,c("Locus.ID","Corrected.AMOVA.Fst")],by.x="ID",by.y="Locus.ID",all.x = TRUE)
colnames(fw_SNPinfo)[ncol(fw_SNPinfo)]<-"stacks_FL"
saveRDS(fw_SNPinfo,"fw_SNPinfo.RDS")
```


```{r}
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#a65628',0.75),stacks=alpha('#f781bf',0.75),xtx=alpha('#ff7f00',0.75))
png("../figs/FstOutliers.png",height=5,width=8.5,units="in",res=300,pointsize=14)
par(mfrow=c(4,1),oma=c(1,1,0.5,1),mar=c(2,2,1,1),xpd=TRUE)
# plot TX
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_TX",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[1],grp7colors[2]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_TX[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_TX[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_TX[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_TX[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_TX[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("TXFW vs. TXCC",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot AL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_AL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[3],"lightgrey"),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_AL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_AL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_AL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_AL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_AL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("ALFW vs. ALST",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# plot LA
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_LA",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c("lightgrey",grp7colors[3]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_LA[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_LA[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_LA[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_LA[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_LA[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("LAFW vs. ALST",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)


# FL
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "stacks_FL",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19,pt.cols = c(grp7colors[6],grp7colors[7]),pt.cex = 1)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$stacks_FL[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=1,pch=2)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$stacks_FL[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=1,pch=3,lwd=2)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$stacks_FL[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       plot_dat$stacks_FL[plot_dat$stacks_AL_P < 0.05 & plot_dat$stacks_LA_P < 0.05 &
                           plot_dat$stacks_TX_P < 0.05 & plot_dat$stacks_FL_P < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$stacks_FL[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1,pos=-1500000)
mtext("FLFW vs. FLCC",2,line=1,cex=0.65)
# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-0.15)

# add outside legend

opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
            mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend("top",c(expression("Permutation"~italic("F")["ST"]),
         expression("Stacks"~italic("F")["ST"]),
         "PCAdapt",expression(italic("X")^T~italic("X")),"Salinity BF"),xjust = 0.5,x.intersp = 0.5,
       col = cols[c("perm","stacks","pc","xtx","sal")],
       pt.bg=cols[c("perm","stacks","pc","xtx","sal")],pch=c(4,5,0,3,2),bty='n',ncol=5)
dev.off()
```

Ok, now I feel pretty good about these aspects - I just need to add text to the doc and add some preliminary dadi analysis. Also, I should update 202_fwsw_reanalysis


# 17 October 2019

Ok, returning to making Figure 1 from yesterday.

```{r source}
source("../../gwscaR/R/gwscaR.R")
source("../../gwscaR/R/gwscaR_plot.R")
source("../../gwscaR/R/gwscaR_utility.R")
source("../../gwscaR/R/gwscaR_fsts.R")
source("../../gwscaR/R/gwscaR_popgen.R")
source("../../gwscaR/R/vcf2dadi.R")
source("../R/203_treemix_plotting_funcs.R")#I've modified these functions
library(knitr)
library(scales)
library(RColorBrewer)
```

```{r}
pop.list<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST","ALFW","FLSG","FLKB",
	"FLFD","FLSI","FLAB","FLPB","FLHB","FLCC","FLLG")
pop.labs<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST","ALFW","FLSG","FLKB",
            "FLFD","FLSI","FLAB","FLPB","FLHB","FLCC","FLFW")
fw.list<-c("TXFW","LAFW","ALFW","FLLG")
sw.list<-c("TXSP","TXCC","TXCB","ALST","FLSG","FLKB",
	"FLFD","FLSI","FLAB","FLPB","FLHB","FLCC")
lgs<-c("LG1","LG2","LG3","LG4","LG5","LG6","LG7","LG8","LG9","LG10","LG11",
	"LG12","LG13","LG14","LG15","LG16","LG17","LG18","LG19","LG20","LG21",
	"LG22")
lgn<-seq(1,22)
all.colors<-c(rep("black",2),"#2166ac","black","#2166ac","black","#2166ac",
        rep("black",8),"#2166ac")
grp.colors<-c('#762a83','#af8dc3','#e7d4e8','#d9f0d3','#7fbf7b','#1b7837')
col_vector<-c(red='#e6194b', green='#3cb44b', blue='#4363d8',yellow='#ffe119', cyan='#46f0f0',orange='#f58231', teal='#008080', purple='#911eb4',  magenta='#f032e6', lime='#bcf60c', pink='#fabebe',  lavendar='#e6beff', brown='#9a6324', olive='#808000', apricot='#ffd8b1',maroon='#800000', mint='#aaffc3', navy='#000075', beige='#fffac8', grey='#808080', white='#ffffff', black='#000000')

col_vector<-c('#762a83','#762a83',"#2166ac",'#762a83',"#2166ac",'#af8dc3',"#2166ac",'#e7d4e8','#e7d4e8','#e7d4e8','#e7d4e8','#7fbf7b','#1b7837','#1b7837','#1b7837',"#2166ac")
#ppi<-data.frame(Pop=pop.labs,cols = all.colors,pch=c(0,1,21,2,24,3,23,4,5,6,7,9,10,11,12,22))
ppi<-data.frame(Pop=pop.labs,cols = col_vector,pch=rep(c(15,16,17,18),4))
```

^ those are the colors from before, but now I've got up to K=7, so I need more colors

```{r}
grp.colors<-c('#762a83','#af8dc3','#e7d4e8','#d9f0d3','#7fbf7b','#1b7837') 
grp7colors<-c('#762a83','#9970ab','#c2a5cf','#d9f0d3','#a6dba0','#5aae61','#1b7837')
grp5colors<-c('#762a83','#c2a5cf','#a6dba0','#5aae61','#1b7837')
```

And I'll need to update the PCAdapt colors as well. 

```{r}
ind_dat<-read.table("filter_rad_20191014@1654/14_filtered/individuals.qc.stats_20191014@1654.tsv",header=T,
                    stringsAsFactors = F)
pops<-ind_dat$STRATA	
grp<-pops
grp[grp=="TXFW" | grp=="LAFW" | grp=="ALFW" | grp=="FLLG"]<-"freshwater"
grp[grp!="freshwater"]<-"saltwater"

ppi<-data.frame(Pop=pop.labs,
                cols = c(grp7colors[1],grp7colors[1],"#2166ac",grp7colors[1],
                         "#2166ac",grp7colors[3],"#2166ac",grp7colors[4],
                         grp7colors[4],grp7colors[4],grp7colors[4],grp7colors[5],
                         grp7colors[6],grp7colors[6],grp7colors[6],"#2166ac"),
                pch=rep(c(15,16,17,18),4))
ppi[ppi$Pop=="ALFW","pch"]<-3
#colors
pap<-data.frame(Pop=pops,cols=pops,pch=pops,grp=grp,stringsAsFactors = F)
pap$Pop[pap$Pop == "FLLG"]<-"FLFW"
for(i in 1:nrow(pap)){
  pap[i,"cols"]<-as.character(ppi[ppi$Pop %in% pap[i,"Pop"],"cols"])
}
for(i in 1:nrow(pap)){
  pap[i,"pch"]<-as.numeric(ppi[ppi$Pop %in% pap[i,"Pop"],"pch"])
}
write.table(pap,"pcadapt_colp.txt",col.names=TRUE)
```


Now time to load the data

```{r}
#admixture 
admixK5<-read.delim("admixture/admixK5.txt",header = FALSE)
admixK7<-read.delim("admixture/admixK7.txt",header = FALSE)
#pcadapt
pa<-readRDS("fwsw_all_pcadapt.RDS")
pa.props<-round((pa$singular.values/sum(pa$singular.values))*100,2)
pap<-read.delim("pcadapt_colp.txt",sep=' ')
# map
library(jpeg)
img<-readJPEG("all_sites_map.jpeg")

# stuff for treemix
source("../R/203_treemix_plotting_funcs.R") #I've modified the functions from treemix
library(lattice); library(grid); library(RColorBrewer)
poporder<-read.delim("treemix/poporder")
colors<-poporder$colors
poporder<-poporder$poporder

```


```{r}
npop<-length(pop.list)
pseq<-1:npop
m<-matrix(c(rep(1,16),rep(2,6),
            3:18,rep(2,6),
            19:34,rep(2,6),
            rep(35,8),rep(36,8),rep(37,6)),
          nrow=4,ncol=npop+6,byrow = T)
jpeg("../figs/NewPopStructure_v1.jpeg",res=300,height=8,width=10,units="in")
#set the layout
layout(mat=m,heights=c(6,1,1,6))
#MAP
#open an empty plot window with coordinates
par(oma=c(1.5,3.5,1,2),mar=c(0,0,0,0),xpd=NA)
plot(1:14,ty="n",axes=FALSE,xlab="",ylab="",xpd=TRUE)
#specify the position of the image through bottom-left and top-right coords
rasterImage(img,1,1,14,14,xpd=TRUE)

#Treemix
par(mar=c(0,0,0,1))
t2<-plot_tree("treemix/fwsw_k100bFLPBrm2",scale=T,mbar=T,cex = 1.5,
              lwd=2,mig_left=FALSE,disp=0.0002,xlab=FALSE,scadj=0.05)
#STRUCTURE
par(mar=c(1,0,0,0))
plotting.structure(admixK5,5,pop.order = poporder$Pop, pop.list, make.file=FALSE, 
                   xlabcol = all.colors,plot.new=FALSE,
                   colors=grp5colors[c(3,5,4,1,2)],xlabel=FALSE,
                   ylabel=expression(atop(italic(K)==5)),lab.cex=0.85)
plotting.structure(admixK7,7,pop.order = poporder$Pop,pop.labs, make.file=FALSE,
                   plot.new=FALSE,colors=grp7colors[c(6,7,1,2,5,3,4)],xlabel=TRUE,
                   xlabcol = all.colors,
                   ylabel=expression(atop(italic(K)==7)),lab.cex=0.85)
#PCADAPT
par(mar=c(2,2,2,2))
plot(pa$scores[,1],pa$scores[,2],col=alpha(pap$cols,0.5),bg=alpha(pap$cols,0.75),
     pch=as.numeric(pap$pch),	cex=3,bty="L",xlab="",ylab="",cex.axis=1.5)

mtext(paste("PC1 (",pa.props[1],"%)",sep=""),1,line = 2.5,cex=1)
mtext(paste("PC2 (",pa.props[2],"%)",sep=""),2,line = 2.5,cex=1)

plot(pa$scores[,3],pa$scores[,4],col=alpha(pap$cols,0.5),bg=alpha(pap$cols,0.75),pch=as.numeric(pap$pch),
     cex=3,	bty="L",xlab="",ylab="",cex.axis=1.5)

mtext(paste("PC3 (",pa.props[3],"%)",sep=""),1,line = 2.5,cex=1)
mtext(paste("PC4 (",pa.props[4],"%)",sep=""),2,line = 2.5,cex=1)

plot(1:10,ty="n",axes=FALSE,xlab="",ylab="",xpd=TRUE)
legend("bottom", legend=ppi$Pop, pch=as.numeric(ppi$pch), pt.cex=3,cex=1.5,
       col=alpha(ppi$cols, 0.5),pt.bg=alpha(ppi$cols,0.25), ncol=2,bty='n')
dev.off()
```

Something is going on with the colors and I don't know what it is...I hacked it above, now I want to fix the treemix bar (move it to the right) and see about increasing text size and add bootstrap values. I was able to adjust those things, but not the bootstrap stuff. 

```{r}
source("../R/203_treemix_plotting_funcs.R") #I've modified the functions from treemix
t2<-plot_tree("treemix/fwsw_k100bFLPBrm2",scale=T,mbar=T,cex = 1.5,
              lwd=2,mig_left=FALSE,disp=0.0002,xlab=FALSE,scadj=0.1)

```

Not sure how to include the bootstrap values or even where the output is located...maybe the *llik files?

Well let's ignore this, just go with what I've got for now and worry about describing bootstrapping in the text or leave that for future me. I should update the 202_fwsw_reanalysis document next.

Next steps:
* Update text and add figures
* create fst outlier plots

# 16 October 2019

I converted the tped files from radiator output into ped using `~/Programs/plink-1.07-x86_64/plink --tfile radiator_data_20191014@1710 --recode --noweb --allow-no-sex --missing-genotype '00' --out fwsw_all_filt` so that I can run admixture.

Running a first test I get a "PLINK Input file error" message from admixture. But if I turn it into a bed file it works instead, using 
```
~/Programs/plink-1.07-x86_64/plink --file fwsw_all_filt --make-bed --noweb --allow-no-sex --missing-genotype '00' --out fwsw_all_filt
~/Programs/admixture_linux-1.3.0/admixture fwsw_all_filt.bed 2
```

Now I'm working on a script that I can run to test multiple values of K.

The manual isn't clear on whether it's good to run multiple instances of the model to get improved estimates, so I'll have to look at some papers that have used it before and/or tutorials online. for now though I'll run their test of Ks. I've got this working and running in the background at the moment.

Admixture has run successfully! The examples online that I've been able to find don't mention running multiple reps (e.g. [this one](https://gaworkshop.readthedocs.io/en/latest/contents/07_admixture/admixture.html))

So I'll analyze the Ks first

```{r}
admixK<-read.delim("admixture/K_CVs.txt",header = FALSE)
admixK$K<-as.numeric(gsub(".*\\(K=(\\d+)\\).*","\\1",admixK$V1))
admixK$CV<-as.numeric(gsub("^.*\\: (\\d+\\.\\d+)$","\\1",admixK$V1))

admixK<-admixK[order(admixK$K),]

plot(admixK$K,admixK$CV,pch=19,type = "b",lty=1,xlab = "K",ylab="CV",las=1,lwd=2)

```

Looks like K=5 or K=7 are the best, let's look at those outputs.

Adapting code from [the GA workshop](https://gaworkshop.readthedocs.io/en/latest/contents/07_admixture/admixture.html):

```{r}
library(RColorBrewer)
qfile<-"admixture/fwsw_all_filt.5.Q"
famfile<-"admixture/fwsw_all_filt.fam"
poporderFile<-"treemix/poplist"
K<-5

# read files in 
qtbl<-read.table(qfile,stringsAsFactors = F)
famTable<- read.table(famfile,
                      col.names = c("Pop","Ind","Father","Mother","Sex","phenotype"),stringsAsFactors = F)[1:2]
poporder<-read.table(poporderFile,col.names = c("Pop"),stringsAsFactors = F)
poporder$orderNum<-1:nrow(poporder)


# create useful tables
mergedAdmixtureTable <- cbind(qtbl, famTable)
mergedAdmixTabOrderNs <- merge(mergedAdmixtureTable,poporder,by="Pop")
ordered <- mergedAdmixTabOrderNs[order(mergedAdmixTabOrderNs$orderNum),]


plotting.structure(ordered[,1:(ncol(ordered)-2)],k = 5,pop.order = poporder$Pop,make.file = FALSE)
admixK5<-ordered[,1:(ncol(ordered)-2)]
write.table(admixK5,"admixture/admixK5.txt",sep = '\t',quote = FALSE,col.names = TRUE,row.names = FALSE)
```

```{r}
qfile<-"admixture/fwsw_all_filt.7.Q"
famfile<-"admixture/fwsw_all_filt.fam"
poporderFile<-"treemix/poplist"
K<-7

# read files in 
qtbl<-read.table(qfile,stringsAsFactors = F)
famTable<- read.table(famfile,
                      col.names = c("Pop","Ind","Father","Mother","Sex","phenotype"),stringsAsFactors = F)[1:2]
poporder<-read.table(poporderFile,col.names = c("Pop"),stringsAsFactors = F)
poporder$orderNum<-1:nrow(poporder)


# create useful tables
mergedAdmixtureTable <- cbind(qtbl, famTable)
mergedAdmixTabOrderNs <- merge(mergedAdmixtureTable,poporder,by="Pop")
ordered <- mergedAdmixTabOrderNs[order(mergedAdmixTabOrderNs$orderNum),]

par(mar=c(1,0,0,0))
plotting.structure(ordered[,1:(ncol(ordered)-2)],k = 7,pop.order = poporder$Pop,make.file = FALSE)
admixK7<-ordered[,1:(ncol(ordered)-2)]
plotting.structure(admixK7,k = 7,pop.order = poporder$Pop,make.file = FALSE)
write.table(admixK7,"admixture/admixK7.txt",sep = '\t',quote = FALSE,col.names = TRUE,row.names = FALSE)
```

So I could use this code which I put in the reanalysis doc, and change poptree to tree mix and run the PCadapt analysis here...I just need to load some things and do a bit of plotting stuff. 

I need to do the PCAdapt analysis


```{r}
library(pcadapt)
```

````{r pcadapt_choose}
filename<-read.pcadapt("filter_rad_20191014@1654/14_filtered/radiator_data_20191014@1710.vcf",type="vcf")
x<-pcadapt(filename, K=20)
plot(x,option="screeplot") #K=7

```

```{r pcadapt_analyze}
pa<-pcadapt(filename,K=7)
saveRDS(pa,"fwsw_all_pcadapt.RDS")
pa.props<-round((pa$singular.values/sum(pa$singular.values))*100,2)
pa.props

ind_dat<-read.table("filter_rad_20191014@1654/14_filtered/individuals.qc.stats_20191014@1654.tsv",header=T,
                    stringsAsFactors = F)
pops<-ind_dat$STRATA	
grp<-pops
grp[grp=="TXFW" | grp=="LAFW" | grp=="ALFW" | grp=="FLLG"]<-"freshwater"
grp[grp!="freshwater"]<-"saltwater"

#colors
pap<-data.frame(Pop=pops,cols=pops,pch=pops,grp=grp,stringsAsFactors = F)
pap$Pop[pap$Pop == "FLLG"]<-"FLFW"
for(i in 1:nrow(pap)){
  pap[i,"cols"]<-as.character(ppi[ppi$Pop %in% pap[i,"Pop"],"cols"])
}
for(i in 1:nrow(pap)){
  pap[i,"pch"]<-as.numeric(ppi[ppi$Pop %in% pap[i,"Pop"],"pch"])
}
write.table(pap,"pcadapt_colp.txt",col.names=TRUE)

```


```{r}
# setup
library(jpeg)
img<-readJPEG("all_sites_map.jpg")

source("../R/203_treemix_plotting_funcs.R") #I've modified the functions from treemix
library(lattice); library(grid); library(RColorBrewer)
poporder<-read.delim("treemix/poporder")
colors<-poporder$colors
poporder<-poporder$poporder

```


```{r}
npop<-length(pop.list)
pseq<-1:npop
m<-matrix(c(rep(1,16),rep(2,6),
            3:18,rep(2,6),
            19:34,rep(2,6),
            rep(35,8),rep(36,8),rep(37,6)),
          nrow=4,ncol=npop+6,byrow = T)

#set the layout
layout(mat=m,heights=c(6,1,1,6))
#MAP
#open an empty plot window with coordinates
par(oma=c(1.5,3.5,1,2),mar=c(0,0,0,0),xpd=NA)
plot(1:14,ty="n",axes=FALSE,xlab="",ylab="",xpd=TRUE)
#specify the position of the image through bottom-left and top-right coords
rasterImage(img,1,1,14,14,xpd=TRUE)

#Treemix
par(mar=c(0,0,0,1))
t2<-plot_tree("treemix/fwsw_k100bFLPBrm2",plus=0.05,scale=TRUE,mbar=TRUE)
#plot.phylo(pt.subtree,tip.color = pt.colors,cex=2,
#           edge.color = clcolr,edge.width = 2,label.offset = 0.0015)

#STRUCTURE
par(mar=c(1,0,0,0))
plotting.structure(admixK5,5,pop.order = poporder$Pop, pop.list, make.file=FALSE, xlabcol = all.colors,plot.new=FALSE,
                   colors=grp.colors[c(1:5)],xlabel=FALSE,
                   ylabel=expression(atop(italic(K)==5)),lab.cex=0.85)
plotting.structure(admixK7,7,pop.order = poporder$Pop,pop.labs, make.file=FALSE,
                   plot.new=FALSE,colors=grp.colors,xlabel=TRUE,
                   xlabcol = all.colors,
                   ylabel=expression(atop(italic(K)==7)),lab.cex=0.85)
#PCADAPT
par(mar=c(2,2,2,2))
plot(pa$scores[,1],pa$scores[,2],col=alpha(pap$cols,0.5),bg=alpha(pap$cols,0.75),
     pch=as.numeric(pap$pch),	cex=3,bty="L",xlab="",ylab="",cex.axis=1.5)

mtext(paste("PC1 (",pa.props[1],"%)",sep=""),1,line = 2.5,cex=1)
mtext(paste("PC2 (",pa.props[2],"%)",sep=""),2,line = 2.5,cex=1)

plot(pa$scores[,3],pa$scores[,4],col=alpha(pap$cols,0.5),bg=alpha(pap$cols,0.75),pch=as.numeric(pap$pch),
     cex=3,	bty="L",xlab="",ylab="",cex.axis=1.5)

mtext(paste("PC3 (",pa.props[3],"%)",sep=""),1,line = 2.5,cex=1)
mtext(paste("PC4 (",pa.props[4],"%)",sep=""),2,line = 2.5,cex=1)

plot(1:10,ty="n",axes=FALSE,xlab="",ylab="",xpd=TRUE)
legend("bottom", legend=ppi$Pop, pch=as.numeric(ppi$pch), pt.cex=3,cex=1.5,
       col=alpha(ppi$cols, 0.5),pt.bg=alpha(ppi$cols,0.25), ncol=2,bty='n')

```


# 15 October 2019

This morning I would like to re-do treemix but with the new file. Let's see if I can make that happen. RStudio was being glitchy, I think maybe due to cache associated with this document. So let's give this another try. I've managed to install treemix on C001KR remotely, woohoo. So now I just need to write a script to convert the files and run it there. Basically I just have to re-run it as I had before, I think, but improve the analysis. I think I can do this from my laptop, and at the moment I should run structure on this computer.

Ok, running structure: it's got one row with locus names, one column with sample IDs and one with putative pop, and each sample is on two lines.

I'm going to use the admixture model, not using sample locations as prior. Running with K=2 to K=16 with 1 iteration for now. 

Now for treemix:

I'm going to write a script that can be called from the command line, in the run_treemix file.

```{r treemix}
#!/usr/bin/env Rscript
require(devtools)
# need gwscaR tools
devtools::install_github("https://github.com/spflanagan/gwscaR.git")
library(gwscaR)
#source("../../gwscaR/R/gwscaR.R")
#source("../../gwscaR/R/gwscaR_plot.R")
#source("../../gwscaR/R/gwscaR_utility.R")
#source("../../gwscaR/R/gwscaR_fsts.R")
#source("../../gwscaR/R/gwscaR_popgen.R")
#source("../../gwscaR/R/vcf2dadi.R")

# get the input
args = commandArgs(trailingOnly = TRUE)
vcf_name<-args[1]
poplist_file<-args[2]
treemix_name<-args[3]

vcf<-parse.vcf(vcf_name)
poplist<-read.delim(args[2])
tm<-treemix.from.vcf(vcf,poplist)
write.table(tm,treemix.name,col.names=TRUE,row.names=FALSE,quote=F,sep=' ')
```

Saved the above as vcf2treemix.R and included calling it in the run_treemix.sh file.

I'm going to save a file to the treemix directory with the order of populations - did this remotely on C001KR too.
```{r}
poplist<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST",
            "ALFW","FLSG","FLKB","FLFD","FLSI","FLAB",
            "FLPB","FLHB","FLCC","FLLG")
write.table(poplist,"treemix/poplist",sep='\n',quote=FALSE,col.names = FALSE,row.names = FALSE)
```

Ok, I've had to make a few changes to get it to run properly (and now I'm motivated to try to fix gwscaR once and for all!) but it looks like it's working now. And crap, that ran really quickly!

So now can I analyse it from afar? Best thing might be to just move the files. Now taking things from 202_treemix_analysis.R...this chunk I already did and have the poporder file in my directory

```{r}
poporder<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST",
            "ALFW","FLSG","FLKB","FLFD","FLSI","FLAB",
            "FLPB","FLHB","FLCC","FLLG")
colors<-poporder
colors[colors %in% "FLLG"]<-grp.colors[6]
colors[colors %in% c("FLPB","FLHB","FLCC")]<-grp.colors[6]
colors[colors %in% c("FLAB")]<-grp.colors[5]
colors[colors %in% c("FLSI","FLFD","FLKB","FLSG")]<-grp.colors[3]
colors[colors %in% c("ALST","ALFW","LAFW")]<-grp.colors[2]
colors[colors %in% c("TXSP","TXCC","TXFW","TXCB")]<-grp.colors[1]
write.table(cbind(poporder,colors),"poporder",quote=F,sep='\t')
```

```{r}
source("../R/203_treemix_plotting_funcs.R") #I've modified the functions from treemix
library(lattice); library(grid); library(RColorBrewer)
poporder<-read.delim("treemix/poporder")
colors<-poporder$colors
poporder<-poporder$poporder
par(mfrow=c(1,2),oma=c(2,2,2,2),mar=c(2,2,2,2))
tree<-plot_tree("treemix/",plotmig=F,scale=F,mbar=F,plus=0.05)
mtext("Drift parameter",1,line=2)
resid<-plot_resid("fwsw.basic","poporder",wcols="rb")

nr<-treemix.cov.plot("treemix/fwsw_k100b",poporder)
m0<-treemix.cov.plot("treemix/fwsw_k100bFLPBr",poporder,split=c(1,1,3,2),more=TRUE)
m1<-treemix.cov.plot("treemix/fwsw_k100bFLPBrm1",poporder,split=c(2,1,3,2),more=TRUE)
m2<-treemix.cov.plot("treemix/fwsw_k100bFLPBrm2",poporder,split=c(3,1,3,2),more=TRUE)
m3<-treemix.cov.plot("treemix/fwsw_k100bFLPBrm3",poporder,split=c(1,2,3,2),more=TRUE)
m4<-treemix.cov.plot("treemix/fwsw_k100bFLPBrm4",poporder,split=c(2,2,3,2),more=TRUE)
m5<-treemix.cov.plot("treemix/fwsw_k100bFLPBrm5",poporder,split=c(3,2,3,2),more=FALSE)

# visualize residuals
png("treemix/treemix-residuals_FLPB.png",height=8,width=8,units="in",res=300)
par(mfrow=c(3,3))
t0<-plot_resid("treemix/fwsw_k100b",  "treemix/poplist")
r0<-plot_resid("treemix/fwsw_k100bFLPBr",  "treemix/poplist")
r1<-plot_resid("treemix/fwsw_k100bFLPBrm1","treemix/poplist")
r2<-plot_resid("treemix/fwsw_k100bFLPBrm2","treemix/poplist")
r3<-plot_resid("treemix/fwsw_k100bFLPBrm3","treemix/poplist")
r4<-plot_resid("treemix/fwsw_k100bFLPBrm4","treemix/poplist")
r5<-plot_resid("treemix/fwsw_k100bFLPBrm5","treemix/poplist")
dev.off()

# look at the trees
png("treemix/migration_trees_treemix_FLPB.png",height=6,width=11,units="in",res=300)
par(mfrow=c(3,3),mar=c(1,1,1,1),oma=c(1,1,1,1))
r0<-plot_tree("treemix/fwsw_k100b",plus=0.05,plotmig = F,scale=T,mbar=T)
t0<-plot_tree("treemix/fwsw_k100bFLPBr",plotmig = F,plus=0.05,scale=T,mbar=F)
t1<-plot_tree("treemix/fwsw_k100bFLPBrm1",plus=0.05,scale=F,mbar=F)
t2<-plot_tree("treemix/fwsw_k100bFLPBrm2",plus=0.05,scale=F,mbar=F)
t3<-plot_tree("treemix/fwsw_k100bFLPBrm3",plus=0.05,scale=F,mbar=F)
t4<-plot_tree("treemix/fwsw_k100bFLPBrm4",plus=0.05,scale=F,mbar=F)
t5<-plot_tree("treemix/fwsw_k100bFLPBrm5",plus=0.05,scale=F,mbar=F)
dev.off()

# Evaluate migration p-values
nort0<-read.table(gzfile("treemix/fwsw_k100b.treeout.gz"), as.is  = T, comment.char = "", quote = "")
tree0<-read.table(gzfile("treemix/fwsw_k100bFLPBr.treeout.gz"), as.is  = T, comment.char = "", quote = "")
tree1<-read.table(gzfile("treemix/fwsw_k100bFLPBrm1.treeout.gz"), as.is  = T, comment.char = "", quote = "",skip=1)
tree2<-read.table(gzfile("treemix/fwsw_k100bFLPBrm2.treeout.gz"), as.is  = T, comment.char = "", quote = "",skip=1)
tree3<-read.table(gzfile("treemix/fwsw_k100bFLPBrm3.treeout.gz"), as.is  = T, comment.char = "", quote = "",skip=1)
tree4<-read.table(gzfile("treemix/fwsw_k100bFLPBrm4.treeout.gz"), as.is  = T, comment.char = "", quote = "",skip=1)
tree5<-read.table(gzfile("treemix/fwsw_k100bFLPBrm5.treeout.gz"), as.is  = T, comment.char = "", quote = "",skip=1)

d <- read.table("fwsw.k100bFLPBrm3.vertices.gz", as.is  = T, comment.char = "", quote = "")
branch.cols<-rep("black",nrow(d))
branch.cols[d[,2] %in% c("TXFW","ALFW","LAFW","FLLG")]<-"cornflowerblue"

tip.names<-as.vector(d[d[,5] == "TIP",2])
tip.names<-data.frame(Original=tip.names,Replacement=tip.names,stringsAsFactors = FALSE)
tip.names$Replacement[tip.names$Replacement=="FLLG"]<-"FLFW"

png("FWSW_treemix_m3_FLPB.png",height=7,width=7,units="in",res=300)
t3<-plot_tree("fwsw.k100bFLPBrm3","poporder",plus=0.05,scale=F,mbar=F,arrow=0.1,tip.order = tip.names)
ybar<-0.01
mcols = rev( heat.colors(150) )
mcols = mcols[50:length(mcols)]
ymi = ybar+0.15
yma = ybar+0.35
l = 0.2
w = l/100
xma = max(t3$d$x/20)
rect( rep(0.15, 100), ymi+(0:99)*w, rep(0.15+xma, 100), ymi+(1:100)*w, col = mcols, border = mcols)
text(0.15+xma+0.001, ymi, lab = "0", adj = 0, cex = 0.7)
text(0.15+xma+0.001, yma, lab = "0.5", adj = 0, cex =0.7)
text(0.15, yma+0.06, lab = "Migration", adj = 0 , cex = 0.6)
text(0.15, yma+0.03, lab = "weight", adj = 0 , cex = 0.6)
dev.off()

```

Ok, but how do I actually choose which number of migration edges to choose? Well, apparently [there's an R package for that!](https://rdrr.io/cran/OptM/#vignettes)

```{r}
tmOpt<-optM("treemix")
plot_optM(tmOpt)
```

Huh, apparently it's recommended to run more than 2 iterations for each m. I can do that! And I have - it chooses m=2 as the best. Woohoo!


What's next?

A few things on my mind:
* admixture
* plot Fsts (like bayenv)
* analyse dadi that I've got so far
* reorganize the code

I think I'll look into running admixture first - how much file conversion etc does it require?


# 14 October 2019

At the moment I want to try to improve the plotting so I can send something nice to Adam and Emily. The first order of business is to improve the plots from Friday - I want to show both types of Bayenv outliers on both graphs. 


```{r plotOutliersSetup}
library(scales)
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
cols<-c(perm=alpha('#e41a1c',0.75),sal=alpha('#377eb8',0.75),pc=alpha('#4daf4a',0.75),xtx=alpha('#984ea3',0.75),stacks=alpha('#ff7f00',0.75))
```
```{r plotOutliers}
par(mfrow=c(2,1),oma=c(1,2,1,1),mar=c(2,2,1,1),xpd=TRUE)
# plot XtX
plot_dat<-fst.plot(fw_SNPinfo,scaffs.to.plot = lgs,fst.name = "XtX",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$XtX[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=0.75,pch=19)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$XtX[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=0.85,pch=17)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$XtX[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL < 0.05 & plot_dat$stacks_LA < 0.05 &
                           plot_dat$stacks_TX < 0.05 & plot_dat$stacks_FL < 0.05],
       plot_dat$XtX[plot_dat$stacks_AL < 0.05 & plot_dat$stacks_LA < 0.05 &
                           plot_dat$stacks_TX < 0.05 & plot_dat$stacks_FL < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$XtX[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1)
mtext(expression(italic("X")^"T"~italic("X")),2,line=2)

# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=0)

# plot Bayes Factors
plot_dat<-fst.plot(plot_dat,scaffs.to.plot = lgs,fst.name = "logSalBF",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$logSalBF[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols["sal"],cex=0.85,pch=17)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$logSalBF[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols["xtx"],cex=0.75,pch=19)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$logSalBF[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols["perm"],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL < 0.05 & plot_dat$stacks_LA < 0.05 &
                           plot_dat$stacks_TX < 0.05 & plot_dat$stacks_FL < 0.05],
       plot_dat$logSalBF[plot_dat$stacks_AL < 0.05 & plot_dat$stacks_LA < 0.05 &
                           plot_dat$stacks_TX < 0.05 & plot_dat$stacks_FL < 0.05],
       col=cols["stacks"],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$logSalBF[plot_dat$pcadaptQ<0.01],
       col=cols["pc"],cex=1,pch=0,lwd=2)
axis(2,las=1)
mtext("log(Salinity Bayes Factors)",2,line=2)

# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-5)

# add outside legend

opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
            mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend("top",c(expression("Permutation"~italic("F")["ST"]),
         expression("Stacks"~italic("F")["ST"]),
         "PCAdapt",expression(italic("X")^T~italic("X")),"Salinity BF"),xjust = 0.5,x.intersp = 0.5,
       col = cols[c("perm","stacks","pc","xtx","sal")],
       pt.bg=cols[c("perm","stacks","pc","xtx","sal")],pch=c(4,5,0,19,17),bty='n',ncol=3)
```

Ok, that's better. But what does this tell me? the outliers associated with salinity are generally not those associated withfst or pcadapt outliers, but they are sometimes with the xtx statistic from bayenv. This is not incredibly useful information. What do I do with this?

The other thing is that I've re-run populations to require the loci be present in 16 populations and 75% of individuals. These results are in populations_all/

It seems to have 104150 SNPs (using `grep -v "#" batch_2.vcf | wc -l`), but this seems like a lot! I need to prune for HWE anyway, and I need to make sure to do this on a population-by-population level. Last time I did this using plink but I wonder if vcftools or similar would be better. I'm going to look into some R packages 
1. [HardyWeinberg](https://cran.r-project.org/web/packages/HardyWeinberg/vignettes/HardyWeinberg.pdf), but it doesn't seem to play all that nicely with regularly formatted data
2. [snpStats](https://www.bioconductor.org/packages/release/bioc/html/snpStats.html), which seems to be designed for GWAS but might work well
3. [adegenet](http://adegenet.r-forge.r-project.org/files/tutorial-basics.pdf) tests for HWE but I'm not sure if it does so per population. But I could probably make it do so.
4. [radiator](https://rdrr.io/github/thierrygosselin/radiator/man/filter_hwe.html) seems to be focused on RADseq so this could be useful.

I might try radiator

```{r installRadiator}
if (!require("devtools")) install.packages("devtools") # to install
install.packages("tidyverse")
devtools::install_github("thierrygosselin/grur")
devtools::install_github("thierrygosselin/assigner")
```

First I need to create the strata file (stacks map with headers):
```{r}
vcf<-parse.vcf("stacks/populations_all/batch_2.vcf")
strata<-data.frame(INDIVIDUALS=colnames(vcf)[10:ncol(vcf)],
                   STRATA=gsub("sample_(\\w{4}).*","\\1",colnames(vcf)[10:ncol(vcf)]))
write.table(strata,"../fwsw_pops_map.txt",sep='\t',quote = FALSE,row.names = FALSE,col.names = TRUE)
```

It has filter_rad, described as 'ONE FUNCTION TO RULE THEM ALL'. It's interactive and apparently enough?

```{r}
data<-radiator::filter_rad(data="stacks/populations_all/batch_2.vcf",
                           strata="../fwsw_pops_map.txt")
```

I'm getting an error message saying that the function arguments names have changed, but what I'm doing matches the information in the help function...Ah, this was because I didn't have the right packages installed.

According to this, it's going to blacklist individuals based on missingness threshold of 0.0730675, and this removed 141 individuals! It also removed 106 loci with monomorphic markers. choosing MAC threshold of 5, choosing coverage between 35 and 50. ...ok this might be too stringent it's left with only 401 loci. I think I'll do this again but in a better way. It's too bad it doesn't turn it into a single command that I could run multiple times in a non-interactive way. Here's what I did:

```
################################################################################
############################# radiator::filter_rad #############################
################################################################################
The function arguments names have changed: please read documentation

Execution date@time: 20191014@0939
Folder created: filter_rad_20191014@0939
Function call and arguments stored in: radiator_filter_rad_args_20191014@0939.tsv
File written: random.seed (835230)
Filters parameters file generated: filters_parameters_20191014@0939.tsv

Reading a large VCF...you actually have time for coffee or tea!

Data summary: 
    number of samples: 698
    number of markers: 104150
done! timing: 6 sec

Generating individual stats...
Generating markers stats...


Number of chromosome/contig/scaffold: 539
Number of locus: 11155
Number of markers: 104150
Number of populations: 16
Number of individuals: 698

Number of ind/pop:
ALFW = 48
ALST = 47
FLAB = 42
FLCC = 41
FLFD = 40
FLHB = 41
FLKB = 42
FLLG = 47
FLPB = 43
FLSG = 44
FLSI = 45
LAFW = 48
TXCB = 36
TXCC = 41
TXFW = 31
TXSP = 62

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20191014@0939.gds
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20191014@0941
Function call and arguments stored in: radiator_filter_monomorphic_args_20191014@0941.tsv
Filters parameters file: initiated
File written: blacklist.monomorphic.markers_20191014@0941.tsv
Synchronizing markers.meta
File written: whitelist.polymorphic.markers_20191014@0941.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Before: 698 / 16 / 539 / 11155 / 104150
    Blacklisted: 0 / 0 / 1 / 3 / 10
    After: 698 / 16 / 538 / 11152 / 104140

Computation time, overall: 3 sec
######################## filter_monomorphic completed ##########################
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20191014@0941
Function call and arguments stored in: radiator_filter_common_markers_args_20191014@0941.tsv
Filters parameters file: initiated
Scanning for common markers...
Generating UpSet plot to visualize markers in common
File written: whitelist.common.markers_20191014@0941.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 698 / 16 / 538 / 11152 / 104140
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 698 / 16 / 538 / 11152 / 104140

Computation time, overall: 11 sec
####################### filter_common_markers completed ########################
################################################################################
######################### radiator::filter_individuals #########################
################################################################################
Execution date@time: 20191014@0941
Function call and arguments stored in: radiator_filter_individuals_args_20191014@0941.tsv
Interactive mode: on

Step 1. Visualization
Step 2. Missingness
Step 3. Heterozygosity
Step 4. Total Coverage (if available)


Filters parameters file: initiated

Step 1. Visualization of samples QC

File written: individuals qc info and stats summary
File written: individuals qc plot

Step 2. Filtering markers based individual missingness/genotyping

Do you want to blacklist samples based on missingness ? (y/n):
y
2 options to blacklist samples:
1. based on the outlier statistics
2. enter your own threshold
1

Removing outliers individuals based on genotyping statistics: 0.0730675
Filters parameters file: updated
################################### RESULTS ####################################

Filter individuals based on missingness: 0.0730675
Number of individuals / strata / chrom / locus / SNP:
    Before: 698 / 16 / 538 / 11152 / 104140
    Blacklisted: 141 / 0 / 0 / 0 / 0
    After: 557 / 16 / 538 / 11152 / 104140

Step 3. Filtering markers based on individual heterozygosity

Do you want to blacklist samples based on heterozygosity ? (y/n):
y
2 options to blacklist samples:
1. based on the outlier statistics
2. enter your own threshold
1

Removing outliers individuals based on heterozygosity statistics: 0.005222 / 0.02875475
    number of individuals blacklisted based on heterozygosity: 14
Filters parameters file: updated
################################### RESULTS ####################################

Filter individuals based on heterozygosity: 0.005222 0.02875475
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 538 / 11152 / 104140
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 557 / 16 / 538 / 11152 / 104140

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 6 / 106 / 5428

Computation time, overall: 139 sec
########################### completed filter_individuals #######################
################################################################################
############################## radiator::filter_mac ############################
################################################################################
Execution date@time: 20191014@0944
Function call and arguments stored in: radiator_filter_mac_args_20191014@0944.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on MAC


Importing data ...
Filters parameters file: initiated
Calculating GLOBAL MAC
File written: maf.global.tsv

Step 1. MAC visualization and helper table

File written: mac.summary.stats.tsv
MAC range: [1 - 556]
MAF range: [0.0009 - 0.4991]
Generating MAC helper table...
File written: maf.helper.table.tsv

Step 2. Filtering markers based on MAC

Choose the filter.mac threshold: 
5
File written: whitelist.markers.mac.tsv
File written: blacklist.markers.mac.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter mac threshold: 5
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 532 / 11046 / 98712
    Blacklisted: 0 / 0 / 140 / 2953 / 75467
    After: 557 / 16 / 392 / 8093 / 23245

Computation time, overall: 111 sec
############################ completed filter_mac ##############################
################################################################################
########################### radiator::filter_coverage ##########################
################################################################################
Execution date@time: 20191014@0946
Function call and arguments stored in: radiator_filter_coverage_args_20191014@0946.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on total coverage


Importing data ...
Filters parameters file: initiated

Step 1. Coverage visualization and helper table

Generating coverage statistics
Generating coverage statistics: without outliers
Generating mean coverage helper table...
Files written: helper tables and plots

Step 2. Filtering markers based on mean coverage

Choose the min mean coverage threshold(e.g. 7 or 10): 
35
Choose the max mean coverage threshold (e.g. 100 or 300): 
50
File written: blacklist.markers.coverage_20191014@0946.tsv
File written: whitelist.markers.coverage_20191014@0946.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter mean coverage thresholds: 35 / 50
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 392 / 8093 / 23245
    Blacklisted: 0 / 0 / 309 / 7673 / 21758
    After: 557 / 16 / 83 / 420 / 1487

Computation time, overall: 63 sec
########################## completed filter_coverage ###########################
################################################################################
######################### radiator::filter_genotyping ##########################
################################################################################
Execution date@time: 20191014@0947
Function call and arguments stored in: radiator_filter_genotyping_args_20191014@0947.tsv
Interactive mode: on

Step 1. Visualization and helper table
Step 2. Filtering markers based on maximum missing proportion allowed


Importing data ...
Filters parameters file: initiated

Step 1. Missing visualization and helper table

Generating statistics
Generating missingness/genotyping helper table...
File written: markers.pop.missing.helper.table.tsv
Files written: helper tables and plots

Step 2. Filtering markers based on maximum missing proportion

Choose the maximum missing proportion allowed: 
0.1

Removing markers based on genotyping statistic: 0.1
File written: blacklist.markers.genotyping_20191014@0947.tsv
File written: whitelist.markers.genotyping_20191014@0947.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter genotyping threshold: 0.1
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 83 / 420 / 1487
    Blacklisted: 0 / 0 / 0 / 0 / 1
    After: 557 / 16 / 83 / 420 / 1486

Computation time, overall: 59 sec
######################## completed filter_genotyping ###########################
################################################################################
######################### radiator::filter_snp_position_read ###################
################################################################################
Execution date@time: 20191014@0948
Function call and arguments stored in: radiator_filter_snp_position_read_args_20191014@0948.tsv
2 steps to visualize and filter the data based on the number of SNP on the read/locus:
Step 1. Visualization (boxplot, distribution
Step 2. Threshold selection
Filters parameters file: initiated
COL info required, returning data

Computation time, overall: 0 sec
##################### completed filter_snp_position_read #######################
################################################################################
############################ radiator::filter_snp_number #######################
################################################################################
Execution date@time: 20191014@0948
Function call and arguments stored in: radiator_filter_snp_number_args_20191014@0948.tsv
Interactive mode: on
2 steps to visualize and filter the data based on the number of SNP on the read/locus:
Step 1. Impact of SNP number per read/locus (on individual genotypes and locus/snp number potentially filtered)
Step 2. Choose the filtering thresholds
Filters parameters file: initiated
Generating statistics
Generating helper table...
Files written: helper tables and plots

Step 2. Filtering markers based on the maximum of SNPs per locus

Do you still want to blacklist markers? (y/n):
y
2 options to blacklist SNPs:
1. based on the outlier statistics
2. enter your own threshold
2
Enter the maximum number of SNP per locus allowed:
10

Removing markers based on the number of SNPs per locus statistic: 10
File written: whitelist.markers.genotyping.tsv
File written: blacklist.markers.genotyping.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter SNPs per locus threshold: 10
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 83 / 420 / 1486
    Blacklisted: 0 / 0 / 0 / 3 / 38
    After: 557 / 16 / 83 / 417 / 1448

Computation time, overall: 58 sec
######################### completed filter_snp_number ##########################
################################################################################
############################## radiator::filter_ld #############################
################################################################################
Execution date@time: 20191014@0949
Function call and arguments stored in: radiator_filter_ld_args_20191014@0949.tsv

Interactive mode: on

Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD


Filters parameters file: initiated
Minimizing short distance LD...
    The range in the number of SNP/locus is: 1-10

Step 1. Short distance LD threshold selection
the goal is to keep only 1 SNP per read/locus
Choose the filter.short.ld threshold
Options include:
1: mac (Not sure ? use mac...)
2: random
3: first
4: middle
5: last
1

Step 2. Filtering markers based on short distance LD
filter.short.ld = mac
File written: whitelist.short.ld.tsv
File written: blacklist.short.ld.tsv
Filters parameters file: updated
################################### RESULTS ####################################

Filter short ld threshold: mac
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 83 / 417 / 1448
    Blacklisted: 0 / 0 / 0 / 0 / 1031
    After: 557 / 16 / 83 / 417 / 417

Do you want to continue filtering using long distance ld  ? (y/n):
y

Step 3. Long distance LD pruning selection
With a reference genome, pruning is done by chromosome/scaffolds
Pruning method can randomly choose to keep 1 SNP or
select the SNP based on missing data statistics

Do you want to use missing data statistics ? (y/n):
y

Long distance LD pruning with missing data
Computing LD by CHROM/scaffold (n = 83), with several LD thresholds
Chrom: LG1 SNPs number: 32    (1/83)
    LD threshold: 0.1, SNPs blacklisted: 4
    LD threshold: 0.2, SNPs blacklisted: 1
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG10 SNPs number: 14    (2/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG11 SNPs number: 12    (3/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG12 SNPs number: 15    (4/83)
    LD threshold: 0.1, SNPs blacklisted: 2
    LD threshold: 0.2, SNPs blacklisted: 1
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG13 SNPs number: 14    (5/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG14 SNPs number: 12    (6/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG15 SNPs number: 11    (7/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG16 SNPs number: 14    (8/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG17 SNPs number: 11    (9/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG18 SNPs number: 11    (10/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG19 SNPs number: 11    (11/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG2 SNPs number: 30    (12/83)
    LD threshold: 0.1, SNPs blacklisted: 3
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG20 SNPs number: 12    (13/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG21 SNPs number: 10    (14/83)
    LD threshold: 0.1, SNPs blacklisted: 1
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG22 SNPs number: 1    (15/83)
    SNPs blacklisted: 0
Chrom: LG3 SNPs number: 32    (16/83)
    LD threshold: 0.1, SNPs blacklisted: 1
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG4 SNPs number: 26    (17/83)
    LD threshold: 0.1, SNPs blacklisted: 1
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG5 SNPs number: 16    (18/83)
    LD threshold: 0.1, SNPs blacklisted: 1
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG6 SNPs number: 19    (19/83)
    LD threshold: 0.1, SNPs blacklisted: 1
    LD threshold: 0.2, SNPs blacklisted: 1
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG7 SNPs number: 16    (20/83)
    LD threshold: 0.1, SNPs blacklisted: 2
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG8 SNPs number: 19    (21/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: LG9 SNPs number: 10    (22/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_1015 SNPs number: 1    (23/83)
    SNPs blacklisted: 0
Chrom: scaffold_1043 SNPs number: 1    (24/83)
    SNPs blacklisted: 0
Chrom: scaffold_1063 SNPs number: 1    (25/83)
    SNPs blacklisted: 0
Chrom: scaffold_1070 SNPs number: 1    (26/83)
    SNPs blacklisted: 0
Chrom: scaffold_1087 SNPs number: 1    (27/83)
    SNPs blacklisted: 0
Chrom: scaffold_1103 SNPs number: 1    (28/83)
    SNPs blacklisted: 0
Chrom: scaffold_1108 SNPs number: 1    (29/83)
    SNPs blacklisted: 0
Chrom: scaffold_1148 SNPs number: 1    (30/83)
    SNPs blacklisted: 0
Chrom: scaffold_1174 SNPs number: 1    (31/83)
    SNPs blacklisted: 0
Chrom: scaffold_1735 SNPs number: 1    (32/83)
    SNPs blacklisted: 0
Chrom: scaffold_174 SNPs number: 1    (33/83)
    SNPs blacklisted: 0
Chrom: scaffold_1760 SNPs number: 1    (34/83)
    SNPs blacklisted: 0
Chrom: scaffold_1877 SNPs number: 1    (35/83)
    SNPs blacklisted: 0
Chrom: scaffold_1899 SNPs number: 1    (36/83)
    SNPs blacklisted: 0
Chrom: scaffold_247 SNPs number: 1    (37/83)
    SNPs blacklisted: 0
Chrom: scaffold_304 SNPs number: 1    (38/83)
    SNPs blacklisted: 0
Chrom: scaffold_332 SNPs number: 1    (39/83)
    SNPs blacklisted: 0
Chrom: scaffold_387 SNPs number: 1    (40/83)
    SNPs blacklisted: 0
Chrom: scaffold_391 SNPs number: 1    (41/83)
    SNPs blacklisted: 0
Chrom: scaffold_395 SNPs number: 1    (42/83)
    SNPs blacklisted: 0
Chrom: scaffold_396 SNPs number: 1    (43/83)
    SNPs blacklisted: 0
Chrom: scaffold_407 SNPs number: 2    (44/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_434 SNPs number: 2    (45/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_450 SNPs number: 1    (46/83)
    SNPs blacklisted: 0
Chrom: scaffold_468 SNPs number: 2    (47/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_473 SNPs number: 2    (48/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_477 SNPs number: 1    (49/83)
    SNPs blacklisted: 0
Chrom: scaffold_479 SNPs number: 1    (50/83)
    SNPs blacklisted: 0
Chrom: scaffold_480 SNPs number: 1    (51/83)
    SNPs blacklisted: 0
Chrom: scaffold_484 SNPs number: 1    (52/83)
    SNPs blacklisted: 0
Chrom: scaffold_493 SNPs number: 1    (53/83)
    SNPs blacklisted: 0
Chrom: scaffold_497 SNPs number: 1    (54/83)
    SNPs blacklisted: 0
Chrom: scaffold_514 SNPs number: 1    (55/83)
    SNPs blacklisted: 0
Chrom: scaffold_520 SNPs number: 1    (56/83)
    SNPs blacklisted: 0
Chrom: scaffold_523 SNPs number: 1    (57/83)
    SNPs blacklisted: 0
Chrom: scaffold_546 SNPs number: 1    (58/83)
    SNPs blacklisted: 0
Chrom: scaffold_556 SNPs number: 1    (59/83)
    SNPs blacklisted: 0
Chrom: scaffold_557 SNPs number: 1    (60/83)
    SNPs blacklisted: 0
Chrom: scaffold_585 SNPs number: 1    (61/83)
    SNPs blacklisted: 0
Chrom: scaffold_587 SNPs number: 1    (62/83)
    SNPs blacklisted: 0
Chrom: scaffold_591 SNPs number: 1    (63/83)
    SNPs blacklisted: 0
Chrom: scaffold_604 SNPs number: 1    (64/83)
    SNPs blacklisted: 0
Chrom: scaffold_606 SNPs number: 1    (65/83)
    SNPs blacklisted: 0
Chrom: scaffold_609 SNPs number: 1    (66/83)
    SNPs blacklisted: 0
Chrom: scaffold_612 SNPs number: 2    (67/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_654 SNPs number: 1    (68/83)
    SNPs blacklisted: 0
Chrom: scaffold_660 SNPs number: 2    (69/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_662 SNPs number: 2    (70/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_673 SNPs number: 1    (71/83)
    SNPs blacklisted: 0
Chrom: scaffold_702 SNPs number: 1    (72/83)
    SNPs blacklisted: 0
Chrom: scaffold_720 SNPs number: 1    (73/83)
    SNPs blacklisted: 0
Chrom: scaffold_728 SNPs number: 1    (74/83)
    SNPs blacklisted: 0
Chrom: scaffold_739 SNPs number: 1    (75/83)
    SNPs blacklisted: 0
Chrom: scaffold_828 SNPs number: 1    (76/83)
    SNPs blacklisted: 0
Chrom: scaffold_837 SNPs number: 1    (77/83)
    SNPs blacklisted: 0
Chrom: scaffold_839 SNPs number: 1    (78/83)
    SNPs blacklisted: 0
Chrom: scaffold_849 SNPs number: 2    (79/83)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_850 SNPs number: 1    (80/83)
    SNPs blacklisted: 0
Chrom: scaffold_877 SNPs number: 1    (81/83)
    SNPs blacklisted: 0
Chrom: scaffold_901 SNPs number: 1    (82/83)
    SNPs blacklisted: 0
Chrom: scaffold_915 SNPs number: 1    (83/83)
    SNPs blacklisted: 0
File written: whitelist(s) and blacklist(s)
Generating figures...

Step 4. Threshold selection
Look at the boxplot, a threshold of 0.2 will blacklist more markers than a threshold of 0.8

Enter the long LD threshold (filter.long.ld threshold, double/proportion):
0.2

Step 5. Filtering markers based on long distance LD
Filters parameters file: updated
################################### RESULTS ####################################

Filter long ld threshold: 0.2
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 83 / 417 / 417
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 557 / 16 / 83 / 417 / 417

Computation time, overall: 74 sec
############################# completed filter_ld ##############################
################################################################################
######################## radiator::detect_mixed_genomes ########################
################################################################################
Execution date@time: 20191014@0950

detect_mixed_genomes function call arguments:
    data = SeqVarGDSClass
    interactive.filter = TRUE
    detect.mixed.genomes = FALSE
    ind.heterozygosity.threshold = NULL
    verbose = TRUE
    parallel.core = 11

dots-dots-dots ... arguments

Arguments inside "..." assigned in detect_mixed_genomes:
    internal = FALSE
    parameters = list
    path.folder = filter_rad_20191014@0939


File written: radiator_detect_mixed_genomes_args_20191014@0950.tsv
Filters parameters file: initiated
Calculating heterozygosity statistics
Generating plots

The greatest value of a picture is when it forces us
to notice what we never expected to see.

John W. Tukey. Exploratory Data Analysis. 1977.


Inspect plots and tables in folder created...
    Do you want to exclude individuals based on heterozygosity ? (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect mixed genomes: 0 1
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 83 / 417 / 417
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 557 / 16 / 83 / 417 / 417

Computation time, overall: 125 sec
######################## completed detect_mixed_genomes ########################

################################################################################
###################### radiator::detect_duplicate_genomes ######################
################################################################################
Execution date@time: 20191014@0952
Function call and arguments stored in a file
File written: radiator_detect_duplicate_genomes_args_20191014@0952.tsv
File written: random.seed (835230)
Filters parameters file: initiated
Calculating manhattan distances between individuals...
Generating summary statistics
Generating plots

Inspect tables and figures to decide if some individual(s) need to be blacklisted
    Do you need to blacklist individual(s) (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect duplicate genomes: 0
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 83 / 417 / 417
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 557 / 16 / 83 / 417 / 417

Computation time, overall: 85 sec
###################### completed detect_duplicate_genomes ######################
Registered S3 methods overwritten by 'ggtern':
  method           from   
  +.gg             ggplot2
  grid.draw.ggplot ggplot2
  plot.ggplot      ggplot2
  print.ggplot     ggplot2
################################################################################
############################# radiator::filter_hwe #############################
################################################################################
Execution date@time: 20191014@0953
Interactive mode: on
Function call and arguments stored in: radiator_filter_hwe_args_20191014@0953.tsv
Filters parameters file: initiated
    using tidy data frame of genotypes as input
    skipping all filters
Summarizing data
File written: genotypes.summary.tsv
HWE analysis for pop: ALFW
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: ALST
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLAB
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLCC
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLFD
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLHB
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLKB
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLLG
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLPB
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLSG
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: FLSI
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: LAFW
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: TXCB
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: TXCC
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: TXFW
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: TXSP
  |=============================================| 100%, Elapsed 00:00
HWE analysis for pop: OVERALL
  |=============================================| 100%, Elapsed 00:00
File written: hw.pop.sum.tsv
Plot written: hwd.plot.blacklist.markers.pdf
Plot written: hwe.ternary.plots.missing.data.pdf
Plot written: hwe.manhattan.plot.pdf

Do you want to continue with the filtering ? (y/n):
y

Based on figures and tables enter the hw.pop.threshold (integer): 
2

Generating blacklists, whitelists and filtered tidy data...
done!

Choosing the final filtered dataset
   select the mid p-value threshold (5 options):
1: 0.05 *
2. 0.01 **
3. 0.001 ***
4. 0.0001 ****
5. 0.00001 *****
1

Final filtered tidy dataset: 
tidy.filtered.hwe.0.05.mid.p.value.2.hw.pop.threshold.rad

Using hw.pop.threshold/midp.threshold: 2/0.05
Filters parameters file: updated
################################### RESULTS ####################################
Filter HWE: 2 / 0.05
Number of individuals / strata / chrom / locus / SNP:
    Before: 557 / 16 / 83 / 417 / 417
    Blacklisted: 0 / 0 / 1 / 16 / 16
    After: 557 / 16 / 82 / 401 / 401

Computation time, overall: 47 sec
############################# completed filter_hwe #############################

Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv
Writing the filtered strata: strata.filtered.tsvstrata.filtered.tsv

Generating statistics after filtering
calculating individual stats...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...

Transferring data to genomic converter...
Synchronizing data and strata...
    Number of strata: 16
    Number of individuals: 557

Writing tidy data set:
radiator_data_20191014@0954.rad
Data summary: 
    number of samples: 557
    number of markers: 401

Computation time, overall: 922 sec
############################# completed filter_rad #############################
```

The good news is that this program outputs a lot of information, so I can go back and look at what was not great and what was good and make different decisions.

```{r}
missingness<-read.delim("filter_rad_20191014@0939/04_filter_individuals/blacklist.individuals.missing_20191014@0941.tsv")
het_black<-read.delim("filter_rad_20191014@0939/04_filter_individuals/blacklist.individuals.heterozygosity_20191014@0941.tsv")
```

I think I'll up the proportion of missingness allowed per individual, up the gneotyping missingness allowed, and reduce the minimum coverage.

```{r}
data<-radiator::filter_rad(data="stacks/populations_all/batch_2.vcf",
                           strata="../fwsw_pops_map.txt",
                           output=c("genepop","vcf","plink","structure"))
```

```
Chrom: scaffold_612 SNPs number: 3    (213/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_620 SNPs number: 2    (214/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_622 SNPs number: 1    (215/345)
    SNPs blacklisted: 0
Chrom: scaffold_625 SNPs number: 1    (216/345)
    SNPs blacklisted: 0
Chrom: scaffold_627 SNPs number: 1    (217/345)
    SNPs blacklisted: 0
Chrom: scaffold_629 SNPs number: 3    (218/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_631 SNPs number: 1    (219/345)
    SNPs blacklisted: 0
Chrom: scaffold_632 SNPs number: 1    (220/345)
    SNPs blacklisted: 0
Chrom: scaffold_638 SNPs number: 4    (221/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_639 SNPs number: 2    (222/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_641 SNPs number: 1    (223/345)
    SNPs blacklisted: 0
Chrom: scaffold_643 SNPs number: 2    (224/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_647 SNPs number: 2    (225/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_650 SNPs number: 3    (226/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_651 SNPs number: 1    (227/345)
    SNPs blacklisted: 0
Chrom: scaffold_654 SNPs number: 1    (228/345)
    SNPs blacklisted: 0
Chrom: scaffold_660 SNPs number: 3    (229/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_662 SNPs number: 3    (230/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_665 SNPs number: 5    (231/345)
    LD threshold: 0.1, SNPs blacklisted: 1
    LD threshold: 0.2, SNPs blacklisted: 1
    LD threshold: 0.3, SNPs blacklisted: 1
    LD threshold: 0.4, SNPs blacklisted: 1
    LD threshold: 0.5, SNPs blacklisted: 1
    LD threshold: 0.6, SNPs blacklisted: 1
    LD threshold: 0.7, SNPs blacklisted: 1
    LD threshold: 0.8, SNPs blacklisted: 1
    LD threshold: 0.9, SNPs blacklisted: 1
Chrom: scaffold_668 SNPs number: 2    (232/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_669 SNPs number: 1    (233/345)
    SNPs blacklisted: 0
Chrom: scaffold_672 SNPs number: 2    (234/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_673 SNPs number: 1    (235/345)
    SNPs blacklisted: 0
Chrom: scaffold_674 SNPs number: 2    (236/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_677 SNPs number: 1    (237/345)
    SNPs blacklisted: 0
Chrom: scaffold_680 SNPs number: 2    (238/345)
    LD threshold: 0.1, SNPs blacklisted: 1
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_681 SNPs number: 1    (239/345)
    SNPs blacklisted: 0
Chrom: scaffold_684 SNPs number: 1    (240/345)
    SNPs blacklisted: 0
Chrom: scaffold_689 SNPs number: 4    (241/345)
    LD threshold: 0.1, SNPs blacklisted: 1
    LD threshold: 0.2, SNPs blacklisted: 1
    LD threshold: 0.3, SNPs blacklisted: 1
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_697 SNPs number: 1    (242/345)
    SNPs blacklisted: 0
Chrom: scaffold_698 SNPs number: 2    (243/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_702 SNPs number: 1    (244/345)
    SNPs blacklisted: 0
Chrom: scaffold_703 SNPs number: 2    (245/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_705 SNPs number: 1    (246/345)
    SNPs blacklisted: 0
Chrom: scaffold_706 SNPs number: 1    (247/345)
    SNPs blacklisted: 0
Chrom: scaffold_708 SNPs number: 1    (248/345)
    SNPs blacklisted: 0
Chrom: scaffold_710 SNPs number: 2    (249/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_712 SNPs number: 1    (250/345)
    SNPs blacklisted: 0
Chrom: scaffold_717 SNPs number: 1    (251/345)
    SNPs blacklisted: 0
Chrom: scaffold_720 SNPs number: 1    (252/345)
    SNPs blacklisted: 0
Chrom: scaffold_721 SNPs number: 1    (253/345)
    SNPs blacklisted: 0
Chrom: scaffold_724 SNPs number: 1    (254/345)
    SNPs blacklisted: 0
Chrom: scaffold_728 SNPs number: 2    (255/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_730 SNPs number: 1    (256/345)
    SNPs blacklisted: 0
Chrom: scaffold_732 SNPs number: 1    (257/345)
    SNPs blacklisted: 0
Chrom: scaffold_734 SNPs number: 1    (258/345)
    SNPs blacklisted: 0
Chrom: scaffold_736 SNPs number: 1    (259/345)
    SNPs blacklisted: 0
Chrom: scaffold_737 SNPs number: 1    (260/345)
    SNPs blacklisted: 0
Chrom: scaffold_738 SNPs number: 2    (261/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_739 SNPs number: 1    (262/345)
    SNPs blacklisted: 0
Chrom: scaffold_744 SNPs number: 1    (263/345)
    SNPs blacklisted: 0
Chrom: scaffold_747 SNPs number: 2    (264/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_748 SNPs number: 1    (265/345)
    SNPs blacklisted: 0
Chrom: scaffold_753 SNPs number: 2    (266/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_756 SNPs number: 2    (267/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_758 SNPs number: 1    (268/345)
    SNPs blacklisted: 0
Chrom: scaffold_763 SNPs number: 1    (269/345)
    SNPs blacklisted: 0
Chrom: scaffold_764 SNPs number: 1    (270/345)
    SNPs blacklisted: 0
Chrom: scaffold_765 SNPs number: 1    (271/345)
    SNPs blacklisted: 0
Chrom: scaffold_767 SNPs number: 1    (272/345)
    SNPs blacklisted: 0
Chrom: scaffold_769 SNPs number: 2    (273/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_775 SNPs number: 1    (274/345)
    SNPs blacklisted: 0
Chrom: scaffold_776 SNPs number: 1    (275/345)
    SNPs blacklisted: 0
Chrom: scaffold_777 SNPs number: 1    (276/345)
    SNPs blacklisted: 0
Chrom: scaffold_780 SNPs number: 2    (277/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_784 SNPs number: 2    (278/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_785 SNPs number: 1    (279/345)
    SNPs blacklisted: 0
Chrom: scaffold_787 SNPs number: 1    (280/345)
    SNPs blacklisted: 0
Chrom: scaffold_788 SNPs number: 1    (281/345)
    SNPs blacklisted: 0
Chrom: scaffold_792 SNPs number: 1    (282/345)
    SNPs blacklisted: 0
Chrom: scaffold_794 SNPs number: 2    (283/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_797 SNPs number: 1    (284/345)
    SNPs blacklisted: 0
Chrom: scaffold_798 SNPs number: 1    (285/345)
    SNPs blacklisted: 0
Chrom: scaffold_799 SNPs number: 2    (286/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_800 SNPs number: 2    (287/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_801 SNPs number: 1    (288/345)
    SNPs blacklisted: 0
Chrom: scaffold_807 SNPs number: 1    (289/345)
    SNPs blacklisted: 0
Chrom: scaffold_813 SNPs number: 1    (290/345)
    SNPs blacklisted: 0
Chrom: scaffold_815 SNPs number: 1    (291/345)
    SNPs blacklisted: 0
Chrom: scaffold_816 SNPs number: 1    (292/345)
    SNPs blacklisted: 0
Chrom: scaffold_826 SNPs number: 2    (293/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_827 SNPs number: 1    (294/345)
    SNPs blacklisted: 0
Chrom: scaffold_828 SNPs number: 1    (295/345)
    SNPs blacklisted: 0
Chrom: scaffold_829 SNPs number: 1    (296/345)
    SNPs blacklisted: 0
Chrom: scaffold_831 SNPs number: 2    (297/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_833 SNPs number: 1    (298/345)
    SNPs blacklisted: 0
Chrom: scaffold_835 SNPs number: 1    (299/345)
    SNPs blacklisted: 0
Chrom: scaffold_837 SNPs number: 1    (300/345)
    SNPs blacklisted: 0
Chrom: scaffold_839 SNPs number: 4    (301/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_840 SNPs number: 2    (302/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_846 SNPs number: 2    (303/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_847 SNPs number: 1    (304/345)
    SNPs blacklisted: 0
Chrom: scaffold_848 SNPs number: 1    (305/345)
    SNPs blacklisted: 0
Chrom: scaffold_849 SNPs number: 2    (306/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_850 SNPs number: 1    (307/345)
    SNPs blacklisted: 0
Chrom: scaffold_851 SNPs number: 1    (308/345)
    SNPs blacklisted: 0
Chrom: scaffold_852 SNPs number: 1    (309/345)
    SNPs blacklisted: 0
Chrom: scaffold_857 SNPs number: 1    (310/345)
    SNPs blacklisted: 0
Chrom: scaffold_862 SNPs number: 1    (311/345)
    SNPs blacklisted: 0
Chrom: scaffold_870 SNPs number: 1    (312/345)
    SNPs blacklisted: 0
Chrom: scaffold_873 SNPs number: 1    (313/345)
    SNPs blacklisted: 0
Chrom: scaffold_874 SNPs number: 2    (314/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_875 SNPs number: 1    (315/345)
    SNPs blacklisted: 0
Chrom: scaffold_876 SNPs number: 3    (316/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_877 SNPs number: 1    (317/345)
    SNPs blacklisted: 0
Chrom: scaffold_878 SNPs number: 1    (318/345)
    SNPs blacklisted: 0
Chrom: scaffold_891 SNPs number: 3    (319/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_895 SNPs number: 2    (320/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_899 SNPs number: 1    (321/345)
    SNPs blacklisted: 0
Chrom: scaffold_900 SNPs number: 1    (322/345)
    SNPs blacklisted: 0
Chrom: scaffold_901 SNPs number: 1    (323/345)
    SNPs blacklisted: 0
Chrom: scaffold_905 SNPs number: 1    (324/345)
    SNPs blacklisted: 0
Chrom: scaffold_909 SNPs number: 1    (325/345)
    SNPs blacklisted: 0
Chrom: scaffold_912 SNPs number: 2    (326/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_914 SNPs number: 1    (327/345)
    SNPs blacklisted: 0
Chrom: scaffold_915 SNPs number: 1    (328/345)
    SNPs blacklisted: 0
Chrom: scaffold_916 SNPs number: 1    (329/345)
    SNPs blacklisted: 0
Chrom: scaffold_918 SNPs number: 1    (330/345)
    SNPs blacklisted: 0
Chrom: scaffold_920 SNPs number: 1    (331/345)
    SNPs blacklisted: 0
Chrom: scaffold_922 SNPs number: 1    (332/345)
    SNPs blacklisted: 0
Chrom: scaffold_923 SNPs number: 1    (333/345)
    SNPs blacklisted: 0
Chrom: scaffold_925 SNPs number: 1    (334/345)
    SNPs blacklisted: 0
Chrom: scaffold_933 SNPs number: 2    (335/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_937 SNPs number: 1    (336/345)
    SNPs blacklisted: 0
Chrom: scaffold_940 SNPs number: 1    (337/345)
    SNPs blacklisted: 0
Chrom: scaffold_942 SNPs number: 3    (338/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_943 SNPs number: 1    (339/345)
    SNPs blacklisted: 0
Chrom: scaffold_948 SNPs number: 1    (340/345)
    SNPs blacklisted: 0
Chrom: scaffold_962 SNPs number: 1    (341/345)
    SNPs blacklisted: 0
Chrom: scaffold_967 SNPs number: 1    (342/345)
    SNPs blacklisted: 0
Chrom: scaffold_969 SNPs number: 2    (343/345)
    LD threshold: 0.1, SNPs blacklisted: 0
    LD threshold: 0.2, SNPs blacklisted: 0
    LD threshold: 0.3, SNPs blacklisted: 0
    LD threshold: 0.4, SNPs blacklisted: 0
    LD threshold: 0.5, SNPs blacklisted: 0
    LD threshold: 0.6, SNPs blacklisted: 0
    LD threshold: 0.7, SNPs blacklisted: 0
    LD threshold: 0.8, SNPs blacklisted: 0
    LD threshold: 0.9, SNPs blacklisted: 0
Chrom: scaffold_978 SNPs number: 1    (344/345)
    SNPs blacklisted: 0
Chrom: scaffold_995 SNPs number: 1    (345/345)
    SNPs blacklisted: 0
File written: whitelist(s) and blacklist(s)
Generating figures...

Step 4. Threshold selection
Look at the boxplot, a threshold of 0.2 will blacklist more markers than a threshold of 0.8

Enter the long LD threshold (filter.long.ld threshold, double/proportion):
0.2

Step 5. Filtering markers based on long distance LD
Filters parameters file: updated
################################### RESULTS ####################################

Filter long ld threshold: 0.2
Number of individuals / strata / chrom / locus / SNP:
    Before: 605 / 16 / 345 / 7670 / 7670
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 605 / 16 / 345 / 7670 / 7670

Computation time, overall: 162 sec
############################# completed filter_ld ##############################
################################################################################
######################## radiator::detect_mixed_genomes ########################
################################################################################
Execution date@time: 20191014@1705

detect_mixed_genomes function call arguments:
    data = SeqVarGDSClass
    interactive.filter = TRUE
    detect.mixed.genomes = FALSE
    ind.heterozygosity.threshold = NULL
    verbose = TRUE
    parallel.core = 11

dots-dots-dots ... arguments

Arguments inside "..." assigned in detect_mixed_genomes:
    internal = FALSE
    parameters = list
    path.folder = filter_rad_20191014@1654


File written: radiator_detect_mixed_genomes_args_20191014@1705.tsv
Filters parameters file: initiated
Calculating heterozygosity statistics
Generating plots

The greatest value of a picture is when it forces us
to notice what we never expected to see.

John W. Tukey. Exploratory Data Analysis. 1977.


Inspect plots and tables in folder created...
    Do you want to exclude individuals based on heterozygosity ? (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect mixed genomes: 0 1
Number of individuals / strata / chrom / locus / SNP:
    Before: 605 / 16 / 345 / 7670 / 7670
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 605 / 16 / 345 / 7670 / 7670

Computation time, overall: 92 sec
######################## completed detect_mixed_genomes ########################

################################################################################
###################### radiator::detect_duplicate_genomes ######################
################################################################################
Execution date@time: 20191014@1706
Function call and arguments stored in a file
File written: radiator_detect_duplicate_genomes_args_20191014@1706.tsv
File written: random.seed (322915)
Filters parameters file: initiated
Calculating manhattan distances between individuals...
Generating summary statistics
Generating plots

Inspect tables and figures to decide if some individual(s) need to be blacklisted
    Do you need to blacklist individual(s) (y/n): 
n
Filters parameters file: updated
################################### RESULTS ####################################
Detect duplicate genomes: 0
Number of individuals / strata / chrom / locus / SNP:
    Before: 605 / 16 / 345 / 7670 / 7670
    Blacklisted: 0 / 0 / 0 / 0 / 0
    After: 605 / 16 / 345 / 7670 / 7670

Computation time, overall: 83 sec
###################### completed detect_duplicate_genomes ######################
################################################################################
############################# radiator::filter_hwe #############################
################################################################################
Execution date@time: 20191014@1708
Interactive mode: on
Function call and arguments stored in: radiator_filter_hwe_args_20191014@1708.tsv
Filters parameters file: initiated
    using tidy data frame of genotypes as input
    skipping all filters
Summarizing data
File written: genotypes.summary.tsv
HWE analysis for pop: ALFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: ALST
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLAB
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLCC
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLFD
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLHB
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLKB
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLLG
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLPB
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLSG
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: FLSI
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: LAFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: TXCB
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: TXCC
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: TXFW
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: TXSP
  |=============================================| 100%, Elapsed 00:01
HWE analysis for pop: OVERALL
  |=============================================| 100%, Elapsed 00:01
File written: hw.pop.sum.tsv
Plot written: hwd.plot.blacklist.markers.pdf
Plot written: hwe.ternary.plots.missing.data.pdf
Plot written: hwe.manhattan.plot.pdf

Do you want to continue with the filtering ? (y/n):
y

Based on figures and tables enter the hw.pop.threshold (integer): 
2

Generating blacklists, whitelists and filtered tidy data...
done!

Choosing the final filtered dataset
   select the mid p-value threshold (5 options):
1: 0.05 *
2. 0.01 **
3. 0.001 ***
4. 0.0001 ****
5. 0.00001 *****
1

Final filtered tidy dataset: 
tidy.filtered.hwe.0.05.mid.p.value.2.hw.pop.threshold.rad

Using hw.pop.threshold/midp.threshold: 2/0.05
Filters parameters file: updated
################################### RESULTS ####################################
Filter HWE: 2 / 0.05
Number of individuals / strata / chrom / locus / SNP:
    Before: 605 / 16 / 345 / 7670 / 7670
    Blacklisted: 0 / 0 / 4 / 237 / 237
    After: 605 / 16 / 341 / 7433 / 7433

Computation time, overall: 97 sec
############################# completed filter_hwe #############################

Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv
Writing the filtered strata: strata.filtered.tsvstrata.filtered.tsv

Generating statistics after filtering
calculating individual stats...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...

Transferring data to genomic converter...
Synchronizing data and strata...
    Number of strata: 16
    Number of individuals: 605

Writing tidy data set:
radiator_data_20191014@1710.rad
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 2
Data summary: 
    number of samples: 605
    number of markers: 7433

Computation time, overall: 1007 sec
############################# completed filter_rad #############################
> filt<-radiator::filter_rad(data="filter_rad_20191014@1654/14_filtered/radiator_data_20191014@1710.vcf",
+                            strata="filter_rad_20191014@1654/14_filtered/strata.filtered.tsv",
+                            output=c("genepop","vcf","plink","structure"))
################################################################################
############################# radiator::filter_rad #############################
################################################################################
The function arguments names have changed: please read documentation

Execution date@time: 20191014@1713
Folder created: filter_rad_20191014@1713
Function call and arguments stored in: radiator_filter_rad_args_20191014@1713.tsv
File written: random.seed (422581)
Filters parameters file generated: filters_parameters_20191014@1713.tsv

Reading VCF
Data summary: 
    number of samples: 605
    number of markers: 7433
done! timing: 3 sec

Generating individual stats...
Generating markers stats...


Number of chromosome/contig/scaffold: 341
Number of locus: 7433
Number of markers: 7433
Number of populations: 16
Number of individuals: 605

Number of ind/pop:
ALFW = 37
ALST = 43
FLAB = 34
FLCC = 31
FLFD = 35
FLHB = 41
FLKB = 40
FLLG = 42
FLPB = 42
FLSG = 43
FLSI = 40
LAFW = 39
TXCB = 30
TXCC = 36
TXFW = 27
TXSP = 45

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20191014@1713.gds
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20191014@1714
Function call and arguments stored in: radiator_filter_monomorphic_args_20191014@1714.tsv
Filters parameters file: initiated


Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################

Computation time, overall: 11 sec
############################# completed filter_rad #############################

```

Obviously I lost some of this history but here's some of it. I recorded all of my choices on a piece of paper.

Ok, this retained a better number of loci. But there's a weird band in the detect_duplicate_genomes step, so I might try running that one again.

```{r}
filt<-radiator::detect_duplicate_genomes(data=data$output)
```

This resulted in keeping 353 individuals and 7411 loci

```
> filt<-radiator::detect_duplicate_genomes(data=data$gds)

################################################################################
###################### radiator::detect_duplicate_genomes ######################
################################################################################
Execution date@time: 20191014@1717
Folder created: -7_detect_duplicate_genomes_20191014@1717
Function call and arguments stored in a file
File written: radiator_detect_duplicate_genomes_args_20191014@1717.tsv
File written: random.seed (979520)
Filters parameters file generated: filters_parameters_20191014@1717.tsv
Filters parameters file: initiated
Calculating manhattan distances between individuals...
Generating summary statistics
Generating plots

Inspect tables and figures to decide if some individual(s) need to be blacklisted
    Do you need to blacklist individual(s) (y/n): 
y

2 options to remove duplicates:
    1. threshold: using the figure you choose a threshold. It's more powerful to fully remove duplicates
    2. manually: the function generate a blacklist that you have to complete
    Note: not sure ? Use option 1, it's more powerful to fully remove duplicates
    Enter the option to remove duplicates (1/2): 
1

Enter the threshold to remove duplicates: (between 0 and 1)
0.25

2 options to remove duplicates involved in pairs from different strata/group:
    (the black points on the figure, above your threshold)
    1: blacklist both samples in the pair
    2: blacklist only 1 sample, based on missingness
    Enter 1/2: 
2
With threshold selected, 70 individual(s) blacklisted
Written in the directory: blacklist.id.similar.tsv
Blacklisted individuals: 70 ind.
    Filtering with blacklist of individuals
Filters parameters file: updated
################################### RESULTS ####################################
Detect duplicate genomes: 0.25
Number of individuals / strata / chrom / locus / SNP:
    Before: 605 / 16 / 341 / 7433 / 7433
    Blacklisted: 70 / 0 / 0 / 0 / 0
    After: 535 / 16 / 341 / 7433 / 7433

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 1 / 22 / 22

Computation time, overall: 71 sec
###################### completed detect_duplicate_genomes ######################
```

This results in fewer individuals than in my first analysis. My goal here was to get a list of which individuals might be problematic

```{r}
similar_ids<-read.delim("-7_detect_duplicate_genomes_20191014@1717/blacklist.id.similar.tsv")
```

As I suspected, these are the FLFW individuals plus primarily individuals from the other FW populations. I think I'll leave them in the analysis for now (so I'll disregard this folder).

I'm saving these folders to my external drive so I can analyse them some more while I'm away.


# 11 October 2019

Trying to get to the bottom of the weirdness with the locus names...

```{r}
get_bayenv_results<-function(dir,env_vars){
  # process the variable names
  var_names<-unlist(lapply(env_vars,function(var){
    nms<-c(paste0(var,"_BF"),paste0(var,"_rho"),paste0(var,"_rs"))
    return(nms)
  }))
  # list all the files
  bf.files<-list.files(pattern="bf",path = dir,full.names = TRUE)
  xtx.files<-list.files(pattern="xtx",path=dir,full.names = TRUE)
  # get the bf files
  bf.dat<-do.call(rbind,lapply(bf.files,function(filename){
    bf<-read.table(filename,header = FALSE)
  }))
  colnames(bf.dat)<-c("locus", var_names)
  
  # xtx files
  xtx.files<-list.files(pattern="xtx",path=dir,full.names = TRUE)
  xtx.dat<-do.call(rbind,lapply(xtx.files,function(filename){
    xtx<-read.table(filename,header = FALSE,stringsAsFactors = FALSE)
  }))
  colnames(xtx.dat)<-c("locus","XtX")
  
  # combine the two
  bayenv.dat<-merge(xtx.dat,bf.dat,by="locus")
  return(bayenv.dat)
}
bayenv_dat<-get_bayenv_results(dir="bayenv/SNPFILES",env_vars=c("temp","salinity","seagrass"))

freq<-read.table("bayenv/bayenv.frq.strat",header=T, stringsAsFactors=F)
#want to get $MAC for every snp at every pop 
#and NCHROBS-MAC for every stnp at every pop
freq<-cbind(freq,freq$NCHROBS-freq$MAC)
colnames(freq)[ncol(freq)]<-"NAC"
pop.order<-levels(as.factor(freq$CLST))
snp.names<-split(freq$SNP,freq$CLST)[[1]]
snp.names<-gsub("(\\d+)_\\d+","\\1",snp.names)

snp_dat<-data.frame(ID=snp.names,loc=1:length(snp.names))

loc<-as.numeric(gsub("SNPFILES\\/(\\d+)","\\1",bayenv_dat$locus))
```

Ah, I think I see the problem -- these are actually the SNP IDs, not row numbers like I though. Oops! if you look closely at the snp.names definition, it's taking the SNP ID and using that.

OK, so let's see if we can match these things up. 


```{r}
pmap<-read.delim("stacks/populations_subset75/batch_2.pruned.map",header = FALSE)
pmap$locus<-gsub("(\\d+)_\\d+","\\1",pmap[,2])

bayenv_dat$locus<-as.numeric(gsub("SNPFILES\\/(\\d+)","\\1",bayenv_dat$locus))


dim(pmap[pmap$V2 %in% freq$SNP,])
dim(pmap[pmap$locus %in% snp.names,])
dim(pmap[pmap$locus %in% bayenv_dat$locus,])
dim(pmap[pmap$locus %in% loc,])
```

Ok, so there's something weird going on with the bayenv_dat SNP names - they don't match up properly with the locus names but snp.names do match up. So, I'm thinking that something gets lost in one conversion or another - could some of them have 1s added on the end? No, it doesn't look like it. 

Ok, what are the rownames of the actual snps file?

```{r}
snpsfile<-read.delim("bayenv/SNPSFILE")
```

Ok, let's take some steps back. When creating the SNPSFILE, I used the rownames properly but then didn't write them to the file. Then when making the SNP files, I used rownames of the first row per locus to name the files, so I should have names 1:12103*2 by 2s. 

```{r}
summary(seq(1,12103*2,2))
summary(as.numeric(bayenv_dat$locus))
```

Ok, and that is what I've got. Ok, so now I can do a better conversion.

```{r}
snp_dat<-data.frame(ID=snp.names,loc=seq(1,length(snp.names)*2,2))
loc<-snp_dat[order(snp_dat$loc %in% bayenv_dat$locus),]
```
This doesn't work! ugh. 

```{r}
head(snp_dat[sort(as.character(snp_dat$loc)),])
head(bayenv_dat[sort(bayenv_dat$locus),])
```

Ok, let's try another tactic

```{r}
bayenv_dat$locus<-as.numeric(bayenv_dat$locus)
bayenv_dat<-bayenv_dat[order(bayenv_dat$locus),]
bayenv_dat<-merge(snp_dat,bayenv_dat,by.x="loc",by.y="locus")
colnames(bayenv_dat)[1:2]<-c("index","SNPID")
```

Whew ok that works! Now we can add on the chromosome info

```{r}
colnames(pmap)<-c("Chr","SNP","X","BP","locus")
test<-merge(pmap[,c(1,4,5)],bayenv_dat,by.x="locus",by.y="SNPID")
```

Woohoo this works!

```{r}
bayenv_dat<-merge(pmap[,c(1,4,5)],bayenv_dat,by.x="locus",by.y="SNPID")
write.table(bayenv_dat,"bayenv/bayenv_output.txt",sep="\t",col.names = TRUE,row.names = FALSE,quote = FALSE)
```


I've put the working bits into the 202_fwsw_reanalysis.Rmd document.

Now onto plotting the results! I should be able to use functions I've written previously and are probably in gwscaR?

```{r}
bayenv_dat<-read.delim("bayenv/bayenv_output.txt",header = TRUE)
bayenv_dat$logSalBF<-log(bayenv_dat$salinity_BF)
xp<-fst.plot(bayenv_dat,scaffs.to.plot = lgs,fst.name = "logSalBF",chrom.name = "Chr",bp.name = "BP")

```

This plot doesn't look so nice. 

```{r}
bounds<-tapply(as.numeric(as.character(bayenv_dat$BP)), bayenv_dat$Chr,max)
bounds<-data.frame(Chrom=dimnames(bounds)[[1]],End=bounds,stringsAsFactors = FALSE)
plot_dat<-assign.plotpos(bayenv_dat,plot.scaffs = lgs,bounds = bounds, df.chrom = "Chr")
```

Not sure why this is so annoying. 

```{r}
par(mfrow=c(2,1))
plot(bayenv_dat$BP[bayenv_dat$Chr=="LG1"],bayenv_dat$XtX[bayenv_dat$Chr=="LG1"])
plot(xp$plot.pos[xp$Chr=="LG1"],xp$XtX[xp$Chr=="LG1"])
```

these match....

```{r}
par(mfrow=c(2,1))
plot(bayenv_dat$BP[bayenv_dat$Chr=="LG2"],bayenv_dat$XtX[bayenv_dat$Chr=="LG2"])
plot(xp$plot.pos[xp$Chr=="LG2"],xp$XtX[xp$Chr=="LG2"])
```

Those are also ok, but the max of LG1's plot pos, `r max(xp$plot.pos[xp$Chr=="LG1"])` is much smaller than the minimum of LG2's plot pos `r min(xp$plot.pos[xp$Chr=="LG2"])` - this is causing the weirdness.

```{r}
xp<-fst.plot(bayenv_dat,scaffs.to.plot = lgs,fst.name = "logSalBF",chrom.name = "Chr",bp.name = "locus",axis.size = 0,pch=19)
```

Aha, if I use the locus it looks a lot beter! But locus doesn't have anything to do with BP necessarily.

```{r}
bf.co<-apply(bayenv_dat[,grep("BF",colnames(bayenv_dat))],2,quantile,0.99) #focus on Bayes Factors, because of Lotterhos & Whitlock (2015)
temp.bf.sig<-bayenv_dat[bayenv_dat$temp_BF>bf.co["temp_BF"],]
sal.bf.sig<-bayenv_dat[bayenv_dat$salinity_BF>bf.co["salinity_BF"],]
grass.bf.sig<-bayenv_dat[bayenv_dat$seagrass_BF>bf.co["seagrass_BF"],]
xtx.sig<-bayenv_dat[bayenv_dat$XtX > quantile(bayenv_dat$XtX,0.99),]
```


```{r}
par(mfrow=c(2,1),mar=c(4,4,2,2))
xp<-fst.plot(bayenv_dat,scaffs.to.plot = lgs,fst.name = "XtX",chrom.name = "Chr",bp.name = "index",axis.size = 0,pch=19)
points(xp$plot.pos[xp$locus %in% xtx.sig$locus])
sp<-fst.plot(bayenv_dat,scaffs.to.plot = lgs,fst.name = "logSalBF",chrom.name = "Chr",bp.name = "index",axis.size = 0,pch=19)
```

This is annoying...maybe I should just combine all my outlier info at the moment - do I need to actually show the results on a graph?

```{r}
bayenv_dat$logTemBF<-log(bayenv_dat$temp_BF)
bayenv_dat$logSegBF<-log(bayenv_dat$seagrass_BF)
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
fw_SNPinfo<-merge(fw_SNPinfo,bayenv_dat[,c("locus","XtX","logSalBF","logTemBF","logSegBF")],by.x="ID",by.y="locus")
saveRDS(fw_SNPinfo,"fw_SNPinfo.RDS")
```

Ok, now what? I want to see if the ones that are outliers from bayenv are outliers in other things...

```{r}
summary(fw_SNPinfo[fw_SNPinfo$XtX>=quantile(fw_SNPinfo$XtX,0.99),])
```

What do I want to know? are certain chromosomes enriched?

```{r}
summary(factor(fw_SNPinfo[fw_SNPinfo$XtX>=quantile(fw_SNPinfo$XtX,0.99),"Chrom"]))
summary(factor(fw_SNPinfo[fw_SNPinfo$logSalBF>=quantile(fw_SNPinfo$logSalBF,0.99),"Chrom"]))
```

It doesn't really seem like it. 

Ok, what if I alter the plotting thing?

```{r}
pd<-do.call(rbind,lapply(lgs,function(lg,dat){
  d<-dat[dat$Chrom %in% lg,] # grab loci on this chromosome
  dsort<-d[order(d$Pos),]
  dsort$plotpos<-seq(1:nrow(dsort))
  return(dsort)
},dat=fw_SNPinfo))
bounds<-tapply(as.numeric(as.character(pd$plotpos)), pd$Chrom,max)
bounds<-data.frame(Chrom=dimnames(bounds)[[1]],End=bounds,stringsAsFactors = FALSE)

plot_dat<-assign.plotpos(pd,lgs,bounds,df.chrom = "Chrom",df.bp="plotpos")
```

```{r plotOutliers}
library(scales)
cols<-c(alpha('#e41a1c',0.75),alpha('#377eb8',0.75),alpha('#4daf4a',0.75),alpha('#984ea3',0.75))
par(mfrow=c(2,1),oma=c(1,2,1,1),mar=c(2,2,1,1),xpd=TRUE)
# plot XtX
plot_dat<-fst.plot(plot_dat,scaffs.to.plot = lgs,fst.name = "XtX",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19)
points(plot_dat$plot.pos[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       plot_dat$XtX[plot_dat$XtX>=quantile(plot_dat$XtX,0.99)],
       col=cols[4],cex=0.75,pch=19)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$XtX[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols[1],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL < 0.05 & plot_dat$stacks_LA < 0.05 &
                           plot_dat$stacks_TX < 0.05 & plot_dat$stacks_FL < 0.05],
       plot_dat$XtX[plot_dat$stacks_AL < 0.05 & plot_dat$stacks_LA < 0.05 &
                           plot_dat$stacks_TX < 0.05 & plot_dat$stacks_FL < 0.05],
       col=cols[2],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$XtX[plot_dat$pcadaptQ<0.01],
       col=cols[3],cex=1,pch=0,lwd=2)
axis(2,las=1)
mtext(expression(italic("X")^"T"~italic("X")),2,line=2)

# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=0)

# plot Bayes Factors
plot_dat<-fst.plot(plot_dat,scaffs.to.plot = lgs,fst.name = "logSalBF",chrom.name = "Chrom",bp.name = "Pos",axis.size = 0,pch=19)
points(plot_dat$plot.pos[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       plot_dat$logSalBF[plot_dat$logSalBF>=quantile(plot_dat$logSalBF,0.99)],
       col=cols[4],cex=0.75,pch=19)
points(plot_dat$plot.pos[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       plot_dat$logSalBF[rowSums(plot_dat[,c("perm_TX","perm_FL","perm_AL","perm_LA")])==4],
       col=cols[1],cex=1,pch=4,lwd=2)
points(plot_dat$plot.pos[plot_dat$stacks_AL < 0.05 & plot_dat$stacks_LA < 0.05 &
                           plot_dat$stacks_TX < 0.05 & plot_dat$stacks_FL < 0.05],
       plot_dat$logSalBF[plot_dat$stacks_AL < 0.05 & plot_dat$stacks_LA < 0.05 &
                           plot_dat$stacks_TX < 0.05 & plot_dat$stacks_FL < 0.05],
       col=cols[2],cex=1,pch=5,lwd=2)
points(plot_dat$plot.pos[plot_dat$pcadaptQ<0.01],
       plot_dat$logSalBF[plot_dat$pcadaptQ<0.01],
       col=cols[3],cex=1,pch=0,lwd=2)
axis(2,las=1)
mtext("log(Salinity Bayes Factors)",2,line=2)

# add the LG labels
midpts<-tapply(plot_dat$plot.pos,plot_dat$Chrom,median)
text(x=midpts[lgs],y=-5)

# add outside legend

opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
            mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend("top",c(expression("Permutation"~italic("F")["ST"]),
         expression("Stacks"~italic("F")["ST"]),
         "PCAdapt","BayEnv2"),xjust = 0.5,x.intersp = 0.5,
       col = cols,pt.bg=cols,pch=c(4,5,0,19),bty='n',ncol=2)
```


# 10 October 2019

Bayenv is done, now let's see what the results are.

```{r}

get_bayenv_results<-function(dir,env_vars){
  # process the variable names
  var_names<-unlist(lapply(env_vars,function(var){
    nms<-c(paste0(var,"_BF"),paste0(var,"_rho"),paste0(var,"_rs"))
    return(nms)
  }))
  # list all the files
  bf.files<-list.files(pattern="bf",path = dir,full.names = TRUE)
  xtx.files<-list.files(pattern="xtx",path=dir,full.names = TRUE)
  # get the bf files
  bf.dat<-do.call(rbind,lapply(bf.files,function(filename){
    bf<-read.table(filename,header = FALSE)
  }))
  colnames(bf.dat)<-c("locus", var_names)
  
  # xtx files
  xtx.files<-list.files(pattern="xtx",path=dir,full.names = TRUE)
  xtx.dat<-do.call(rbind,lapply(xtx.files,function(filename){
    xtx<-read.table(filename,header = FALSE,stringsAsFactors = FALSE)
  }))
  colnames(xtx.dat)<-c("locus","XtX")
  
  # combine the two
  bayenv.dat<-merge(xtx.dat,bf.dat,by="locus")
  return(bayenv.dat)
}
bayenv_dat<-get_bayenv_results(dir="bayenv/SNPFILES",env_vars=c("temp","salinity","seagrass"))


```

The SNP names are uninformative, just the row number the SNP was in. We can make these better using the freq info

```{r}
freq<-read.table("bayenv/bayenv.frq.strat",header=T, stringsAsFactors=F)
#want to get $MAC for every snp at every pop 
#and NCHROBS-MAC for every stnp at every pop
freq<-cbind(freq,freq$NCHROBS-freq$MAC)
colnames(freq)[ncol(freq)]<-"NAC"
pop.order<-levels(as.factor(freq$CLST))
snp.names<-split(freq$SNP,freq$CLST)[[1]]
snp.names<-gsub("(\\d+)_\\d+","\\1",snp.names)

loc<-as.numeric(gsub("SNPFILES/(\\d+)","\\1",bayenv_dat$locus))

bayenv_dat$locus<-snp.names[loc]
write.table(bayenv.dat,"bayenv/bayenv_output.txt",sep="\t",col.names = TRUE,row.names = FALSE,quote = FALSE)
```

Ok, I've added this to the 200_fwsw_reanalysis.Rmd document. Next step will be to plot it, which will require chromosome and bp info -- I could get this from the map file I suppose, or the vcf. Easiest is probably from the pruned map.

```{r}
pmap<-read.delim("stacks/populations_subset75/batch_2.pruned.map",header = FALSE)
pmap$locus<-gsub("(\\d+)_\\d+","\\1",pmap[,2])

bayenv_dat<-read.delim("bayenv/bayenv_output.txt",header = TRUE,stringsAsFactors = FALSE)
test<-merge(pmap[,c(1,4,5)],bayenv.dat,by="locus")
```

Hmm how are there locus names that are NA?? this is strange. Going back to the SNP name conversion above...I'll come back to this later.


# 9 October 2019

I started a populations run at home requiring loci to be in all populations, we'll see if that runs or not.

It looks like the initial runs of the FLAL, FLTX, and ALTX combinations ran without any problems, so I'll run another set, especially since the simulations are still running. I wonder if I should be doing all the pairwise comparisons. I should at least do LA compared to the others -- I've written that but will hold off running them until the simulations are done and I can re-start my computer.

Now I'm going to focus back on bayenv analysis - I've successfully run bayenv, it seems, and I've got xtx and bf files in the SNPFILES/ directory. Now I need to aggregate them. Looking at code from the popgen paper, I can see that yes I've done this before. Now I'll just write a couple functions to do this. 

```{r}
dir="bayenv/SNPFILES"
env_vars<-c("temp","salinity","seagrass")
var_names<-unlist(lapply(env_vars,function(var){
  nms<-c(paste0(var,"_BF"),paste0(var,"_rho"),paste0(var,"_BF"))
  return(nms)
}))
bf.files<-list.files(pattern="bf",path = dir,full.names = TRUE)
xtx.files<-list.files(pattern="xtx",path=dir,full.names = TRUE)

bf.dat<-do.call(rbind,lapply(bf.files,function(filename){
  bf<-read.table(filename,header = FALSE)
}))

colnames(bf.dat)<-c("locus", var_names)
```

Ok, I'm running into an interesting issue, which is that some of the files only have one number instead of three for each variable. The only problematic one seems to be 1776.freq.bf, as well as its neightbor 1776.freq.bf.freqs 

```{r}
bf.dat<-lapply(bf.files,function(filename){
   bf<-read.table(filename,header = FALSE)
  # if(ncol(bf)<10){ browser() }
 })
ncols<-lapply(bf.dat,ncol)
which(ncols<10)

bf.dat[which(ncols<10)]
bf.files[which(ncols<10)]
```


this gives me an indication that possibly bayenv ran on 1776.freq.bf somehow and now has weird stuff going on. I think I need to re-run bayenv on it and delete the other files. 

I re-ran that analysis using `~/Programs/bayenv/bayenv2 -i 1776 -m ../matrix -e ../env_data_sub75.txt -p 7 -k 100000 -n 3 -t -c -f -X -r 6842483 -o 1776.freq` and got this output

```
===== BAYENV2.0 =====

input file is set
matrix is set
environment file is set
number of populations is set
number of iterations is set
number of environmental variables is set
TEST is set
correlation output is set
frequency output is set
Calculating XtX
seed is set
output file is set
TEST = 1 . So running test at a SNP
number of environmental variables 3

MCMC VER 0.71 (THREADED)
ITERATIONS = 100000
INPUT FILE = 1776
MATRIX FILE = ../matrix
SEED = -6842483

ENVIRON FILE = ../env_data_sub75.txt
reading environ
num_alleles = 2.000000
number of loci = 1
```

So now let's try the code above again...if it works, I can continue working on it. 

```{r}
dir="bayenv/SNPFILES"
env_vars<-c("temp","salinity","seagrass")
var_names<-unlist(lapply(env_vars,function(var){
  nms<-c(paste0(var,"_BF"),paste0(var,"_rho"),paste0(var,"_BF"))
  return(nms)
}))

bf.files<-list.files(pattern="bf",path = dir,full.names = TRUE)
bf.dat<-do.call(rbind,lapply(bf.files,function(filename){
  bf<-read.table(filename,header = FALSE,stringsAsFactors = FALSE)
}))

colnames(bf.dat)<-c("locus", var_names)

xtx.files<-list.files(pattern="xtx",path=dir,full.names = TRUE)
xtx.dat<-do.call(rbind,lapply(xtx.files,function(filename){
  xtx<-read.table(filename,header = FALSE,stringsAsFactors = FALSE)
}))
colnames(xtx.dat)<-c("locus","XtX")

bayenv.dat<-merge(xtx.dat,bf.dat,by="locus")


```

Let's get the SNP names to put in the bayenv dataframe...

```{r}
freq<-read.table("bayenv/bayenv.frq.strat",header=T, stringsAsFactors=F)
#want to get $MAC for every snp at every pop 
#and NCHROBS-MAC for every stnp at every pop
freq<-cbind(freq,freq$NCHROBS-freq$MAC)
colnames(freq)[ncol(freq)]<-"NAC"
pop.order<-levels(as.factor(freq$CLST))
snp.names<-split(freq$SNP,freq$CLST)[[1]]
snp.names<-gsub("(\\d+)_\\d+","\\1",snp.names)

bayenv.dat$locus<-snp.names
```

Huh, this highlights an interesting issue, that I somehow have 24205 bf files but should only have 12103 loci. I wonder if perhaps my bayenv thing wasn't set up properly...So the bayenv file should have two rows per locus. 

I think the issue is how I created the SNPFILES -- I will need to re-do this. I think I've fixed it, so now it is 
```{r}
for(i in seq(1,(nrow(snpsfile)-1),2)){
  write.table(snpsfile[i:(i+1),],paste0(directory,"/locus",rownames(snpsfile)[i]),
              quote=F,col.names=F,row.names=F,sep='\t')
}
```

Before running this I'm going to delete the old SNPSfiles and bayenv stuff. Ok, this works! `../../scripts/run_bayenv2_matrix_general.sh SNPFILES SNPSFILE SNPFILES `.

Now to re-run bayenv...
```{bash}
nohup ../../scripts/run_bayenv2_matrix_general.sh BAYENV ~/Programs/bayenv/ matrix env_data_sub75.txt 7 3 SNPFILES > bayenv.log &
```

Ok, it's running! 

# 7 October 2019

The dadi TX simulations are almost done, but it turns out I haven't run the analyses with heterogeneous genomic migration rates. I also haven't run all of the other comparisons - maybe that's what I should run next. So I started the FLFW vs TXFW, FLFW vs ALFW, and ALFW vs TXFW. So now I've got 7 dadi runs going at once, and I'll start more eventually. 

When I get back, these are the priorities:
* Re-run populations to include all 16 populations but require more populations to be present.
* Re-run structure and treemix
* Re-run adegenet, PCAdapt, etc.


# 30 September 2019

On Saturday I wrote an R script (`plot_bayenv_matrices.R`) to plot the final matrix from each bayenv run. Now I'm going to update that script to randomly choose one and write it to `matrix` for the next bayenv step.

Ok, I've created the SNPFILEs but something is up with the environmental file, it looks like. I need to reformat it!

```{r}
env_dat<-read.csv("bayenv/env_data_raw.csv",row.names = 1)
std_dat<-t(apply(env_dat,1,function(x){
  stands<-(x-mean(x))/sd(x)
  return(stands)
}))
# make sure it's in the same order as the plinkfile
freq<-read.table("bayenv/bayenv.frq.strat",header=T, stringsAsFactors=F)
freq<-cbind(freq,freq$NCHROBS-freq$MAC)
colnames(freq)[ncol(freq)]<-"NAC"
pop.order<-levels(as.factor(freq$CLST))
snp.names<-split(freq$SNP,freq$CLST)[[1]]


write.table(std_dat[,pop.order],
            "bayenv/env_data_sub75.txt",sep='\t',col.names = FALSE,row.names = FALSE)
```


# 27 September 2019

`plink --file data --extract mysnps.txt`

SNP names are different, though, in the map than in the vcf, so I've gotta convert them. 

```{r}
map<-read.table("stacks/populations_subset75/batch_2.plink.map",header=F)	
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
map$id<-gsub("(\\d+)_\\d+","\\1",map$V2)
map$pos<-map$V4+1

loci<-map[which(map$id %in% vcf$ID & map$pos %in% vcf$POS),]
```

This has too many rows, which means that some of them are duplicated. Let's dig into this...

```{r}
dups<-loci[loci$id %in% loci$id[duplicated(loci$id)],]
head(dups)
vcf[vcf$ID==18109,1:3]
```

What this shows me is that the double %in% doesn't work right. 

Ok, let's do this instead

```{r}
loci<-map[paste(map$id,map$pos,sep="_") %in% paste(vcf$ID,vcf$POS,sep="_"),]
```

That works! Now we need to write the list of snps to a file.

```{r}
write.table(loci$V2,"keptSNPs_plink.txt",quote=FALSE,col.names = FALSE,row.names = FALSE)
```

so then I ran ` ~/Programs/plink-1.07-x86_64/plink --file stacks/populations_subset75/batch_2.plink --extract keptSNPs_plink.txt --noweb --recode --out stacks/populations_subset75/batch_2.pruned`

And it seems to have worked...

```{r}
ped<-read.delim("stacks/populations_subset75/batch_2.pruned.ped",sep=' ',header = FALSE)
map<-read.delim("stacks/populations_subset75/batch_2.pruned.map",header = FALSE)
```

Except it seems to have removed the LG info from the map file. I can probably reconstruct this, though...

```{r}
map$V1<-vcf$`#CHROM`[paste(vcf$ID,vcf$POS,sep="_") %in% paste(gsub("(\\d+)_\\d+","\\1",map$V2),map$V4+1,sep="_")]
write.table(map,"stacks/populations_subset75/batch_2.pruned.map",col.names = FALSE,row.names = FALSE,quote=FALSE,sep='\t')
```

Ok, awesome, now I can run bayenv.

WAIT! these are runs with only the fw-sw pairs...do I want to use the whole populations set? If so, then maybe I don't have to run this again, but the loci probably won't match up. 

If I do want to run it again, what do I need to do to specify populations? Put the Pop ID in the family line? Ah, no, with a cluster file (i.e., a pop map)

```{r}
clust<-ped[,1:2]
clust$clust<-gsub("sample_(\\w{4}).*","\\1",clust$V2)
write.table(clust,"bayenv/sub75.pruned.clust",col.names = FALSE,row.names = FALSE,quote=FALSE)
```

Ok, I ran 
```
../scripts/run_bayenv2_matrix_general.sh bayenv/sub75.pruned.clust stacks/populations_subset75/batch_2.pruned.ped stacks/populations_subset75/batch_2.pruned.map bayenv ~/Programs/bayenv/
```
But this doesn't quite work, it says 
```
===== BAYENV2.0 =====

input file is set
number of populations is set
number of iterations is set
seed is set

!!!Missing or wrong options for chosen configuration.
```

This has to do with the script I'm using and the fact that it's not quite right and kind of janky.

Also, the plink command in the script is not creating the SNPSFILE required by bayenv...Looking back at my notes from the popgen paper, I converted the plink output into bayenv SNPSFILE in R.

I found the code and created a new script that I can run, SNPSFILEfromPLINKfrq.R, and that seems to work. 

So now I'm generating the matrices. 

What else? I really should figure out some way to plot the outliers but I'm feeling really lazy. 

# 26 September 2019

So I'm a bit concerned that I've been using the wrong vcf file to generate the dadi analyses and to do these outlier analyses. This is not a HUGE deal for the outliers, since I can fairly easily re-run them, but for dadi this is a bit concerning. I *think* I should be using `populations_subset75/batch_2.pruned.vcf`, but it looks like I used `populations_subset75/batch_2.vcf` instead. Of course, I didn't do a good job recording WHAT I pruned it for. 

Let's compare them.

```{r}
vcf<-parse.vcf("stacks/populations_subset75/batch_2.vcf")
pruned_vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
```

batch_2.vcf has 62497 loci, and pruned_vcf has 12103. Ok, looking more carefully at my dadi code, I see that I did subset them - whew. It was in making the dadi stuff that I created the pruned vcf. Yay former me! 

Ok, let's work on subsetting the ped and map files then. Have I solved this problem before, and would the code be in gwscaR? It doesn't look like it. But plink probably does this? Yes, it does, but I have to figure out how again. 

`plink --file data --extract mysnps.txt`


# 25 September 2019

Returning to my goal of having a 'master' data frame summarising my SNPs, I need to figure out the conversion between the stacks fst BP output and the vcf POS output. Why can't they all just be the same?!?!

```{r stacks_fsts}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
fwsw.al<-read.delim("stacks/populations_subset75/batch_2.fst_ALFW-ALST.tsv")
fwsw.la<-read.delim("stacks/populations_subset75/batch_2.fst_ALST-LAFW.tsv")
fwsw.tx<-read.delim("stacks/populations_subset75/batch_2.fst_TXCC-TXFW.tsv")
fwsw.fl<-read.delim("stacks/populations_subset75/batch_2.fst_FLCC-FLFW.tsv")
```

And we need to match these to the Stacks Fsts, which have multiple SNPs per locus. And of course the BP and Position don't match up between the two -- I **thought** it was BP = POS+1, but that doesn't seem to work in all cases. And of course the stacks manual is useless. The google groups says that "Stacks starts counting at 0" [unlike SAM files apparently] but this doesn't clarify the vcf vs fst output.  

```{r}
test<-merge(fw_SNPinfo,fwsw.al,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all = FALSE) 
```

This test only has 6918 rows and it should have 12103. If we keep all of the fw_SNPinfo ones:

```{r}
test<-merge(fw_SNPinfo,fwsw.al,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all.x=TRUE,all.y = FALSE) 
head(test)
```

We can see that this is super inconsistent and confusing. Let's look at the first NA one.

```{r}
fwsw.al[which(fwsw.al$Locus.ID == fw_SNPinfo$ID[3]),]
fw_SNPinfo[3,]
```

The locus is found in both but the BP is not! I wonder if some SNPs were removed from the Fst comparisons due to coverage issues or something. 

```{r}
testa<-merge(fw_SNPinfo,fwsw.al,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all = FALSE) 
testl<-merge(fw_SNPinfo,fwsw.la,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all = FALSE) 
testt<-merge(fw_SNPinfo,fwsw.tx,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all = FALSE) 
testf<-merge(fw_SNPinfo,fwsw.fl,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all = FALSE) 
```

They've all got different numbers of rows - AL has 6918, LA has 7042, TX has 3687, and FL has 2049. UGH how annoying. Ok, well we'll just move forward with this, I can't waste my life trying to make sense of this weirdness. I *could* re-run populations, but I'd need to do that at home, so I'll consider that later.

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
fw_SNPinfo<-merge(fw_SNPinfo,fwsw.al,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all.x=TRUE,all.y = FALSE)[,c(colnames(fw_SNPinfo),"Fisher.s.P")] 
colnames(fw_SNPinfo)[ncol(fw_SNPinfo)]<-"stacks_AL"
fw_SNPinfo<-merge(fw_SNPinfo,fwsw.la,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all.x=TRUE,all.y = FALSE)[,c(colnames(fw_SNPinfo),"Fisher.s.P")] 
colnames(fw_SNPinfo)[ncol(fw_SNPinfo)]<-"stacks_LA"
fw_SNPinfo<-merge(fw_SNPinfo,fwsw.tx,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all.x=TRUE,all.y = FALSE)[,c(colnames(fw_SNPinfo),"Fisher.s.P")] 
colnames(fw_SNPinfo)[ncol(fw_SNPinfo)]<-"stacks_TX"
fw_SNPinfo<-merge(fw_SNPinfo,fwsw.fl,by.x=c("Chrom","BP"),by.y=c("Chr","BP"),all.x=TRUE,all.y = FALSE)[,c(colnames(fw_SNPinfo),"Fisher.s.P")] 
colnames(fw_SNPinfo)[ncol(fw_SNPinfo)]<-"stacks_FL"
saveRDS(fw_SNPinfo,"fw_SNPinfo.RDS")
```

Ok, I've added that to the main analysis document (which, by the way, I think I probably need to split into sections because it's so long!). Anyway, now let's add on PCAdapt and Bayenv outliers. I'm going to use the existing code in the doc to walk through PCAdapt.


I should note that for some of these PCAdapt gives "NA" -- not sure what causes this behaviour but there it is. It's true for the residuals and everything else. It looks to be due to low allele frequencies -- though stacks should have been run with a minimum allele frequency cutoff, so this is perplexing.

### Bayenv?

Let's see what I did previously.

```{r compareEnvVar}
env.data<-read.csv("bayenv/env_data_raw.csv",row.names = 1)
env.data<-rbind(env.data,pop=c(rep("SW",12),rep("FW",4)))
env.data<-as.data.frame(t(env.data))
wilcox.test(as.numeric(env.data$temp)~env.data$pop) #ties, but p=0.539
wilcox.test(as.numeric(env.data$seagrass)~env.data$pop) #ties, but p=0.897
```


```{r bayenv}
#taken directly from fwsw_analysis.R
bf<-read.delim("bayenv/p4.bf.txt",header=T)
bf$SNP<-paste(bf$scaffold,as.numeric(as.character(bf$BP))+1,sep=".")
bf.co<-apply(bf[,5:7],2,quantile,0.99) #focus on Bayes Factors, because of Lotterhos & Whitlock (2015)
temp.bf.sig<-bf[bf$Temp_BF>bf.co["Temp_BF"],c(1,2,4,8,5,9)]
sal.bf.sig<-bf[bf$Salinity_BF>bf.co["Salinity_BF"],c(1,2,4,8,6,9)]
grass.bf.sig<-bf[bf$seagrass_BF>bf.co["seagrass_BF"],c(1,2,4,8,7,9)]
#get the log transformed Bayes Factors
bf$logSal<-log(bf$Salinity_BF)
bf$logTemp<-log(bf$Temp_BF)
bf$logSeagrass<-log(bf$seagrass_BF)

```

There are `r nrow(temp.bf.sig[temp.bf.sig$locus %in% sal.bf.sig$locus & temp.bf.sig$locus %in% grass.bf.sig,])` overlapping outliers between temperature-, salinity-, and seagrass-associated loci.

But if we only care about salinity ones, there are `r nrow(temp.bf.sig)` outliers.

This used a different dataset, so let's see if ANY of the loci are the same 

```{r}
fw_SNPinfo<-readRDS("fw_SNPinfo.RDS")
compare<-merge(fw_SNPinfo,bf,by.x = c("Chrom","BP"),by.y=c("scaffold","BP"))
```
... and `r nrow(compare)` are the same :cryface: 

So I think I need to re-run Bayenv. It looks like I already tried to do this, as I've got a fw_sub75.clust file. The run_bayenv2_matrix_general seems to be the one I want. 

It also looks like I need re-subset the ped and map files based on my vcf that I'm using.....and looking back at stuff, I think I'm currently using my dadi file, which maybe wasn't subsetted FML why is past me the worst???? Do I need to re-run dadi with a different dataset?!?!?!?!? goiajefpoiuh;oiaejoijevaoiejfoai 

# 20 September 2019

Apparently my computer restarted last night so I re-started the TX 2D simultaions. 

# 19 September 2019

1000 permutations took a while - finished ~20hrs later.

Now I'm going take a look at those that are never found in permuted range.

```{r}
perm_out<-lapply(permuted_fsts,function(x){
  outs<-x[x$act_in_perm==1,]
  return(outs)
})
names(perm_out)<-list("TXFW vs TXCC","FLFW vs FLCC","ALFW vs ALST","ALST vs LAFW")
```

```{r}
pf1a<-fst.plot(permuted_fsts[[1]],fst.name = "Fst",bp.name="Pos",pt.cex = 1,axis.size = 1,pch=19,y.lim = c(0,1))
points(pf1a$plot.pos[pf1a$act_in_perm==1],pf1a$Fst[pf1a$act_in_perm==1],col="red")
```

Hmm, the points are all over, even small values. Let's look at all four comparison.

```{r}
par(mfrow=c(4,1),mar=c(2,2,2,2))
lapply(permuted_fsts,function(perm_fst){
  pf1a<-fst.plot(perm_fst,fst.name = "Fst",bp.name="Pos",pt.cex = 1,axis.size = 1,pch=19,y.lim = c(0,1))
  points(pf1a$plot.pos[pf1a$act_in_perm==1],pf1a$Fst[pf1a$act_in_perm==1],col="red")
})
```

The nice thing is that most of the extreme outliers are captured in this analysis. The next thing to do would be to compare all of the outliers. I also should start some dadi runs.

I also started more TX simulations (4 sets of 20 sims). Hopefully none of them take 20 days! It would be really great if I could start a bunch of runs on the supercomputer or with Felipe or something. 

Before I leave for the day, I'm going to start my 'master' list of SNPs and their outlier status info. The locus IDs are in the vcf, not in the permuted_fsts dataframes, so I'll start with that.

```{r}
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
fw_SNPinfo<-data.frame(ID=vcf$ID,Chrom=vcf$`#CHROM`,Pos=vcf$POS,
                       REF=vcf$REF,ALT=vcf$ALT,BP=vcf$POS-1,
                       perm_TX=permuted_fsts[[1]]$act_in_perm,
                       perm_FL=permuted_fsts[[2]]$act_in_perm,
                       perm_AL=permuted_fsts[[3]]$act_in_perm,
                       perm_LA=permuted_fsts[[4]]$act_in_perm,
                       stringsAsFactors = FALSE)
saveRDS(fw_SNPinfo,"fw_SNPinfo.RDS")
```


```{r stacks_fsts}
fwsw.al<-read.delim("stacks/populations_subset75/batch_2.fst_ALFW-ALST.tsv")
fwsw.la<-read.delim("stacks/populations_subset75/batch_2.fst_ALST-LAFW.tsv")
fwsw.tx<-read.delim("stacks/populations_subset75/batch_2.fst_TXCC-TXFW.tsv")
fwsw.fl<-read.delim("stacks/populations_subset75/batch_2.fst_FLCC-FLFW.tsv")
```

And we need to match these to the Stacks Fsts, which have multiple SNPs per locus. And of course the BP and Position don't match up between the two -- I **thought** it was BP = POS+1, but that doesn't seem to work in all cases. Ugh how frustrating. I really don't get it. 
```{r}
for(i in 1:nrow(fw_SNPinfo)){
  browser()
  fwsw.al$Fisher.s.P[which(fwsw.al$BP[which(fwsw.al$Locus.ID == fw_SNPinfo$ID[i])] == fw_SNPinfo$Pos[i]-1)]
}
test<-merge(fw_SNPinfo,fwsw.al,by.x=c("Chrom","BP"),by.y=c("Locus.ID","Chr","BP"),all.y = FALSE)
```

Hmm ok this is confusing and frustrating, I think I'm going to leave it at this and go home. 


# 18 September 2019

My goals for today are as follows:

1. Make overlapping histograms of Fst distributions for permuted and actual values. 
2. Use permutations to identify outliers (probability of actual value within permuted range?)
3. Spawn more dadi runs - possibly in parallel?
4. Create dataframe which contains for each locus:
* its information
* its stacks p-value
* its permutation probabilities
* its PCAdapt q-vale
* its XTX outlier designation
5. Make PCAdapt and Bayenv outlier plots 

So for #1 I'll start with the existing permutation distributions, just to write the code. 

```{r}
permuted_fsts<-readRDS("permuted_fsts.RDS")

```

```{r}
pop.list<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST","ALFW","FLSG","FLKB",
	"FLFD","FLSI","FLAB","FLPB","FLHB","FLCC","FLLG")
pop.labs<-c("TXSP","TXCC","TXFW","TXCB","LAFW","ALST","ALFW","FLSG","FLKB",
            "FLFD","FLSI","FLAB","FLPB","FLHB","FLCC","FLFW")
fw.list<-c("TXFW","LAFW","ALFW","FLLG")
sw.list<-c("TXSP","TXCC","TXCB","ALST","FLSG","FLKB",
	"FLFD","FLSI","FLAB","FLPB","FLHB","FLCC")
all.colors<-c(rep("black",2),"#2166ac","black","#2166ac","black","#2166ac",
        rep("black",8),"#2166ac")
#grp.colors<-c('#e41a1c','#377eb8','#4daf4a','#984ea3','#ffff33','#f781bf')
grp.colors<-c('#762a83','#af8dc3','#e7d4e8','#d9f0d3','#7fbf7b','#1b7837')

# used this order for permutations
#pwise_maps<-list(popmap[popmap$pops %in% c("TXFW","TXCC"),],
#                 popmap[popmap$pops %in% c("FLLG","FLCC"),],
#                 popmap[popmap$pops %in% c("ALFW","ALST"),],
#                 popmap[popmap$pops %in% c("LAFW","ALST"),])

plot_labs<-list("TXFW vs TXCC","FLFW vs FLCC","ALFW vs ALST","ALST vs LAFW")
pt_cols<-list(TXTX=grp.colors[1],FLFL=grp.colors[6],
              ALAL=grp.colors[3],ALLA=grp.colors[2])
```

```{r}


plot_fst_hists<-function(perms,plot_lab=NULL,cols=NULL,permlab="mean_perm",reallab="Fst",baseplot=TRUE,inset=NULL){
  require(scales)
  if(is.null(plot_lab)){
    plot_lab<-""
  }
  if(is.null(cols)){
    cols<-c("grey","black")
  } else if(length(cols)==1){
    cols<-c("dark grey",cols)
  }
  #inset<-par()$fig
  #browser()
  if(isTRUE(baseplot)){
    hist(perms[,permlab],col=alpha(cols[1],0.5),border = alpha(cols[1],0.5),
         xlim=c(0,1),breaks = seq(0,1,0.01),main = plot_lab,xlab=expression(italic(F)[ST]),
         ylab="Number of SNPs")
    hist(perms[,reallab],col=alpha(cols[2],0.5),border = alpha(cols[2],0.5),
         xlim=c(0,1),breaks = seq(0,1,0.01),main = "",xlab=expression(italic(F)[ST]),
         ylab="Number of SNPs",add=TRUE)
  }
  if(!is.null(inset)){ # add an inset
    # adjust the fig coordinates
    ifig<-c(inset[1]+0.25*(inset[2]-inset[1]),inset[2], 
            inset[3]+0.25*(inset[4]-inset[3]), inset[4])
    par(fig = ifig,new=TRUE) # start x, end x, start y, end y (percent plotting space)
    hist(perms[,permlab][perms[,reallab]>0],col=alpha(cols[1],0.5),border = alpha(cols[1],0.5),
         xlim=c(0,1),breaks = seq(0,1,0.01),main = "",xlab="",
         ylab="")
    box() #give it a box
    hist(perms[,reallab,][perms[,reallab]>0],col=alpha(cols[2],0.5),border = alpha(cols[2],0.5),
         xlim=c(0,1),breaks = seq(0,1,0.01),main = "",xlab="",
         ylab="",add=TRUE)
  }
  invisible(par()$fig)
}

```
```{r}
png("../figs/permuted_fsts.png",pointsize = 16,height=7,width=8,units="in",res=300)
par(mfrow=c(2,2),new=FALSE,mar=c(4,4,3,1))
#plot the base
pars<-mapply(plot_fst_hists,perms=permuted_fsts,plot_lab=plot_labs,cols=pt_cols,SIMPLIFY = FALSE)
# add the insets
ipars<-mapply(plot_fst_hists,perms=permuted_fsts,plot_lab=plot_labs,cols=pt_cols,
              inset=pars,
              MoreArgs = list(baseplot=FALSE))
dev.off()
```

Ok, this works!


I need to change the output a bit and somehow evaluate the likelihood that the actual value is within the range of permuted values.
```{r}
permute.gwsca<-function(vcf,map1,nperms,z=1.96, maf.cutoff = 0.05,cov.thresh=0.2){
  # calculate the actuals
  actual_fsts<-gwsca(vcf,colnames(vcf)[1:9],
                     map1[map1[,2] %in% unique(map1[,2])[1],1],
                     map1[map1[,2] %in% unique(map1[,2])[2],1],
                     maf.cutoff=maf.cutoff,prop.ind.thresh=cov.thresh)
  # do the permutations
  perm_fsts<-lapply(1:nperms,function(i,vcf,map1){
    perm_map<-map1
    perm_map[,2]<-perm_map[,2][permute::shuffle(perm_map[,2])]
    perm_dat<-gwsca(vcf,colnames(vcf)[1:9],
                     perm_map[perm_map[,2] %in% unique(perm_map[,2])[1],1],
                     perm_map[perm_map[,2] %in% unique(perm_map[,2])[2],1],
                     maf.cutoff,cov.thresh)
   
    return(perm_dat)
  },vcf=vcf,map1=map1)
  
  # calculate stats
  fsts<-t(do.call(rbind,lapply(perm_fsts,'[[',"Fst"))) #extract permuted fsts
  perm_fst_mu<-rowMeans(fsts)
  perm_fst_in<-NULL
  for(i in 1:nrow(actual_fsts)){
    pmax<-max(fsts[i,] )
    pmin<-min(fsts[i,] )
    if(actual_fsts[i,"Fst"] > pmax | actual_fsts[i,"Fst"] < pmin ){
      perm_fst_in[i]<-1
    }else{
      perm_fst_in[i]<-0
    }
  }
  
  fst_dat<-data.frame(cbind(actual_fsts,
                            n_perms=nperms,
                            mean_perm=perm_fst_mu,
                            act_in_perm=perm_fst_in))
  return(fst_dat)
}
```
```{r}
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
popmap<-data.frame(inds=colnames(vcf)[10:ncol(vcf)],
                   pops=gsub("sample_(\\w{4}).*","\\1",colnames(vcf)[10:ncol(vcf)]),
                   stringsAsFactors = FALSE)
pwise_maps<-list(popmap[popmap$pops %in% c("TXFW","TXCC"),],
                 popmap[popmap$pops %in% c("FLLG","FLCC"),],
                 popmap[popmap$pops %in% c("ALFW","ALST"),],
                 popmap[popmap$pops %in% c("LAFW","ALST"),])

TXmap<-popmap[popmap$pops %in% c("TXFW","TXCC"),]
perm_test<-permute.gwsca(vcf,TXmap,10,maf.cutoff=0,cov.thresh = 0)
```

Ok, this seems to have worked (it's a crude measure but let's go with it). Let's re-run this and generate many more permutations, then re-make the plots. 

```{r}
permuted_fsts<-lapply(pwise_maps,permute.gwsca,vcf=vcf,nperms=1000, maf.cutoff=0)
saveRDS(permuted_fsts,"permuted_fsts.RDS")
```
```{r}
png("../figs/permuted_fsts.png",pointsize = 16,height=7,width=8,units="in",res=300)
par(mfrow=c(2,2),new=FALSE,mar=c(4,4,3,1))
#plot the base
pars<-mapply(plot_fst_hists,perms=permuted_fsts,plot_lab=plot_labs,cols=pt_cols,SIMPLIFY = FALSE)
# add the insets
ipars<-mapply(plot_fst_hists,perms=permuted_fsts,plot_lab=plot_labs,cols=pt_cols,
              inset=pars,
              MoreArgs = list(baseplot=FALSE))
dev.off()
```


# 17 September 2019

My computer crashed -- 3DAL had been running for over 1600 hr and didn't even get through one replicate of one model. This is absolutely ridiculous. Luckily 254_2DoptNsim_TX_1 finished yesterday. Ah except it's not that it finished, it ran out of space! Crap. But it has finished 20 simulations so far. What's crazy to me is the variation in analysis times for the simulations -- from 5hr 49 min to 17 days! Once I restart my computer I'll run some more dadi combinations. maybe I can just do a bunch of 2D analyses and skip the 3D, unless Felipe sorts me out with something through MS Azure.

In the meantime, I'm going to work on an aspect of these analyses that is more within my control -- the Fst permutations.

```{r}
permuted_fsts<-readRDS("permuted_fsts.RDS")
```

So one of the issues with the permutations is that many of the loci are only polymorphic in one population, causing Fst to be 0, based on my calculations (because average Hs is the same as Ht). I could try another Fst calculation method but I'm not sure that will really help. But this should come out in the permutations? Are the permutations not permuting the map reliably? I think it should be. 

Looking at the variation in the Num1s after 10 permutations, it would seem that the permutations are working. What seems to be happening is that at some loci there is only one heterozygote so even if that individual gets switched it doesn't change things. 

I suppose the question is whether these perhaps-problematic loci are also problematic in the fst comparisons. The Fst output from permute.gwsca is the ACTUAL fst so I could plot those with CIs.

```{r gwscaPlot}
pf1<-fst.plot(permuted_fsts[[1]],bp.name="Pos",pt.cex = 1,axis.size = 1,pch=19)
# add lines
pos_fsts<-pf1[pf1$Fst>0,]
arrows(x0=pos_fsts$plot.pos,x1=pos_fsts$plot.pos,
       y0=pos_fsts$low_ci, y1=pos_fsts$upp_ci,
       col=c("darkgrey","lightgrey")[as.factor(pos_fsts$Chrom)],length=0)
```

Ok, so I calculated the confidence intervals around the permuted means, not the actual Fst values. 

So, how do I present these results? possibly not CIs. A difference between actual and permuted?

```{r}
par(mfrow=c(2,1))
pf1a<-fst.plot(permuted_fsts[[1]],fst.name = "Fst",bp.name="Pos",pt.cex = 1,axis.size = 1,pch=19,y.lim = c(0,1))
pf1p<-fst.plot(permuted_fsts[[1]],fst.name = "mean_perm",bp.name="Pos",pt.cex = 1,axis.size = 1,pch=19,y.lim = c(0,1))
arrows(x0=pf1p$plot.pos,x1=pf1p$plot.pos,
       y0=pf1p$low_ci, y1=pf1p$upp_ci,
       col=c("darkgrey","lightgrey")[as.numeric(pf1p$Chrom)%%2],length=0)
```

So they definitely are different. Maybe I can do a chi-squared test or something.


```{r}
fwsw.al<-read.delim("stacks/populations_subset75/batch_2.fst_ALFW-ALST.tsv")
fwsw.la<-read.delim("stacks/populations_subset75/batch_2.fst_ALST-LAFW.tsv")
fwsw.tx<-read.delim("stacks/populations_subset75/batch_2.fst_TXCC-TXFW.tsv")
fwsw.fl<-read.delim("stacks/populations_subset75/batch_2.fst_FLCC-FLFW.tsv")
```


# 9 September 2019

Ok, so that permuted Fsts thing works for permutations, but I think something odd is going on with the Fst calculations.

```{r}
permute.gwsca<-function(vcf,map1,nperms,z=1.96, maf.cutoff = 0.05,cov.thresh=0.2){
  # calculate the actuals
  actual_fsts<-gwsca(vcf,colnames(vcf)[1:9],
                     map1[map1[,2] %in% unique(map1[,2])[1],1],
                     map1[map1[,2] %in% unique(map1[,2])[2],1],
                     maf.cutoff=maf.cutoff,prop.ind.thresh=cov.thresh)
  # do the permutations
  perm_fsts<-lapply(1:nperms,function(i,vcf,map1){
    perm_map<-map1
    perm_map[,2]<-perm_map[,2][permute::shuffle(perm_map[,2])]
    perm_dat<-gwsca(vcf,colnames(vcf)[1:9],
                     perm_map[perm_map[,2] %in% unique(perm_map[,2])[1],1],
                     perm_map[perm_map[,2] %in% unique(perm_map[,2])[2],1],
                     maf.cutoff,cov.thresh)
   
    return(perm_dat)
  },vcf=vcf,map1=map1)
  
  # calculate stats
  fsts<-t(do.call(rbind,lapply(perm_fsts,'[[',"Fst"))) #extract permuted fsts
  perm_fst_mu<-rowMeans(fsts)
  perm_fst_sd<-apply(fsts,1,sd)
  perm_fst_ci<-z*perm_fst_sd/sqrt(nperms)
  fst_dat<-data.frame(cbind(actual_fsts,
                            n_perms=nperms,
                            mean_perm=perm_fst_mu,
                            low_ci=perm_fst_mu-perm_fst_ci,
                            upp_ci=perm_fst_mu+perm_fst_ci))
  return(fst_dat)
}
```
```{r}
source("../../gwscaR/R/gwscaR.R")
source("../../gwscaR/R/gwscaR_plot.R")
source("../../gwscaR/R/gwscaR_utility.R")
source("../../gwscaR/R/gwscaR_fsts.R")
source("../../gwscaR/R/gwscaR_popgen.R")
source("../../gwscaR/R/vcf2dadi.R")
```
```{r}

permuted_fsts<-readRDS("permuted_fsts.RDS")
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
popmap<-data.frame(inds=colnames(vcf)[10:ncol(vcf)],
                   pops=gsub("sample_(\\w{4}).*","\\1",colnames(vcf)[10:ncol(vcf)]),
                   stringsAsFactors = FALSE)
pwise_maps<-list(popmap[popmap$pops %in% c("TXFW","TXCC"),],
                 popmap[popmap$pops %in% c("FLLG","FLCC"),],
                 popmap[popmap$pops %in% c("ALFW","ALST"),],
                 popmap[popmap$pops %in% c("LAFW","ALST"),])

TXmap<-popmap[popmap$pops %in% c("TXFW","TXCC"),]
perm_test<-permute.gwsca(vcf,TXmap,10,maf.cutoff=0,cov.thresh = 0)
```

Ok, now let's dive into this analysis. Using browser(), I deduced that this was a minimum allele frequency issue. Because the loci in this analysis have passed a global minor allele frequency threshold, I'm going to run with a maf=0. The other reason is that some of them are polymorphic only in one population, and the function currently only deals with loci that are polymoprhic in both populations. This is unfortunately most of them in the TX dataset so I might change the function. 

Now it seems to be working correctly so I'll apply it to the larger dataset.

```{r}
permuted_fsts<-lapply(pwise_maps,permute.gwsca,vcf=vcf,nperms=100, maf.cutoff=0)
saveRDS(permuted_fsts,"permuted_fsts.RDS")
```


# 5 September 2019

Returning to the idea of permutations...

I'll want to permute the labels in each pairwise comparison.

```{r}
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")

```

In some ways this might be easier to run in populations, but I want to try it here.

My gwscaR code isn't quite right for this because the population lists must be found in the individual IDs. So I need to re-make this using a population map framework. So here's the population map:

```{r}
popmap<-data.frame(inds=colnames(vcf)[10:ncol(vcf)],
                   pops=gsub("sample_(\\w{4}).*","\\1",colnames(vcf)[10:ncol(vcf)]),
                   stringsAsFactors = FALSE)
```

Now I want to do pairwise comparisons and permute those. 

```{r}
pwise_maps<-list(popmap[popmap$pops %in% c("TXFW","TXCC"),],
                 popmap[popmap$pops %in% c("FLLG","FLCC"),],
                 popmap[popmap$pops %in% c("ALFW","ALST"),],
                 popmap[popmap$pops %in% c("LAFW","ALST"),])
```

`gwscaR::fst.one.vcf` actually will work well -- as will `gwscaR::gwsca`. What I need to do for permutations is for each of these pairwise maps, I need to run a permutation. 

```{r}
permute.gwsca<-function(vcf,map1,nperms,z=1.96, maf.cutoff = 0.05,cov.thresh=0.2){
  # calculate the actuals
  actual_fsts<-gwsca(vcf,colnames(vcf)[1:9],
                     map1[map1[,2] %in% unique(map1[,2])[1],1],
                     map1[map1[,2] %in% unique(map1[,2])[2],1],
                     maf.cutoff,cov.thresh)
  # do the permutations
  perm_fsts<-lapply(1:nperms,function(i,vcf,map1){
    perm_map<-map1
    perm_map[,2]<-perm_map[,2][permute::shuffle(perm_map[,2])]
    return(gwsca(vcf,colnames(vcf)[1:9],
                     perm_map[perm_map[,2] %in% unique(perm_map[,2])[1],1],
                     perm_map[perm_map[,2] %in% unique(perm_map[,2])[2],1],
                     maf.cutoff,cov.thresh))
  },vcf=vcf,map1=map1)
  # calculate stats
  fsts<-t(do.call(rbind,lapply(perm_fsts,'[[',"Fst"))) #extract permuted fsts
  perm_fst_mu<-rowMeans(fsts)
  perm_fst_sd<-apply(fsts,1,sd)
  perm_fst_ci<-z*perm_fst_sd/sqrt(nperms)
  fst_dat<-data.frame(cbind(actual_fsts,
                            n_perms=nperms,
                            mean_perm=perm_fst_mu,
                            low_ci=perm_fst_mu-perm_fst_ci,
                            upp_ci=perm_fst_mu+perm_fst_ci))
  return(fst_dat)
}
```

```{r}
TXmap<-popmap[popmap$pops %in% c("TXFW","TXCC"),]
perm_test<-permute.gwsca(vcf,TXmap,10)

```

This seems like it will work, I think!

Let's try it on a larger scale:

```{r}
permuted_fsts<-lapply(pwise_maps,permute.gwsca,vcf=vcf,nperms=100)
saveRDS(permuted_fsts,"permuted_fsts.RDS")
```
Ok, I want to do permutations but I can't think about how to implement it right now. Looking at this, though there seems to be an issue, since Hs1 and Hs2 are non-zero but Hs, Ht, and Fst are 0 for some of these. I'll need to dig into this tomorrow.

# 28 August 2019

loading the file with diveRsity is pretty slow so I'm going to try subsetting it first. I'm gonna give the R package `genepopedit` a try.

```{r, eval=FALSE}
devtools::install_github("rystanley/genepopedit") #https://github.com/rystanley/genepopedit
```

```{r}
library(genepopedit)
source("../../gwscaR/R/gwscaR.R")
source("../../gwscaR/R/gwscaR_plot.R")
source("../../gwscaR/R/gwscaR_utility.R")
source("../../gwscaR/R/gwscaR_fsts.R")
source("../../gwscaR/R/gwscaR_popgen.R")
source("../../gwscaR/R/vcf2dadi.R")
library(fsthet)
source("../R/203_treemix_plotting_funcs.R")#I've modified these functions
library(knitr)
library(scales)

```
```{r}
gpop75<-my.read.genepop("stacks/populations_subset75/batch_2.genepop")
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")
```

It's not clear to me how to match up the genepop locus names and the IDs in the vcf file. I'm guessing that the genepop ones are ID_BP but I'm not sure. 

```{r}
gpop_ids<-as.numeric(gsub("X(\\d+)_\\d+","\\1",colnames(gpop75)[3:ncol(gpop75)]))
```
 Yes I think that's the case. `r length(gpop_ids %in% vcf$ID)` of the `r length(gpop_ids)` IDs are in the vcf file. and `r length(vcf$ID %in% gpop_ids)` of the `r length(vcf$ID)` vcf IDs are in the genepop file. The issue is that I don't know which ones I chose! Because it was done randomly and the POS is not the same as the basepair tacked onto the genepop IDs -- POS indicates the position on the chromosome, whereas the number in the genepop names is the location on the 95-bp locus.
 
The map file might be able to help me convert them

```{r}
map<-read.delim("stacks/populations_subset75/batch_2.plink.map",comment.char = "#",header=FALSE)
colnames(map)<-c("CHROM","ID_BP","X","POS")
map$ID<-as.numeric(gsub("(\\d+)_(\\d+)","\\1",map$ID_BP))
map$BP<-as.numeric(gsub("(\\d+)_(\\d+)","\\2",map$ID_BP))
# Map positions start from 1, vcf from 0
map_pruned<-map[paste(map$POS+1,map$ID,sep="_") %in% paste(vcf$POS, vcf$ID,sep="_"),]
gpop75_pruned<-gpop75[,c(colnames(gpop75)[1:2],colnames(gpop75)[colnames(gpop75) %in% paste0("X",map_pruned$ID_BP)])]
```

Woohoo, this seems to have worked!

Now I need to turn this back into genepop file format to work with diveRsity...

```{r}
library(genepopedit)
genepop_unflatten(gpop75_pruned,"stacks/populations_subset75/batch_2.pruned.genepop")

```

Ok, this seems to work. 

```{r}
library(diveRsity)
source("../R/readGenepop_modified.R") # need to load this after divRsity to mask the package fx
pop75_div<-fastDivPart(infile=gpop75_pruned,gp = 2) #readGenepop(infile="stacks/populations_subset75/batch_2.pruned.genepop",gp = 2)
```

This is not working! I ended up re-writing some of the code that parses the genepop files because diveRsity wasn't correctly parsing my genepop file for some reason (something to do with the POP specifications, as far as I can tell -- which came from genepopedit and before that from Stacks).

However, this does not get past the issues, as far as I can tell. So perhaps I should retire the idea of using this package and/or post an issue on the github. I think I should maybe abandon this and perform permutations/bootstraps on my own. 

So if I'm going to do permutations or bootstraps I might have to write it myself.

```{r}
popmap<-data.frame(pop=gsub("sample_([A-Z]{4}).*","\\1",colnames(vcf)[10:ncol(vcf)]),
                   sample=colnames(vcf)[10:ncol(vcf)],stringsAsFactors = FALSE)

```


### Pcadapt

I managed to run pcadapt again but I need a better overall plan before doing anything with the outliers.


# 27 August 2019

I'm going to try the diveRsity package.

```{r}
library(diveRsity)
pop75<-readGenepop("stacks/populations_subset75/batch_2.genepop")
vcf<-parse.vcf("stacks/populations_subset75/batch_2.pruned.vcf")

divstats<-bigDivPart(gpop75)
```

This is taking far too long. I should try this at work tomorrow.


# 19 August 2019

I'm running the heterogeneous migration models for FL on my local machine and today requested that dadi be installed on abacus. I've also prepared scripts to run FLFW vs ALFW, FLFW vs TXFW, and TXFW vs ALFW. I may need to modify those scripts further if I'm going to run them on abacus, but for now they're a start.


# 14 August 2019

I've been putting off returning to this because I don't really know what to do. With regards to dadi, I think there are a few things I need to do:

1. Incorporate admixture into models. 
2. Run models between freshwater populations, specifically TX vs FL, TX vs AL/LA, and FL vs AL/LA. I'm worried about a 3D model taking forever, so maybe run AL vs LA? Ugh this will take FOREVER. Maybe I can figure out how to run these on the server?

Ok, I'm going to read that Rougeux paper and consider using his models. Rougeux et al used 4 basic models with 13 extensions. They used strict isolation (SI), isolation with migration (IM) ancient migration (AM) and secondary contact (SC). The extensions capture effect of selection inducing reduced gene flow around loci associated with adaptive divergence. 

The way this is written makes me think that they fit the basic models and then incorporated heterogeneous gene flow plus background selection (considering local variation in Ne). So, I'm going to try to use their code to add variation in migration and Ne across the genome into the asym_mig code. I think I've done this successfully! So now to run it for FL...I've updated the 252_2DFL.py script and now will try running it on my computer. 

# 9 August 2019

Gutenkunst wrote back, saying "The suitability of dadi for this depends on what assumptions you’re willing to make. Are you assuming your RADseq loci are independent? If not, you’ll need a much more complex simulation scheme to capture that dependence anyways. If yes, then you don’t need to use coalescent simulations." I guess he doesn't want to just answer my questions. 

But I guess it brings me back to the question of what I'm even doing, in a way. Maybe I should focus on running the models with heterogeneous selection first. Do I even need to identify outliers? probably not to this extent. Blargh. I should probably run some additional models. 

I am looking at Rougeux's code: https://github.com/crougeux/Dadi_v1.6.3_modif/blob/master/Dadi_studied_model/00_inference/modeledemo_mis_new_models.py

for inspiration about the heterogeneous effective population sizes.

```{python}
def SI2N(params, (n1,n2), pts):
    nu1, nu2, Ts, nr, bf, O = params
    """
    Model with split and complete isolation, heterogenous effective population size
        (2 classes, shared by the two populations = background selection)
    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    Ts: The scaled time of the split
    n1,n2: Size of fs to generate.
    nr: Proportion of non/low-recombining regions
    bf : Background factor (to which extent the effective population size is reduced in the non-recombining regions)
    O: The proportion of accurate orientation
    pts: Number of points to use in grid for evaluation.
    """
    # Define the grid we'll use
    xx = dadi.Numerics.default_grid(pts)

    # Spectrum of non-recombining regions
    # phi for the equilibrium ancestral population
    phinr = dadi.PhiManip.phi_1D(xx)
    # Now do the divergence event
    phinr = dadi.PhiManip.phi_1D_to_2D(xx, phinr)
    # We set the population sizes after the split to nu1 and nu2	
    phinr = dadi.Integration.two_pops(phinr, xx, Ts, nu1*bf, nu2*bf, m12=0, m21=0)
    # Finally, calculate the spectrum.
    # oriented
    fsnrO = dadi.Spectrum.from_phi(phinr, (n1,n2), (xx,xx))
    # mis-oriented
    fsnrM = dadi.Numerics.reverse_array(fsnrO)
    
    
	# Spectrum of recombining regions
    # phi for the equilibrium ancestral population
    phir = dadi.PhiManip.phi_1D(xx)
    # Now do the divergence event
    phir = dadi.PhiManip.phi_1D_to_2D(xx, phir)
    # We set the population sizes after the split to nu1 and nu2
    phir = dadi.Integration.two_pops(phir, xx, Ts, nu1, nu2, m12=0, m21=0)
    # Finally, calculate the spectrum.
    # oriented
    fsrO = dadi.Spectrum.from_phi(phir, (n1,n2), (xx,xx))
    # mis-oriented
    fsrM = dadi.Numerics.reverse_array(fsrO)

    ### Sum the two spectra in proportion O
    fs= O*(nr*fsnrO + (1-nr)*fsrO) + (1-O) *(nr*fsnrM + (1-nr)*fsrM)

    return fs

```

I should be able to adapt this in my code, but honestly 4:30pm on a Friday is not the time to be doing this. I will work on this on Monday, I think.

Remember to be inspired by this paper: https://onlinelibrary.wiley.com/doi/full/10.1111/jeb.13482

# 8 August 2019

I'm using the code in 257_debug_ms.py to make the comparison plots.

Other things I can do: 
1. re-evaluate calculation of theta by scouring the dadi user webpage
2. run many mas sims and average the results (or increase iter?)

regarding theta, on this group (https://groups.google.com/forum/#!searchin/dadi-user/dadi$20to$20ms|sort:date/dadi-user/DYrpTHCcC_I/nl3f2eGSAQAJ) Ryan says to use  `L = 3e6 * 56,343/123,797` for someone with 3 million sites (RAD loci), a total of 123797 SNPs, of which 56343 were used. That L should then get plugged into `4*Ne*mu*L`, and dadi's theta is `4*Ne*mu*L`. So to get the ms theta, which is `4*Ne*mu*L`, we divide dadi's theta by L. So what I need to do is get L. Let's write out the variables. `L = (num loci)*(num SNPs in analysis)*(total num SNPs)` - which is different from what I did (`dadi-theta*Llocus/Ltotal` [`227.83*95/(12103*251339*95/1370051)=0.1026112`]), and does not include the length of the loci.

However, this post (https://groups.google.com/forum/#!searchin/dadi-user/dadi$20to$20ms|sort:date/dadi-user/1Chc8kXrWE0/NuB4d4X2BwAJ) uses the length of the SNPs. They did `dadi-theta*(length of sequence)/(total length of sequences used) `

I've increased the iter substantially and it is very slow, but when it is done I will test out averaging spectra (just try doing mean() with two spectra objects I guess?). Using mean does not work! Nor does dadi.Spectrum.mean(fs1,fs2)

So I posted this on the forum:

```

Thank you for your response! I have a few follow-up questions:

  1.  To calculate theta I was following what the original post had done, which was: (dadi's theta estimate)*(individual rad locus length)/(total estimated length (L)), where L = (length of RAD loci)*(total number of RAD loci)*(number of SNPs in the analysis)/(total number of SNPs). Your response prompted me to scour the dadi-user group more thoroughly and I see that this estimate differs slightly from others that have been suggested. For instance, in this post, the conversion to ms theta is (daid-theta)*(length of sequence)/(total length of sequence used), and in this post, the conversion appears to be (dadi-theta)/L, where L = (total number of rad loci)*(number of SNPs in the analysis)*(total number of SNPs). What would you say is the best way to convert dadi's estimated theta to the input theta for ms for RAD-seq data?
  2.  I think I am a bit confused about specifying the number of loci in ms. I thought the number of iterations is equivalent to the number of SNPs simulated by ms (and I had used the number of SNPs in my dataset) -- is there a different way to specify the number of SNPs, and the number of SNPs per locus, in ms?
   3. How do you recommend that I average the spectra produced by multiple ms runs to compare the models?

Thank you so much for your help and the time you put into answering all of the questions here on the forum!

```

And now I just need to wait for Gutenkunst to answer.

He responded already! He said:

```
Maybe it’s time to back up here. Why are you running ms again? Typically we would only do coalescent simulations if we wanted to test how linkage affects our data. But if you’re simulating RAD seq loci independently and only taking one SNP per locus, then you’re generating data with no linkage anyways. In this case, the only difference between dadi and ms would be if you made an error in converting model specifications.

If you’re assuming you’re loci are unlinked, then the likelihood dadi calculates is correct, and you can do model selection with likelihood ratio tests, etc.
```

So I responded:

```
I would like to compare the distribution of Fsts in the observed data to a null distribution, so I'm trying to simulate data using the best-inferred model for my dataset. I was under the impression that dadi does not simulate data that can be used for this purpose, so I'm trying to use ms to generate simulated data. Do you have a suggestion for a better way to do this?
```

Now it's the waiting game again

# 1 August 2019

This is the response from Ryan Gutenkunst:
```
Hi Sarah,

Your demographic parameter conversion looks fine to me. But I don’t follow the theta conversion formula you’re using. You want your bootstrap data sets to have the same total theta as your original data data set. Are you downsampling the ms simulations to get to one SNP per locus?

It’s hard to tell whether your ms and dadi simulations agree, because the ms simulation is so sparse. Run many ms simulations then average the results to compare with the dadi model (and look at the residuals to see if there is systematic difference).

Best,
Ryan
```

And I find that response to be rather confusing. So....??? A few things to unpack:
1. Check that I've calculated theta correctly and re-write the equation.
2. How many SNPs per locus am I specifying in ms??
3. Run more ms sims and average the results 
4. How do I look at residuals? 
    Using dadi.Plotting.plot_2d_comp_multinomial between the data (fl_optimized) and models looks like it should work. I was struggling at first because I'd used different projection #s but I think it's better now?
    
I'm finding this VERY frustrating so I'm going to move onto something else for right now.


# 31 July 2019

I got ms to work within dadi by adding the msdir to the $PATH (duh). The issue was that python couldn't find the ms program.
Ok, so now troubleshooting the odd looking spectrum. I need to compare model with different cases, both the ms model and the dadi model. 
After trying a number of different models and still not finding a particularly good fit, I found that the most effective approach is to increase theta -- which implies that my estimates are not very good.

If I remove the -ej part of the ms code (but keep migration and ns) it works better. So I should probably check that -ej is the appropriate way to specify these models in the ms manual. I feel fairly confident that this is specified correctly...perhaps I should post my issue on the google group and get some feedback. 

I posted this on the dadi forum, in response to the thread from 2018 that inspired me:

```
Were you able to resolve this issue? I am running into a similar issue. In my case, I have ddRADseq data and I generated 251339 RAD loci (each 95bp long) with a total of 1370051 SNPs. I filtered and down-sampled to a dataset with 12103 unpolarized SNPs, each from a different RAD locus. I compared a number of models using Daniel Portik's dadi-pipeline and a model with ongoing asymmetric migration since a population split had the best fit.

The following forward-in-time dadi parameters were estimated from the best-fitting model:

Size of population 1 after split (nu1) = 0.1317
Size of population 2 after split (nu2) = 8.4225
Time in the past of split in units of 2*Na generations (T) = 0.1441
Migration from pop 2 to pop 1 (m12) = 0.7777
Migration from pop 1 to pop 2 (m21) = 0.0561
theta = 227.83

Based on what I've seen on this forum, I converted these parameters to ms arguments as follows:
-m 1 2 1.5554 [m12*2]
-m 2 1 0.1122 [m21*2]
-n 1 0.1317
-n 2 0.84225
-ej 0.07205 2 1 [T/2]

For theta, I converted it using the previously-recommended formula of (dadiTheta * locusLength) / ( (numSNPused * numLoci * locusLength) / totalNumSNPs ), which was:
-t = (227.83  * 95) / ( (12103 * 251339 * 95) / 1370051 )  = 0.1026112

like this:
core="-m 1 2 1.5554 -m 2 1 0.1122 -n 1 0.1317 -n 2 8.4225 -ej 0.07205 2 1"
command=dadi.Misc.ms_command(theta=0.1026112, ns=(60,70), core=core, iter=12103)
ms_fs=dadi.Spectrum.from_ms_file(os.popen(command))
folded_sfs=dadi.Spectrum.fold(ms_fs)
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001)


The sfs of asym_mig with these parameters looks like this:

FL_asym_mig_folded.png

But the output of the ms run above looks like this:

FL_ms_asym_mig_folded.png


Following the suggestions in this post, I tested it with equal migration between the two populations, no migration, and without the -ej term (the attached script has all of the code for these exploratory analysis). I found that removing the -ej term improved the ms-simulated sfs:

FL_ms_asym_mig_folded_noEj.png
However I'm not convinced that this ms model is the appropriate one for my dadi model.

Is anyone able to spot an issue in my implementation of the ms code? Has anyone successfully solved this type of issue?

Thank you in advance for your help!

Sarah

Attached: 257_debug_ms.py
```


Here are some of the things I tried:

```{python}
def no_divergence(notused, ns, pts):
    """
    Standard neutral model, populations never diverge.
    """
    
    xx = Numerics.default_grid(pts)
    
    phi = PhiManip.phi_1D(xx)
    phi = PhiManip.phi_1D_to_2D(xx, phi)
    
    fs = Spectrum.from_phi(phi, ns, (xx,xx))
    return fs


def no_mig(params, ns, pts):
    """
    Split into two populations, no migration.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations) 
    """
    nu1, nu2, T = params

    xx = Numerics.default_grid(pts)

    phi = PhiManip.phi_1D(xx)
    phi = PhiManip.phi_1D_to_2D(xx, phi)

    phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=0, m21=0)

    fs = Spectrum.from_phi(phi, ns, (xx,xx))
    return fs


def sym_mig(params, ns, pts):
    """
    Split into two populations, with symmetric migration.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations) 
    m: Migration rate between populations (2*Na*m)
    """
    nu1, nu2, m, T = params

    xx = Numerics.default_grid(pts)

    phi = PhiManip.phi_1D(xx)
    phi = PhiManip.phi_1D_to_2D(xx, phi)

    phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m, m21=m)

    fs = Spectrum.from_phi(phi, ns, (xx,xx))
    return fs

# Let's try it without the asymmetric migration
m = mean([m12,m21]) #0.4169
sym_mig_fs = sym_mig(params=(nu1,nu2,m,T),ns=ns,pts=pts)
sym_mig_folded = dadi.Spectrum.fold(sym_mig_fs)
dadi.Plotting.plot_single_2d_sfs(sym_mig_folded,vmin=0.000001)

core="-m 1 2 0.8338 -m 2 1 0.8338 -n 1 0.1317 -n 2 8.4225 -ej 0.07205 2 1"
command=dadi.Misc.ms_command(theta=0.1026112, ns=(60,70), core=core, iter=12103)
ms_fs=dadi.Spectrum.from_ms_file(os.popen(command))
folded_sfs=dadi.Spectrum.fold(ms_fs)
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001) # this is still not great.


# no migration

no_mig_fs = no_mig(params=(nu1,nu2,T),ns=ns,pts=pts)
no_mig_folded = dadi.Spectrum.fold(no_mig_fs)
dadi.Plotting.plot_single_2d_sfs(no_mig_folded,vmin=0.000001)

core="-n 1 0.1317 -n 2 8.4225 -ej 0.07205 2 1"
command=dadi.Misc.ms_command(theta=0.1026112, ns=(60,70), core=core, iter=12103)
ms_fs=dadi.Spectrum.from_ms_file(os.popen(command))
folded_sfs=dadi.Spectrum.fold(ms_fs)
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001) # this is still not great.

# no divergence -- ms can't converge
no_div_fs = no_divergence([nu1,nu2],ns=ns,pts=pts)
no_div_folded = dadi.Spectrum.fold(no_div_fs)
dadi.Plotting.plot_single_2d_sfs(no_div_folded,vmin=0.000001)

core="-n 1 0.1317 -n 2 8.4225"
command=dadi.Misc.ms_command(theta=0.1026112, ns=(60,70), core=core, iter=12103)
ms_fs=dadi.Spectrum.from_ms_file(os.popen(command))
folded_sfs=dadi.Spectrum.fold(ms_fs)
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001) # this is still not great.

# what about a different theta?
core="-m 1 2 0.8338 -m 2 1 0.8338 -n 1 0.1317 -n 2 8.4225 -ej 0.07205 2 1"
command=dadi.Misc.ms_command(theta=.05, ns=(60,70), core=core, iter=12103)
ms_fs=dadi.Spectrum.from_ms_file(os.popen(command))
folded_sfs=dadi.Spectrum.fold(ms_fs)
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001) # this is still not great.

# a different time + no migration? ms t = dadi T/2
asym_mig_fs = asym_mig(params=(nu1,nu2,0,0,1),ns=ns,pts=pts)
asym_mig_folded = dadi.Spectrum.fold(asym_mig_fs)
dadi.Plotting.plot_single_2d_sfs(asym_mig_folded,vmin=0.000001)

core="-m 1 2 0 -m 2 1 0 -n 1 0.1317 -n 2 8.4225 -ej 0.5 2 1"
command=dadi.Misc.ms_command(theta=0.1026112, ns=(60,70), core=core, iter=12103)
ms_fs=dadi.Spectrum.from_ms_file(os.popen(command))
folded_sfs=dadi.Spectrum.fold(ms_fs)
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001)

# a different time + sym migration? ms t = dadi T/2
asym_mig_fs = asym_mig(params=(nu1,nu2,m12,m12,1),ns=ns,pts=pts)
asym_mig_folded = dadi.Spectrum.fold(asym_mig_fs)
dadi.Plotting.plot_single_2d_sfs(asym_mig_folded,vmin=0.000001)

core="-m 1 2 1.5554 -m 2 1 1.5554 -n 1 0.1317 -n 2 8.4225 -ej 0.5 2 1"
command=dadi.Misc.ms_command(theta=0.1026112, ns=(60,70), core=core, iter=12103)
ms_fs=dadi.Spectrum.from_ms_file(os.popen(command))
folded_sfs=dadi.Spectrum.fold(ms_fs)
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001)
# Increasing theta seems to help -- perhaps my theta estimate is not appropriate?


os.chdir("~/Research/popgen/fwsw_results/dadi_analysis")
dd = dadi.Misc.make_data_dict ( "fwsw75.dadi.snps" )
fl = dadi.Spectrum.from_data_dict(dd , pop_ids =[ 'FLLG','FLCC' ],projections =[70,61] ,polarized = False )  #polarized = False creates folded spectrum

os.chdir(fl_dir)

pts = [ 220,240 ]

#Provide best optimized parameter set for empirical data.
#These will come from previous analyses you have already completed (above)
emp_params = [0.1317,8.4225,0.7777,0.0561,0.1441]

#Indicate whether your frequency spectrum object is folded (True) or unfolded (False)
fs_folded = True

#Fit the model using these parameters and return the folded model SFS (scaled by theta).
#Here, you will want to change the "sym_mig" and sym_mig arguments to match your model function,
#but everything else can stay as it is. See above for argument explanations.
scaled_fl = Optimize_Empirical(fl, pts, "Empirical", "asym_mig", asym_mig, emp_params, fs_folded=fs_folded)


```

Also worth keeping, for inspiration/guidance, is this github site that converts ms into fst type data:
https://github.com/molpopgen/msstats

# 30 July 2019

I'm moving forward thinking about how to use the dadi simulations to analyze my Fst results (with the FL population as my first pass).
The dadi simulations only give me a sense of whether my parameters are 'good' or not, and I think they are (given the chi-squared distributions etc). I should maybe also look at the parameters as well. Once I'm confident in my parameter estimates, I'll use them to run ms. 

So looking at histograms of the simulated values, the parameters I used definitely fall within the distributions. The m21 parameter seems a bit low comopared to many of the estimates, since the median estimate is 0.3838, the variance is 0.03662088, and my initial parameter was 0.0561. I'll run with it for now, but I should maybe test different runs of ms just to be sure.

To run ms I need to estimate theta (which comes from the time parameter??) and 4N0m. Also, the population sizes need to be relative to N0. In the dadi manual, it states:
```
So to convert from a time in ∂a∂i to a time in ms, divide by 2.
Migration rates are given in units of M ij = 2N ref m ij . Again, this differs from ms, where
the scaling factor is 4N ref generations. So to get equivalent migration (m ij ) in ms for a given
rate in ∂a∂i, multiply by 2
```

Therefore, for Florida, I need to use time = `0.1441/2`, m12 = `0.7777*2`, m21 = `0.0561*2`.  

On the dadi forum was a question posed using this model (https://groups.google.com/forum/#!searchin/dadi-user/ms%7Csort:date/dadi-user/N623Qe75iDs/neuuqEyUAwAJ):

```

Forward in time dadi parameters:
theta = 5911.67
pop 1 relative pop size post divergence = .1009
pop 2 relative pop size post divergence = .5933
T1, time from divergence until gene flow ceases = 5.3655
T2, time from cessation of gene flow until present = 0.0269
m12 = migration from pop 2 into 1 = 1.1616
m21 = migration from pop 1 into 2 = 0.4132
pop 1 relative pop size after size change = 29.9198
pop 2 relative pop size after size change = 8.7145

To simulate unlinked RAD loci, I generate a single-locus ms command by adjusting theta to be locus specific: 5911.67 * 94/3451628, i.e. multiply theta x individual rad locus length/total estimated length (L).


MS SIM SETUP
Basically, I simulate the set of unlinked rad loci and sum their respective sfs:

ms_fs = dadi.Spectrum.from_ms_file(popen('ms 306 56055 -t 0.160995617141 -I 2 91 215 -n 1 29.9198 -n 2 8.7145 -en 0.01345 1 0.1009 -en 0.01345 2 0.5933 -em 0.01345 1 2 0.8264 -em 0.01345 2 1 2.3232 -ej 2.6962 1 2),average=False)
folded_sfs=dadi.Spectrum.fold(ms_fs)
fig = pylab.figure(1)
fig.clear()
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.005)

note that the vmin setting was the same used for the plotting of the data and dadi inferred sfs.

MS ARGS IN DETAIL
-n 1 29.9198 : relative pop size of pop 1 (post expansion) at present
-n 2 8.7145 : relative pop size of pop 2 (post expansion) at present
-en 0.01345 1 0.1009: going backward in time, pop 1 size changes, in ms units, at 0.0269/2 = 0.01345 to .1009
-en 0.01345 2 0.5933 : similarly, pop 2 size changes at same time as pop 1 to .5933 

### time parameter conversions ###
dadi parameters are forward in time, while ms is backwards. m12-dadi, the fraction of pop 1 comprised of migrants from pop 2, is equivalent in coalescent terms to m21 for ms. thus, given that dadi rates are multiplied by 2 to get equivalent ms rates:

 -em 0.1345 1 2 0.8264  : M12, the # of individuals from pop 1 comprised of backward in time migrants from pop 2
 -em 0.1345 2 1 2.3232 : M21 the # of individuals from pop 2 comprised of backward in time migrants from pop 1
 -ej 2.6962 1 2 : the full length of the genealogy in ms units is (T1+T2)/2 = (5.3655 + 0.0269)/2 = 2.6962, at which time all lineages from population 1 are moved to population 2
 
```

So what this tells me is that I can run ms from within dadi!
And The conversions I need to make are not exactly what I had thought. Information I need: number of total RAD loci and length of RAD loci. 

Then, I'll have the following parameters:

nsam = n1 + n2 [82 + 94]
nreps = NUMSNPS [12103]
-t = `dadi-theta*Llocus/Ltotal` [`227.83*95/(12103*251339*95/1370051)=0.1026112`]
-m 1 2 = `2*m12` [`2*0.7777`=1.5554]
-m 2 1 = `2*m21` [`2*0.0561`=0.1122]
-n 1 nu1 [0.1317] (size of pop 1 after split at time T)
-n 2 nu2 [2 8.4225] (size of pop 2 after split at time T)
-ej T/2 2 1 [`0.1441/2`=0.07205 2 1] (combine the populations at time T)

According to the last version of the manuscript, I had 194294 RAD loci, which are 95 bp long, and I have 12103 SNPs in the analysis. 
But according to the sstacks log, I have 251339 loci in the catalog, and there are 1370051 SNPs in batch_2.catalog.snps.tsv. So, I now have:

ms 176 12103 -t 0.1026112 -I 2 82 94 -m 1 2 1.5554 -m 2 1 0.1122 -en 0.07205 1 0.1317 -en 0.07205 2 8.4225

Ok, this is actually wrong -- the ns are the current population sizes, which formed at time T, so I need to use -ej. 
This should be the code:
ms 176 12103 -t 0.1026112 -I 2 82 94 -m 1 2 1.5554 -m 2 1 0.1122 -n 1 0.1317 -n 2 8.4225 -ej 0.07205 2 1

Now the question is whether I use command-line ms, the ms built into dadi, or msprime (which was recommended by Gutenkunst on another forum post). 

```{python}
import os
import dadi
ms_fs=dadi.Spectrum.from_ms_file(os.popen('ms 176 12103 -t 0.1026112 -I 2 82 94 -m 1 2 1.5554 -m 2 1 0.1122 -n 1 0.1317 -n 2 8.4225 -ej 0.07205 2 1'))
```
Running the above does not work. But the same commands run in ms, and I was able to read them into dadi - but the spectrum looks off. I probably need to troubleshoot like the OP of the above information did, but I'm going to come back to that.

I got this to work:


```{python}
import os
import dadi
command=dadi.Misc.ms_command(theta=0.1026112, ns=(82,94), core=core, iter=12103)
```

but I still get an error when trying to run ms in dadi. Tomorrow I will try to install msprime and give that a try.

I also found some models for heterogeneous migration across the genome:  https://github.com/crougeux/Dadi_v1.6.3_modif/blob/master/Dadi_studied_model/00_inference/modeledemo_mis_new_models.py

# 29 July 2019

Gutenkunst wrote back:

```
Hello Sarah,

Attached is the script I used look at your spectra more. You’ll see that the negative values in each of the integrated spectra and the final spectrum itself are very small. If we look at the plot, the only values < 1e-12 (and thus all the negative values) are in the lower right-hand corner. If not much of your data lies in that corner, then you have nothing to worry about.

Best,
Ryan
```

with this script attached:

```{python}
from asym_mig_test import *

print('Minimum value in fs200: {0}'.format(fs200.min()))
print('Minimum value in fs250: {0}'.format(fs250.min()))
print('Minimum value in fs300: {0}'.format(fs300.min()))

spectra = [fs200,fs250,fs300]
xs = [_.extrap_x for _ in spectra]
fs_extrap = dadi.Numerics.quadratic_extrap(spectra, xs)

print('Minimum value in extrapolated spectrum: {0}'.format(fs_extrap.min()))

dadi.Plotting.plot_single_2d_sfs(fs_extrap, vmin=1e-12)

```

![2D asym_mig spectrum]("asym_mig_test.png")

Just to double check that my data aren't in that lower right-hand corner, I re-plotted my TX data with vmin=1e-12 and got this:

![2D data spectrum]("dadi_analysis/TX2D/TX_2d_sfs_smallvmin.png")

And that plot also has empty space in the bottom right-hand corner, so I think it's ok. Which means that I can ignore the warnings and move forward with my analyses of the TX runs. How many have I run, and do I need to run more? I don't think so, I have a lot of runs in the TX2D folder.

Analyzing those, I find that sym_mig_size is the best model, with optimal parameters 5.6521,4.4595,1.0109,19.9392,0.4646,1.3392,0.4042 (nu1a,nu2a,nu1b,nu2b,m,T1,T2). So I've now started the simulation code (254_2DoptNsim.py) with the TX dataset. The next big question to answer (other than how can I speed up the 3D analysis?) is how to simulate data in a way that will help me interpret my Fst values. I'm not sure that I can easily do that with dadi, I might need to use ms.


# 22 July 2019

I submitted this script to the dadi webpage for Ryan Gutenkunst to help me analyze the spectra:

```{python 255_asym_mig}
'''
Running asym_mig to compare frequency spectra
'''

import os
import numpy
import dadi
import pylab
from datetime import datetime
from dadi import Numerics, PhiManip, Integration
from dadi.Spectrum_mod import Spectrum


def asym_mig(params, ns, pts):
    """
    Split into two populations, with different migration rates.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations) 
    m12: Migration from pop 2 to pop 1 (2*Na*m12)
    m21: Migration from pop 1 to pop 2
	"""
    nu1, nu2, m12, m21, T = params
    xx = Numerics.default_grid(pts)
    
    phi = PhiManip.phi_1D(xx)
    phi = PhiManip.phi_1D_to_2D(xx, phi)
    
    phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
    fs = Spectrum.from_phi(phi, ns, (xx,xx))
    
    return fs    


# Set the parameters (from a run that resulted in lots of warnings)
nu1 = 1.01
nu2 = 15.4299
m12 = 0.9824
m21 = 1.1719
T = 3.6611
ns = [46,60]

# First set of points
pts = 200
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs200 = Spectrum.from_phi(phi, ns, (xx,xx))

# Second set of points
pts = 250
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs250 = Spectrum.from_phi(phi, ns, (xx,xx))

# Third set of points
pts = 300
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs300 = Spectrum.from_phi(phi, ns, (xx,xx))


```

This is in response to his email:

```
Hello Sarah,

You'll need to compare the numerical values in the spectra, using numpy functions. If you're not comfortable doing that, I can help if you send a self-contained script that generates your model and spectra.

Best,
Ryan
```

# 19 July 2019

Ok, returning to how to actually run this stuff. I want to run something like:

```{python asym_mig_goal}
asym_mig(params=[1.01,15.4299,0.9824,1.1719,3.6611], ns=[46,60] , pts=[200]) #ns are sample sizes
```

but this throws an error, "TypeError: int() argument must be a string or a number, not 'list'"

I'm not sure where this is going wrong, but it seems to be in the creation of the pts. Maybe I should run asym_mig step by step? This is asym_mig:

```{python asym_migfxn}
def asym_mig(params, ns, pts):
    """
    Split into two populations, with different migration rates.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations) 
    m12: Migration from pop 2 to pop 1 (2*Na*m12)
    m21: Migration from pop 1 to pop 2
	"""
    nu1, nu2, m12, m21, T = params
    xx = Numerics.default_grid(pts)
    
    phi = PhiManip.phi_1D(xx)
    phi = PhiManip.phi_1D_to_2D(xx, phi)
    
    phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
    fs = Spectrum.from_phi(phi, ns, (xx,xx))
    
    return fs    
```

```{python python_setup}
import os
import numpy
import dadi
import pylab
from datetime import datetime

#use dportik's functions
#get the optimize functions
execfile("../../programs/dadi_pipeline-master/Two_Population_Pipeline/Optimize_Functions.py")
execfile( "../../programs/dadi_pipeline-master/Two_Population_Pipeline/Models_2D.py")
execfile("../../scripts/250_custom_dadi_models.py")


# Load the data
dd = dadi.Misc.make_data_dict ( "fwsw75.dadi.snps" )
#projections is sample size of alleles
#need to use MINIMUM projections

#pops = ['FLLG', 'FLCC', 'ALFW','ALST','LAFW','TXFW','TXCC']
#projs = [70,      61,     72,     70,    72,    46,     61]

tx = dadi.Spectrum.from_data_dict(dd , pop_ids =[ 'TXFW','TXCC' ],projections =[46,60] ,polarized = False )  #polarized = False creates folded spectrum

```

So I'll try, step by step in ipython (after loading python_setup):


```{python asym_migfxn}
nu1 = 1.01
nu2 = 15.4299
m12 = 0.9824
m21 = 1.1719
T = 3.6611
ns = [46,60]

# First set of points
pts = 200
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs200 = Spectrum.from_phi(phi, ns, (xx,xx))
    
```

Incredibly, this seemed to work, so now I'll run the other points settings (250,300) and plot the resulting spectra.

```{python pts}
# Second set of points
pts = 250
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs250 = Spectrum.from_phi(phi, ns, (xx,xx))

# Third set of points
pts = 300
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs300 = Spectrum.from_phi(phi, ns, (xx,xx))
```


```{python plot_pts_sfs}
dadi.Plotting.plot_single_2d_sfs(fs200,vmin=0.01)
dadi.Plotting.plot_single_2d_sfs(fs250,vmin=0.01)
dadi.Plotting.plot_single_2d_sfs(fs300,vmin=0.01)
```

![SFS with 200 points]("dadi_analysis/TX2D/asym_mig_200ptsTest.png")
![SFS with 250 points]("dadi_analysis/TX2D/asym_mig_250ptsTest.png")
![SFS with 300 points]("dadi_analysis/TX2D/asym_mig_300ptsTest.png")

I have no idea how to interpret these. They all look basically the same. I think maybe it's time to respond to Gutenkunst's suggestion.

First, actually, I'll try the same set of code but with parameters that didn't give warnings (`r opt_warns[sample(which(opt_warns$Model=="asym_mig" & opt_warns$Warning=="FALSE"),1),]`).

```{python asym_migGoodParams}
nu1 = 1.0101
nu2 = 8.5997
m12 = 0.3722
m21 = 0.8104
T = 0.7221
ns = [46,60]

# First set of points
pts = 200
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs200 = Spectrum.from_phi(phi, ns, (xx,xx))
    
# Second set of points
pts = 250
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs250 = Spectrum.from_phi(phi, ns, (xx,xx))

# Third set of points
pts = 300
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
fs300 = Spectrum.from_phi(phi, ns, (xx,xx))

# Plot
dadi.Plotting.plot_single_2d_sfs(fs200,vmin=0.01)
dadi.Plotting.plot_single_2d_sfs(fs250,vmin=0.01)
dadi.Plotting.plot_single_2d_sfs(fs300,vmin=0.01)
```

![Good Params SFS with 200 points]("dadi_analysis/TX2D/asym_mig_200ptsGoodParams.png")
![Good Params SFS with 250 points]("dadi_analysis/TX2D/asym_mig_250ptsGoodParams.png")
![Good Params SFS with 300 points]("dadi_analysis/TX2D/asym_mig_300ptsGoodParams.png")

I posted these results on the dadi forum because I really have no idea what to make of them. https://groups.google.com/forum/#!topic/dadi-user/7DHNai6wDb4

# 18 July 2019

I received a response back from Gutenkunst:

```
Hello Sarah,

Sorry to hear you’re running into this. The one standout from your script is that your passing in a list of 4 grid points for extrapolation, rather than the typical 3. So you’re doing a cubic rather than a quadratic extrapolation. It’s possible that that is less well-behaved. It certainly hasn’t been tested.

If you want to dig deeper, take one of the problematic parameter sets, evaluate the sfs for each individual grid point setting, and compare them with each other and with the eventual extrapolation. (For example, if you’re using pts=[20,30,40], you’d calculate with pts=[20], pts=[30], pts=[40] to get three spectra, then compare among them.) That will at least give you a more specific idea of where in the spectrum the problem is and what’s actually happening.

Best,
Ryan
```

So, now I have to figure out how to implement this. Some action items:

1. Obviously, I'll remove the extra pts setting -- though that shouldn't be an issue, since I added it hoping it would fix the warnings problem. [I did this, removed the 350]
2. Extract a problematic parameter set
3. Run dadi on each individual grid point setting for those parameters and compare the spectra. This step should probably be run in the interactive python environment?

From the 250_fwsw_dadi.Rmd: 

```{r dadiSetup}
knitr::opts_chunk$set(echo = TRUE,out.extra='',fig.pos="H")
knitr::opts_knit$set(root.dir='./fwsw_results/')
source("../../gwscaR/R/gwscaR.R")
source("../../gwscaR/R/gwscaR_plot.R")
source("../../gwscaR/R/gwscaR_utility.R")
source("../../gwscaR/R/gwscaR_fsts.R")
source("../../gwscaR/R/gwscaR_popgen.R")
source("../../gwscaR/R/vcf2dadi.R")
source("../R/250_dadi_analysis.R")
library(knitr)
pop.list<-c("ALFW","ALST","FLCC","FLLG","LAFW","TXCC","TXFW")
```
```{r getWarnings}
all_warnings<-lapply(list.files(pattern="251",path = "dadi_analysis",full.names = TRUE),dadi_warnings)
names(all_warnings)<-list.files(pattern="251",path = "dadi_analysis",full.names = FALSE)

#for now, focus on V1s
all_warnings<-all_warnings[c("251_2DTX_V1_1.log","251_2DTX_V1_2.log","251_2DTX_V1_3.log")] 
all_warnings<-lapply(all_warnings,function(dat) { 
  dat$Warning<-TRUE
  return(dat)
})
```

Now let's get the parameters
```{r getParams}
opt_files<-list.files(path = "dadi_analysis/TX2D",pattern="V.*optimized.*",full.names = TRUE)
opt_files<-opt_files[grep("V1",opt_files)] #focus on V1s
tx_opts<-do.call(rbind,lapply(opt_files,function(file){
  dat<-parse_dadi_opt(file)
  dat$file<-file
  return(dat)
}))
```

```{r MatchWarnings2params}
opt_warns<-do.call(rbind,mapply(function(warnings,name,opts){
  key<-gsub("251_2DTX_(V\\d)_(\\d).log","\\1_Number_\\2",name)
  dat<-merge(opts[grep(key,opts$file),],warnings,by=c("Model","Replicate"),all = TRUE)
  dat$Warning[is.na(dat$Warning)]<-FALSE
  return(dat)
},all_warnings,names(all_warnings),MoreArgs = list(opts=tx_opts),SIMPLIFY = FALSE))
```

Ok, so now I've got all the parameter combinations, let's choose a random set taht didn't work.

```{r chooseRandWarn}
opt_warns[sample(which(opt_warns$Warning==TRUE),size = 1),]
```

Ok, so looking at the manual, to run this chosen set of parameters:
                         Model            Replicate log.likelihood     AIC  chi.squared  theta                  optimized_params             params
251_2DTX_V1_2.log.243 asym_mig Round_3_Replicate_20       -1180.06 2370.12 -15550366.57 113.78 1.01,15.4299,0.9824,1.1719,3.6611 nu1,nu2,m12,m21,T.
                                                                       file Warning
251_2DTX_V1_2.log.243 dadi_analysis/TX2D/V1_Number_2.asym_mig.optimized.txt    TRUE

I need to specify the model like this:

```{python}
asym_mig([1.01,15.4299,0.9824,1.1719,3.6611], ns=[46,60] , pts=[200]) #ns are sample sizes
```


# 15 July 2019

I decided to post on the dadi forum regarding my 2D TX populations. Here is what I wrote:

```
Hi dadi community,

I'm trying to compare the fit of several models for an analysis between two populations and I'm consistently getting warnings that say
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
These warnings do not go away as the model runs; they are just as likely in early runs as in later runs. Following answers to similar questions on this user group, I've tried increasing the number of points, narrowing the parameter bounds so that the model doesn't wander into difficult-to-escape parameter spaces, and using the make_extrap_func() instead of make_extrap_log_func(). However, I'm still getting these warnings pretty consistently, and not for any particular region of parameter space. I've attached a violin plot of the parameters for runs that produce warnings (left, red) and ones that don't produce warnings (right, grey) -- apologies for not making it a super pretty graph, but it conveys the idea that the parameter distributions are generally similar for the ones with warnings and the ones without. I'm getting these errors for both simple models (no migration) and complex models (ancient asymmetrical migration with population size changes). I chose the models after running 1D models for both populations, with the best fit for one population (TXCC) being growth or bottlegrowth and the best fit for the other population (TXFW) being either two_epoch or growth. The 2D frequency spectrum is attached as well.

I'm using a modified version of Daniel Portik's dadi_pipeline (https://github.com/dportik/dadi_pipeline) using the attached script. The in_params settings are ones taken from previous runs that did not produce warnings, but I get warnings whether I include those or not.

Can anyone help me understand why I'm getting these warnings and how to run the models successfully?

Thank you in advance!
```

I attached this script:
```{python, eval=FALSE}
'''
Running 2D model for TX pops
Run this from outside the dadi directory
'''

#start with ipython -pylab from ~/Research/popgen/fwsw_results/dadi_analysis/TX2D

# Numpy is the numerical library dadi is built upon
import sys
import os
import numpy
import dadi
import pylab
from datetime import datetime

#use dportik's functions
#get the optimize functions
execfile("../../programs/dadi_pipeline-master/Two_Population_Pipeline/Optimize_Functions.py")
execfile( "../../programs/dadi_pipeline-master/Two_Population_Pipeline/Models_2D.py")
execfile("../../scripts/250_custom_dadi_models.py")


# Load the data
dd = dadi.Misc.make_data_dict ( "fwsw75.dadi.snps" )
#projections is sample size of alleles
#need to use MINIMUM projections

#pops = ['FLLG', 'FLCC', 'ALFW','ALST','LAFW','TXFW','TXCC']
#projs = [70,      61,     72,     70,    72,    46,     61]

tx = dadi.Spectrum.from_data_dict(dd , pop_ids =[ 'TXFW','TXCC' ],projections =[46,60] ,polarized = False )  #polarized = False creates folded spectrum

os.chdir("TX2D")
#=================================================================================================#
#										PLOT SPECTRA	 										  #
#=================================================================================================#
dadi.Plotting.plot_single_2d_sfs(tx,vmin=0.01)


#=================================================================================================#
#										LOOP TO OPTIMIZE 										  #
#=================================================================================================#
pts = [ 200,250,300,350 ]
rounds=4
#define the lists for optional arguments
#you can change these to alter the settings of the optimization routine
reps = [10,20,30,40]
maxiters = [3,5,10,20]
folds = [3,2,2,1]
fs_folded = True
prefix = "tx"

for i in range(3,4):
	prefix = "V1_Number_{}".format(i)
	# Split into two populations, no migration.
	Optimize_Routine_Extrap(tx, pts, prefix, "no_mig", no_mig, rounds, 3, fs_folded=fs_folded, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1, nu2, T",in_params=[1.01,9.73,0.25],in_upper=[20,20,10],in_lower=[1,1,0.01])

	# Split into two populations, with continuous symmetric migration.
	Optimize_Routine_Extrap(tx, pts, prefix, "sym_mig", sym_mig, rounds, 4, fs_folded=fs_folded, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1, nu2, m, T",in_params=[1.01,9.77,0.60,1.09],in_upper=[20,20,10,10],in_lower=[1,1,0.01,0.01])

	# Split into two populations, with continuous asymmetric migration.
	Optimize_Routine_Extrap(tx, pts, prefix, "asym_mig", asym_mig, rounds, 5, fs_folded=fs_folded, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1, nu2, m12, m21, T",in_params=[1.01,10.98,0.68,0.48,1.32],in_upper=[20,20,10,10,10],in_lower=[1,1,0.01,0.01,0.01])

	# Split with no migration, then instantaneous size change with no migration.
	Optimize_Routine_Extrap(tx, pts, prefix, "no_mig_size", no_mig_size, rounds, 6, fs_folded=fs_folded, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1a, nu2a, nu1b, nu2b, T1, T2",in_params=[6.88,14.20,1.01,9.69,0.02,0.24],in_upper=[20,20,20,20,10,10],in_lower=[1,1,1,1,0.01,0.01])

	# Split with symmetric migration, then instantaneous size change with continuous symmetric migration.
	Optimize_Routine_Extrap(tx, pts, prefix, "sym_mig_size", sym_mig_size, rounds, 7, fs_folded=fs_folded, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1a, nu2a, nu1b, nu2b, m, T1, T2",in_params=[14.75,2.53,1.01,14.93,0.50,0.98,0.60],in_upper=[20,20,20,20,10,10,10],in_lower=[1,1,1,1,0.01,0.01,0.01])

	# Split with different migration rates, then instantaneous size change with continuous asymmetric migration.
	Optimize_Routine_Extrap(tx, pts, prefix, "asym_mig_size", asym_mig_size, rounds, 8, fs_folded=fs_folded, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1a, nu2a, nu1b, nu2b, m12, m21, T1, T2",in_params=[1.01,9,1.01,9.92,0.63,0.51,0.16,1.15],in_upper=[20,20,20,20,10,10,10,10],in_lower=[1,1,1,1,0.01,0.01,0.01,0.01])

	# Split with continuous symmetrical gene flow, followed by instantaneous size change with no migration.
	Optimize_Routine_Extrap(tx, pts, prefix, "anc_sym_mig_size", anc_sym_mig_size, rounds, 7, fs_folded=fs_folded, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1a, nu2a, nu1b, nu2b, m, T1, T2",in_params=[1.88,10.50,1.01,15,6.95,1.77,0.24],in_upper=[20,20,20,20,10,10,10],in_lower=[1,1,1,1,0.01,0.01,0.01])

	# Split with continuous asymmetrical gene flow, followed by instantaneous size change with no migration.
	Optimize_Routine_Extrap(tx, pts, prefix, "anc_asym_mig_size", anc_asym_mig_size, rounds, 8, fs_folded=fs_folded, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1a, nu2a, nu1b, nu2b, m12, m21, T1, T2",in_params=[1.34,7.49,1.01,15,9.76,7.66,1.34,0.25],in_upper=[20,20,20,20,10,10,10,10],in_lower=[1,1,1,1,0.01,0.01,0.01,0.01])


```

# 12 July 2019

So yesterday I did not get super far with the simulations, but today I'm going to focus first on the Texas population and dive into what parameter combinations are causing warnings. Meanwhile the 3D runs are going VERY slowly in the background.

I used existing functions (form previous analyses of FL2D runs) to visualize the parameters for each model in runs with warnings and ones without warnings. However, there are no obvious patterns arising. A bunch of them are different according to a wilcoxon rank test:
[1] "m21 is different in anc_asym_mig_size"
[1] "nu1b is different in anc_asym_mig_size"
[1] "nu2b is different in anc_asym_mig_size"
[1] "T1 is different in anc_asym_mig_size"
[1] "T2 is different in anc_asym_mig_size"
[1] "m is different in anc_sym_mig_size"
[1] "nu1b is different in anc_sym_mig_size"
[1] "nu2a is different in anc_sym_mig_size"
[1] "T2 is different in anc_sym_mig_size"
[1] "m12 is different in asym_mig"
[1] "m21 is different in asym_mig"
[1] "nu1 is different in asym_mig"
[1] "m21 is different in asym_mig_size"
[1] "nu1b is different in asym_mig_size"
[1] "nu2a is different in asym_mig_size"
[1] "T1 is different in asym_mig_size"
[1] "nu2a is different in no_mig_size"
[1] "T1 is different in no_mig_size"
[1] "m is different in sym_mig"
[1] "nu2 is different in sym_mig"
[1] "m is different in sym_mig_size"
[1] "nu1a is different in sym_mig_size"
[1] "nu2a is different in sym_mig_size"
[1] "nu2b is different in sym_mig_size"

but it's unclear to me how these issues can be fixed. All of these parameters seem to be ones that tend to be near the minimum or maximum values in parameter space.

I could try using other models -- that's what Gutenkunst suggests in one post: https://groups.google.com/forum/#!searchin/dadi-user/warning$20numerics$20extrapolation%7Csort:date/dadi-user/esRqfOQ7Amc/LfxsS0eog0IJ

# 11 July 2019

I'm not sure if this is the best way to go about doing this lab notebook thing, but I'm gonna give it a go. Today I'm trying to figure out how to analyze the $\delta_A\delta_I$ simulation results for the Florida populations. 

Portik used the simulated data to simply evaluate whether the empirical log-likelihoods and $\chi^2$ values are in the simulated distributions -- which I've done using the dadi-pipeline GOF R functions.

**How to implement heterogeneous migration, like in Tine et al. 2014 and Rougeux et al. 2017? $\delta_A\delta_I$??**
* The $\delta_A\delta_I$ user group has this response from Gutenkunst: "In your function, run through two sfs calculations, to create sfs1 and sfs2, then return `p*sfs1 + (1-p)*sfs2`."

**How to use simulations to analyze outlier data?**

**Updates on other $\delta_A\delta_I$ analyses:**
* Alabama & Louisiana analysis stopped after ~2.4 rounds of the first model results (possibly it's just slow, otherwise it could be an issue with the computer stalling)
* TX has lots of warnings - I want to delve into what's causing those by looking at the model output for the warning rounds.