-
Notifications
You must be signed in to change notification settings - Fork 1
/
100_popgen_analysis.txt
593 lines (482 loc) · 31 KB
/
100_popgen_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
Population genetics Analysis Log File
LAST UPDATED: 3 May 2016
This is started after receving first round of reviews from Molecular Ecology
22 February 2016
The reviewers want a general re-structuring of the paper, plus I need to check/fix/add the following things:
X-Migration analysis with MIGRATE
X-Calculate isolation by distance on contigs or single SNP per contig
X-use a more statistical approach to IBD
X-only do IBD on neutral loci
X-Look for the distribution of IBD across loci
X-Use neutral loci to generate Fst CIs
X-Test the association between Pst and Fst on a locus-by-locus basis or use morphological data in Bayenv
X-Apply common subspace approach to complete set of population P-matrices
X-use GIS to calculate distances instead of Google Maps
X-Standardize the traits for size (possibly--or at least justify why not)
X-Summary statistics output from stacks needs to be modified
X-Report sequencing coverage
X-How many of the 66 SNPs were unlinked?
X-Better description of evaluating K from structure
X-Why are there so many extreme Pst values (1)
X-Did LD pruning work as expected? Are SNPs from the same RAD locus removed?
X-Instead of mean temperature, use a temperature anomaly?
#THE PIPELINE
1. align with bowtie2
2. ref_map.pl
3. populations -r 0.75 -p 12 -a 0.05
4. prune_snps.cpp
5. plink --file pruned --hwe 0.001 --noweb --allow-no-sex --write-snplist
plink --file pruned --extract plink.snplist --recode --recodeA --recode-structure --out subset --noweb --allow-no-sex
6. PCAdapt
7. Adegenet
8. FastStructure
9. Structure
Structure Harvester
10. Bayenv2
11. calculate_global_fst
##################################################################################################
#####Monday, 2 May 2016
I've re-done the analysis, blastx is running, and the supplemental has been re-done.
Migrate map:
calculated by multiplying M2->1 by theta 1 for each arrow pointing to pop 1.
So for TXCC, it's got an arrow that's MTXCB->TXCC and MTXSP->TXCC, plus its own average theta. so I calculated Nm as:
Nm=((thetaTXCC*MTXCB->TXCC)+(thetaTXSP*MTXCB->TXCC))/2
OK, need Nm for each arrow.
(thetaTXCC*MTXCB->TXCC)/4
#####Sunday, 1 May 2016
Re-doing the Pst-Fst analysis with linear model instead of the anova.
This changes things slightly.
#####Tuesday, 12 April 2016
Comparing analyses:
summary(all.outliers[all.outliers$SNP %in% band.sig$V1,"analysis"])
CollectionSalinity CollectionTemp Fst MeanTemp PCAdapt Salinity Seagrass TempVariance XtX
2 4 0 3 0 4 1 6 1
> summary(all.outliers[all.outliers$SNP %in% head.sig$V1,"analysis"])
CollectionSalinity CollectionTemp Fst MeanTemp PCAdapt Salinity Seagrass TempVariance XtX
28 36 11 38 5 31 43 34 42
> summary(all.outliers[all.outliers$SNP %in% svl.sig$V1,"analysis"])
CollectionSalinity CollectionTemp Fst MeanTemp PCAdapt Salinity Seagrass TempVariance XtX
34 35 4 33 4 29 29 26 36
> summary(all.outliers[all.outliers$SNP %in% tail.sig$V1,"analysis"])
CollectionSalinity CollectionTemp Fst MeanTemp PCAdapt Salinity Seagrass TempVariance XtX
21 23 3 23 1 23 24 19 23
>
setwd("E:/ubuntushare/popgen/sw_results")
all.outliers<-read.table("AllOutliers.txt",header=T)
svl.sig<-read.table("pstfst/SVL_pstfst.txt")
tail.sig<-read.table("pstfst/TailLength_pstfst.txt")
band.sig<-read.table("pstfst/Bands_pstfst.txt")
head.sig<-read.table("pstfst/HeadLength_pstfst.txt")
length(levels(factor(svl.sig[svl.sig$V1 %in% all.outliers$SNP,])))
length(levels(factor(tail.sig[tail.sig$V1 %in% all.outliers$SNP,])))
length(levels(factor(head.sig[head.sig$V1 %in% all.outliers$SNP,])))
length(levels(factor(band.sig[band.sig$V1 %in% all.outliers$SNP,])))
summary(all.outliers[all.outliers$SNP %in% band.sig$V1,"analysis"])
summary(all.outliers[all.outliers$SNP %in% head.sig$V1,"analysis"])
summary(all.outliers[all.outliers$SNP %in% svl.sig$V1,"analysis"])
summary(all.outliers[all.outliers$SNP %in% tail.sig$V1,"analysis"])
bf.tempvar<-read.table("bayenv2/new_bayenv/tempvar_bf.txt",header=T)
bf.dat<-read.table("bayenv2/new_bayenv/BF_summary.pruned_snps.txt",header=T)
bf.dat$locus<-sub("./pruned_snps/(\\d+.*)","\\1",bf.dat$locus)
bf.dat<-merge(bf.dat,bf.tempvar,by.x="locus",by.y="SNP")
sub.map<-read.table("stacks/populations_subset/batch_1.plink.map")
bf.scaff<-merge(sub.map, bf.dat, by.x="V2", by.y="locus")
colnames(bf.scaff)[1:4]<-c("locus","scaffold","dist","BP")
bf<-bf.scaff[,c(1,2,4,5,8,11,14,17,20)]
bf.co<-apply(bf[,4:9],2,quantile,0.95)
temp.bf.sig<-bf[bf$Temp_BF>bf.co["Temp_BF"],c(1,2,3,4)]
sal.bf.sig<-bf[bf$Salinity_BF>bf.co["Salinity_BF"],c(1,2,3,5)]
ctemp.bf.sig<-bf[bf$coll.temp_BF>bf.co["coll.temp_BF"],c(1,2,3,6)]
csal.bf.sig<-bf[bf$coll.sal_BF>bf.co["coll.sal_BF"],c(1,2,3,7)]
grass.bf.sig<-bf[bf$seagrass_BF>bf.co["seagrass_BF"],c(1,2,3,8)]
tvar.bf.sig<-bf[bf$BFtempvar>bf.co["BFtempvar"],c(1,2,3,9)]
bf.names<-c("CollectionTemp","TempVariance","Salinity","Seagrass",
"CollectionSalinity","MeanTemp")
locad.out<-all.outliers[!(all.outliers$analysis %in% bf.names),]
loc.bf<-c(as.character(temp.bf.sig[temp.bf.sig$locus %in% locad.out$SNP,"locus"]),
as.character(ctemp.bf.sig[ctemp.bf.sig$locus %in% locad.out$SNP,"locus"]),
as.character(tvar.bf.sig[tvar.bf.sig$locus %in% locad.out$SNP,"locus"]),
as.character(sal.bf.sig[sal.bf.sig$locus %in% locad.out$SNP,"locus"]),
as.character(csal.bf.sig[csal.bf.sig$locus %in% locad.out$SNP,"locus"]),
as.character(grass.bf.sig[grass.bf.sig$locus %in% locad.out$SNP,"locus"]))
length(levels(factor(loc.bf)))
#####Monday, 11 April 2016
I need to do the summary stats. I'm thinking of just including more from the stacks output. That should be fine.
I'm also re-making all of the figures to be pdfs, as per the molecular ecology instructions.
#####Sunday, 10 April 2016
OK, bayenv2 finished and provided useful output. There were 88 loci associated with variance in temperature, and 4 of those were also in the mean temperature analysis. What now? I should blast those four. And make a plot for the temperature comparisons.
Still to do:
-update paper
-coverage analysis
-summary statistics
-use quantile() to ID outlier Fsts
-update MS
-organize files and annotate info on github
I re-did the Fst analysis using quantile() and 0.99 and now there's one overlapping SNP. This is fine (there are 4 with quantile() and 0.95).
I'm running coverage_from_stacks.
Comparing loadings etc. from PCA:
fem.pheno.pca (body traits)
PC1 PC2 PC3 PC4 PC5 PC6
SVL -7.8222 -2.67375 -0.13739 0.0003987 -0.01577 -0.002399
TailLength -9.3092 2.28507 -0.07646 -0.0027316 0.02766 -0.001429
depth -1.2765 0.02343 0.53818 -0.0644711 -0.48566 -0.007264
SnoutLength -0.9573 -0.20964 0.53620 -0.4570516 0.29055 -0.016289
SountDepth -0.3009 -0.03534 0.07418 -0.0211756 0.01171 0.212012
HeadLength -1.0488 -0.16777 0.53763 0.5230155 0.19467 -0.006552
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Eigenvalue 101.3515 8.32052 0.59936 0.32565 0.24025 0.03030
Proportion Explained 0.9142 0.07505 0.00541 0.00294 0.00217 0.00027
Cumulative Proportion 0.9142 0.98922 0.99462 0.99756 0.99973 1.00000
mal.pheno.pca
PC1 PC2 PC3 PC4 PC5 PC6
SVL -7.8154 2.10852 0.023154 0.08940 -0.03854 2.334e-04
TailLength -10.7317 -1.60014 -0.018902 0.02363 -0.01443 2.294e-05
depth -1.3536 0.12363 0.316103 -0.11005 0.34219 2.535e-02
SnoutLength -1.0014 0.25384 -0.552500 -0.15971 0.16346 1.864e-02
SountDepth -0.3602 0.05437 -0.005883 -0.05611 0.05398 -1.893e-01
HeadLength -1.2963 0.19451 0.115248 -0.48074 -0.14682 1.015e-02
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Eigenvalue 129.4529 5.10017 0.30013 0.20068 0.1216 0.02644
Proportion Explained 0.9575 0.03772 0.00222 0.00148 0.0009 0.00020
Cumulative Proportion 0.9575 0.99520 0.99742 0.99890 0.9998 1.00000
band.pca
PC1 PC2
MBandArea 0.131 -0.74800
BandNum 5.593 0.01752
Importance of components:
PC1 PC2
Eigenvalue 4.0198 0.07191
Proportion Explained 0.9824 0.01757
Cumulative Proportion 0.9824 1.00000
The loadings of the eigenanalysis of each population are generally the same..
Females:
the first eigenvector is mainly explained by SVL/TailLength, and second is also SVL/TailLength, with headlength and bnad num also having larger loadings. The third one is band num. There are some exceptions like FLHB, FLSG, FLSI, TXCC, and TXSP which flip e2 and e3.
Males:
e1 is SVL/TailLength, e2 is also SVL/TailLength, and e3 is HeadLength, snoutlength, or body depth
These generally amtch the PCA results above. It also matches the eigenanalysis of the H matrix. I'm not sure if this is at all equivalent to the eigentensor analyses.
Looking at the tensor summary:
Females:
e11 is mostly SVL, e21 is mostly tail and SVL, e31 is mostly SVL, Tail, and Band num (look at tensorsummary)
Males:
e11 is SVL and tail length, e21 is SVL and tail length, e31 is SVL and tail length.
Coverage from Stacks summary:
Num individuals 524
Num loci 194294
Average per-locus coverage 7.07612
average per-individual coverage 10.0782
Num Loci with >=5x coverage 95145
Num loci with >=10x coverage 36227
Overall minimum per-locus coverage 0
Overall maximum per-locus coverage 6905
dim(temp.bf.sig[temp.bf.sig$locus %in% locad.out,])
[1] 10 4
> dim(ctemp.bf.sig[ctemp.bf.sig$locus %in% locad.out,])
[1] 14 4
> dim(tvar.bf.sig[tvar.bf.sig$locus %in% locad.out,])
[1] 14 4
> dim(sal.bf.sig[sal.bf.sig$locus %in% locad.out,])
[1] 21 4
> dim(csal.bf.sig[csal.bf.sig$locus %in% locad.out,])
[1] 20 4
> dim(grass.bf.sig[grass.bf.sig$locus %in% locad.out,])
[1] 18 4
Total of 383 SNPs in all 6.
I made a table of all of the outliers..I should add the RAD sequences probably and share it as a supplement--did that!
#####Saturday, 9 April 2016
I changed the temperature variance file to have the unix line endings. Now I'm re-running Bayenv. It's still giving me nan...maybe I'll try it on the lab computer. Last time it didn't work right. It's still giving me nan...probably because it's not formatted correctly. I needed to remove the row and column names. That seems to have fixed it!
I've been updating the manuscript. I've also been running the pst-fst significant ones in blast2go to see if there are any differences/similarities.
#####Friday, 8 April 2016
Of course bayenv didn't run properly on the temp variance data and I didn't send myself the temp var data. So I'll have to figure that out back at home.
So I guess that means I'll need to do the Pmatrix analysis instead.
I've output the P matrices and they are all pretty close to 1...the biggest differences seem to be related to body size, with FLFD and FLAB being different (because they're larger)
A p matrix of just the bands yields all 1s..I think maybe because there are only two traits. or something.
In the Aguirre et al review, they compare eigenvectors of H to values from "a common set of principal components". They also have confidence intervals, which I don't know how they generated. Maybe it's in their supplement? They did MCMC resampling. Also this for the null: "The null model we use assumes the G are sampled from the same population, and hence the randomised G will have all subspaces in common up to sampling error. Thus, to compare our observed and null model we will apply the kr.subspace function to the first four eigenvectors of the randomised G array".
Randomly re-assigned each individual to a population.
For calculation of H I'm using the first 3 eigenvectors of each population.
The angles between eac eigenvector of H and each of the p population subspaces quantify how close the corresponding eigenvector of H is to each population's subspace.
Aguirre et al says "Insight into differences among populations in genetic variance associated with common subspaces can be gained in this context by using projection to find the genetic variance in each population for those bi that are judged to form part of the common subspace."
I've made a new figure 7 and am working on a supplementary document with more tables and figures that will also help me with my interpretation.
#####Thursday, 7 April 2016
I'm just reading in the fem.pheno and mal.pheno files from home. Re-doing the Pst analysis. This is slow, I had kind of forgotten that.
Meanwhile, some of my blast searches have finished so I'll run them through blast2go.
The eigenvalues of the female H are:
[1] 11.973398525 11.374808392 10.489751252 1.851644007 0.191285884
[6] 0.097418370 0.012777080 0.008916491
eigenvalues of the male H are:
[1] 11.9978140 11.8908686 10.9745211 0.6891948 0.3391816 0.1084199
where the order is SVL,length, depth,snout length, snout depth, head length,
band area, band num
UMM this doesn't have the modifications! Maybe that's the difference between my heatmap results. Yep, that's it. FML.
Both males and females differ quite a bit if we consider only the male and female body traits (excluding female bands). the heatmap is difficult to read and interpret.
New
female eigenvalues for H: [1] 11.9937325 11.6188491 7.4354786 3.2870541 1.6045495 0.0603361
male eigenvalues for H: [1] 11.9972371 11.7489118 6.2810093 4.5904273 1.1963887 0.1860257
Angles are <0.03. I don't know how to best depict this.
####Wednesday, 6 April 2016
I made a PCA plot of the phenotypes. It looks like what I'd expect. PC1 explains SVL/TailLength vs other traits, and the populations basically group by size. FLAB and FLFD are the largest. FLAB has a weird pattern in the bands pc on PC2 where it's separate from the rest.
The p-matrix analysis is still reversed, and I don't know why. I'm wondering if I messed up the analysis before? But the calculations don't match exactly either so I can't explain it. It worked right on the files from home!weird.
I wrote code plot migrate data with the arrows and points scaled to migration rate and theta. I also output a matrix with the migration rates...it should be similar to Fst matrix. I can make a heatmap of each and compare them.
#####Tuesday, 5 April 2016
Migrate finished running on my subset of data. It gives me a theta for each locus for each pop and a migration rate for each direction of travel from each pop to each pop. No estimates of Ne. Just these giant tables of posterior distributions. I don't know what theta is..OK, 4Nem
I'm extracting the scaffolds with significant fst-pst loci. Ran subset_fasta_file and moved the scaffold-specific fasta files to pstfst/sig_regions. Then used extract_sequence_part to remove the 5kb region and then cat'ed the files together. Also extracted the rad regions using fasta_from_stacks_catalog. Submitted all of the merged results on TIGGS for blastx (xml output).
#####Monday, 4 April 2016
Working on the Pst-Fst by locus analysis:
798 of 1753 are significant in one or more female traits
277 of 1753 are significant in one or more male traits
Num Sig in males and females for each trait:
SVL 634
TailLength 414
BodyLength 782
SnoutLength 540
SnoutDepth 582
HeadLength 770
BandNum or BandArea 132
None are present in all of these.
315 loci are significant for IBD. None of those are in the above significant loci sets.
I should export those loci and create a supplementary table of blast results for those.
315 loci are significant for IBD. None of those are in the above significant loci sets.
Are those that are significant in Fst-Pst comparison significantly IBD? NO! none of them have a significant IBD pattern. This fits.
Exporting all of the rad loci to files and then I'll extract the RAD locus. Also should output the 5bk region...
#####Friday, 1 April 2016
Migrate quit but it didn't finish correctly. So I used plink to randomly sample 25% of the subsetted data (408 ish loci). Then re-ran migrate.
I'm also working on the pst-fst analysis bylocus.
#####Thursday, 31 March 2016
I'm working of Fst-Pst and Fst-IBD locus-by-locus analyses. I'm running into an error message dueto some 0s in the Fst matrices that the distance matrix is non-euclidean, but I don't thinkthis affectsthe Mantel calculations.
#####Wednesday, 30 March 2016
For some reason my P-matrix Blows analysis is reversed--females are not as different as males!?! weird.
The fst-pst analysis doesn't work on band area and band number when standardized...the pst matrix is all ones. I think it's because of a lack of variance? or something? It works for males. and female body stuff. Doing PCA on standardized traits, pst matrix for bands is all 1s. But the patterns look basically the same as before. AHA!!! BandNum and Band Area mean = 0. What if I standardize by trait/mean(trait) instead of trait/mean(trait)-1? Nope, that's not solving the problem.
:::::::RESULTS::::::
STANDARDIZED
mal.fst.pst
Obs P
SVL -0.04095571 0.6062
TailLength -0.15937982 0.8618
BodyDepth -0.02983801 0.6011
SnoutLength -0.09345849 0.7336
SnoutDepth -0.02595385 0.5703
HeadLength -0.06003236 0.6436
> mal.pst.dist
Obs P
SVL 0.19370982 0.0679
TailLength 0.13618662 0.1826
BodyDepth 0.03568641 0.4351
SnoutLength 0.08921589 0.2981
SnoutDepth 0.02090614 0.4442
HeadLength 0.19282189 0.0818
PCA versions:
> mantel.rtest(as.dist(t(sband.pst)),as.dist(t(dist)), nrepet=9999)
Monte-Carlo test
Observation: 0.1785789
Call: mantel.rtest(m1 = as.dist(t(sband.pst)), m2 = as.dist(t(dist)),
nrepet = 9999)
Based on 9999 replicates
Simulated p-value: 0.1261
> mantel.rtest(as.dist(t(sfem.pst)),as.dist(t(dist)), nrepet=9999)
Monte-Carlo test
Observation: -0.2014884
Call: mantel.rtest(m1 = as.dist(t(sfem.pst)), m2 = as.dist(t(dist)),
nrepet = 9999)
Based on 9999 replicates
Simulated p-value: 0.8943
> mantel.rtest(as.dist(t(smal.pst)),as.dist(t(dist)), nrepet=9999)
Monte-Carlo test
Observation: 0.09107063
Call: mantel.rtest(m1 = as.dist(t(smal.pst)), m2 = as.dist(t(dist)),
nrepet = 9999)
Based on 9999 replicates
Simulated p-value: 0.2882
> mantel.rtest(as.dist(t(sband.pst)),as.dist(t(pwise.fst)), nrepet=9999)
Monte-Carlo test
Observation: NA
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 9999 replicates
Simulated p-value: NA
There were 50 or more warnings (use warnings() to see the first 50)
> mantel.rtest(as.dist(t(sfem.pst)),as.dist(t(pwise.fst)), nrepet=9999)
Monte-Carlo test
Observation: 0.01594444
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 9999 replicates
Simulated p-value: 0.4864
> mantel.rtest(as.dist(t(smal.pst)),as.dist(t(pwise.fst)), nrepet=9999)
Monte-Carlo test
Observation: -0.06373443
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 9999 replicates
Simulated p-value: 0.6768
UNSTANDARDIZED
fem.fst.upst
Obs P
SVL 0.13144683 0.1799
TailLength 0.08060939 0.3139
BodyDepth 0.03756990 0.4288
SnoutLength 0.02517979 0.4989
SnoutDepth 0.09016056 0.3063
HeadLength 0.05842674 0.3528
BandArea 0.04007397 0.4444
BandNum -0.20954717 0.8187
> mal.fst.upst
Obs P
SVL 0.01723110 0.4743
TailLength 0.01728534 0.4632
BodyDepth 0.03865810 0.4595
SnoutLength -0.06330220 0.6633
SnoutDepth 0.04661631 0.4132
HeadLength -0.02334358 0.5645
fem.upst.dist
Obs P
SVL -0.15632610 0.8695
TailLength -0.16330518 0.8806
BodyDepth -0.12692967 0.8576
SnoutLength -0.20835525 0.8916
SnoutDepth -0.10891345 0.7732
HeadLength -0.14761062 0.8603
BandArea 0.03079835 0.4718
BandNum -0.54223261 1.0000
> mal.upst.dist
Obs P
SVL -0.001084414 0.5225
TailLength 0.003154747 0.5085
BodyDepth 0.028435422 0.4778
SnoutLength -0.170531471 0.8571
SnoutDepth -0.052837081 0.6628
HeadLength -0.026574775 0.5902
PCA versions:
mantel.rtest(as.dist(t(band.pst)),as.dist(t(dist)), nrepet=9999)
Monte-Carlo test
Observation: -0.5409249
Call: mantel.rtest(m1 = as.dist(t(band.pst)), m2 = as.dist(t(dist)),
nrepet = 9999)
Based on 9999 replicates
Simulated p-value: 1
> mantel.rtest(as.dist(t(fem.pst)),as.dist(t(dist)), nrepet=9999)
Monte-Carlo test
Observation: -0.1224953
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 9999 replicates
Simulated p-value: 0.8224
> mantel.rtest(as.dist(t(mal.pst)),as.dist(t(dist)), nrepet=9999)
Monte-Carlo test
Observation: -0.004192046
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 9999 replicates
Simulated p-value: 0.5284
>
>
> mantel.rtest(as.dist(t(band.pst)),as.dist(t(pwise.fst)), nrepet=9999)
Monte-Carlo test
Observation: -0.2099312
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 9999 replicates
Simulated p-value: 0.8218
> mantel.rtest(as.dist(t(fem.pst)),as.dist(t(pwise.fst)), nrepet=9999)
Monte-Carlo test
Observation: 0.1124728
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 9999 replicates
Simulated p-value: 0.238
> mantel.rtest(as.dist(t(mal.pst)),as.dist(t(pwise.fst)), nrepet=9999)
Monte-Carlo test
Observation: 0.01852087
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 9999 replicates
Simulated p-value: 0.4628
>
#####Tuesday, 29 March 2016
There are 88 Bayes Factor outliers in the temperature analysis but 0 for the other environmental factors using the "quantile" function with a cutoff of 0.95.
Now there aren't any overlapping outliers. Partly because XTX at 1% has 18 loci..let's try 5%?
If I use 5% and the pruned Fsts then there's overlap. Whew.
I think I need to present the outliers and the temperature-associated ones separately. Maybe when I do the anomaly analysis there will be something to show there?
#####Monday, 28 March 2016
I need to really try to finish up re-running things. First, I need to edit the fst summary file so that it's in the same order as the distance file (which is geographic order).
PCAdapt rho2 values are in the *.stats file, which I didn't read into R (I just looked it up and entered the correct values in the plotting of the graph)
I made figure 3 in R (rather than piecing them together in gimp).
Now i need to process the structure output and make the structure figure again.
Reviewer 1 wants the colors to make more sense so that colors are consistent in each population. I think I've figured out a way to do this, but my faststructure output is a bit weird.
I don't know why but the faststructure groupings are not very good. They also don't really match the distruct plots. I need to look at the distruct code/info to see what they use, maybe I'm just plotting the wrong values.
I'm re-running bayenv on my work computer because the wrapper didn't work on all of the loci apparently and it didn't record the locus names. If I want to compare these outliers to others I need to record the runs!
Meanwhile I'm stilla little stumped by the Pst-Fst comparison not working well.
#####Saturday, 26 March 2016
I found the original raw environmental file and re-standardized the variables. Now I'm running the bayenv wrapper using run_bayenv2_wrapper.sh... This seems to have worked.
I'm also standardizing the traits by population (not by sex) by dividing each trait value by population mean.
Now I'm trying to re-do the pst-fst comparisons.
For the band area I'm getting Pst=1 because the sums of squares is so small.
#####Friday, 25 March 2016
I realized that the reason I was getting NAN values from Bayenv2 was because my standardized.env file was messed up--full of NANs. So I found the real version and am re-running bayenv2.
Huh. for some reason it's getting overwritten or something because after running several bayenv2 runs it's back to having nans...
And it doesn't look like it's standardized properly. I need to return to this when I can pay more attention to it.
#####Monday, 21 March 2016
I'm choosing a 5% cutoff for the XtX and Bayes Factors from Bayenv, same as before.
That yields 88 of each.
Something didn't work right with the environmental associations. 1633 of them have NAs and the rest, except one, all have the same BF value. Weird.
Re-starting migrate with Bayesian inference with number of recorded steps in chain to 1000.
On my home computer I found a more complete popgen analysis.
#####Saturday, 19 March 2016
My bayenv matrix was formatted incorrectly. Once I fixed it I was able to run the environmental correlations no problem.
#####Friday, 11 March 2016
Ran populations on the pruned SNPs to generate Fsts, summary stats, etc.
populations -b 1 -P ./stacks -M ./stacks/marine_map.txt -s -W ./stacks/populations/subset.whitelist.txt -t 2 --fstats --plink --vcf
#####Thursday, 10 March 2016
I had to re-start structure because of a power outage.
#***IBD***#
Reading up on IBD: it can reflect population structure, and pop structure can reflect it.
In Meirmans 2012: One way to take the hierarchical structure into account is to use a stratified Mantel test in which the permutation scheme is changed to permute the locations of populations within the clusters (Oksanen et al. 2009)...it is advised to use clustering approaches that take the geographical position of the samples into account (e.g. François et al. 2006)...A partial Mantel test can also be of help here, testing the association between the matrix of genetic distances and a model matrix of cluster membership with the matrix of geographical distances as a covariate (Drummond & Hamilton 2007)
#***MIGRATE***#
I also downloaded migrate for Windows and am trying to run it but something's not working correctly.
I'm running the hapmap format parameter...maybe theres a designation about filenames or something that I need to change. I'm trying replacing the locus ID with a locus count (an integer instead of a string)
There are a number of loci that appear not to be polymorphic. I've added some code to add a 0 count if necessary. That didn't fix it (but it's certainly a good thing to do.) I found a post on the google forum that says, "the migrate manual contains an error and you will need to enter the maximal number of individual for each population in the population line." <-I'm pretty sure he means max number of alleles, since in the example the number preceding the pop name is the max number found in the 'total' column. That fixed it!!! Now it's running
#####Wednesday, 9 March 2016
I wrote plink_to_migrate to convert my subset.ped and subset.map files into a format compatible with migrate.
I need to re-code the FamID part of the ped file probably to refer to the populations. I did that in R (popgen_analysis.R)
#####Monday, 7 March 2016
#***Bayenv2***#
Working on Bayenv2:
Once the subset files have been fixed up in R, in bayenv2/:
plink --file bayenv.plink --freq --within plink.clust.txt --noweb --allow-no-sex --out subset.bayenv.plink
saved new snpsfile as pruned.SNPSFILE in bayenv2/
Now I'm running bayenv2 on pruned.SNPSFILE, 10 iterations, to generate a SNPS matrix.
All of the matrices look the same.
plink --file ../stacks/populations/batch_1.plink --freq --within all.6348.clust.txt --allow-no-sex --noweb --out all.bayenv.plink
Then turned it into all.SNPSFILE in R.
~/Programs/bayenv_2/bayenv2 -i all.SNPSFILE -m matrix.1.out -n 1 -p 12 -k 1000000 -t -X -r 13425
^^that doesn't work. I need to use individual snps from the pruned snps. It is better to use unlinked SNPs, so I should use the pruned dataset.
So I need to split the file...did that in R.
But now I'm getting a segmentation fault! Why??
#***FastStructure***#
To run faststructure, I needed to edit the plink --recode-structure output by removing the first two lines, which I did in jEdit.
but faststructure isn't finding the files..it has to be /home/sarah/sf_ubuntushare/popgen/... not ~/sf_ubuntushare/popgen/..
Now it's running. distruct says,
Model complexity that maximizes marginal likelihood = 2
Model components used to explain structure in data = 5
So now I'm running it with K=2 to K=5 with logistic prior instead of --full. (This part probably isn't necessary).
#####Saturday, 5 March 2016
#***Structure***#
Started running Structure with K=1 through K=12, 10 repetitions of each, with admixture model.
#####Friday, 4 March 2016
#***PRUNING***#
The pruning seems to work, it retains 3802 loci. Now for HWE. Doing that in R, since I've already got a function to do it (I just have to modify it for a ped file).
Only 838 pass my hwe pruning in R...1581 pass at 0.001 level, which is plink's default. Plink retains 1753.
plink --file pruned --hwe 0.001 --noweb --allow-no-sex --recodeA --recode --out subset
Plink is easier because I don't have to reformat the files myself, so I'll go with that.
#***ADEGENET***#
OK, I've done that. Now I need to run adegenet. Done!
#***PCAdapt***#
I apparently didn't save the pcadapt commands. No, I did, it was just in PCAdapt folder in ~/Programs/
Now I'm running PCAdapt from ~/sf_ubuntushare/popgen/ directory with K = 1 through K = 12.
K=4 was the best. So I'm running it 10 times with K=4. They're not super consistent but they're probably good enough. Good enough for the one to be a representative run.
#***Bayenv***#
For Bayenv2, I ran the R code to reformat plink. After first bit I ran plink --file subset --freq --within --plink.clust.txt --noweb --allow-no-sex --out subset.bayenv.plink. I'm having some problems with this.
#####Thursday, 3 March 2016
First thing I need to do is to re-prune to ensure that SNPs are not actually on the same locus. So, I'll take a PLINK file (pruned for hwe in PLINK) and then write a program that will read the file and for each locus randomly choose one of the SNPs to be the SNP for that locus. I will also evaluate how far RAD loci are from each other and only keep one SNP every ~2kb (so if a RAD locus is 1kb away from another locus in either direction I'll randomly choose one of those SNPs to keep, I guess).
#####28 February 2016
I've fixed the map as per a reviewer's request.
Now for the hard part. First, I need to check to see if my pruned SNPs set does not include multiple SNPs per RAD locus and see how many are on the scaffold, and how far apart they are. (e.g. make sure the pruning actually worked).
So...the 1833 SNPs are not all in individual RAD loci. They represent 1562 RAD loci. And 1248 scaffolds.
So I think I need to consider re-doing that by taking one SNP from each RAD locus. Then I can do the HWE pruning and some LD pruning, maybe.
#####22 February 2016
Thoughts on how to proceed:
-should I just take one SNP from the RAD locus rather than do the pruning-for-LD-approach?
-I need to figure out which files I used for what and re-acquaint myself with these data.
-Calculate average distance between RAD loci on the same scaffold??