-
Notifications
You must be signed in to change notification settings - Fork 2
/
04-prep_genos.Rmd
698 lines (509 loc) · 23.9 KB
/
04-prep_genos.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
# Prepare genotypic data
```{r setup, include=F, echo=F}
library(tidyverse); library(genomicMateSelectR)
```
- **Context and Purpose:**
- Depending on whether parent- vs. mate-selection are intended, there are several formats to-be-constructed / computed from the downloaded genotypic data.
- Potentially, do exploratory / preliminary assessment of population structure, esp. any divergence between "training" samples and selection candidates ("test").
- **Upstream:** Section \@ref(prepare-phenotype-data) - quality control steps for phenotype data
- **Downstream:** All analyses relying on genotypic data
- **Inputs:**
- For parent selection: An imputed allele-dosage matrix downloaded from BreedBase (`data/BreedBaseGenotypesDownload.tsv`)
- For mate selection:
- An imputed variant call format (VCF) file downloaded from BreedBase (`data/BreedBaseGenotypesDownload.vcf`)
- A centimorgan-scale genetic map with positions on the same reference genome version as the VCF. Can be sourced from the Cassavabase FTP, FTP archive, or another alternative.
- **Expected outputs:**
- For parent selection: a MAF filtered, dosage matrix, possibly with some samples removed relative to the database download.
- For mate selection:
- a haplotype matrix extracted from the VCF
- a dosage matrix (computed from the haplotype matrix)
- a filtered SNP list
- Interpolated genetic map, a cM position for each SNP
- For parent and mate selection: genomic relationship matrices (GRMs, aka kinship matrices), possibly both additive *and* dominance relationship matrices, constructed based on the dosage matrix.
## Process Map
![](images/prepare_genotypic_data_process_map.png){width="100%"}
## Parent vs. Mate Selection?
The following sections will exemplify the **genomic mate selection** as opposed to the somewhat simpler **genomic parent selection** pathway along the process map. This chapter can be simplified / mostly avoided if parent selection is sufficient.
## Learning command line navigation
- <https://www.codecademy.com/learn/learn-the-command-line/modules/learn-the-command-line-navigation/cheatsheet>
- [SSH Client on Windows 10 Using the Command Prompt \| SSH from Windows to Linux and Other Systems](https://www.youtube.com/watch?v=JbMgOKlj5fE&ab_channel=SavvyNik)
- [Secure Copy Protocol (SCP) - Transfer Files using SSH & Command Line on Windows 10 to Linux / Other](https://www.youtube.com/watch?v=2u0I-U0D7Uk&ab_channel=SavvyNik)
## vcftools and bcftools manuals
[`vcftools`](https://vcftools.github.io/man_latest.html)
[`bcftools`](https://samtools.github.io/bcftools/bcftools.html)
## Remote access the server
These steps will vary a bit depending on your system.
Instructions for BioHPC access are here: <https://biohpc.cornell.edu/lab/doc/BioHPCLabexternal.pdf>
First, remote login, using `ssh`
`ssh userid@serverid.biohpc.cornell.edu`
For example:
```{bash, eval=F}
ssh mw489@cbsumm21.biohpc.cornell.edu
```
Input your password when prompted.
```{bash, eval=F}
pwd
```
Should show something like `/home/mw489/` (but your userid)
## [INSTRUCTOR'S STEPs]
**DON'T DO THESE STEPS, UNLESS YOU ARE ME, WHICH YOU ARE NOT \^\_\~!!**
**Create a sub-folder for analysis**
I create a sub-folder for the analysis.
I created one with same name as my example GitHub repository for this workshop, like so:
```{bash, eval=F}
mkdir GSexample2022;
cd GSexample2022;
mkdir data;
mkdir output;
```
**Transfer data to CBSU**
Now, on my local computer, navigate in the command line to `data/`
Use `scp` to transfer the VCF file downloaded from Cassavabase to the server.
Also copy the "cleaned_phenos.rds" from last week, which is in the `output/`:
```{bash, eval=F}
scp BreedBaseGenotypesDownload.vcf mw489@cbsulogin.biohpc.cornell.edu:GSexample2022/data/
cd ../output;
scp phenotypes_cleaned.rds mw489@cbsulogin.biohpc.cornell.edu:GSexample2022/output/
```
**Copy data to reserved servers `/workdir/`**
```{bash, eval=F}
mkdir /workdir/mw489/;
```
Then I copy my `GSexample2022` directory and its contents into that sub-folder.
```{bash, eval=F}
cp -r ~/GSexample2022 /workdir/mw489/;
```
## [EVERYONE'S STEPS]
### Create a directory on `/workdir/`
Go to `/workdir/`
```{bash, eval=F}
cd /workdir/
```
Create a directory with *your* username:
Example:
```{bash, eval=F}
mkdir youruserid/;
```
Create a sub-directory for *this* analysis.
Example:
```{bash, eval=F}
mkdir /workdir/youruserid/yourprojectname
```
Navigate into your subdirectory on `/workdir/` (empty now).
Example:
```{bash, eval=F}
cd /workdir/youruserid/yourprojectname
```
Copy the `data/` and `output/` from *my* `/workdir/mw489/GSexample2022/` location, into yours.
Example:
```{bash, eval=F}
cp -r /workdir/mw489/GSexample2022/data /workdir/youruserid/yourprojectname/
cp -r /workdir/mw489/GSexample2022/output /workdir/youruserid/yourprojectname/
```
Check that its there. Example:
```{bash, eval=F}
ls /workdir/youruserid/yourprojectname/data/
```
Now you should be ready to go.
Do all remaining steps *within* the directory you just created.
### Install R packages to server
First time on the server and using R there?
Type `R` and then press "enter" to start R. (Later type `q() + enter` to quit R)...
Then install packages like so:
```{r, eval=F}
install.packages(c("tidyverse","workflowr", "sommer", "lme4", "devtools"))
devtools::install_github("wolfemd/genomicMateSelectR", ref = 'master')
```
If asked "do you want to install dependencies" and given a long list, type "1" + Enter to install _ALL_ packages it asks. This will ensure you have everything you need.
## Introducing VCF files
Some links:
- <https://en.wikipedia.org/wiki/Variant_Call_Format>
- Detailed specifications: <https://samtools.github.io/hts-specs/VCFv4.2.pdf>
- VCF poster: <http://vcftools.sourceforge.net/VCF-poster.pdf>
## Check the VCF
Check the VCF 'manually'.
- Are the number of samples and sites what you expected?
```{bash, eval=F}
vcftools --vcf data/BreedBaseGenotypesDownload.vcf
# VCFtools - 0.1.16
# (C) Adam Auton and Anthony Marcketta 2009
#
# Parameters as interpreted:
# --vcf data/BreedBaseGenotypesDownload.vcf
#
# After filtering, kept 1207 out of 1207 Individuals
# After filtering, kept 61239 out of a possible 61239 Sites
# Run Time = 16.00 seconds
```
- Are the data phased?
- What FORMAT fields are present? At a minimum, should include GT field.
- Do the column-names after FORMAT (should be sample names) look right / make sense?
- Do the SNP IDs in the "ID" field make sense?
Let's take a look at *our* VCF file:
```{bash, eval=F}
# look at the header of the VCF file
# print the "top-left" corner of the file
cat data/BreedBaseGenotypesDownload.vcf | head -n50 | cut -c1-100
```
## Subset VCF
- There *may* be multiple imputed DNA samples corresponding to a single unique 'germplasmName'. For matching to the phenotypic observations in downstream analyses, a single entry in the VCF file must be chosen per 'germplasmName'.
- Remove extraneous samples, if any.
- **For example / tutorial purposes ONLY:** randomly sample and subset the number of SNPs to only a few thousand, for quick, local computations.
### Remove duplicate samples
Cassavabase returned multiple columns in the VCF file with the same column-name, because of multiple tissue_samples per germplasmName. This prevents using certain tools, e.g. `bcftools` which errors.
Try this and you'll see:
```{bash, eval=F}
bcftools query --list-samples data/BreedBaseGenotypesDownload.vcf
```
**Manual solution required:**
Write just the column-names of the VCF to disk. Here's a one-liner:
```{bash, eval=F}
egrep "^#CHROM" data/BreedBaseGenotypesDownload.vcf | head -n1 > data/vcf_colnames.txt
```
Read the column-names-file to R. Exclude the first 9 elements as they are standard VCF columns, not "germplasmName"
```{r}
library(genomicMateSelectR) # to get the %>% loaded
vcf_sample_names<-readLines("data/vcf_colnames.txt") %>%
strsplit(.,"\t") %>% unlist() %>%
.[10:length(.)]
# Check how many sample names are duplicated?
table(duplicated(vcf_sample_names))
```
Quite a few duplicates.
Next, I (1) create unique names for each sample column in the VCF file, (2) write the "unique_names_for_vcf.txt" to disk and (3) use it to replace the current sample column names in the VCF. Finally, I (4) subset the VCF to only one unique instance of each name.
First, manipulate names using **R**.
```{r}
# create unique names for each VCF
unique_names_for_vcf<-tibble(vcfName=vcf_sample_names) %>%
# create an overall index to ensure I can recover the original column order
mutate(vcfIndex=1:n()) %>%
# now for each vcfName create an sub-index, to distinguish among duplicates
group_by(vcfName) %>%
# sub-index
mutate(vcfNameIndex=1:n(),
# For the first (or only) instance of each unique vcfName
vcfName_Unique=ifelse(vcfNameIndex==1,
# return the original name
vcfName,
# for all subsequent (duplicate) names,
#put a unique-ified name by pasting the sub-index
paste0(vcfName,".",vcfNameIndex)))
# Write the "unique_names_for_vcf.txt" to disk
write.table(unique_names_for_vcf$vcfName_Unique,file = "data/unique_names_for_vcf.txt",
row.names = F, col.names = F, quote = F)
# Create also a list containing only one instance of each unique name, the first instance
subset_unique_names_for_vcf<-unique_names_for_vcf %>%
filter(vcfNameIndex==1) %$%
vcfName_Unique
# Write that list to disk for subsetting the VCF downstream
write.table(subset_unique_names_for_vcf,file = "data/subset_unique_names_for_vcf.txt",
row.names = F, col.names = F, quote = F)
```
Now in the command line:
`bcftools reheader` to replace the sample names in the VCF with the unique ones.
```{bash, eval=F}
# replace sample names in original VCF with unique ones (creates a new VCF)
bcftools reheader --samples data/unique_names_for_vcf.txt data/BreedBaseGenotypesDownload.vcf > data/BreedBaseGenotypesDownload_1.vcf;
# overwrite the original VCF with the new that has unique names
mv data/BreedBaseGenotypesDownload_1.vcf data/BreedBaseGenotypesDownload.vcf;
# check that the names are now unique by printing sample list
bcftools query --list-samples data/BreedBaseGenotypesDownload.vcf
```
Now subset the VCF to only a single instance of each "germplasmName" with `vcftools`.
```{bash, eval=F}
vcftools --vcf data/BreedBaseGenotypesDownload.vcf --keep data/subset_unique_names_for_vcf.txt --recode --stdout | bgzip -c > data/BreedBaseGenotypes_subset.vcf.gz
# uses stdout and bgzip to output a gzipped vcf file; saves disk space!
```
```{bash, eval=F}
vcftools --gzvcf data/BreedBaseGenotypes_subset.vcf.gz
#VCFtools - 0.1.16
#(C) Adam Auton and Anthony Marcketta 2009
# Parameters as interpreted:
# --gzvcf data/BreedBaseGenotypes_subset.vcf.gz
#
# Using zlib version: 1.2.11
# After filtering, kept 963 out of 963 Individuals
# After filtering, kept 61239 out of a possible 61239 Sites
# Run Time = 2.00 seconds
```
### Check genotype-to-phenotype matches
- Do the number of unique **germplasmName** (in the "cleaned phenos" from the previous step) matching samples in the VCF make sense? Are there as many as expected? If not, you will need to figure out why not.
```{r}
phenos<-readRDS(here::here("output","phenotypes_cleaned.rds"))
# vector of the unique germplasmName in the field trial data
germplasm_with_phenos<-unique(phenos$germplasmName)
length(germplasm_with_phenos)
```
How many matches to the VCF?
```{r}
subset_unique_names_for_vcf<-read.table(file = "data/subset_unique_names_for_vcf.txt",
stringsAsFactors = F)$V1
table(germplasm_with_phenos %in% subset_unique_names_for_vcf)
```
350 matches. Does that make sense? Yes. We ended up excluding the "genetic gain" trial from the phenotypes b/c actually there were no trait scores.
To be sure, I look at the names of: (1) the genotyped *and* phenotyped, (2) the genotyped *but not* phenotyped, (3) the phenotyped *but not* genotyped.
```{r, eval=F}
# geno and pheno
subset_unique_names_for_vcf[subset_unique_names_for_vcf %in% germplasm_with_phenos]
# pheno not geno
germplasm_with_phenos[!germplasm_with_phenos %in% subset_unique_names_for_vcf]
# geno not pheno
subset_unique_names_for_vcf[!subset_unique_names_for_vcf %in% germplasm_with_phenos]
```
To diagnose some the phenotyped-but-not-genotyped, I actually resorted to searching a few on Cassavabase to verify that there were non-genotyped lines in the trials I downloaded.
For the genotyped-but-not-phenotyped, indeed the names are all "genetic gain" population clones.
**Probably the details above will change if I go back and choose better example trials.**
**The checklist and approach to verifying it all should stay the same.**
### Subset SNPs (for tutorial purposes)
**For example / tutorial purposes ONLY:** randomly sample and subset the number of SNPs to only a few thousand, for quick, local computations.
```{bash, eval=F}
# write the positions list
# first two columns (chrom. and position) of the VCF
# ignoring the header rows
cat data/BreedBaseGenotypesDownload.vcf | grep -v "^#" | cut -f1-2 > data/BreedBaseGenotypesDownload.positions
```
Read into R, sample 4000 at random
```{r, eval=F}
library(genomicMateSelectR)
set.seed(1234)
read.table(here::here("data","BreedBaseGenotypesDownload.positions"),
header = F, stringsAsFactors = F) %>%
dplyr::slice_sample(n=4000) %>%
arrange(V1,V2) %>%
write.table(.,file = "data/BreedBaseGenotypes_subset.positions",
row.names = F, col.names = F, quote = F)
```
Subset the VCF using the randomly sampled list of positions.
```{bash, eval=F}
vcftools --vcf data/BreedBaseGenotypesDownload.vcf \
--keep data/subset_unique_names_for_vcf.txt \
--positions data/BreedBaseGenotypes_subset.positions \
--recode --stdout | bgzip -c > data/BreedBaseGenotypes_subset.vcf.gz
# VCFtools - 0.1.16
# (C) Adam Auton and Anthony Marcketta 2009
#
# Parameters as interpreted:
# --vcf data/BreedBaseGenotypesDownload.vcf
# --keep data/subset_unique_names_for_vcf.txt
# --positions data/BreedBaseGenotypes_subset.positions
# --recode
# --stdout
#
# Keeping individuals in 'keep' list
# After filtering, kept 963 out of 1207 Individuals
# Outputting VCF file...
# After filtering, kept 4000 out of a possible 61239 Sites
# Run Time = 8.00 seconds
```
```{bash, eval=F}
vcftools --gzvcf data/BreedBaseGenotypes_subset.vcf.gz
```
### LD-prunning SNPs (for computational savings)
**NOT DEMONSTRATED HERE, YET.** In practice, when predicting cross-variances, it can still be very computationally intensive with large numbers of markers. [Previously](https://wolfemd.github.io/IITA_2021GS/04-PreprocessDataFiles.html#Variant_filters), I used `plink --indep-pairwise` to prune markers based on linkage disequilibrium. I [found an LD-prunned subset that had similar accuracy to the full set](https://wolfemd.github.io/IITA_2021GS/07-Results.html#Prediction_accuracy_estimates), but less than half the markers. Subsequently, I used the full set to predict cross means, but the LD-pruned marker subset for the cross variances [predictions of \>250K crosses of 719 candidate parents in IITA's 2021 crossing block](https://wolfemd.github.io/IITA_2021GS/07-Results.html#Genomic_Predictions).
## Haplotype matrix from VCF
Extract haplotypes from VCF with `bcftools convert --hapsample`
```{bash, eval=F}
bcftools convert --hapsample data/BreedBaseGenotypes_subset data/BreedBaseGenotypes_subset.vcf.gz
# Hap file: data/BreedBaseGenotypes_subset.hap.gz
# Sample file: data/BreedBaseGenotypes_subset.samples
# [W::vcf_parse_format] FORMAT 'NT' at 1:652699 is not defined in the header, assuming Type=String
# 4000 records written, 0 skipped: 0/0/0 no-ALT/non-biallelic/filtered
```
Read haps to R and format them.
```{r}
library(genomicMateSelectR)
library(data.table)
vcfName<-"BreedBaseGenotypes_subset"
haps<-fread(paste0("data/",vcfName,".hap.gz"),
stringsAsFactors = F,header = F) %>%
as.data.frame
sampleids<-fread(paste0("data/",vcfName,".samples"),
stringsAsFactors = F,header = F,skip = 2) %>%
as.data.frame
```
Add sample ID's.
```{r}
hapids<-sampleids %>%
select(V1,V2) %>%
mutate(SampleIndex=1:nrow(.)) %>%
rename(HapA=V1,HapB=V2) %>%
pivot_longer(cols=c(HapA,HapB),
names_to = "Haplo",values_to = "SampleID") %>%
mutate(HapID=paste0(SampleID,"_",Haplo)) %>%
arrange(SampleIndex)
colnames(haps)<-c("Chr","HAP_ID","Pos","REF","ALT",hapids$HapID)
dim(haps)
```
Format, transpose, convert to matrix.
```{r}
haps %<>%
mutate(HAP_ID=gsub(":","_",HAP_ID)) %>%
column_to_rownames(var = "HAP_ID") %>%
select(-Chr,-Pos,-REF,-ALT) %>%
t(.) %>%
as.matrix(.)
dim(haps)
```
## Dosage matrix from haps
To ensure consistency in allele counting, create dosage from haps manually.
The counted allele in the dosage matrix, which will be used downstream to construct kinship matrices and estimate marker effects, should be the ALT allele. This will ensure a match to the haplotype matrix, where a "1" indicates the presence of the ALT allele.
The BreedBase system currently (as of Jan. 2022) gives dosages that count the REF allele and we need to fix this.
Here's my tidyverse-based approach, using `group_by()` plus `summarise()` to sum the two haplotypes for each individual across all loci.
```{r}
dosages<-haps %>%
as.data.frame(.) %>%
rownames_to_column(var = "GID") %>%
separate(GID,c("SampleID","Haplo"),"_Hap",remove = T) %>%
select(-Haplo) %>%
group_by(SampleID) %>%
summarise(across(everything(),~sum(.))) %>%
ungroup() %>%
column_to_rownames(var = "SampleID") %>%
as.matrix %>%
# preserve same order as in haps
.[sampleids$V1,]
dim(dosages)
dosages[1:5,1:5]
```
```{r}
haps[1:10,1:5]
```
## Variant filters {#filter-variants}
In this case, simple: keep only positions with \>1% minor allele frequency.
```{r}
# use function built into genomicMateSelectR
dosages<-maf_filter(dosages,thresh = 0.01)
dim(dosages)
# subset haps to match
haps<-haps[,colnames(dosages)]
```
### Save dosages and haps
```{r}
saveRDS(dosages,file=here::here("data","dosages.rds"))
saveRDS(haps,file=here::here("data","haplotypes.rds"))
```
## Genomic Relationship Matrices (GRMs) {#construct-grms}
In the example below, I use the `genomicMateSelectR` function `kinship()` to construct **additive** (A) and **dominance** (D) relationship matrices.
```{r}
A<-kinship(dosages,type="add")
D<-kinship(dosages,type="domGenotypic")
saveRDS(A,file=here::here("output","kinship_add.rds"))
saveRDS(D,file=here::here("output","kinship_dom.rds"))
```
For more information on the models to-be-implemented downstream, see [this vignette for **genomicMateSelectR**](https://wolfemd.github.io/genomicMateSelectR/articles/non_additive_models.html#genetic-models-implemented-1), the references cited therein.
## Recombination Frequency Matrix {#recomb-freq-mat}
Matrix needed for cross-variance predictions.
### Source a genetic map
- Must match the reference genome of the marker set to be used in prediction
- Not necessarily the exact markerset, but overlap is ideal
I source a single genome-wide file representing the ICGMC concensus genetic map on the V6 Cassava Reference genome. The file is on the Cassavabase FTP-archive, [here](https://cassavabase.org/ftp/marnin_datasets/NGC_BigData/CassavaGeneticMap/cassava_cM_pred.v6.allchr.txt).
```{r}
genmap<-read.table("https://cassavabase.org/ftp/marnin_datasets/NGC_BigData/CassavaGeneticMap/cassava_cM_pred.v6.allchr.txt",
header = F, sep=';', stringsAsFactors = F) %>%
rename(SNP_ID=V1,Pos=V2,cM=V3) %>%
as_tibble
genmap %>% dim
```
120K positions.
```{r}
genmap %>% head
```
```{r}
snps_genmap<-tibble(DoseSNP_ID=colnames(dosages)) %>%
separate(DoseSNP_ID,c("Chr","Pos","Ref","Alt"),remove = F) %>%
mutate(SNP_ID=paste0("S",Chr,"_",Pos)) %>%
full_join(genmap %>%
separate(SNP_ID,c("Chr","POS"),"_",remove = F) %>%
select(-POS) %>%
mutate(Chr=gsub("S","",Chr)) %>%
mutate(across(everything(),as.character)))
snps_genmap %>%
ggplot(.,aes(x=as.integer(Pos)/1000/1000,y=as.numeric(cM))) +
geom_point() +
theme_bw() +
facet_wrap(~as.integer(Chr))
```
### Interpolate genetic map
```{r}
interpolate_genmap<-function(data){
# for each chromosome map
# find and _decrements_ in the genetic map distance
# fix them to the cumulative max to force map to be only increasing
# fit a spline for each chromosome
# Use it to predict values for positions not previously on the map
# fix them AGAIN (in case) to the cumulative max, forcing map to only increase
data_forspline<-data %>%
filter(!is.na(cM)) %>%
mutate(cumMax=cummax(cM),
cumIncrement=cM-cumMax) %>%
filter(cumIncrement>=0) %>%
select(-cumMax,-cumIncrement)
spline<-data_forspline %$% smooth.spline(x=Pos,y=cM,spar = 0.75)
splinemap<-predict(spline,x = data$Pos) %>%
as_tibble(.) %>%
rename(Pos=x,cM=y) %>%
mutate(cumMax=cummax(cM),
cumIncrement=cM-cumMax) %>%
mutate(cM=cumMax) %>%
select(-cumMax,-cumIncrement)
return(splinemap)
}
```
```{r}
splined_snps_genmap<-snps_genmap %>%
select(-cM) %>%
mutate(Pos=as.numeric(Pos)) %>%
left_join(snps_genmap %>%
mutate(across(c(Pos,cM),as.numeric)) %>%
arrange(Chr,Pos) %>%
nest(-Chr) %>%
mutate(data=map(data,interpolate_genmap)) %>%
unnest(data)) %>%
distinct
```
```{r}
all(colnames(dosages) %in% splined_snps_genmap$DoseSNP_ID)
```
```{r}
splined_snps_genmap %>%
filter(DoseSNP_ID %in% colnames(dosages)) %>%
mutate(Map="Spline") %>%
bind_rows(snps_genmap %>%
filter(DoseSNP_ID %in% colnames(dosages),
!is.na(cM)) %>%
mutate(across(c(Pos,cM),as.numeric)) %>%
arrange(Chr,Pos) %>% mutate(Map="Data")) %>%
ggplot(.,aes(x=Pos/1000/1000,y=cM,color=Map, shape=Map),alpha=0.3,size=0.75) +
geom_point() +
theme_bw() + facet_wrap(~as.integer(Chr), scales='free_x')
```
Save the interpolated map, just for the marker loci to-be-used downstream.
```{r}
splined_snps_genmap %>%
filter(DoseSNP_ID %in% colnames(dosages)) %>%
saveRDS(.,file=here::here("output","interpolated_genmap.rds"))
```
### Recomb. freq. matrix
```{r}
genmap<-readRDS(file=here::here("output","interpolated_genmap.rds"))
m<-genmap$cM;
names(m)<-genmap$DoseSNP_ID
recombFreqMat<-1-(2*genmap2recombfreq(m,nChr = 18))
saveRDS(recombFreqMat,file=here::here("output","recombFreqMat_1minus2c.rds"))
```
See also, [**genomicMateSelectR** vignette](https://wolfemd.github.io/genomicMateSelectR/articles/start_here.html#recombination-frequency-matrix-1).
## Copy files back to your computer
Option 1) FileZilla
Option 2) Command line:
On my local computer, on your command line.
Navigate to the project directory containing the `data/` and `output/` directories.
```{bash, eval=F}
scp mw489@cbsumm21.biohpc.cornell.edu:/workdir/youruserid/yourprojectname/data/BreedBaseGenotypes_subset.vcf.gz data/
scp mw489@cbsumm21.biohpc.cornell.edu:/workdir/youruserid/yourprojectname/data/dosages.rds data/
scp mw489@cbsumm21.biohpc.cornell.edu:/workdir/youruserid/yourprojectname/data/haplotypes.rds data/
scp mw489@cbsumm21.biohpc.cornell.edu:/workdir/youruserid/yourprojectname/output/kinship_add.rds output/
scp mw489@cbsumm21.biohpc.cornell.edu:/workdir/youruserid/yourprojectname/output/kinship_dom.rds output/
scp mw489@cbsumm21.biohpc.cornell.edu:/workdir/youruserid/yourprojectname/output/recombFreqMat_1minus2c.rds output/
scp mw489@cbsumm21.biohpc.cornell.edu:/workdir/youruserid/yourprojectname/output/interpolated_genmap.rds output/
```
Option 3) I'll provide Google Drive links to download the big files, and post the small ones to my GitHub repo