-
Notifications
You must be signed in to change notification settings - Fork 2
/
plink_workflow.Rmd
204 lines (132 loc) · 9.93 KB
/
plink_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
### PLINK workflow used in Geno data editing
* The workflow starts with all the data already converted into PLINK format using Harald's scripts
* The collectioned1 and collectioned2 files are already split (see Tim's script for that)
* The file hetero_remove was created using the script 'hetero_geno' in this repository (scripts folder)
* The file merge1.missnp is created using this trick. Merge the illumina files with the affymetrix files and PLINK will produce a list of snps that are not compatible (the SNPs you will need to flip). This list is the merge1.missnp file.
```bash
#This is the pipeline i used with PLINK to edit Geno data
#1.CONVERSION OF ALL THE FILES IN PLINK FORMAT USING HARALD'S SCRIPTS (.....)
#2.CONVERSION OF ALL THE FILES IN BINARY FORMAT. Merge the files into three groups: one for 777k , one for 54k illumina and one for the 54k affymetrix.
#3.REMOVE ALL THE ID'S IN THE 54K THAT ARE PRESENT IN THE 54K (REDUNDANT INFORMATION) AND ALL THE DOUBLE ID'S.
#4.REMOVE SOME MESSED ID'S (SEE ISSUE NUMBER .....). THOSE ID'S HAVE DIFFERENT SAMPLE ID BUT SAME ID. WE MUST DELETE THEM.
#5.CHANGE ALL THE SAMPLE ID'S WITH THE PEDIGREE ID'S (GENO INFORMATION)
#6.HETEROZYGOSITY TEST AND REMOTION OF THE ID'S THAT FAILED IT
#7.CONVERSION OF ALL THE FILES INTO ATCG FORMAT (THIS PASSAGE IS NOT FUNDAMENTAL)
#8.MISSINGNESS PER ANIMAL.FILTER
#9.MISSINGNESS PER SNPS
#10.MAF FILTER. 0.01 THRESHOLD. SKIP THIS PART IF YOU DON'T WANT TO FILTER FOR MAF. I DIDN'T FILETER FOR IT (FOLLOWING YOUR INSTRUCTIONS)
#11.MERGE OF ALL THE ILLUMINA FILES
#12.MERGE AFFYMETRIX FILES WITH ILLUMINA FILES IN ORDER TO HAVE THE LIST OF THE SNPS TO FLIP
#13.FLIP ILLUMINA SNPS IN ORDER TO ENSURE COMPATIBILITY WITH AFFYMETRIX
#14.MERGE ILLUMINA WITH AFFIMETRYX
#15. HW FILTER 0.0000001
#16.LOOP IN ORDER TO EXTRACT ALL THE CHROMOSOMES
#17.LOOP IN ORDER TO CONVERT EVERYTHING IN ALPHAIMPUTE FORMAT
#2.CONVERSION OF ALL THE FILES IN BINARY FORMAT. Merge the files into two groups: one for 777k , one for 54k illumina and one for the 54k affymetrix. Note: you can do this with a single comand line using the --merge-list option in plink. I did't use it because i wanted to check each merge individually.
./plink --file FinalReport_54kV1_apr2009_ed1 --cow --make-bed --out FinalReport_54kV1_apr2009_ed1
./plink --file FinalReport_54kV1_ed1 --cow --make-bed --out FinalReport_54kV1_ed1
./plink --file FinalReport_54kV2_collection_ed1 --cow --make-bed --out FinalReport_54kV2_collection_ed1
./plink --file FinalReport_54kV2_collection_ed2 --cow --make-bed --out FinalReport_54kV2_collection_ed2
./plink --file FinalReport_54kV2_ed1 --cow --make-bed --out FinalReport_54kV2_ed1
./plink --file FinalReport_54kV2_feb2011_ed1 --cow --make-bed --out FinalReport_54kV2_feb2011_ed1
./plink --file FinalReport_54kV2_genoskan --cow --make-bed --out FinalReport_54kV2_genoskan
./plink --file FinalReport_54kV2_nov2011_ed1 --cow --make-bed --out FinalReport_54kV2_nov2011_ed1
./plink --file Nordic_54k --cow --make-bed --out Nordic_54k
./plink --file Nordic_54k_2012_ed1 --cow --make-bed --out Nordic_54k_2012_ed1
./plink --file Swedish_54k_ed1 --cow --make-bed --out Swedish_54k_ed1
./plink --file FinalReport_777k_jun2015 --cow --make-bed --out FinalReport_777k_jun2015
./plink --file FinalReport_777k_apr2015 --cow --make-bed --out FinalReport_777k_apr2015
./plink --file FinalReport_777k_jan2015 --cow --make-bed --out FinalReport_777k_jan2015
./plink --file FinalReport_777k --cow --make-bed --out FinalReport_777k
./plink --file Nordic_HDexchange_201110 --cow --make-bed --out Nordic_HDexchange_201110
./plink --file Batch1 --cow --make-bed --out Batch1
./plink --file Batch2 --cow --make-bed --out Batch2
./plink --file Batch3 --cow --make-bed --out Batch3
#merge of the 54k illumina using the --merge-list option
./plink --bfile FinalReport_54kV1_apr2009_ed1 --merge-list list --cow --recode --out merge_54k
#merge of the 777k illumina using the --merge-list option
./plink --bfile FinalReport_777k_jun2015 --merge-list list --cow --recode --out merge_777k
#merge of the 54k affymetrix using the --merge-list option
./plink --bfile merge --bmerge Batch1 --merge-list list --cow --recode --out merge54k_affymetrix_temp
./plink --bfile affy_25K_flipped_2_affy50K --allele1234 --cow --make-bed
./plink --bfile merge54k_affymetrix_temp --bmerge plink --cow --recode --out merge54k_affymetrix
#3.REMOVE ALL THE ID'S IN THE 54K THAT ARE PRESENT IN THE 54K (REDUNDANT INFORMATION) AND ALL THE DOUBLE ID'S.
./plink --bfile merge54k --cow --remove id_to_delete54k --make-bed --out merge54k_cleaned
./plink --bfile merge_777k --cow --remove id_to_delete54k --make-bed --out merge777k_cleaned
./plink --bfile merge54k_affymetrix --cow --remove id_to_delete54k --make-bed --out merge54k_affymetrix_cleaned
#4.REMOVE SOME MESSED ID'S (SEE ISSUE NUMBER .....). THOSE ID'S HAVE DIFFERENT SAMPLE ID BUT SAME ID. WE MUST DELETE THEM.
./plink --bfile merge54k_cleaned --cow --remove id_remove_scientific_54k --make-bed --out merge54k_cleaned1
#5.CHANGE ALL THE SAMPLE ID'S WITH THE PEDIGREE ID'S (GENO INFORMATION)
./plink --bfile merge54k_cleaned1 --update-ids list_changeID_plink_54k --make-bed --cow --out merge54k_cleaned1_update
./plink --bfile merge777k_cleaned --update-ids list_changeID_plink --make-bed --cow --out merge777k_cleaned_update
./plink --bfile merge54k_affymetrix_cleaned --update-ids list_changeID_plink --make-bed --cow --out merge54k_affymetrix_update
#6.HETEROZYGOSITY TEST AND REMOTION OF THE ID'S THAT FAILED IT
./plink --bfile merge54k_cleaned1_update --cow --het --out het_illumina54k
./plink --bfile merge777k_cleaned_update--cow --het --out het_illumina54k
./plink --bfile merge54k_affymetrix_update --cow --het --out het_affymetrix
#deletion of the ID's that failed the hetero test (NOTE: you must have one hetero_remove for each array). The R script that produce the file is called hetero_geno.R. You need to run it for each array you have.
./plink --bfile merge54k_cleaned1_update --cow --remove hetero_remove1 --make-bed --out merge54k_update1
./plink --bfile merge777k_cleaned_update --cow --remove hetero_remove2 --make-bed --out merge777k_update1
./plink --bfile merge54k_affymetrix_update --cow --remove hetero_remove3 --make-bed --out merge54k_affymetrix_update1
#7.CONVERSION OF ALL THE FILES INTO ATCG FORMAT (THIS PASSAGE IS NOT FUNDAMENTAL)
./plink --file merge54k_update1 --cow --alleleACGT --recode --out merge54k_update1_ATCG
./plink --file merge777k_update1 --cow --alleleACGT --recode --out merge777k_update1_ATCG
./plink --file merge54k_affymetrix_update1 --cow --alleleACGT --recode --out merge54k_affymetrix_update1_ATCG
#8.MISSINGNESS PER ANIMAL.FILTER
./plink --bfile merge54k_update1_ATCG --cow --missing --out missing_54k
./plink --file merge777k_update1_ATCG --cow --missing --out missing_777k
./plink --bfile merge54k_affymetrix_update1_ATCG --cow --missing --out missing_54k_affy #don't filter this since there are low density genotyped animals (3k)
#9.MISSINGNESS PER SNPS.
#illumina 777k 12115 variants removed due to missing genotype data (--geno)
./plink --file merge777k_update1_ATCG --cow --geno 0.1 --make-bed --out merge777k_update1_ATCG_filtered
#illumina 54k 15303 variants removed due to missing genotype data (--geno)
./plink --file merge54k_update1_ATCG --cow --geno 0.1 --make-bed --out merge54k_update1_ATCG_filtered
#affymetrix 54k 0 missing in the 54k. file== merge54k_affymetrix_update1_ATCG
#10.MAF FILTER. 0.01 THRESHOLD. SKIP THIS PART IF YOU DON'T WANT TO FILTER FOR MAF. I DIDN'T FILTER FOR IT (FOLLOWING YOUR INSTRUCTIONS).
#Illumina 777k
./plink --bfile merge777k_update1_ATCG_filtered --maf 0.01 --cow --make-bed --out merge777k_update1_ATCG_filtered1
#Illumina 54k
./plink --bfile merge54k_update1_ATCG_filtered --maf 0.01 --cow --make-bed --out merge54k_update1_ATCG_filtered1
#Affymetrix
./plink --bfile merge54k_affymetrix_update1_ATCG --maf 0.01 --cow --make-bed --out merge54k_affymetrix_update1_ATCG_filtered1
#11.MERGE OF ALL THE ILLUMINA FILES
./plink --bfile merge777k_update1_ATCG_filtered --bmerge merge54k_update1_ATCG_filtered --cow --recode --out merge_illumina_ATCG
#12.MERGE AFFYMETRIX FILES WITH ILLUMINA FILES IN ORDER TO HAVE THE LIST OF THE SNPS TO FLIP
./plink --bfile merge54k_affymetrix_update1_ATCG --bmerge merge_illumina_ATCG --cow --recode --out merge1
#13.FLIP ILLUMINA SNPS IN ORDER TO ENSURE COMPATIBILITY WITH AFFYMETRIX
./plink --file merge_illumina_ATCG --flip merge1.missnp --cow --make-bed --out flipped
#14.MERGE ILLUMINA WITH AFFIMETRYX
./plink --bfile flipped --bmerge merge54k_affymetrix__update2_ATCG_filtered1 --cow --recode --out merge_final_ATCG
#15. HW FILTER 0.0000001. 24685 variants removed due to Hardy-Weinberg exact test
./plink --bfile merge_final_ATCG --cow --hwe 0.0000001 --make-bed --out merge_final_ATCG_no_maf
#16.LOOP IN ORDER TO EXTRACT ALL THE CHROMOSOMES
for i in {1..29}; do
./plink --file merge_final2_ATCG_no_maf --cow --recode A --chr ${i} --out chr${i}
done
#17.LOOP IN ORDER TO CONVERT EVERYTHING IN ALPHAIMPUTE FORMAT
for i in {1..29} ; do
tail -n +2 chr${i}.raw > file_flipped
cut -f 7- -d ' ' file_flipped > SNPs
cut -f2 -d ' ' file_flipped > ID
sed -i -e "s/[[:<:]]NA[[:>:]]/9/g" SNPs
paste -d ' ' ID SNPs > ch${i}_coded
rm ID SNPs file_flipped SNPs-e
done
######END######
#ALPHAIMPUTE RUN
.
.
.
.
.
.
.
.
#IMPUTED DATA
#REMOVE ID_MESSED (402 more or less)
#REMOVE ANIMALS THAT ARE IN THE PEDIGREE BUT ARE NOT GENOTYPED
############CORRELATIONS
Scenario 1. cross validation keeping all the SNPs in the dataframe, masking 100 animals and 20% of the snps for those animals…totally random with one condition: the snps must not have a low MAF. Overall correlations 0.98.
Scenario 2. 30 rounds of validations masking one animal at a time (10 animals for each array). Condition: the snps must not have a low MAF. Overall correlations 0.98.
Scenario 3. 30 rounds of validations masking one animal at a time . Condition: the snps must not have a low MAF.Condition2: I’m now masking animals that belongs only to the 6k snp chip . Overall correlations 0.90.
```