# Roadmap
- [x] New plot way of missing values
  - So that it is easier to determine which SNP/ID to be filtered away
- [x] Check the Norwegian data set.
  - Request new data on platform $V_2$
- [x] Remove low quality loci
  - [x] Also do a brief statistics
- [x] Remove low quality ID
- [x] Merge above results into new 3-set data for imputation
- [x] Remove platform specific SNP again.
- [x] Impute for GBLUP
  - ? With some forward $R^2$ info about imputation quality

## Memo
- [ ] Check Norwegian 50k-v2 data.
  - I used bed$\rightarrow$vcf$\rightarrow$bed to circumvent the problem.
  - But obviously something are wrong in the original files.

# QC with new plots
## SNP control
### Missing values and Hardy-Weinberg Equillibrium

Note I will later call the Dutch LD platform 10690, 10993, 11483 as $dd_{1,2,3}$. Illum. 50kv1, 50kv2, 50kv3, 777k as $v_{1,2,3,7}$ with possible country initials $d,g,n$.

![](fig/dutch-d1-ms-hwe.png)

![](fig/dutch-d2-ms-hwe.png)

![](fig/dutch-d3-ms-hwe.png)

![](fig/dutch-v7-ms-hwe.png)

![](fig/dutch-v2-ms-hwe.png)

![](fig/dutch-v3-ms-hwe.png)

![](fig/german-v2-ms-hwe.png)

![](fig/german-v3-ms-hwe.png)

![](fig/norge-v7-ms-hwe.png)

![](fig/norge-v1-ms-hwe.png)

![](fig/norge-v2-ms-hwe-old.png)

### To circumvent the HWE issue about Norge-V2
If I just convert the Norwegian-$V_2$ data in `bed` to `vcf` and back to `bed` again
The HWE problem is gone.

I will check what went wrong with the original data from GENO.

![](fig/norge-v2-ms-hwe.png)

### HWE again
Another way to remove low quality SNP is to let, e.g., $|O_{\mathrm{HET}} - E_{\mathrm{HET}}| < 0.15$. 
This is obviously **not** a good standard to filter loci, especially for datasets with only a few ID. So I ignored this.

#### Summary
There are too many figures, as there are too many platforms. It is now much easier to decide which loci to be excluded.
- Above figures used only autosomes
- A stats of SNP in the target set.
  - meaning there are maximally ~50k loci
- I will remove
  - loci with 10% missing
  - loci with $-\log_{10}t>4$. where $t$ is the HWE statistic
  - cutting off the right tails is similar to using FDR
- The Norwegian $V_2$ data has some unkown, so far, problems.
- Of $V_1$, 777k from Norway
  - 777k data is the best.
  - $V_1$ is OK, but of lower quality than other countries
- German data came last, but seems to be of best quality.

#### Action
- [x] I will check Norwegian $V_2$ first
  -  problem temporarily circumvented.
- More inspection needed.

### Results - 1
- [x] Remove loci of missing rate >0.1%
- [x] Remove loci of HWE statistic with P-value <0.0001
- [x] Remove loci of MAF$<0.05$.
  - Memo $nv_1$: Warning: --hwe observation counts vary by more than 10%.  Consider using --geno, and/or applying different p-value thresholds to distinct subsets of your data.
  - In `data/genotypes/flt-snp.plk` run `bash ../../../src/snp-superset.sh` for the statistics.
- [x] Find a SNP superset = $\cap(\cup(dd_{1,2,3}, dv_{2,3,7}), \cup(gv_{2,3}), \cup(nv_{1,2,7}))$
- [x] Remove country specific SNP

Note above take only a few seconds. So they can be easily changed.

### Final number of SNPs

| maf | $dd_1$ | $dd_2$ | $dd_3$ | $dv_2$ | $dv_3$ | $dv_7$ | $gv_2$ | $gv_3$ | $nv_1$ | $nv_2$ | $nv_7$ | Total |
| --: | --: | --: | --: | --: | --: | --: | --: | --: | --: | --: | --: | --: |
| 0.01 | 7041 | 7158 | 7188 | 40795 | 43182 | 36142 | 41923 | 44028 | 31802 | 41091 | 38917 | 44348 |
| 0.02 | 7003 | 7102 | 7149 | 39371 | 41455 | 35778 | 40644 | 42493 | 30509 | 39555 | 37564 | 42912 |
| 0.05 | 6864 | 6955 | 6985 | 36006 | 37568 | 34397 | 37477 | 39204 | 27826 | 36085 | 34656 | 39754 |

Note before:

| Target | 50k-V1 | 50k-V2 |    HD | PF-10690 | PF-10993 | PF-11483 |
| --: | --: | --: | --: | --: | --: | --: |
|  51626 |  47877 |  49465 | 47155 |     7364 |     7438 |     7448 |

## ID quality control
I temporarily used  previous results with MAF $=0.02$.

If to set missing ratio threshold to 0.1:

<img src="fig/dutch-d1-imiss.png" style="width: 400px;"/><img src="fig/dutch-d2-imiss.png" style="width: 400px;"/>

<img src="fig/dutch-d3-imiss.png" style="width: 400px;"/><img src="fig/dutch-v2-imiss.png" style="width: 400px;"/>

<img src="fig/dutch-v3-imiss.png" style="width: 400px;"/><img src="fig/dutch-v7-imiss.png" style="width: 400px;"/>

<img src="fig/german-v2-imiss.png" style="width: 400px;"/><img src="fig/german-v3-imiss.png" style="width: 400px;"/>

<img src="fig/norge-v1-imiss.png" style="width: 400px;"/><img src="fig/norge-v2-imiss.png" style="width: 400px;"/>

<img src="fig/norge-v7-imiss.png" style="width: 400px;"/>

### In Summary
The affected platforms are:

|  | $dd_1$ | $dd_2$ | $dd_3$ | $dv_2$ | $dv_3$ | $dv_7$ | $gv_2$ | $gv_3$ | $nv_1$ | $nv_2$ | $nv_7$ |
| -- | --: | --: | --: | --: | --: | --: | --: | --: | --: | --: | --: |
| Before | 1994 | 102 | 109 | 343 | 44 | 10 | 69 | 732 | 5664 | 5621 | 2206 | 
| After | 1888 | - | 107 | 338 | - | - | - | - | 5627 | 5594 | 2192 |

Note:
- The number of reference in Dutch data is $dv_{2,3,7}$ = 392. 2017 ID genotyped with LD platform are to be imputed. The results **might be ugly**.
- I removed country specific SNP at this stage
  - 42861 SNP left for downstream
- Imputation was based on above set.

## Considerations
- Remove country specific SNP after imputation
- Imputation took most of the time in above steps, ca 30-60minutes.
- $nv_2$ is a trouble maker.
  - Actually none of the 3 file sets, $nv_{1,2,7}$, from Norway can go through `beagle.jar`
  - After QC, the missing rate can go up to **>80%** if convert from `bed` to `vcf` again.
- Another issue is same SNP on different platforms allele-1 and allele-2
  - will they affect genotypes if not unified?
  - Unification about these? Any experience?
- Going through above pipeline take maximally 1-2hr
  - Hence suggestions are very important.
  
### About the codes
- ~2k lines
  - many thrown away
  - many can be thrown
  - many can be reused
- ~2h to run to date.