# 1000 Genomes Data - Human Genome Diversity Project 

We obtained the HGDP dataset using PLINK (https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg) and added some pre-processing steps.
Execute the following cells to obtain the data.
Note than ~200GB of storage is required!

### Workflow:
1. Download the data
2. Filter the dataset to keep only the autosomal data
3. Convert the dataset to PLINK binary data (PACKEDPED: `.bed`, `.bim`, `.fam` files)
4. LD prune the dataset
5. Convert the dataset to EIGENSTRAT (`.ind`, `.snp`, `.geno` files)
6. Reannotate the correct populations

In [None]:
! wget https://www.dropbox.com/s/hppj1g1gzygcocq/hgdp_all.pgen.zst?dl=1 && mv 'hgdp_all.pgen.zst?dl=1' hgdp_all.pgen.zst
! wget https://www.dropbox.com/s/1mmkq0bd9ax8rng/hgdp_all.pvar.zst?dl=1 && mv 'hgdp_all.pvar.zst?dl=1' hgdp_all.pvar.zst
! wget https://www.dropbox.com/s/0zg57558fqpj3w1/hgdp.psam?dl=1 && mv 'hgdp.psam?dl=1' hgdp_all.psam

In [None]:
! plink2 --zst-decompress hgdp_all.pgen.zst  > hgdp_all.pgen
! plink2 --pfile hgdp_all vzs --max-alleles 2 --make-bed --out hgdp_all

In [None]:
# "--chr 1-22" excludes all variants not on the listed chromosomes
# "--output-chr 26" uses numeric chromosome codes
# "--max-alleles 2": PLINK 1 binary does not allow multi-allelic variants
# "--rm-dup" removes duplicate-ID variants
# "--set-missing-var-id" replaces missing IDs with a pattern

! plink2 --pfile hgdp_all vzs \
       --chr 1-22 \
       --output-chr 26 \
       --max-alleles 2 \
       --rm-dup exclude-mismatch \
       --set-missing-var-ids '@_#_$1_$2' \
       --new-id-max-allele-len 510 \
       --make-pgen \
       --out hgdp_autosomes

In [None]:
# pgen to bed + remove most monomorphic SNPs 

! plink2 --pfile hgdp_autosomes \
       --maf 0.005 \
       --make-bed \
       --out hgdp_autosomes

In [None]:
# LD pruning with r^2 = 0.5, sliding window of 50 variants with shift of 5
! plink2 --bfile hgdp_autosomes \
       --indep-pairwise 50 5 0.5 \
       --rm-dup exclude-mismatch \
       --set-missing-var-ids '@_#_$1_$2' \
       --new-id-max-allele-len 341 \
       --out hgdp_autosomes_ld_pruned
! plink2 --bfile hgdp_autosomes --extract hgdp_autosomes_ld_pruned.prune.in --out hgdp_autosomes_ld_pruned --make-bed

In [None]:
# convert the dataset to EIGENSTRAT format
import pathlib
import pandas as pd

from pandora.converter import run_convertf
from pandora.custom_types import FileFormat

run_convertf(
    convertf="convertf",
    in_prefix=pathlib.Path("hgdp_autosomes_ld_pruned"),
    in_format=FileFormat.PACKEDPED,
    out_prefix=pathlib.Path("hgdp_autosomes_ld_pruned"),
    out_format=FileFormat.EIGENSTRAT
)

# and reannotate the population information in the .ind file for visualization later on
sample_info = pd.read_parquet("1000genomes/HGDP.parquet")
ind_file = pathlib.Path("hgdp_autosomes_ld_pruned.ind")

new_ind_content = []
for line in ind_file.open():
    sample_id, sex, _ = line.split()
    population = sample_info.loc[lambda x: x.sample_id == sample_id].population
    population = population.replace(" ", "_")
    assert population.shape[0] == 1
    new_ind_content.append(f"{sample_id} {sex} {population.item()}")

ind_file.write_text("\n".join(new_ind_content))

## Run Pandora

To run Pandora for the sliding window analyses, paste the following into a file called `sliding_window.yaml` and then run the analysis with `pandora -c sliding_window.yaml`. 
We ran the analyses three times for different `n_replicates`: `12`, `50`, `100`. Make sure to also adjust the result dir accordingly.

```yaml
dataset_prefix: hgdp_autosomes_ld_pruned
result_dir: sliding_window/hgdp_12_windows
n_replicates: 12
analysis_mode: SLIDING_WINDOW
seed: 0
```