# pVCF to PLINK 2.0

> This notebook shows how to interact with genomic data in bed/bim/bam format using PLINK 2.0. We will learn how to convert between PLINK 1.x and PLINK 2.x file formats, merge variants from different chromosomes into a single file and filter them based on variant completeness and minor allelic frequencies (MAF). Please note the extended runtime of this notebook and that no subsequent analyses are contingent on its outputted files.

- runtime: 4hrs
- recommended instance: mem1_ssd1_v2_x16
- estimated cost: <£1.50

This notebook depends on:
* **PLINK install**


## List the exome sequences data directories in your project

Please note, that depending on your project's MTA the list of files might differ.

In [None]:
ls /mnt/project/Bulk/'Exome sequences'/

## List the population variant files in PLINK 1.x (bed/bim/fam) format

In [None]:
ls -lah /mnt/project/Bulk/'Exome sequences'/'Population level exome OQFE variants, pVCF format - final release'/*c1_b1_*gz

### Install and test the PLINK2 binary
#### We recommend installing plink using the links available here:
https://www.cog-genomics.org/plink/2.0/

#### You can download the binary (AVX2 Intel; for example, using `wget <URL>`), before unzipping (`unzip <zip file>`) then making it exectutable (`chmod a+x <name>`)

#### if preferred, Plink is also available in the following locations:
https://anaconda.org/bioconda/plink2; https://github.com/chrchang/plink-ng

#### Once installed, continue with the below code chunks.


In [None]:
# Test plink works
docker run quay.io/biocontainers/plink2:2.0.0a.6.9--h9948957_0 plink2 --help

### Next install and test BCFTOOLS
#### Following instructions here: http://samtools.github.io/bcftools/howtos/install.html, enter the following code (NB a large amount of text output will follow):

In [None]:
docker run quay.io/biocontainers/bcftools:1.21--h3a4d415_1 bcftools --help

## Get reference genome

In [None]:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai

In [None]:
# Upload reference genome
dx mkdir ref_gen
dx upload GRCh38* --path ref_gen/

In [None]:
REF=`ls *fa`
echo $REF

## Find pVCF path(s)

In [None]:
# Find the path of UKB block pvcf files for chromosome 1, recursive download them
dx find data --brief --name ukb23157_c1_b1_v1.vcf.gz | xargs dx download

In [None]:
VCF=`ls *vcf.gz`
echo $VCF

## Run bcftools normalization
This procedure left-aligns and normalizes indels, checks if REF alleles match the reference and split multiallelic sites into multiple rows. More info here: https://samtools.github.io/bcftools/bcftools.html#norm

In [None]:
%%bash
# Single file run: just to test IO-bound threads
time docker run -v "$PWD":"$PWD" -w "$PWD" quay.io/biocontainers/bcftools:1.21--h3a4d415_1 \
  bcftools norm --threads "$(nproc)" -f "$REF" -m -any -Oz -o "${VCF%.*.*}.norm.vcf.gz" "$VCF"

# Now for batch processing using xargs (multiple processes with multiple threads in bcftools)
batch_file_names="filelist.txt"
echo "$VCF" > "$batch_file_names"

# Read files from batch_file_names into array
mapfile -t file_array < "$batch_file_names"

# Define function to run dockerized bcftools
process_file() {
    local file="$1"
    docker run -v "$PWD":"$PWD" -w "$PWD" quay.io/biocontainers/bcftools:1.21--h3a4d415_1 \
      bcftools norm --threads 1 -f "$REF" -m -any -Oz -o "$(basename "$file" .vcf.gz).norm.vcf.gz" "$file"
}

export -f process_file
export REF

# Run parallel docker jobs using xargs (P=processes, n=arguments per command)
printf "%s\n" "${file_array[@]}" | xargs -P "$(($(nproc)/2))" -n 1 -I {} bash -c 'process_file "$@"' _ {}


In [None]:
VCF=`ls *norm.vcf.gz`
echo $VCF

## Make a Plink bed file

In [None]:
docker run -v "$PWD":"$PWD" -w "$PWD" quay.io/biocontainers/plink2:2.0.0a.6.9--h9948957_0 plink2 \
    --vcf $VCF \
    --vcf-idspace-to _ \
    --double-id \
    --allow-extra-chr 0 \
    --make-bed \
    --vcf-half-call m \
    --out "${VCF/.vcf.gz/""}"

## Convert the pVCF to PLINK 2.x formated dataset (pgen/pvar/psam)
PLINK 2.x formated files are faster to work with and have significntly smaller size than PLINK 1.x formated files.
However, PLINK 1.x is more popular format with wider support.

In [None]:
time docker run -v "$PWD":"$PWD" -w "$PWD" quay.io/biocontainers/plink2:2.0.0a.6.9--h9948957_0 plink2 \
  --no-pheno \
  --vcf "$VCF" \
  --vcf-half-call 'haploid' \
  --make-pgen \
  --out "${VCF/.vcf.gz/""}"

## Convert to BED/BIM/FAM (PLINK 1.x format)

`--max-alleles` - excludes variants with more than the indicated value. When a variant has exactly one ALT allele and it's a missing-code, these filters treat it as having only one allele.
> see here: https://groups.google.com/g/plink2-users/c/rxMlVLIX-JA?pli=1 and https://github.com/meyer-lab-cshl/plinkQC/issues/10

In [None]:
docker run -v "$PWD":"$PWD" -w "$PWD" quay.io/biocontainers/plink2:2.0.0a.6.9--h9948957_0 plink2 \
  --no-pheno \
  --vcf "$VCF" \
  --vcf-half-call 'haploid' \
  --max-alleles 2 \
  --make-bed \
  --out test_vcf_bed

## Validate the output files

In [None]:
docker run -v "$PWD":"$PWD" -w "$PWD" quay.io/biocontainers/plink2:2.0.0a.6.9--h9948957_0 plink2 \
  --pfile "${VCF/.vcf.gz/""}" \
  --validate

In [None]:
docker run -v "$PWD":"$PWD" -w "$PWD" quay.io/biocontainers/plink2:2.0.0a.6.9--h9948957_0 plink2 \
  --bfile test_vcf_bed \
  --validate

In [None]:
!dx upload ukb23157_c1_b1_v1.norm.vcf.gz --path bed_maf/