In [1]:
import os.path as osp
import pandas as pd
%run ../init/benchmark.py
register_timeop_magic(get_ipython(), 'plink')
data_dir = osp.expanduser('~/data/gwas/tutorial/2_PS_GWAS')
data_dir

'/home/eczech/data/gwas/tutorial/2_PS_GWAS'

### Step 1: Convert VCF to PLINK

In [2]:
%%timeop -o ps1-convert
%%bash -s "$data_dir"
set -e
cd $1

## Download 1000 Genomes data ##
# This file from the 1000 Genomes contains genetic data of 629 individuals from different ethnic backgrounds.
# Note, this file is quite large (>60 gigabyte).  
#*# Do this externally in background process:
# wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz

# Convert vcf to Plink format.
plink --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes

PLINK v1.90b6.14 64-bit (7 Jan 2020)           www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.2of4intersection.20100804.genotypes.log.
Options in effect:
  --make-bed
  --out ALL.2of4intersection.20100804.genotypes
  --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz

128535 MB RAM detected; reserving 64267 MB for main workspace.
--vcf: ALL.2of4intersection.20100804.genotypes-temporary.bed +
ALL.2of4intersection.20100804.genotypes-temporary.bim +
ALL.2of4intersection.20100804.genotypes-temporary.fam written.
25488488 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to ALL.2of4intersection.20100804.genotypes.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161718192021222324252627282930313233343536373839404

Note that the original 62G gzipped vcf converts to a ~4.5G PLINK dataset:

In [5]:
%%bash -s "$data_dir"
set -e
cd $1
du -ch ALL.2of4intersection.20100804.genotypes*

3.8G	ALL.2of4intersection.20100804.genotypes.bed
607M	ALL.2of4intersection.20100804.genotypes.bim
16K	ALL.2of4intersection.20100804.genotypes.fam
4.0K	ALL.2of4intersection.20100804.genotypes.log
12K	ALL.2of4intersection.20100804.genotypes.nosex
62G	ALL.2of4intersection.20100804.genotypes.vcf.gz
66G	total


In [2]:
%%timeop -o ps1-assignrsid
%%bash -s "$data_dir"
set -e
cd $1

# Noteworthy, the file 'ALL.2of4intersection.20100804.genotypes.bim' contains SNPs without an rs-identifier, these SNPs are indicated with ".". This can also be observed in the file 'ALL.2of4intersection.20100804.genotypes.vcf.gz'. To check this file use this command: zmore ALL.2of4intersection.20100804.genotypes.vcf.gz .
# The missing rs-identifiers in the 1000 Genomes data are not a problem for this tutorial.
# However, for good practice, we will assign unique indentifiers to the SNPs with a missing rs-identifier (i.e., the SNPs with ".").
plink --bfile ALL.2of4intersection.20100804.genotypes --set-missing-var-ids @:#[b37]\$1,\$2 --make-bed --out ALL.2of4intersection.20100804.genotypes_no_missing_IDs

PLINK v1.90b6.14 64-bit (7 Jan 2020)           www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.2of4intersection.20100804.genotypes_no_missing_IDs.log.
Options in effect:
  --bfile ALL.2of4intersection.20100804.genotypes
  --make-bed
  --out ALL.2of4intersection.20100804.genotypes_no_missing_IDs
  --set-missing-var-ids @:#[b37]$1,$2

128535 MB RAM detected; reserving 64267 MB for main workspace.
25488488 variants loaded from .bim file.
10375501 missing IDs set.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
ALL.2of4intersection.20100804.genotypes_no_missing_IDs.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798

### Step 2: Sample/Variant Absence Filter

Much like as was done with the HapMap data in QC steps, remove samples and variants with low call rates:

In [4]:
%%timeop -o ps2-qc
%%bash -s "$data_dir"
set -e
cd $1

## QC on 1000 Genomes data.
# Remove variants based on missing genotype data.
plink --bfile ALL.2of4intersection.20100804.genotypes_no_missing_IDs --geno 0.2 --allow-no-sex --make-bed --out 1kG_MDS

# Remove individuals based on missing genotype data.
plink --bfile 1kG_MDS --mind 0.2 --allow-no-sex --make-bed --out 1kG_MDS2

# Remove variants based on missing genotype data.
plink --bfile 1kG_MDS2 --geno 0.02 --allow-no-sex --make-bed --out 1kG_MDS3

# Remove individuals based on missing genotype data.
plink --bfile 1kG_MDS3 --mind 0.02 --allow-no-sex --make-bed --out 1kG_MDS4

# Remove variants based on MAF.
plink --bfile 1kG_MDS4 --maf 0.05 --allow-no-sex --make-bed --out 1kG_MDS5

PLINK v1.90b6.14 64-bit (7 Jan 2020)           www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_MDS.log.
Options in effect:
  --allow-no-sex
  --bfile ALL.2of4intersection.20100804.genotypes_no_missing_IDs
  --geno 0.2
  --make-bed
  --out 1kG_MDS

128535 MB RAM detected; reserving 64267 MB for main workspace.
25488488 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 1kG_MDS.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.615305.
16481066 variants removed due to missing genotype data (--geno).
9007422 variant