This repo will take you through the steps to merge the HGDP and 1000G reference files into a single plink binary file.
Once the repo has been downloaded make sure that you meet all of the requirements, and download the necessary files in their respective folders.
To do that, first go to the DataBases
folder, and read the README files indicating what needs to be downloaded on each (both the HGDP
and the 1000G
folders).
After everything has been downloaded, you can start running the script, located in the Code
folder.
This is a python notebook, so you can interactively run it, and modify it to your needs.
In summary the script will follow these steps:
- Transform the HGDP into plink files
- LifOver the HGDP from hg18 to hg19
- Extract only the SNPs found in the HGDP from the 1000G vcf files
- Concatenate the different chromosomes and export to plink files
- Merge the HGDP and 1000G
This script was ran on a Linux machine, using Ubuntu 18.04. You will need the following programs:
- Python 3.x: I recommend installing python 3.x using Anaconda.
For the following programs, you can use the bioconda channel to install them through Anaconda. To do that, once you've installed Anaconda follow the instructions in here. The script will assume that all of the following programs are in your path.
- Plink: to install it using bioconda use the following command
conda install plink
- Vcftools: to install it using bioconda use the following command
conda install vcftools
- Bcftools: to install it using bioconda use the following command
conda install bcftools
- USCS liftOver: to install it using bioconda use the following command
conda install ucsc-liftover
To ease the process, there is a conda environment file in Code/mergeref.yml
.
With anaconda already installed you can create the same environment used to run the script:
conda env create -f mergeref.yml
conda activate mergeref
In the DataBases
folder you'll need to download the respective files.
In each folder (HGDP
and 1000G
) there is a README file with the same information.
Download the following files and paste them in the DataBases/HGDP
folder.
The HGDP Stanford files can be downloaded from here.
You will also need to download the Sample Information from here.
You'll need to download the chain file that tells liftOver how to convert between hg18 to hg19 from here.
Finally, you'll need to download the information with the SNP RSID and their chromosome and position, as indicated in the respective README file.
Download the following files and and paste them in the DataBases/1000G
folder.
The 1000G Phase 3 files can be downloaded from here.
Here you can see the steps for merging the HGDP and 1000G databases.