Skip to content

tomszar/HGDP_1000G_Merge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Merging HGDP and 1000G

This repo will take you through the steps to merge the HGDP and 1000G reference files into a single plink binary file. Once the repo has been downloaded make sure that you meet all of the requirements, and download the necessary files in their respective folders. To do that, first go to the DataBases folder, and read the README files indicating what needs to be downloaded on each (both the HGDP and the 1000G folders). After everything has been downloaded, you can start running the script, located in the Code folder. This is a python notebook, so you can interactively run it, and modify it to your needs. In summary the script will follow these steps:

  • Transform the HGDP into plink files
  • LifOver the HGDP from hg18 to hg19
  • Extract only the SNPs found in the HGDP from the 1000G vcf files
  • Concatenate the different chromosomes and export to plink files
  • Merge the HGDP and 1000G

Requirements

This script was ran on a Linux machine, using Ubuntu 18.04. You will need the following programs:

For the following programs, you can use the bioconda channel to install them through Anaconda. To do that, once you've installed Anaconda follow the instructions in here. The script will assume that all of the following programs are in your path.

  • Plink: to install it using bioconda use the following command conda install plink
  • Vcftools: to install it using bioconda use the following command conda install vcftools
  • Bcftools: to install it using bioconda use the following command conda install bcftools
  • USCS liftOver: to install it using bioconda use the following command conda install ucsc-liftover

To ease the process, there is a conda environment file in Code/mergeref.yml. With anaconda already installed you can create the same environment used to run the script:

conda env create -f mergeref.yml
conda activate mergeref

Files to download

In the DataBases folder you'll need to download the respective files. In each folder (HGDP and 1000G) there is a README file with the same information.

HGDP

Download the following files and paste them in the DataBases/HGDP folder. The HGDP Stanford files can be downloaded from here. You will also need to download the Sample Information from here. You'll need to download the chain file that tells liftOver how to convert between hg18 to hg19 from here. Finally, you'll need to download the information with the SNP RSID and their chromosome and position, as indicated in the respective README file.

1000G

Download the following files and and paste them in the DataBases/1000G folder. The 1000G Phase 3 files can be downloaded from here.

Steps

Here you can see the steps for merging the HGDP and 1000G databases.

About

Merging the HGDP and 1000 Genomes reference samples

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published