This README file explains the basics of how to set up and run the HumanLinkedSelectionMaps
code and replicate the results presented in our paper. The code can be adapted to work with data from different species or different human datasets not explored in our paper (see "Advanced usage" section). If you are just interested in downloading and using the B-maps, these can be found in the Bmaps/
directory, along with relevant README files. B-maps are formatted using the hg19 genome build.
The minimal data required to estimate the effects of linked selection in a given population are:
- polymorphism data
- a list of neutral sites
- a genetic map (ideally not based on polymorphism data, which can be confounded with effects of linked selection)
- estimates of the neutral substitution rate with fairly high resolution (e.g., 6kb windows)
- annotations of candidate targets of purifying selection (optional)
- annotations of candidate targets of selective sweeps (optional)
Note that while either selection target annotation is optional, at least one of them is required: either targets of purifying selection resulting in background selection AND/OR targets of selective sweeps. See Appendix 1 Sections 2 and 3 for details on how we constructed the dataset for the results presented in our paper. In the sections below, we detail the directories where each type of file can be found in our downloadable dataset.
The following "Quick start" guide will allow you to repdroduce the inference results presented in the paper (i.e. the CADD 6% and phastCons 6% results shown in the figures).
NOTE: The code in this repository was run using Python 2.7, so some compatibility changes may be needed for Python 3.x+.
- Download software and update
root_dir
- Download data
- Create new
RunStruct
file - Compress data
- Run inference
- Create figures
First, download the entire contents of this repository. Name the folder lsm_python
(or whatever you like). Paths to input and output files are set in the RunStruct
class, which indlues a hard-coded root_dir
variable at the top of the file lsm_python/classes/runstruct.py
. This was set to have two options, a local option and a remote option, depending on the User path on your system. Both paths must be updated here:
# ROOT DIRECTORIES FOR MY COMPUTER AND CLUSTER
if os.getcwd().startswith('/Users/davidmurphy'):
root_dir = '{}/linked_selection'.format(goog_dir)
else:
root_dir = '/ifs/data/c2b2/gs_lab/dam2214/linked_selection'
NOTE: Don't forget to add the full path to lsm_python
to your PYTHONPATH
variable in the terminal. This will enable modules to be imported within python code.
Download the following google drive folder, which includes all of the data to recreate the main results. To preserve the expected file structure, the contents of the folder must be put into the same directory where you've put lsm_python
. This folder contains files processed to the second to last step required before running the inference. The final step, compressing the data into "bins", is run by the user (see Appendix 1 Section 1.3 for description and rationale for binning).
The RunStruct
class includes a representation of the complete file structure used for input and output as well as inference settings, inference results, and other important variables. The code for this class is found in lsm_python/classes/runstruct.py
(see above section on the root directory). For a given configuration of the inference, a new RunStruct
is initialized and saved to a text file that is used as the input to a number of other programs. The file lsm_python/likelihood/initialize_runstruct.py
is set up to initialize a RunStruct
and save it in text format for the inference results presented as "CADD 6%" in the main paper. It will also create some folders for initialization and results files. Run this code and check that the file /<ROOT_DIR>/result/init_files/YRI.cadd94_gmask_v1.6_without_bstat.BS1.6.CS0.0.NOT_STARTED.initial.txt
was created in your root directory.
To initialize a RunStruct
for the other main result shown in the paper (phastCons 6%), change the line b_an_1 = 'cadd94_gmask_v1.6_without_bstat'
to b_an_1 = 'fish_cons94_new'
in lsm_python/likelihood/initialize_runstruct.py
and run the code again. This should create a comparable text file with the new annotation in the file name.
The RunStruct
is a multilayered data structure which includes several sub-classes as well, including the FileStruct
which contains the full file structure for input and output files and the OptimizerParams
which contains the initial and final parameters of the optimization.
As detailed in Appendix 1 Section 1.3, we compress neutral sites into "bins" where the values of exogenous parameters or pre-calculated "raw" B values are constant, which vastly downscales the number of calculations needed in the inference without any loss of precision in our predictions. The file lsm_python/precalc/lh_inputs.py
will perform this step by taking as input the RunStruct
file created in the previous step. The processing pipeline takes one chromosome at a time and can be used with external command line arguments to run each autosome in parallel on a cluster (see code segments in the main
function that are commented out). In the current configuration of the main
function, a for
loop will just go through each chromosome in serial. Note that this step may take 10-20 minutes (depending on your system).
This step builds the entire data structure needed to run the inference and includes the key files listed here:
- Zipped numpy arrays of neutral sites (neutral "mask"
.npz
files) for each chromosome. These arrays are the length of the chromosome in their filename, with a 1 for neutral sites and a 0 for non-neutral sites (see more details of the process of curating neutral sites in Appendix 1 Sections 2 and 3 in our paper). These files are found in the folderdata/nmsk
in the Google Drive folder linked above. - Pre-calculated maps of the effects of background selection for each chromosome and annotation (optional). These text files are found in the
precalc/cadd94_gmask_v1.6_without_bstat
directory for the current example. To calculate these files directly from genetic maps and annotated targets of background selection, see below "ADVANCED USAGE: 3. Pre-calculating background selection maps usingcalc_bkgd
". - Pre-calculated maps of the effects of selective sweeps for each chromosome and annotation (optional). These text files will be found in the
precalc/
directory but are not used in the current example. To calculate these files directly from genetic maps and annotated targets of sweeps, see below "ADVANCED USAGE: 4. Pre-calculating selective sweep maps". - Estimates of the neutral substitution rate in bins of 6000 neutral sites. These are used to scale predictions of the effects of linked selection on diveristy levels by accounting for variation in the neutral substitution rate along the genome (see Appendix 1 Section 3 of our paper for details). These text files are found in the
data/phast/aligned_8_win_6000
directory for the current example. - Polymorphism data from the population of interest (YRI in the current example). These are tables of counts of reference and alternate alleles at all biallelic SNPs in the 1000 Genomes Project Phase 3 dataset and are found in the
data/snps/frq/YRI
directory for the current example. This polymorphism data will be filtered in combination with the neutral mask files described above to only include neutral polymorphic sites.
As described in Appendix 1 Section 1.4, we run the inference in two optimization steps. In the first step, we run the optimization from 15 different starting conditions and keep the 3 results with the greatest likelihood. In the second step, we take the mean parameters of these 3 "best" runs from stage 1 and use these as the input parameters to the optimization.
To run the first step, run the script lsm_python/likelihood/inf_step_1.py
. This will read the RunStruct
file (created above) and loop through the 15 different starting conditions (indexed 0-14), leading to 15 different results files in the folder <ROOT_DIR>/result/final_files/cadd94_gmask_v1.6_without_bstat/
, with index 0-14 in the file labels. Note that this step is much faster if you can run each of the 15 optimizations in parallel and take the index as an external command line argument (see code segments in the main
function that are commented out).
After all 15 optimizations finish, step 1 is complete and you can move on to step 2. To do so, run the script lsm_python/likelihood/inf_step_2.py
. As in the above steps, this script is hard coded to continue running through the example for CADD 6% which we've started in the previous steps. The main
function will just read the name of the results folder (i.e., cadd94_gmask_v1.6_without_bstat
) as input. After the step 2 optimization is complete, a new file with the suffix .composite.txt
will be saved to the folder <ROOT_DIR>/result/final_files/cadd94_gmask_v1.6_without_bstat/
. This file contains the best-fitting parameters the current model configuration and data.
After the inference is run and the final .composite.txt
file has been created, you can use the inferred parameters to perform the analyses presented in the paper. Here we will cover the figures from the main paper (Figs. 2-5); to generate figures from the Appendix, see ADVANCED USAGE below.
Each of the figures requires an analysis step that results in a small table of values that are used to render the plots in Figs. 2-5. To simplify, each of these steps has been put into a single python file: lsm_python/figures/main_figs_analysis.py
. Running this script will run each analysis for Figs. 2-5 and save the results in text or numpy array format to the folder cadd94_gmask_v1.6_without_bstat
. You can explore each of the called analysis functions in more detail if you'd like to better understand the analyses, or change certain paramters (e.g., you could change the chromosome in figure 2a from chr1 to chr5).
To plot the figures, use the script lsm_python/figures/main_figs_quickstart.py
. This script is set up like a python notebook with individual cells for each figure. The individual cells will make each of the plots (Figs. 2-5). You can also just run the whole script and generate all of the figures. Saved files will be written to <ROOT_DIR>/result/final_files/mainfigs
.
NOTE: The results/final_files/
folder includes some additional results folders that are used for mutation rate estimates plotted in figure 4.
- Download and install third-party software used for data processing
- Download publicly available datasets used for the inference
- Pre-calculating background selection maps using
calc_bkgd
- Pre-calculating selective sweep maps
To run the inference starting from scratch, you will need to generate new "raw" maps of the effects of background selection for a grid of selection effects using the modified version Graham McVicker's calc_bkgd
program (McVicker et al., 2009) found in debug_bkgd/
. To compile this C program, you will most likely need to install some dependencies, which may vary depending on your system, and this step can bit a bit of a pain, depending on your programming skill level. I have included a list of the required C libraries that were needed to get this compiled on MacOS and some notes on installation that were helpful in my case in the /debug_bkgd/notes-on-dependencies.txt
file.
To identify conserved regions and also estimate substitution rates, data were processed with both the phastCons
and phyloFit
tools from the PHAST suite (Siepel and Haussler 2004; Siepel, Bejerano et al. 2005) found here. These programs are well-documented very easy to install and use.
To process raw VCF files from the 1000 Genomes Project, we used VCFTools (Danecek et al., 2011), which is also well-documented and easy to use.
All of the annotations, genetic maps, polymorphism data, etc., are downloaded from publicly available datasets (see Appendix 1 Section 2 in our paper for details). Some of these, like the 100-vertebrate alignment MAF files used to generate phastCons scores and the 1000 Genomes Project VCF files, are quite large, so be sure you have ~ 1TB available if you plan to download everything at once. Many of the larger files (like the 100-vertebrate alignment MAF files) will only be used once to create smaller processed datasets, so the large originals can be deleted. Alternatively, some of the processing can be done by streaming remote files into a pipe using curl
, although you will need to a stable connection for this option (see, for example, /1000_genomes/sh/recodesubpop.sh
script, which processes a stream of 1000 Genomes Project VCF file data directly).
Most file paths used for input/output processing data are found in the FileStruct
data structure defined in lsm_python/classes/filestruct.py
and will be automatically updated once you set the root_dir
variable (see above). Some file paths are also embedded in the relevant bash scripts or python files directly, so these must be updated accordingly (a search for /Users/davidmurphy/
, /ifs/data
and ifs/scratch
should help you track down any leftover hard coded paths to update for in system, or better yet incorporate into the FileStruct
data structure).
The most intensive step in the pre-processing of data is the calculation of background selection maps using the modified version of Graham McVicker's calc_bkgd
program (McVicker et al., 2009) found in debug_bkgd/
. To speed up this step, we made a version of the code that calculates effects of background selection on partitions of chromosomes (we used 10M bp partitions in our results) that are created in parallel. The partitions are then reassembled into whole chromosome maps when they are all finishing running. Once the calc_bkgd
exectutable file is compiled, you won't have to directly interact with the c program at all - the python program lsm_python/precalc/calc_bkgd.py
will read a few parameters, built a configuration file and then use this to run the c program.
The lsm_python/precalc/calc_bkgd.py
program can be called by the shell script sh/calc_bkgd.sh
by setting arguments in the command line. These arguments are:
init
:RunStruct
initialization file (see Quickstart Guide 3. Create newRunStruct
file above)chrom
: chromosome (chr1-chr22)coef
: the selection coefficient for the effects of background selection given as a float (i.e., 10^-2 is input 0.01)anno
: annotation label for annotation files of selected sites that will be used for calculating effects of background selection. Annotations files should be formatted and stored as they are currently found in the folder/data/bsanno
. For example, the full path to annotations for chromosome 1 for the annotationfish_cons94_new
would be:<ROOT_DIR>/data/bsanno/fish_cons94_new/chr1.fish_cons94_new.bed
.anno
only needs the labelfish_cons94_new
and the code will fill in the full path. If you introduce new background selection targets, make sure they conform to this formatting.udel
: deleterious mutation rate (fixed at a default of 7.4e-8)fexe
: path to executable c programcalc_bkgd
pidx
: the partition index of the chromosome to calculate effects of background selection (units of 0:length(chrom) / partition size). We used a default partition size of 10M bp; for instance chr1, which is just under 250M bp, would have 25 10M bp partitions andpidx
from 0 to 24plen
: this is the above referenced partition size for breaking up chromosomes. We used 10M bp.pts
: this is a flag (set to 1 for true, 0 for false) to indicate whether or not to apply a technical fix in the background selection calculation to avoid a rare loss of precision that occurs when thecalc_bkgd
program very rarely "jumps" over selected sites and misses their effect on neutral diversity in the resulting maps. In practice, setting this to 0 (false) will have very little effect on precision in most cases and will make the program run much faster. We set it to 1 (true) to deal with some of the issues of precision described in Appendix 1 Section 1.5 of the paper.
The key data files needed for calculating the effects of background selection are the set of annotations (set in the command line as describd above) and the genetic map. The genetic map is set in the RunStruct
and the program lsm_python/precalc/calc_bkgd.py
will look up the file paths to the genetic map files automatically. The files we used can be found in data/genetic_maps/AA_Map_10kb
in the Google Drive folder linked above.
NOTE: An additional wrapper script sh/bkgd_batch.sh
calls sh/calc_bkgd.sh
for a given range of parameters (for example, chromomosomes 1-10, all partition indices, a given range of selection coefficients, etc.). This can speed up the process of running dozens or hundreds of jobs in parallel instead of writing all of the commands to sh/calc_bkgd.sh
independently.
The program lsm_python/precalc/process_partitions.py
is used reassemble the partitioned chromosome maps once they are finished running. Once again, a shell wrapper sh/prcs_bparts.sh
executes this code with arguments from the command line. These arguments are:
chrom
: the chromosomegmap
: name of the genetic map used (set in theRunStruct
initilization file described above)bdir
: name of the folder where background selection maps are stored (set in theRunStruct
initilization file described above)anno
: annotation label (as described in previous step)coef
: selection coefficient (as described in previous step)plen
: partition size (as described in previous step)
After running this script, a new background selection map made up of the merged partitions will be created and saved in the same directory as the partitioned files (<ROOT_DIR>/precalc/<bdir>
where <bdir>
is the argument referenced above and stored in the RunStruct
; typically bdir
and anno
will be the same). These merged files are the raw input used in the data compression step described above in the quickstart guide prior to inference (4. Compress data).
The calculation of selective sweep maps requires far less computation power and is therefore much more straightforward. The program lsm_python/precalc/sweep_map.py
calculates these maps and is called by the shell wrapper sh/sweep_map.sh
. The arguments given in the command line are:
chrom
: the chromosome (chr1-chr22)anno
: the annotation label. As with theanno
argument used in calculating background selection maps, the annotation label will be used to fill in the full file paths by the code. For example, the full path to nonsynonymous substitutions on chromosome 1 would be:<ROOT_DIR>/data/csanno/nonsyn/chr1.nonsyn.npy
.anno
only needs the labelnonsyn
and the code will fill in the rest. If you introduce new selective sweep targets, make sure they conform to this formatting.coef
: the selection coefficient as a float (seecoef
for background selection maps above).lim
: an optional flag to ignore substitutions beyond a certain limit where they are estimated to have no effect on diversity levels for a given neutral site. Setting this to "True" can speed up the calculations somewhat but in practice it doesn't make a very big difference because the calculation of sweep maps is very efficient.
As with the maps of background selection described above, the key data files needed for calculating the effects of selective sweeps are the set of substitutions (set in the command line as describd above) and the genetic map. The genetic map files can be found in data/genetic_maps/AA_Map_10kb
in the Google Drive folder linked above.
Completed maps will be saved in the path (<ROOT_DIR>/precalc/<cdir>
where <cdir>
is set in the RunStruct
; typically cdir
and anno
will be the same).