Population Genetics Scripts

This repository contains scripts that I often use when analyzing data (mainly for population genetics purposes).

compute_diversity

This directory contains scripts used to compute genetic diversity (defined as average pairwise differences) in each window of the genome.
The main script is main_compute_diversity.py. Usage is:

python main_compute_diversity.py -h                     
usage: main_compute_diversity.py [-h] --windows_bed WINDOWS_BED --targets_bed
                                 TARGETS_BED --variants VARIANTS --numAllele
                                 NUMALLELE --outfile OUTFILE

This script computes pairwise diversity.

optional arguments:
  -h, --help            show this help message and exit
  --windows_bed WINDOWS_BED
                        REQUIRED. BED file for Xkb window.
  --targets_bed TARGETS_BED
                        REQUIRED. BED file specifying the regions to be
                        partitioned into Xkb window. For example, give the
                        path for the bed file where regions represent neutral
                        region
  --variants VARIANTS   REQUIRED. Variant file. The format should be CHROM POS
                        ind1 ind2 etc. Should be tab delimit. Because of VCF
                        format, it is 1-based
  --numAllele NUMALLELE
                        REQUIRED. Indicate the number of alleles, which is
                        equal to the number of individuals in your sample
                        times 2.
  --outfile OUTFILE     REQUIRED. Name of output file.

Several notes:

The target bed file should represent the region that includes both variant and invariant sites. For instance, let's consider a 100kb window. If this is a simulation, then you should expect all 100,000 sites to be sequenced. However, in reality, not every site in this 100kb window is callable. Therefoere, before computing diversity, one should already determine the regions of the genome that are callable.
For the variant file, note that this is not a VCF file. One should use tools such as VCFtool to extract out the GT information for each individual. Alternatively, one can rewrite this part so that the script also accept the VCF format. However, I found that the script runs faster this way.
In the compute_AF.py script, note that my genotypes are not phased (represented by a dash such as 0/0). If your genotypes are phased (represented by a pipe such as 0|0), then you should change this script.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Determine_GeneticSex		Determine_GeneticSex
Group_Sites_Into_Loci		Group_Sites_Into_Loci
compute_divergence		compute_divergence
compute_diversity		compute_diversity
generate_SFS		generate_SFS
popgen_tools		popgen_tools
README.md		README.md
tabulate_SNPs_inVCF.py		tabulate_SNPs_inVCF.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Population Genetics Scripts

compute_diversity

About

Releases

Packages

Languages

tanyaphung/popgen_scripts

Folders and files

Latest commit

History

Repository files navigation

Population Genetics Scripts

compute_diversity

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages