# Input file descriptions:

The data required are

- Allele counts for various sampling units with the following requirements:
    - This should be a matrix with one row per geographical sampling site and one column per locus
    - The markers should be bi-allelic co-dominant or dominant.
    - The sampling units can be either individuals or groups of individuals observed at the same site
    - Missing data are allowed
- Sample sizes
    - This should be a matrix with one row per geographical sampling site and one column per locus
    - This should be haploid sample sizes, so two times the number of individuals for diploid organisms and so on.
- Spatial coordinates of the sampling units.
    - This should be a matrix with two columns and one row per sampling site. It can be Lon-Lat coordinates or UTM - coordinates.
- Measurements of environmental variables at the same geographcial locations as genetic data.
    - This should be a matrix with one row per sampling site and one column per environmental variable.


In [1]:
# %matplotlib inline
%matplotlib
import re
from collections import defaultdict


import numpy as np
import pandas as pd
import seaborn as sns

import ggplot as gp
import matplotlib.pyplot as plt

import vcf

Using matplotlib backend: TkAgg


In [2]:
# IN Paths
vcf_path = "/home/gus/remote_mounts/louise/data/genomes/glossina_fuscipes/annotations/SNPs/vcftools_out/ddrad58_populations/individuals/tsetseFINAL_14Oct2014_f2_53.recode.renamed_scaffolds.maf0_05.OT_MS_NB_indv.recode.vcf"
pop_coords_and_bioclim_data_path = "/home/gus/data/ddrad/environmental/www.worldclim.org/bioclim/pop_coords_and_bioclim_data.csv"


In [3]:
# OUT Paths
ginland_dir = "/home/gus/data/ddrad/gINLAnd_input"

allele_count_path = ginland_dir + "/allele_count"
sample_sizes_path = ginland_dir + "/sample_sizes"
site_coords_path = ginland_dir + "/site_coords"
environmental_data_path = ginland_dir + "/environmental_data"

In [4]:
# helper functions

def nested_defaultdict():
    return defaultdict(nested_defaultdict)

# Prepare allele_count file

- Allele counts for various sampling units with the following requirements:
    - This should be a matrix with one row per geographical sampling site and one column per locus
    - The markers should be bi-allelic co-dominant or dominant.
    - The sampling units can be either individuals or groups of individuals observed at the same site
    - Missing data are allowed
    
## Notes:

- each "cell" in the table is the count of alternative base at that site for all members of the sample site.
    - homo-REF = 0
    - hetero = 1
    - homo-ALT = 2

-----------

# Prepare sample_sizes file

- Sample sizes
    - This should be a matrix with one row per geographical sampling site and one column per locus
    - This should be haploid sample sizes, so two times the number of individuals for diploid organisms and so on.
    
## Notes:

- this is important because not all individuals have called SNPs at all loci and the `allele_count` data needs to know which loci have how many chroms to get the ALT:REF ratio correct

--------------

In [5]:
vcf_reader = vcf.Reader(open(vcf_path, 'r'))

In [6]:
r = vcf_reader.next()
print r

Record(CHROM=Scaffold0, POS=13388, REF=T, ALT=[C])


In [9]:
!wc -l $vcf_path

75728 /home/gus/remote_mounts/louise/data/genomes/glossina_fuscipes/annotations/SNPs/vcftools_out/ddrad58_populations/individuals/tsetseFINAL_14Oct2014_f2_53.recode.renamed_scaffolds.maf0_05.OT_MS_NB_indv.recode.vcf


In [8]:
s = r.samples[0]
s

Call(sample=MS11_0001, CallData(GT=1/1, PL=[255, 72, 0], DP=24, SP=0, GQ=73))

In [89]:
s.called

True

In [98]:
def vcf_to_allele_count_and_sample_sizes(vcf_path):
    """Generate dataframes with per-locus allele_count and sample_sizes gINLAnd data.

    Args:
        vcf_path (str): Path to source VCF

    Returns:
        allele_count (pandas.DataFrame)
        sample_sizes (pandas.DataFrame)

    """
    allele_count_dict = nested_defaultdict()
    sample_sizes_dict = nested_defaultdict()
    site_members = map_site_members_to_site_code(vcf_path=vcf_path)
    site_codes = tuple(set(site_members.values()))
    vcf_reader = vcf.Reader(open(vcf_path, 'r'))
    
    for snp_rec in vcf_reader:
        chrom_pos = init_nested_dicts_for_locus(allele_count_dict, sample_sizes_dict, snp_rec, site_codes)

        for sample in snp_rec.samples:
            sample_name = sample.sample
            sample_site = site_members[sample_name]

            try:
                allele_count_dict[chrom_pos][sample_site] += sum_hap_gt(sample=sample)
                sample_sizes_dict[chrom_pos][sample_site] += 2
            except TypeError:
                pass
                
    allele_count = pd.DataFrame(data=allele_count_dict).sort(axis=0).sort(axis=1)
    sample_sizes = pd.DataFrame(data=sample_sizes_dict).sort(axis=0).sort(axis=1)
    
    return allele_count,sample_sizes


def map_site_members_to_site_code(vcf_path):
    """maps site members to site codes.

    Args:
        vcf_path (str): Path to source VCF

    Returns:
        site_members (dict): `dict` containing individual site codes as `key` and group name as `value`

    """
    
    vcf_reader = vcf.Reader(open(vcf_path, 'r'))
    
    site_members = defaultdict(str)
    
    for sample in vcf_reader.samples:
        site_members[sample] = sample[:2]
        
    return site_members


def init_nested_dicts_for_locus(allele_count_dict, sample_sizes_dict, snp_rec, site_codes):
    chrom_pos = "{chrom}:{pos}".format(chrom=snp_rec.CHROM,pos=snp_rec.POS)
    for site in site_codes:
        allele_count_dict[chrom_pos][site] = 0
        sample_sizes_dict[chrom_pos][site] = 0
    return chrom_pos


def sum_hap_gt(sample):
    gt = sample.data.GT
    assert '/' in gt
    
    hap_gts = [int(hap) for hap in gt.split('/')]
    assert set(hap_gts).issubset(set([0,1]))
    
    return sum(hap_gts)

In [94]:
allele_count,sample_sizes = vcf_to_allele_count_and_sample_sizes(vcf_path=vcf_path)

In [95]:
allele_count.head()

Unnamed: 0,JFJR01006593.1:10330,JFJR01006593.1:10357,JFJR01006593.1:119566,JFJR01006593.1:119582,JFJR01006593.1:119584,JFJR01006593.1:123271,JFJR01006593.1:123276,JFJR01006593.1:123278,JFJR01006593.1:124260,JFJR01006593.1:132284,...,Scaffold9:972035,Scaffold9:972069,Scaffold9:979793,Scaffold9:979801,Scaffold9:984986,Scaffold9:985024,Scaffold9:988009,Scaffold9:99,Scaffold9:997698,Scaffold9:997702
MS,4,4,0,11,11,5,5,5,7,4,...,9,4,0,8,16,0,0,2,9,0
NB,3,3,0,20,20,0,0,0,12,13,...,0,26,22,0,21,0,19,2,5,43
OT,0,0,12,4,3,8,8,8,1,4,...,0,0,0,0,16,8,0,7,8,0


In [97]:
sample_sizes.

Unnamed: 0,JFJR01006593.1:10330,JFJR01006593.1:10357,JFJR01006593.1:119566,JFJR01006593.1:119582,JFJR01006593.1:119584,JFJR01006593.1:123271,JFJR01006593.1:123276,JFJR01006593.1:123278,JFJR01006593.1:124260,JFJR01006593.1:132284,...,Scaffold9:972035,Scaffold9:972069,Scaffold9:979793,Scaffold9:979801,Scaffold9:984986,Scaffold9:985024,Scaffold9:988009,Scaffold9:99,Scaffold9:997698,Scaffold9:997702
MS,24,24,22,22,22,24,24,24,24,24,...,24,24,24,24,24,24,24,24,22,22
NB,48,48,48,48,48,48,48,48,46,48,...,48,48,48,48,46,46,48,48,48,48
OT,28,28,28,28,28,28,28,28,26,28,...,28,28,28,28,26,26,28,28,24,24


# Prepare  site_coords file

- Spatial coordinates of the sampling units.
    - This should be a matrix with two columns and one row per sampling site. It can be Lon-Lat coordinates or UTM - coordinates.
    
--------

# Prepare  environmental_data file

- Measurements of environmental variables at the same geographcial locations as genetic data.
    - This should be a matrix with one row per sampling site and one column per environmental variable.

---------

In [110]:
def bioclim_pop_data_to_site_coords_and_env_data(pop_coords_and_bioclim_data_path,geo_sites):
    """"""
    coords_and_bioclim = pd.read_csv(pop_coords_and_bioclim_data_path)
    
    # filter for sites we actually have
    geo_site_mask = coords_and_bioclim.code.apply(lambda row: row in geo_sites)
    coords_and_bioclim_filtered = coords_and_bioclim[geo_site_mask]
    
    site_coords = coords_and_bioclim_filtered[["lat","long"]].set_index(coords_and_bioclim_filtered.code.values)
    environmental_data = coords_and_bioclim_filtered[[x for x in coords_and_bioclim_filtered.columns if x.startswith('bio')]].set_index(coords_and_bioclim_filtered.code.values)
    
    return site_coords.sort(axis=0),environmental_data.sort(axis=0).sort(axis=1)

# Run the functions

- because gINLAnd does not expect column or row headers, the files will be written once without headers.
- because it is MUCH easier to trace interesting results back to loci if there are labels, will ALSO be written once WITH header info.

In [105]:
allele_count,sample_sizes = vcf_to_allele_count_and_sample_sizes(vcf_path=vcf_path)

In [106]:
geo_sites = allele_count.index.values
geo_sites

array(['MS', 'NB', 'OT'], dtype=object)

In [113]:
site_coords,environmental_data = bioclim_pop_data_to_site_coords_and_env_data(pop_coords_and_bioclim_data_path=pop_coords_and_bioclim_data_path,
                                                                              geo_sites=geo_sites)

In [114]:
site_coords

Unnamed: 0,lat,long
MS,1.683327,31.734009
NB,0.836069,33.68582
OT,1.918258,33.302457


In [115]:
environmental_data

Unnamed: 0,bio12,bio13,bio14,bio15,bio18,bio19,bio2,bio3,bio4,bio5,bio6,bio7,bio8,bio9
MS,1330,174,33,41,186,388,119,81,795,312,166,146,223,238
NB,1322,202,40,40,214,310,121,82,691,313,166,147,234,238
OT,1312,194,17,51,134,447,129,79,1015,328,165,163,222,245


## Write files WITHOUT header info

In [119]:
allele_count.to_csv(allele_count_path,index=False,header=False)
sample_sizes.to_csv(sample_sizes_path,index=False,header=False)
site_coords.to_csv(site_coords_path,index=False,header=False)
environmental_data.to_csv(environmental_data_path,index=False,header=False)

## Write files WITH header info

In [117]:
allele_count.to_csv(allele_count_path + ".with_headers.csv")
sample_sizes.to_csv(sample_sizes_path + ".with_headers.csv")
site_coords.to_csv(site_coords_path + ".with_headers.csv")
environmental_data.to_csv(environmental_data_path + ".with_headers.csv")

In [118]:
ginland_dir

'/home/gus/data/ddrad/gINLAnd_input'