### __Pipeline 2: SNP processing__

#### __Requirements__
- a python environment (installed with conda for example);
- .TSV files from the obtained with pipeline 1;
- python `subprocess`;
- python `Numpy`;
- python `Pandas`;

Load some python libraries

In [1]:
import numpy as np
import pandas as pd
from snp_utils import filter_short_introns_from_bed, filter_snps_by_interval, create_snp_total_counts_dict
from sfs_utils import create_sfs_dict, downsample_sfs_in_dict

#### __Import the tables into Pandas__

We are going to use pandas to import the SNP table. Pandas is a great (if used with cauting) Python package built on Numpy which allows easy dataFrame manipulations.

In [2]:
%cd examples/
!ls 

/Users/tur92196/Documents/Repositories/dmel_data/create_sfss/examples
chr2L_example_ann_table.tsv chr3R_example_ann_table.tsv
chr2R_example_ann_table.tsv [1m[36mresults[m[m
chr3L_example_ann_table.tsv


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Upload the `.tsv` file for each chromosome (except for the __chrom4__ and __chromX__).

In [3]:
# Upload files with pd.read_table()
chr2L_table = pd.read_table("chr2L_example_ann_table.tsv")
chr2R_table = pd.read_table("chr2R_example_ann_table.tsv")
chr3L_table = pd.read_table("chr3L_example_ann_table.tsv")
chr3R_table = pd.read_table("chr3R_example_ann_table.tsv")

Check the number of SNPs in each file:

In [4]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_table.shape[0], chr2R_table.shape[0], chr3L_table.shape[0], chr3R_table.shape[0])

'chr2L = 1000, chr2R = 1000, chr3L = 1000, chr3R = 1000 SNPs!'

In [5]:
# Take a look at chr2R
chr2L_table.head()

Unnamed: 0,chrom,pos,id,ref,alt,refcount,altcount,refflank,altflank,refcodon,...,snpeff_trnscid,sift_trnscid,sift_geneid,sift_genename,sift_region,sift_vartype,sifts_core,sift_median,sift_pred,deleteriousness
0,chr2L,5090,.,T,C,103,3,CATTTTCTC,CATTCTCTC,,...,FBtr0475186,,,,,,,,,
1,chr2L,5095,.,T,A,16,104,TCTCTCCCA,TCTCACCCA,,...,FBtr0475186,,,,,,,,,
2,chr2L,5110,.,T,A,130,2,AGGGTGAAA,AGGGAGAAA,,...,FBtr0475186,,,,,,,,,
3,chr2L,5118,.,G,T,147,1,ATATGATCG,ATATTATCG,,...,FBtr0475186,,,,,,,,,
4,chr2L,5140,.,C,T,158,2,AGTGCCAAC,AGTGTCAAC,,...,FBtr0475186,,,,,,,,,


#### __Subset SNPs__

__Subset each chromosome to retain only:__
- `SNPs in short-introns`;
- `Synonymous SNPs`;
- `Non-synonymous SNPs`;

__(1) Subset to retain only SNPs annotated as introns__

In [6]:

chr2L_introns = chr2L_table[(chr2L_table['effect'] == "INTRON")]
chr2L_introns.shape

(146, 29)

Do the same for the other chromosomes

In [7]:
# Subset chr2L to retain only SNPs annotated as introns
chr2R_introns = chr2R_table[(chr2R_table['effect'] == "INTRON")]
chr3L_introns = chr3L_table[(chr3L_table['effect'] == "INTRON")]
chr3R_introns = chr3R_table[(chr3R_table['effect'] == "INTRON")]

In [8]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_introns.shape[0], chr2R_introns.shape[0], chr3L_introns.shape[0], chr3R_introns.shape[0])

'chr2L = 146, chr2R = 146, chr3L = 146, chr3R = 146 SNPs!'

__(2)Keep only short intronic SNPs__

For each chromosome, retain only SNPs in `short introns`. To do that, you need to identify the short-intros in Dm6 genome and head/tail the sequence to remove head and trailing 8bp from each short-intron. These extremes migh be under selection constraints. Download the intron regions as `.BED` file from [here](https://genome.ucsc.edu/cgi-bin/hgTables). Select *D. melanogaster* assembly known as *Dm6* (Aug. 2014, BDGP Release 6 + ISO 1 MT/dm6). Define the region of interest as `genome`, select the output format as `BED` and name it. In the next page, select only `Introns plus 0`, then hit `Get BED`. Now you are ready to go. The next step is to create a Pandas DataFrame with short introns intervals.

In [9]:
%cd ..
!ls

/Users/tur92196/Documents/Repositories/dmel_data/create_sfss
README.md
__init__.py
[1m[36m__pycache__[m[m
[1m[36mexamples[m[m
[1m[36mintervals[m[m
mutational_context_utils.py
pipeline_to_process_snps.ipynb
pipeline_to_process_snps_with_imputation.ipynb
sfs_utils.py
snp_utils.py
test_create_sfs_dict.py
test_create_snp_total_counts_dict.py
test_create_unfolded_sfs_from_df.py
test_downsample_sfs.py
test_filter_snps_by_interval.py
test_fold_sfs.py
test_impute_missing_haplotypes.py


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [10]:
short_introns = filter_short_introns_from_bed("intervals/dm6_introns.bed", 
                                              "intervals/dm6_short_introns.bed", 
                                              ["chr2L", "chr2R", "chr3L", "chr3R"],
                                              short_intron_size=86,
                                              trailling_size=8)
short_introns.head()

Unnamed: 0,0,1,2,3,4,5
0,chr2L,8387838,8387882,NM_001201797.2_intron_5_0_chr2L_8387832_f,0,+
1,chr2L,8387838,8387882,NM_001201795.2_intron_5_0_chr2L_8387832_f,0,+
2,chr2L,8387838,8387882,NM_164812.5_intron_4_0_chr2L_8387832_f,0,+
3,chr2L,8387838,8387882,NM_205936.3_intron_5_0_chr2L_8387832_f,0,+
4,chr2L,8387838,8387882,NM_205935.3_intron_5_0_chr2L_8387832_f,0,+


Now, apply the filter based on the BED file intervals (to make sure to only keep `short-intronic` SNPs)

In [11]:
# Keep only SNPs in exons and in short-introns
chr2L_short_introns = filter_snps_by_interval(chr2L_introns, short_introns)
chr2R_short_introns = filter_snps_by_interval(chr2R_introns, short_introns)
chr3L_short_introns = filter_snps_by_interval(chr3L_introns, short_introns)
chr3R_short_introns = filter_snps_by_interval(chr3R_introns, short_introns)

Check how many short-introns SNPs were retained:

In [12]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_short_introns.shape[0], chr2R_short_introns.shape[0], chr3L_short_introns.shape[0], chr3R_short_introns.shape[0])

'chr2L = 0, chr2R = 0, chr3L = 0, chr3R = 0 SNPs!'

__(3) Take synonymous and non-synonymous SNPs__

In [13]:
chr2L_exons = chr2L_table[(chr2L_table['effect'] == "NON_SYNONYMOUS_CODING") |
                          (chr2L_table['effect'] == "SYNONYMOUS_CODING")]
chr2L_exons.shape

(242, 29)

Do the same on the other chromosomes

In [14]:
# Subset chr2L to retain only SNPs annotated as introns
chr2R_exons = chr2R_table[(chr2L_table['effect'] == "NON_SYNONYMOUS_CODING") | (chr2L_table['effect'] == "SYNONYMOUS_CODING")]
chr3L_exons = chr3L_table[(chr2L_table['effect'] == "NON_SYNONYMOUS_CODING") | (chr2L_table['effect'] == "SYNONYMOUS_CODING")]
chr3R_exons = chr3R_table[(chr2L_table['effect'] == "NON_SYNONYMOUS_CODING") | (chr2L_table['effect'] == "SYNONYMOUS_CODING")]

In [15]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_exons.shape[0], chr2R_exons.shape[0], chr3L_exons.shape[0], chr3R_exons.shape[0])

'chr2L = 242, chr2R = 242, chr3L = 242, chr3R = 242 SNPs!'

__Subset each chromosome to retain only SNPs with the total count (`refcount` + `altcount`) higher than a threshould__

Let's start with SNPs with more than 160 haplotypes ( `total count` > 160).
But before doing that, lets combine introns and exons SNPs in one dataset for each chromosome.

In [16]:
total_counts = 160

In [17]:
# Keep only intronic SNPs wiht total counts > 160 haplotypes (for the example use all introns)
chr2L_introns_mincounts = chr2L_introns[chr2L_introns['refcount'] + chr2L_introns['altcount'] >= total_counts]
chr2R_introns_mincounts = chr2R_introns[chr2R_introns['refcount'] + chr2R_introns['altcount'] >= total_counts]
chr3L_introns_mincounts = chr3L_introns[chr3L_introns['refcount'] + chr3L_introns['altcount'] >= total_counts]
chr3R_introns_mincounts = chr3R_introns[chr3R_introns['refcount'] + chr3R_introns['altcount'] >= total_counts]

In [18]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, and chr3R = {}  SNPs!".format(chr2L_introns_mincounts.shape[0], chr2R_introns_mincounts.shape[0], chr3L_introns_mincounts.shape[0], chr3R_introns_mincounts.shape[0])

'chr2L = 144, chr2R = 144, chr3L = 144, and chr3R = 144  SNPs!'

In [19]:
# Keep only exonic SNPs wiht total counts > 160 haplotypes
chr2L_exons_mincounts = chr2L_exons[chr2L_exons['refcount'] + chr2L_exons['altcount'] >= total_counts]
chr2R_exons_mincounts = chr2R_exons[chr2R_exons['refcount'] + chr2R_exons['altcount'] >= total_counts]
chr3L_exons_mincounts = chr3L_exons[chr3L_exons['refcount'] + chr3L_exons['altcount'] >= total_counts]
chr3R_exons_mincounts = chr3R_exons[chr3R_exons['refcount'] + chr3R_exons['altcount'] >= total_counts]

In [20]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, and chr3R = {}  SNPs!".format(chr2L_exons_mincounts.shape[0], chr2R_exons_mincounts.shape[0], chr3L_exons_mincounts.shape[0], chr3R_exons_mincounts.shape[0])

'chr2L = 242, chr2R = 242, chr3L = 242, and chr3R = 242  SNPs!'

#### __SNPs counts dictionary__

Create a dictionary of SNP counts for each dataFrame containing short-intron SNPs.

In [21]:
chr2L_introns_dict = create_snp_total_counts_dict(chr2L_introns_mincounts)
chr2R_introns_dict = create_snp_total_counts_dict(chr2R_introns_mincounts)
chr3L_introns_dict = create_snp_total_counts_dict(chr3L_introns_mincounts)
chr3R_introns_dict = create_snp_total_counts_dict(chr3R_introns_mincounts)

Then get the intron-SFS for each chromosome total count values (this is not downsampled yet)

In [22]:

chr2L_introns_raw_sfs = create_sfs_dict(chr2L_introns_dict)
chr2R_introns_raw_sfs = create_sfs_dict(chr2R_introns_dict)
chr3L_introns_raw_sfs = create_sfs_dict(chr3L_introns_dict)
chr3R_introns_raw_sfs = create_sfs_dict(chr3R_introns_dict)

In [24]:
chr2L_introns_sfs = downsample_sfs_in_dict(chr2L_introns_raw_sfs, total_counts, fold=True)
chr2R_introns_sfs = downsample_sfs_in_dict(chr2R_introns_raw_sfs, total_counts, fold=True)
chr3L_introns_sfs = downsample_sfs_in_dict(chr3L_introns_raw_sfs, total_counts, fold=True)
chr3R_introns_sfs = downsample_sfs_in_dict(chr3R_introns_raw_sfs, total_counts, fold=True)

Then combined each chromosome short-introns SFS

In [27]:
introns_sfs_array = np.array([
    chr2L_introns_sfs,
    chr2R_introns_sfs,
    chr3L_introns_sfs,
    chr3R_introns_sfs
])

introns_sfs = np.sum(introns_sfs_array, 0).tolist()

For the synonymous SFS

In [29]:
# Create total counts dictionary for synonymous SNPs
chr2L_synonymous_dict = create_snp_total_counts_dict(chr2L_exons_mincounts[(chr2L_exons_mincounts['effect'] == "SYNONYMOUS_CODING")])
chr2R_synonymous_dict = create_snp_total_counts_dict(chr2R_exons_mincounts[(chr2R_exons_mincounts['effect'] == "SYNONYMOUS_CODING")])
chr3L_synonymous_dict = create_snp_total_counts_dict(chr3L_exons_mincounts[(chr3L_exons_mincounts['effect'] == "SYNONYMOUS_CODING")])
chr3R_synonymous_dict = create_snp_total_counts_dict(chr3R_exons_mincounts[(chr3R_exons_mincounts['effect'] == "SYNONYMOUS_CODING")])

# Create synonymous SFS dictionary
chr2L_synonymous_raw_sfs = create_sfs_dict(chr2L_synonymous_dict)
chr2R_synonymous_raw_sfs = create_sfs_dict(chr2R_synonymous_dict)
chr3L_synonymous_raw_sfs = create_sfs_dict(chr3L_synonymous_dict)
chr3R_synonymous_raw_sfs = create_sfs_dict(chr3R_synonymous_dict)

# Then combined each chromosome synonymous SFS
synonymous_sfs_array = np.array([
    downsample_sfs_in_dict(chr2L_synonymous_raw_sfs, total_counts, fold=True),
    downsample_sfs_in_dict(chr2R_synonymous_raw_sfs, total_counts, fold=True),
    downsample_sfs_in_dict(chr3L_synonymous_raw_sfs, total_counts, fold=True),
    downsample_sfs_in_dict(chr3R_synonymous_raw_sfs, total_counts, fold=True)
])

synonymous_sfs = np.sum(synonymous_sfs_array, 0).tolist()

And the non-synonymous one:

In [30]:
# Create total counts dictionary for synonymous SNPs
chr2L_nonsynonymous_dict = create_snp_total_counts_dict(chr2L_exons_mincounts[(chr2L_exons_mincounts['effect'] == "NON_SYNONYMOUS_CODING")])
chr2R_nonsynonymous_dict = create_snp_total_counts_dict(chr2R_exons_mincounts[(chr2R_exons_mincounts['effect'] == "NON_SYNONYMOUS_CODING")])
chr3L_nonsynonymous_dict = create_snp_total_counts_dict(chr3L_exons_mincounts[(chr3L_exons_mincounts['effect'] == "NON_SYNONYMOUS_CODING")])
chr3R_nonsynonymous_dict = create_snp_total_counts_dict(chr3R_exons_mincounts[(chr3R_exons_mincounts['effect'] == "NON_SYNONYMOUS_CODING")])

# Create synonymous SFS dictionary
chr2L_nonsynonymous_raw_sfs = create_sfs_dict(chr2L_nonsynonymous_dict)
chr2R_nonsynonymous_raw_sfs = create_sfs_dict(chr2R_nonsynonymous_dict)
chr3L_nonsynonymous_raw_sfs = create_sfs_dict(chr3L_nonsynonymous_dict)
chr3R_nonsynonymous_raw_sfs = create_sfs_dict(chr3R_nonsynonymous_dict)

# Then combined each chromosome synonymous SFS
nonsynonymous_sfs_array = np.array([
    downsample_sfs_in_dict(chr2L_nonsynonymous_raw_sfs, total_counts, fold=True),
    downsample_sfs_in_dict(chr2R_nonsynonymous_raw_sfs, total_counts, fold=True),
    downsample_sfs_in_dict(chr3L_nonsynonymous_raw_sfs, total_counts, fold=True),
    downsample_sfs_in_dict(chr3R_nonsynonymous_raw_sfs, total_counts, fold=True)
])

nonsynonymous_sfs = np.sum(nonsynonymous_sfs_array, 0).tolist()

Save the three SFSs to a file:

In [31]:
%cd examples/results/
!ls

/Users/tur92196/Documents/Repositories/dmel_data/create_sfss/examples/results
example_imputed_folded_sfs.txt


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [32]:
output_sfs_file = "example_downsampled_folded_sfs.txt"

with open(output_sfs_file, "w") as of:
    of.write("Introns, synonymous, and nonsynonymous SFS of " + str(total_counts) + " samples" + "\n")
    of.write("\t".join(str(item) for item in introns_sfs) + "\n")
    of.write("\n")
    of.write("\t".join(str(item) for item in synonymous_sfs) + "\n")
    of.write("\n")
    of.write("\t".join(str(item) for item in nonsynonymous_sfs) + "\n")