### __Pipeline 2: SNP processing with downsampling__

#### __Requirements__
- a python environment (installed with conda for example);
- .TSV files from the obtained with pipeline 1;
- python `subprocess`;
- python `Numpy`;
- python `Pandas`;

Load some python libraries

In [1]:
import numpy as np
import pandas as pd
from snp_utils import filter_short_introns_from_bed, filter_snps_by_interval, create_snp_total_counts_dict
from sfs_utils import create_sfs_dict, downsample_sfs_in_dict
from mutational_context_utils import *

#### __Import the tables into Pandas__

We are going to use pandas to import the SNP table. Pandas is a great (if used with cauting) Python package built on Numpy which allows easy dataFrame manipulations.

In [3]:
%cd ../../masked/vcfs/tables/
!ls 

/Users/tur92196/DGN/dpgp3/masked/vcfs/tables
ZI_chr2L_ann_table.tsv          chr2R_short_introns_imputed.pkl
ZI_chr2R_ann_table.tsv          chr2R_short_introns_sampled.pkl
ZI_chr3L_ann_table.tsv          chr3L_exons_imputed.pkl
ZI_chr3R_ann_table.tsv          chr3L_exons_sampled.pkl
chr2L_exons_imputed.pkl         chr3L_short_introns_imputed.pkl
chr2L_exons_sampled.pkl         chr3L_short_introns_sampled.pkl
chr2L_short_introns_imputed.pkl chr3R_exons_imputed.pkl
chr2L_short_introns_sampled.pkl chr3R_exons_sampled.pkl
chr2R_exons_imputed.pkl         chr3R_short_introns_imputed.pkl
chr2R_exons_sampled.pkl         chr3R_short_introns_sampled.pkl


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Upload the `.tsv` file for each chromosome (except for the __chrom4__ and __chromX__).

In [5]:
# Upload files with pd.read_table()
chr2L_table = pd.read_table("ZI_chr2L_ann_table.tsv")
chr2R_table = pd.read_table("ZI_chr2R_ann_table.tsv")
chr3L_table = pd.read_table("ZI_chr3L_ann_table.tsv")
chr3R_table = pd.read_table("ZI_chr3R_ann_table.tsv")

Check the number of SNPs in each file:

In [27]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_table.shape[0], chr2R_table.shape[0], chr3L_table.shape[0], chr3R_table.shape[0])

'chr2L = 2016260, chr2R = 1582937, chr3L = 1795061, chr3R = 2087683 SNPs!'

In [28]:
# Take a look at chr2L
chr2L_table.head()

Unnamed: 0,chrom,pos,id,ref,alt,refcount,altcount,refflank,altflank,refcodon,...,snpeff_trnscid,sift_trnscid,sift_geneid,sift_genename,sift_region,sift_vartype,sifts_core,sift_median,sift_pred,deleteriousness
0,chr2L,5090,.,T,C,103,3,CATTTTCTC,CATTCTCTC,,...,FBtr0475186,,,,,,,,,
1,chr2L,5095,.,T,A,16,104,TCTCTCCCA,TCTCACCCA,,...,FBtr0475186,,,,,,,,,
2,chr2L,5110,.,T,A,130,2,AGGGTGAAA,AGGGAGAAA,,...,FBtr0475186,,,,,,,,,
3,chr2L,5118,.,G,T,147,1,ATATGATCG,ATATTATCG,,...,FBtr0475186,,,,,,,,,
4,chr2L,5140,.,C,T,158,2,AGTGCCAAC,AGTGTCAAC,,...,FBtr0475186,,,,,,,,,


#### __Prepare the DataFrame for downsampling__

Get the total number of haplotypes for each SNP.

In [29]:
# Add a total count column to all datasets:
chr2L_table['totalcount'] = chr2L_table['refcount'] + chr2L_table['altcount']
chr2R_table['totalcount'] = chr2R_table['refcount'] + chr2R_table['altcount']
chr3L_table['totalcount'] = chr3L_table['refcount'] + chr3L_table['altcount']
chr3R_table['totalcount'] = chr3R_table['refcount'] + chr3R_table['altcount']

In [30]:
# Take a look at chr2L
chr2L_table.head()

Unnamed: 0,chrom,pos,id,ref,alt,refcount,altcount,refflank,altflank,refcodon,...,sift_trnscid,sift_geneid,sift_genename,sift_region,sift_vartype,sifts_core,sift_median,sift_pred,deleteriousness,totalcount
0,chr2L,5090,.,T,C,103,3,CATTTTCTC,CATTCTCTC,,...,,,,,,,,,,106
1,chr2L,5095,.,T,A,16,104,TCTCTCCCA,TCTCACCCA,,...,,,,,,,,,,120
2,chr2L,5110,.,T,A,130,2,AGGGTGAAA,AGGGAGAAA,,...,,,,,,,,,,132
3,chr2L,5118,.,G,T,147,1,ATATGATCG,ATATTATCG,,...,,,,,,,,,,148
4,chr2L,5140,.,C,T,158,2,AGTGCCAAC,AGTGTCAAC,,...,,,,,,,,,,160


Remove SNPs if the totalcount < 160. This defines a lower bound for downsampling(the minimum number of haplotypes that we are downsampling higher number of haplotypes to).

In [31]:
# Set the lower bound for the number of haplotypes
min_number_of_haplotypes = 160

In [32]:
chr2L_table = chr2L_table[chr2L_table['totalcount'] >= min_number_of_haplotypes]
chr2R_table = chr2R_table[chr2R_table['totalcount'] >= min_number_of_haplotypes]
chr3L_table = chr3L_table[chr3L_table['totalcount'] >= min_number_of_haplotypes]
chr3R_table = chr3R_table[chr3R_table['totalcount'] >= min_number_of_haplotypes]

In [33]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_table.shape[0], chr2R_table.shape[0], chr3L_table.shape[0], chr3R_table.shape[0])

'chr2L = 1919950, chr2R = 1514541, chr3L = 1719317, chr3R = 2017777 SNPs!'

#### __Subset SNPs__

__Subset each chromosome to retain only:__
- `SNPs in short-introns`;
- `Synonymous SNPs`;
- `Non-synonymous SNPs`;

__(1) Subset to retain only SNPs annotated as introns__

Do the same for the other chromosomes

In [35]:
# Subset to retain only SNPs annotated as introns
chr2L_introns = chr2L_table[(chr2L_table['effect'] == "INTRON")]
chr2R_introns = chr2R_table[(chr2R_table['effect'] == "INTRON")]
chr3L_introns = chr3L_table[(chr3L_table['effect'] == "INTRON")]
chr3R_introns = chr3R_table[(chr3R_table['effect'] == "INTRON")]

In [37]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_introns.shape[0], chr2R_introns.shape[0], chr3L_introns.shape[0], chr3R_introns.shape[0])

'chr2L = 353457, chr2R = 258584, chr3L = 321868, chr3R = 435809 SNPs!'

__(2)Keep only short intronic SNPs__

For each chromosome, retain only SNPs in `short introns`. To do that, you need to identify the short-intros in Dm6 genome and head/tail the sequence to remove head and trailing 8bp from each short-intron. These extremes migh be under selection constraints. Download the intron regions as `.BED` file from [here](https://genome.ucsc.edu/cgi-bin/hgTables). Select *D. melanogaster* assembly known as *Dm6* (Aug. 2014, BDGP Release 6 + ISO 1 MT/dm6). Define the region of interest as `genome`, select the output format as `BED` and name it. In the next page, select only `Introns plus 0`, then hit `Get BED`. Now you are ready to go. The next step is to create a Pandas DataFrame with short introns intervals.

In [14]:
%cd ../../../../reference/
!ls

/Users/tur92196/DGN/reference
dm3.fa                            dmel-2L-chromosome-r5.57.fasta.gz
dm3.fa.fai                        dmel-2L-r5.57.gff.gz
dm3ToDm6.over.chain               dmel-2R-chromosome-r5.57.fasta.gz
dm6.fa                            dmel-2R-r5.57.gff.gz
dm6.fa.dict                       dmel-3L-chromosome-r5.57.fasta.gz
dm6.fa.fai                        dmel-3L-r5.57.gff.gz
dm6_introns.bed                   dmel-3R-chromosome-r5.57.fasta.gz
dm6_short_introns.bed             dmel-3R-r5.57.gff.gz
dmel-2L-chromosome-r5.57.fasta


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [16]:
short_introns = filter_short_introns_from_bed("dm6_introns.bed", 
                                              "dm6_short_introns.bed", 
                                              ["chr2L", "chr2R", "chr3L", "chr3R"],
                                              short_intron_size=86,
                                              trailling_size=8)
short_introns.head()

Unnamed: 0,0,1,2,3,4,5
0,chr2L,8387838,8387882,NM_001201797.2_intron_5_0_chr2L_8387832_f,0,+
1,chr2L,8387838,8387882,NM_001201795.2_intron_5_0_chr2L_8387832_f,0,+
2,chr2L,8387838,8387882,NM_164812.5_intron_4_0_chr2L_8387832_f,0,+
3,chr2L,8387838,8387882,NM_205936.3_intron_5_0_chr2L_8387832_f,0,+
4,chr2L,8387838,8387882,NM_205935.3_intron_5_0_chr2L_8387832_f,0,+


Now, apply the filter based on the BED file intervals (to make sure to only keep `short-intronic` SNPs)

In [38]:
# Keep only short-introns SNPs.
chr2L_short_introns = filter_snps_by_interval(chr2L_introns, short_introns)
chr2R_short_introns = filter_snps_by_interval(chr2R_introns, short_introns)
chr3L_short_introns = filter_snps_by_interval(chr3L_introns, short_introns)
chr3R_short_introns = filter_snps_by_interval(chr3R_introns, short_introns)

Dump these DataFrames as pickle files to avoid re-running it:

In [39]:
%cd ../dpgp3
!ls 

[1m[36mPopfly[m[m         [1m[36mdmel_data_bckp[m[m [1m[36mmasked[m[m
[1m[36mdmel_data[m[m      dpgp3.txt      [1m[36moriginals[m[m


In [40]:
# Pickle chrm short-introns DataFrames
# chr2L_short_introns.to_pickle("masked/vcfs/tables/chr2L_short_introns_downsampled.pkl")
# chr2R_short_introns.to_pickle("masked/vcfs/tables/chr2R_short_introns_downsampled.pkl")
# chr3L_short_introns.to_pickle("masked/vcfs/tables/chr3L_short_introns_downsampled.pkl")
# chr3R_short_introns.to_pickle("masked/vcfs/tables/chr3R_short_introns_downsampled.pkl")

In [None]:
# Load pickled chrm short-introns DataFrames
# chr2L_short_introns = pd.read_pickle("masked/vcfs/tables/chr2L_short_introns_downsampled.pkl")
# chr2R_short_introns = pd.read_pickle("masked/vcfs/tables/chr2R_short_introns_downsampled.pkl")
# chr3L_short_introns = pd.read_pickle("masked/vcfs/tables/chr3L_short_introns_downsampled.pkl")
# chr3R_short_introns = pd.read_pickle("masked/vcfs/tables/chr3R_short_introns_downsampled.pkl")

Check how many short-introns SNPs were retained:

In [41]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_short_introns.shape[0], chr2R_short_introns.shape[0], chr3L_short_introns.shape[0], chr3R_short_introns.shape[0])

'chr2L = 2728, chr2R = 2384, chr3L = 2073, chr3R = 2967 SNPs!'

__(3) Take synonymous and non-synonymous SNPs__

In [43]:
# Subset chr2L to retain only SNPs annotated as introns
chr2L_exons = chr2L_table[(chr2L_table['effect'] == "NON_SYNONYMOUS_CODING") | (chr2L_table['effect'] == "SYNONYMOUS_CODING")]
chr2R_exons = chr2R_table[(chr2R_table['effect'] == "NON_SYNONYMOUS_CODING") | (chr2R_table['effect'] == "SYNONYMOUS_CODING")]
chr3L_exons = chr3L_table[(chr3L_table['effect'] == "NON_SYNONYMOUS_CODING") | (chr3L_table['effect'] == "SYNONYMOUS_CODING")]
chr3R_exons = chr3R_table[(chr3R_table['effect'] == "NON_SYNONYMOUS_CODING") | (chr3R_table['effect'] == "SYNONYMOUS_CODING")]

In [44]:
# Pickle chrm exons DataFrames
# chr2L_exons.to_pickle("masked/vcfs/tables/chr2L_exons_downsampled.pkl")
# chr2R_exons.to_pickle("masked/vcfs/tables/chr2R_exons_downsampled.pkl")
# chr3L_exons.to_pickle("masked/vcfs/tables/chr3L_exons_downsampled.pkl")
# chr3R_exons.to_pickle("masked/vcfs/tables/chr3R_exons_downsampled.pkl")

In [None]:
# Load pickled chrm exons DataFrames
# chr2L_exons = pd.read_pickle("masked/vcfs/tables/chr2L_exons_downsampled.pkl")
# chr2R_exons = pd.read_pickle("masked/vcfs/tables/chr2R_exons_downsampled.pkl")
# chr3L_exons = pd.read_pickle("masked/vcfs/tables/chr3L_exons_downsampled.pkl")
# chr3R_exons = pd.read_pickle("masked/vcfs/tables/chr3R_exons_downsampled.pkl")

In [45]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_exons.shape[0], chr2R_exons.shape[0], chr3L_exons.shape[0], chr3R_exons.shape[0])

'chr2L = 255680, chr2R = 236781, chr3L = 226182, chr3R = 271278 SNPs!'

#### __SNPs counts dictionary__

Create a dictionary of SNP counts for each dataFrame containing short-intron SNPs.

In [47]:
chr2L_short_introns_dict = create_snp_total_counts_dict(chr2L_short_introns)
chr2R_short_introns_dict = create_snp_total_counts_dict(chr2R_short_introns)
chr3L_short_introns_dict = create_snp_total_counts_dict(chr3L_short_introns)
chr3R_short_introns_dict = create_snp_total_counts_dict(chr3R_short_introns)

Then get the intron-SFS for each chromosome total count values (this is not downsampled yet)

In [50]:

chr2L_short_introns_raw_sfs = create_sfs_dict(chr2L_short_introns_dict)
chr2R_short_introns_raw_sfs = create_sfs_dict(chr2R_short_introns_dict)
chr3L_short_introns_raw_sfs = create_sfs_dict(chr3L_short_introns_dict)
chr3R_short_introns_raw_sfs = create_sfs_dict(chr3R_short_introns_dict)

In [52]:
chr2L_short_introns_sfs = downsample_sfs_in_dict(chr2L_short_introns_raw_sfs, min_number_of_haplotypes, fold=True)
chr2R_short_introns_sfs = downsample_sfs_in_dict(chr2R_short_introns_raw_sfs, min_number_of_haplotypes, fold=True)
chr3L_short_introns_sfs = downsample_sfs_in_dict(chr3L_short_introns_raw_sfs, min_number_of_haplotypes, fold=True)
chr3R_short_introns_sfs = downsample_sfs_in_dict(chr3R_short_introns_raw_sfs, min_number_of_haplotypes, fold=True)

Then combined each chromosome short-introns SFS

In [53]:
short_introns_sfs_array = np.array([
    chr2L_short_introns_sfs,
    chr2R_short_introns_sfs,
    chr3L_short_introns_sfs,
    chr3R_short_introns_sfs
])

short_introns_sfs = np.sum(short_introns_sfs_array, 0).tolist()

For the synonymous SFS

In [55]:
# Create total counts dictionary for synonymous SNPs
chr2L_synonymous_dict = create_snp_total_counts_dict(chr2L_exons[(chr2L_exons['effect'] == "SYNONYMOUS_CODING")])
chr2R_synonymous_dict = create_snp_total_counts_dict(chr2R_exons[(chr2R_exons['effect'] == "SYNONYMOUS_CODING")])
chr3L_synonymous_dict = create_snp_total_counts_dict(chr3L_exons[(chr3L_exons['effect'] == "SYNONYMOUS_CODING")])
chr3R_synonymous_dict = create_snp_total_counts_dict(chr3R_exons[(chr3R_exons['effect'] == "SYNONYMOUS_CODING")])

# Create synonymous SFS dictionary
chr2L_synonymous_raw_sfs = create_sfs_dict(chr2L_synonymous_dict)
chr2R_synonymous_raw_sfs = create_sfs_dict(chr2R_synonymous_dict)
chr3L_synonymous_raw_sfs = create_sfs_dict(chr3L_synonymous_dict)
chr3R_synonymous_raw_sfs = create_sfs_dict(chr3R_synonymous_dict)

# Then combined each chromosome synonymous SFS
synonymous_sfs_array = np.array([
    downsample_sfs_in_dict(chr2L_synonymous_raw_sfs, min_number_of_haplotypes, fold=True),
    downsample_sfs_in_dict(chr2R_synonymous_raw_sfs, min_number_of_haplotypes, fold=True),
    downsample_sfs_in_dict(chr3L_synonymous_raw_sfs, min_number_of_haplotypes, fold=True),
    downsample_sfs_in_dict(chr3R_synonymous_raw_sfs, min_number_of_haplotypes, fold=True)
])

synonymous_sfs = np.sum(synonymous_sfs_array, 0).tolist()

And the non-synonymous one:

In [56]:
# Create total counts dictionary for synonymous SNPs
chr2L_nonsynonymous_dict = create_snp_total_counts_dict(chr2L_exons[(chr2L_exons['effect'] == "NON_SYNONYMOUS_CODING")])
chr2R_nonsynonymous_dict = create_snp_total_counts_dict(chr2R_exons[(chr2R_exons['effect'] == "NON_SYNONYMOUS_CODING")])
chr3L_nonsynonymous_dict = create_snp_total_counts_dict(chr3L_exons[(chr3L_exons['effect'] == "NON_SYNONYMOUS_CODING")])
chr3R_nonsynonymous_dict = create_snp_total_counts_dict(chr3R_exons[(chr3R_exons['effect'] == "NON_SYNONYMOUS_CODING")])

# Create synonymous SFS dictionary
chr2L_nonsynonymous_raw_sfs = create_sfs_dict(chr2L_nonsynonymous_dict)
chr2R_nonsynonymous_raw_sfs = create_sfs_dict(chr2R_nonsynonymous_dict)
chr3L_nonsynonymous_raw_sfs = create_sfs_dict(chr3L_nonsynonymous_dict)
chr3R_nonsynonymous_raw_sfs = create_sfs_dict(chr3R_nonsynonymous_dict)

# Then combined each chromosome synonymous SFS
nonsynonymous_sfs_array = np.array([
    downsample_sfs_in_dict(chr2L_nonsynonymous_raw_sfs, min_number_of_haplotypes, fold=True),
    downsample_sfs_in_dict(chr2R_nonsynonymous_raw_sfs, min_number_of_haplotypes, fold=True),
    downsample_sfs_in_dict(chr3L_nonsynonymous_raw_sfs, min_number_of_haplotypes, fold=True),
    downsample_sfs_in_dict(chr3R_nonsynonymous_raw_sfs, min_number_of_haplotypes, fold=True)
])

nonsynonymous_sfs = np.sum(nonsynonymous_sfs_array, 0).tolist()

Save the three SFSs to a file:

In [58]:
%cd masked/vcfs/
!ls

/Users/tur92196/DGN/dpgp3/masked/vcfs
README.md    ZI_Chr3L.vcf chrms        [1m[36mremade[m[m       [1m[36msnpeff[m[m
ZI_Chr2L.vcf ZI_Chr3R.vcf [1m[36mfiltered[m[m     [1m[36msfss[m[m         [1m[36mtables[m[m
ZI_Chr2R.vcf ZI_ChrX.vcf  [1m[36mliftover[m[m     [1m[36msift4g[m[m


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [None]:
%%bash

# Save the annotated SNP files into snpeff folder
sfss_folder="sfss"
if [ !  -d "$sfss_folder" ]; 
then
    mkdir -p $sfss_folder && ls $sfss_folder
else
    ls $sfss_folder
fi

In [59]:
%cd sfss/
!ls

/Users/tur92196/DGN/dpgp3/masked/vcfs/sfss
[1m[36mno-pairing[m[m [1m[36mpaired[m[m


In [60]:
output_sfs_file = "no-pairing/ZI_sfs_si_nopairing_downsampled_folded___.txt"

with open(output_sfs_file, "w") as of:
    of.write("Introns, synonymous, and nonsynonymous SFS of " + str(min_number_of_haplotypes) + " samples" + "\n")
    of.write("\t".join(str(item) for item in short_introns_sfs) + "\n")
    of.write("\n")
    of.write("\t".join(str(item) for item in synonymous_sfs) + "\n")
    of.write("\n")
    of.write("\t".join(str(item) for item in nonsynonymous_sfs) + "\n")

#### __Pair neutral and non-neutral SNPs by mutational context__

Now that we had a simple way to get the SFSs, we are going to add some complexities and pair neutral and non-neutral SNPs by their mutational context.

In [61]:
!ls

[1m[36mno-pairing[m[m [1m[36mpaired[m[m


Remove SNPs that are close to each other, as they might cause ambiguity on the sequence category.

For Short-introns:

In [62]:
# Get True for SNPs far apart and False otherwise, and insert a column on each chromosome DataFrame:
# Insert the new column at position 3
chr2L_short_introns.insert(2, "pos_to_keep", find_consecutive_positions(list(chr2L_short_introns["pos"])))
chr2R_short_introns.insert(2, "pos_to_keep", find_consecutive_positions(list(chr2R_short_introns["pos"])))
chr3L_short_introns.insert(2, "pos_to_keep", find_consecutive_positions(list(chr3L_short_introns["pos"])))
chr3R_short_introns.insert(2, "pos_to_keep", find_consecutive_positions(list(chr3R_short_introns["pos"])))

In [63]:
chr2L_short_introns = chr2L_short_introns[chr2L_short_introns['pos_to_keep'] == True]
chr2R_short_introns = chr2R_short_introns[chr2R_short_introns['pos_to_keep'] == True]
chr3L_short_introns = chr3L_short_introns[chr3L_short_introns['pos_to_keep'] == True]
chr3R_short_introns = chr3R_short_introns[chr3R_short_introns['pos_to_keep'] == True]

In [64]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_short_introns.shape[0], chr2R_short_introns.shape[0], chr3L_short_introns.shape[0], chr3R_short_introns.shape[0])

'chr2L = 1850, chr2R = 1632, chr3L = 1484, chr3R = 2134 SNPs!'

For Exons:

In [65]:
# Get True for SNPs far apart and False otherwise, and insert a column on each chromosome DataFrame:
# Insert the new column at position 3
chr2L_exons.insert(2, "pos_to_keep", find_consecutive_positions(list(chr2L_exons["pos"])))
chr2R_exons.insert(2, "pos_to_keep", find_consecutive_positions(list(chr2R_exons["pos"])))
chr3L_exons.insert(2, "pos_to_keep", find_consecutive_positions(list(chr3L_exons["pos"])))
chr3R_exons.insert(2, "pos_to_keep", find_consecutive_positions(list(chr3R_exons["pos"])))

In [66]:
chr2L_exons = chr2L_exons[chr2L_exons['pos_to_keep'] == True]
chr2R_exons = chr2R_exons[chr2R_exons['pos_to_keep'] == True]
chr3L_exons = chr3L_exons[chr3L_exons['pos_to_keep'] == True]
chr3R_exons = chr3R_exons[chr3R_exons['pos_to_keep'] == True]

In [67]:
# Take a look at the number of SNPs with .shape
"chr2L = {}, chr2R = {}, chr3L = {}, chr3R = {} SNPs!".format(chr2L_exons.shape[0], chr2R_exons.shape[0], chr3L_exons.shape[0], chr3R_exons.shape[0])

'chr2L = 222389, chr2R = 209955, chr3L = 199710, chr3R = 240024 SNPs!'

##### Add the sequence category for each SNP

In [68]:
# Insert the new column at position 9
# For introns:
chr2L_short_introns.insert(8, "sequence_category", set_snp_sequence_category(chr2L_short_introns))
chr2R_short_introns.insert(8, "sequence_category", set_snp_sequence_category(chr2R_short_introns))
chr3L_short_introns.insert(8, "sequence_category", set_snp_sequence_category(chr3L_short_introns))
chr3R_short_introns.insert(8, "sequence_category", set_snp_sequence_category(chr3R_short_introns))

# For exons:
chr2L_exons.insert(8, "sequence_category", set_snp_sequence_category(chr2L_exons))
chr2R_exons.insert(8, "sequence_category", set_snp_sequence_category(chr2R_exons))
chr3L_exons.insert(8, "sequence_category", set_snp_sequence_category(chr3L_exons))
chr3R_exons.insert(8, "sequence_category", set_snp_sequence_category(chr3R_exons))

##### Create the sequence category dict

In [69]:
# First create a object for the sequence categories dictionary
sequence_categories_dict = create_sequence_categories_dict()
print(sequence_categories_dict)

{1: ['AA/CA', 'AC/AA'], 2: ['CA/CA', 'CC/AA'], 3: ['GA/CA', 'GC/AA'], 4: ['TA/CA', 'TC/AA'], 5: ['AA/CC', 'AC/AC'], 6: ['CA/CC', 'CC/AC'], 7: ['GA/CC', 'GC/AC'], 8: ['TA/CC', 'TC/AC'], 9: ['AA/CG', 'AC/AG'], 10: ['CA/CG', 'CC/AG'], 11: ['GA/CG', 'GC/AG'], 12: ['TA/CG', 'TC/AG'], 13: ['AA/CT', 'AC/AT'], 14: ['CA/CT', 'CC/AT'], 15: ['GA/CT', 'GC/AT'], 16: ['TA/CT', 'TC/AT'], 17: ['AA/GA', 'AG/AA'], 18: ['CA/GA', 'CG/AA'], 19: ['GA/GA', 'GG/AA'], 20: ['TA/GA', 'TG/AA'], 21: ['AA/GC', 'AG/AC'], 22: ['CA/GC', 'CG/AC'], 23: ['GA/GC', 'GG/AC'], 24: ['TA/GC', 'TG/AC'], 25: ['AA/GG', 'AG/AG'], 26: ['CA/GG', 'CG/AG'], 27: ['GA/GG', 'GG/AG'], 28: ['TA/GG', 'TG/AG'], 29: ['AA/GT', 'AG/AT'], 30: ['CA/GT', 'CG/AT'], 31: ['GA/GT', 'GG/AT'], 32: ['TA/GT', 'TG/AT'], 33: ['AA/TA', 'AT/AA'], 34: ['CA/TA', 'CT/AA'], 35: ['GA/TA', 'GT/AA'], 36: ['TA/TA', 'TT/AA'], 37: ['AA/TC', 'AT/AC'], 38: ['CA/TC', 'CT/AC'], 39: ['GA/TC', 'GT/AC'], 40: ['TA/TC', 'TT/AC'], 41: ['AA/TG', 'AT/AG'], 42: ['CA/TG', 'CT/AG'], 

In [70]:
#  Create three dictionaries for each chromosome
# For introns
chr2L_short_introns_seqclasses = create_snp_dict_wrapper(chr2L_short_introns, sequence_categories_dict, "introns")
chr2R_short_introns_seqclasses = create_snp_dict_wrapper(chr2R_short_introns, sequence_categories_dict, "introns")
chr3L_short_introns_seqclasses = create_snp_dict_wrapper(chr3L_short_introns, sequence_categories_dict, "introns")
chr3R_short_introns_seqclasses = create_snp_dict_wrapper(chr3R_short_introns, sequence_categories_dict, "introns")

# For exons
chr2L_nonsyns_seqclasses, chr2L_syns_seqclasses = create_snp_dict_wrapper(chr2L_exons, sequence_categories_dict, "exons")
chr2R_nonsyns_seqclasses, chr2R_syns_seqclasses = create_snp_dict_wrapper(chr2R_exons, sequence_categories_dict, "exons")
chr3L_nonsyns_seqclasses, chr3L_syns_seqclasses = create_snp_dict_wrapper(chr3L_exons, sequence_categories_dict, "exons")
chr3R_nonsyns_seqclasses, chr3R_syns_seqclasses = create_snp_dict_wrapper(chr3R_exons, sequence_categories_dict, "exons")


##### Find mutational context pairs and get the SFSs
Now, find the pairs and obtain the SFSs for each member of the pair. These are the valid pairs we are looking to have:
- `short-intron SNP` and `non-synonymous SNPs`;
- `short-intron SNP` and `synonymous SNPs`;
- `synonymous SNP` and `non-synonymous SNPs` (maybe?)

For `short-intron SNP` and `non-synonymous SNPs`

In [71]:
# Find pairs of SNPs that are closest to each other
chr2L_si_nonsyn_pairs_si, chr2L_si_nonsyn_pairs_nonsyn = find_closest_snp_pairs(chr2L_short_introns_seqclasses, chr2L_nonsyns_seqclasses)
chr2R_si_nonsyn_pairs_si, chr2R_si_nonsyn_pairs_nonsyn = find_closest_snp_pairs(chr2R_short_introns_seqclasses, chr2R_nonsyns_seqclasses)
chr3L_si_nonsyn_pairs_si, chr3L_si_nonsyn_pairs_nonsyn = find_closest_snp_pairs(chr3L_short_introns_seqclasses, chr3L_nonsyns_seqclasses)
chr3R_si_nonsyn_pairs_si, chr3R_si_nonsyn_pairs_nonsyn = find_closest_snp_pairs(chr3R_short_introns_seqclasses, chr3R_nonsyns_seqclasses)

In [73]:
# Get the SFS for the short introns SNPs
# Remember: the data is now sampled, so the max_sample_size means the lower bound set to 160.
# Use the imputed argument, knowing that the data was sampled.
short_introns_paired_nonsyn_sfs_seqclasses_array = np.array([
    create_unfolded_sfs_from_snp_dict(chr2L_si_nonsyn_pairs_si, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr2R_si_nonsyn_pairs_si, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3L_si_nonsyn_pairs_si, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3R_si_nonsyn_pairs_si, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True)
])

short_introns_paired_nonsyn_seqclasses_sfs_folded = np.sum(short_introns_paired_nonsyn_sfs_seqclasses_array, 0).tolist()

# Get the SFS for the nonsynonymous SNPs
nonsynonymous_paired_si_sfs_seqclasses_array = np.array([
    create_unfolded_sfs_from_snp_dict(chr2L_si_nonsyn_pairs_nonsyn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr2R_si_nonsyn_pairs_nonsyn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3L_si_nonsyn_pairs_nonsyn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3R_si_nonsyn_pairs_nonsyn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True)
])

nonsynonymous_paired_si_seqclasses_sfs_folded = np.sum(nonsynonymous_paired_si_sfs_seqclasses_array, 0).tolist()

Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...


In [74]:
# Save the pairs to a file
si_nonsyn_pair_output_sfs_file = "paired/ZI_sfs_si_paired_with_nonsynonymous_downsampled_folded.txt"

with open(si_nonsyn_pair_output_sfs_file, "w") as of:
    of.write("Introns and nonsynonymous SFS of " + str(min_number_of_haplotypes) + " samples" + "\n")
    of.write("\t".join(str(item) for item in short_introns_paired_nonsyn_seqclasses_sfs_folded) + "\n")
    of.write("\n")
    of.write("\t".join(str(item) for item in nonsynonymous_paired_si_seqclasses_sfs_folded) + "\n")

For `short-intron SNP` and `synonymous SNPs`

In [75]:
# Find pairs of SNPs that are closest to each other
chr2L_si_syn_pairs_si, chr2L_si_syn_pairs_syn = find_closest_snp_pairs(chr2L_short_introns_seqclasses, chr2L_syns_seqclasses)
chr2R_si_syn_pairs_si, chr2R_si_syn_pairs_syn = find_closest_snp_pairs(chr2R_short_introns_seqclasses, chr2R_syns_seqclasses)
chr3L_si_syn_pairs_si, chr3L_si_syn_pairs_syn = find_closest_snp_pairs(chr3L_short_introns_seqclasses, chr3L_syns_seqclasses)
chr3R_si_syn_pairs_si, chr3R_si_syn_pairs_syn = find_closest_snp_pairs(chr3R_short_introns_seqclasses, chr3R_syns_seqclasses)

In [76]:
# Get the SFS for the short introns SNPs
short_introns_paired_syn_sfs_seqclasses_array = np.array([
    create_unfolded_sfs_from_snp_dict(chr2L_si_syn_pairs_si, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr2R_si_syn_pairs_si, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3L_si_syn_pairs_si, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3R_si_syn_pairs_si, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True)
])

short_introns_paired_syn_seqclasses_sfs_folded = np.sum(short_introns_paired_syn_sfs_seqclasses_array, 0).tolist()

# Get the SFS for the synonymous SNPs
synonymous_paired_si_sfs_seqclasses_array = np.array([
    create_unfolded_sfs_from_snp_dict(chr2L_si_syn_pairs_syn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr2R_si_syn_pairs_syn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3L_si_syn_pairs_syn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3R_si_syn_pairs_syn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True)
])

synonymous_paired_si_seqclasses_sfs_folded = np.sum(synonymous_paired_si_sfs_seqclasses_array, 0).tolist()

Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...


In [77]:
# Save the pairs to a file
si_syn_pair_output_sfs_file = "paired/ZI_sfs_si_paired_with_synonymous_downsampled_folded_.txt"

with open(si_syn_pair_output_sfs_file, "w") as of:
    of.write("Introns and synonymous SFS of " + str(min_number_of_haplotypes) + " samples" + "\n")
    of.write("\t".join(str(item) for item in short_introns_paired_syn_seqclasses_sfs_folded) + "\n")
    of.write("\n")
    of.write("\t".join(str(item) for item in synonymous_paired_si_seqclasses_sfs_folded) + "\n")

For `synonymous SNP` and `non-synonymous SNPs`

In [78]:
# Find pairs of SNPs that are closest to each other
chr2L_syn_nonsyn_pairs_syn, chr2L_syn_nonsyn_pairs_nonsyn = find_closest_snp_pairs(chr2L_syns_seqclasses, chr2L_nonsyns_seqclasses)
chr2R_syn_nonsyn_pairs_syn, chr2R_syn_nonsyn_pairs_nonsyn = find_closest_snp_pairs(chr2R_syns_seqclasses, chr2R_nonsyns_seqclasses)
chr3L_syn_nonsyn_pairs_syn, chr3L_syn_nonsyn_pairs_nonsyn = find_closest_snp_pairs(chr3L_syns_seqclasses, chr3L_nonsyns_seqclasses)
chr3R_syn_nonsyn_pairs_syn, chr3R_syn_nonsyn_pairs_nonsyn = find_closest_snp_pairs(chr3R_syns_seqclasses, chr3R_nonsyns_seqclasses)

In [79]:
# Get the SFS for the synonymous SNPs
synonymous_paired_nonsyn_sfs_seqclasses_array = np.array([
    create_unfolded_sfs_from_snp_dict(chr2L_syn_nonsyn_pairs_syn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr2R_syn_nonsyn_pairs_syn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3L_syn_nonsyn_pairs_syn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3R_syn_nonsyn_pairs_syn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True)
])

synonymous_paired_nonsyn_seqclasses_sfs_folded = np.sum(synonymous_paired_nonsyn_sfs_seqclasses_array, 0).tolist()

# Get the SFS for the nonsynonymous SNPs
nonsynonymous_paired_syn_sfs_seqclasses_array = np.array([
    create_unfolded_sfs_from_snp_dict(chr2L_syn_nonsyn_pairs_nonsyn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr2R_syn_nonsyn_pairs_nonsyn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3L_syn_nonsyn_pairs_nonsyn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True),
    create_unfolded_sfs_from_snp_dict(chr3R_syn_nonsyn_pairs_nonsyn, "downsampled", max_sample_size=197, min_sample_size=min_number_of_haplotypes, folded=True)
])

nonsynonymous_paired_syn_seqclasses_sfs_folded = np.sum(nonsynonymous_paired_syn_sfs_seqclasses_array, 0).tolist()

Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...
Working with downsampled data...


In [80]:
# Save the pairs to a file
syn_nonsyn_pair_output_sfs_file = "paired/ZI_sfs_synonymous_paired_with_nonsynnonymous_downsampled_folded.txt"

with open(syn_nonsyn_pair_output_sfs_file, "w") as of:
    of.write("Synonymous and nonsynonymous SFS of " + str(min_number_of_haplotypes) + " samples" + "\n")
    of.write("\t".join(str(item) for item in synonymous_paired_nonsyn_seqclasses_sfs_folded) + "\n")
    of.write("\n")
    of.write("\t".join(str(item) for item in nonsynonymous_paired_syn_seqclasses_sfs_folded) + "\n")